“Who Wrote That Email?”
Forensic Authorship Attribution and Stylometry
TASA ID: 3949
Some cases hinge on the authorship of a document. Whether we want to know about the author of a defamatory email, the source of a ransom note, or the authenticity of a will, one of the most important pieces of evidence is the one that establishes who wrote it. Historically, most documents were handwritten and handwriting experts (today they go by the title “forensic document examiners”) could determine who wrote something from the slant of an f or the height of a t. Even with typewritten documents, they could notice a chipped p or an out-of-line c and identify the specific typewriter that created the document. Physical creation also produces physical variance.
Today, things are a little different. [1] Computer characters are not physical, but mathematical; one flat-ASCII A is literally identical to any other, regardless of who, when, or where the document was created. To know who wrote a defamatory comment on a Web page, looking just at the physical properties of the artifact may not be enough.
Authorship attribution [2,3,4], sometimes called stylistic analysis or stylometry, is an increasingly important forensic discipline with applications. One well-known case was the identification of J.K. Rowling (the author of the best-selling Harry Potter series of books) as the true author of Robert Galbraith’s detective novel The Cuckoo’s Calling. By looking at the writing style of Cuckoo, two different teams of scientists were able to show empirically that it was much more similar to the writing style of Rowling than to other authors’.[5]
It is a relatively simple matter to turn this kind of insight into the kind of evidence useful for court cases. McMenamin’s report [6] in Ceglia v. Zuckerberg is a good example. Among the issues in the case were a set of email, allegedly written by Mark Zuckerberg, the founder of Facebook, that were important evidence to prove Ceglia’s claims, amounting to 50% of Facebook.
McMenamin hand-identified eleven specific “style markers” and checked to see if they were present in a sample of email known to be by Zuckerberg as well as in the “questioned” documents, the disputed email in Ceglia’s complaint. He found, for example, that “[a]postrophe’s indicating contraction and possession are sometimes absent in QUESTIONED, but always present in KNOWN-Zuckerberg,” a stylistic difference between the two groups. Similarly, “[t]he word `internet’ starts with a small-i in the QUESTIONED writing but with a capital-I in KNOWN-Zuckerberg,” another difference. Of the eleven style markers, nine were shown to be different between the two groups, and, in McMenamin’s opinion, “the differences demonstrat[e] a sufficiently significant set of differences,” and that; therefore, Zuckerberg was “not the author of the excerpted QUESTIONED references.”
The Ceglia case dealt, of course, with a legal dispute under civil law. The Rowling case did not appear in court, but was of substantial scholarly (and public) interest—and, of course, would be a model for a copyright dispute. Grant [7] was able to help on a genuine whodunit, a murder case. One night in January, 2009, a fire broke out in a house in Staffordshire, UK. A woman named Amanda Birks apparently died in the fire, but “forensic examination showed that fibers recovered from Amanda’s body were from her daytime clothes, and toxicology reports indicated that Amanda’s lungs contained little or no carbon monoxide.” [7] Was she murdered and the fire set to cover it up?
Grant was able to analyze the SMS (text) messages sent from Amanda’s phone, and showed that “a shift in texting style occurred […] at 12:07 p.m.” Put simply, After this time, the messages sent from Amanda’s phone lacked many features characteristic of her writing, and instead, showed features more typical of her husband. For example, Amanda tended to write “dont” for “don’t,” while her husband tended to write “dnt.” Based on features like this, Grant concluded that the messages after 12:07 (the time of her actual death?) were not consistent with her own undisputed writing, but were instead consistent with her husband’s. Presumably based in part on this evidence, “On the morning before trial, [the husband] changed his pleas to 'guilty’ [of the murder of his wife, of arson, and of the endangerment of his children and the firefighters]” and was sentenced to life in prison.
Chaski [1] provides another example of a murder case, where a dead body was found, an apparent suicide, with a word-processed “note” left on a home computer. As in Grant’s case, no physical documents were present to analyze, but, also as in Grant’s case, there were suspicious circumstances surrounding the death (the death was apparently from injected drugs, but no needles were found near the body [8]). Stylometric analysis showed that the “note” lacked key features of the victim’s writing, but was consistent with the writing of his roommate. His roommate eventually admitted to writing the notes and was convicted.
Juola [9] describes an administrative case before the US immigration courts. In this case, an online gadfly and critic of a foreign government sought to remain in the United States, fearing persecution if he were returned to that country. Juola was able to establish that the anonymously-published articles critical of that government were consistent in writing style with other articles he had published under his own name. Based in part upon this evidence, the man was permitted to remain in the United States. Authorship attribution thus can be an important element of many different types of dispute resolution.
How does this work? The basic idea, as expressed by Coulthard [10] is that
“[A]ll speaker/writers of a given language have their own personal form of that language, technically labeled an idiolect. A speaker/writer's idiolect will manifest itself in distinctive and cumulatively unique rule-governed choices for encoding meaning linguistically in the written and spoken communications they produce. For example, in the case of vocabulary, every speaker/writer has a very large learned and stored set of words built up over many years. Such sets may differ slightly or considerably from the word sets that all other speaker/writers have similarly built up, in terms both of stored individual items in their passive vocabulary and, more importantly, in terms of their preferences for selecting and then combining these individual items in the production of texts.”
A simple example of this can be found in well-known regional variations. A speaker/writer who refers to a “lorry” parked on the “pavement” in front of an “ironmonger” is using words very common in Commonwealth English, but uncommon in US English. An obvious question to the investigation officer, then, would be “who among your suspects is not from the United States?” While it is possible for an American to make a point of using British vocabulary, or for a British editor to "regularise" spellings, there are other cues that are both more subtle and harder to control and change.
Figure 1: Where is the salad fork? (Image courtesy of clker.com, used by permission.)
Figure 1 shows an example of a complex, formal, table setting. The reader is invited to answer the question “where is the salad fork?” Perhaps surprisingly, there are many subtly different ways to answer it. For example, it’s “the fork on the outside,” but also “to the left of the dinner fork” or “to the right of the napkin.” Even more subtly, it can be “to the left of the dinner fork,” “on the left of the dinner fork,” or “at the left of the dinner fork.” While the meaning of the expression does not change depending on which preposition is used, the details of the expression do. Furthermore, this kind of subtle change is often not even noticed by the readers [11], who focus instead on the meaning instead of the exact expression. As discussed below, Amanda’s husband may not even have noticed that his wife spelled “don’t” differently than he did.
An article by Binongo [12] illustrates this kind of analysis quite well. In a study of the Oz books, started by L. Frank Baum and continued by Ruth Plumly Thompson, he focused on the authorship of the 13th book, The Royal Book of Oz. After Baum’s death, the publishers asked Thompson to finish "notes and a fragmentary draft'' of what would become The Royal Book, and then Thompson herself continued the series until 1939, writing nearly twenty more books. The question, then, is not who wrote the email, but who actually wrote the 15th Oz book?
Binongo collected frequency statistics on the fifty most common words in the undisputed works by both Baum and Thompson. These fifty most common words, of course, included exactly the sort of “little” words like in the table-setting example -– prepositions, of course, but also articles, conjunctions, common adverbs, and similar instances of what linguists call “function words.” These function words are so-called because they don’t have much content/meaning (consider trying to create a definition of “of,” for example), but instead describe the functional relations between words, such as an attribute or possessor. Psycholinguists have shown [11] that people have difficulty remembering the differences in sentences such as “Three turtles rested on a floating log and a fish swam beneath (it/them),” where the meaning of the two alternate sentences is the same. Many aspects of language appear to happen at a level below our conscious choices. The scientific foundation is firm enough that admissibility is rarely a problem.
One advantage of this approach, and especially of the computational variation, is that it is not limited to English. Much research has been done [13] showing that the same types of analysis can produce evidence in many different languages. Another advantage of the computational approach is speed and volume; Juola’s computers could read eight mystery novels in a few minutes to analyze Rowling’s work, while Binongo’s computers could read almost fifty Oz books. In some cases, such as the famous Chevron v. Donziger litigation, computers have been able to help develop evidence from more than 200,000 pages of text.
Documents and their authorship have been key to litigation for centuries, but modern electronic documents introduce important changes into how to handle and authenticate them. Authorship attribution is an important new field of forensic science that can help journalists, scholars, and litigators develop the evidence they need to win their cases.
References:
1.Chaski, Carole E. "Who’s at the Keyboard: Authorship Attribution in Digital Evidence Investigations." International Journal of Digital Evidence 4(1) (2005):Web. n/a. http://www.ijde.org.
2.Juola, Patrick. "Authorship Attribution." Foundations and Trends in Information Retrieval, 1(3) (2006).
3.Koppel, Moshe ,Schler, Jonathan, and Argamon, Shlomo. "Computational Methods in Authorship Attribution." Journal of the American Society for Information Science and Technology 60(1) (2009):9-26.
4.Stamatatos, Efthstathios. “A Survey of Modern Authorship Attribution Methods.” Journal of the American Society for Information Science and Technology 60(3) (2009):538-56.
5.Juola, Patrick. "The Rowling Case: A Proposed Standard Analytic Protocol for Authorship Questions." Digital Scholarship Humanities. 2015. Web. 30 (suppl_1): i100-i113. doi: 10.1093/llc/fqv040
6.McMenamin, Gerald. "Declaration of Gerald McMenamin." 2011. Web. http://www.scribd.com/doc/67951469/Expert-Report-Gerald-McMenamin.
7.Grant, Tim. "Txt 4n6: Describing and Measuring Consistency and Distinctiveness in the Analysis of SMS Text Messages." Journal of Law and Policy XXI(2) (2013):467-94.
8.Ramsland, Katherine. "Whether You’re Talking or Typing, You Can’t Hide Your Lies: A Fascinating New Branch of Forensic Science Spots Unconscious “Tells”." Psychology Today. Web.14 July 2014. https://www.psychologytoday.com/blog/shadow-boxing/201407/whether-youre-talking-or-typing-you-cant-hide-your-lies
9.Juola, Patrick. "Stylometry and Immigration: A Case Study." Journal of Law & Policy XXI(2) (2013):287-98.
10.Coulthard, Malcolm. "On Admissible Linguistic Evidence." Journal of Law and Policy XXI(2) (2013):441-466
11.Bransford, John D., Barclay, J. Richard, and Franks, Jeffery J. "Sentence Memory: A Constructive Versus Interpretive Approach." Cognitive Psychology 3(2) (1972):193-209.
12.Binongo, José Nilo G. “Who Wrote the 15th Book of Oz? An Application of Multivariate Analysis to Authorship Attribution”, Chance 16(2): 9-17, 2003.
13.Rosso, Paolo, et al. "Overview of PAN’16. In: Fuhr N. et al (eds) Experimental IR Meets Multilinguality, Multimodality, and Interaction." CLEF 2016. Lecture Notes in Computer Science 9822, 2016.
This article discusses issues of general interest and does not give any specific legal or business advice pertaining to any specific circumstances. Before acting upon any of its information, you should obtain appropriate advice from a lawyer or other qualified professional.
This article may not be duplicated, altered, distributed, saved, incorporated into another document or website, or otherwise modified without the permission of TASA. Contact marketing@tasanet.com for any questions.