Forensic linguistics, the science of proving who wrote what.
Sunday, 01 February 2015
"Forensic Linguistics" is a generic term for a group of scientific disciplines that are recognised by Australian courts as having probative value in helping to identify the likely author of a piece of text.
This Wikipedia extract (produced by practitioners) will give you some idea of what I'm talking about.
Author identification
The identification of whether a given individual said or wrote something relies on analysis of their idiolect,[8] or particular patterns of language use (vocabulary, collocations, pronunciation, spelling, grammar, etc). The idiolect is a theoretical construct based on the idea that there is linguistic variation at the group level and hence there may also be linguistic variation at the individual level. William Labov has stated that nobody has found a "homogenous data" in idiolects,[9] and there are many reasons why it is difficult to provide such evidence.
Firstly, language is not an inherited property, but one which is socially acquired.[10] Because acquisition is continuous and life-long, an individual's use of language is always susceptible to variation from a variety of sources, including other speakers, the media and macro-social changes. Education can have a profoundly homogenizing effect on language use.[2] Research into authorship identification is ongoing. The term authorship attribution is now felt to be too deterministic.[11]
The paucity of documents (ransom notes, threatening letters, etc) in most criminal cases in a forensic setting means there is often too little text upon which to base a reliable identification. However, the information provided may be adequate to eliminate a suspect as an author or narrow down an author from a small group of suspects.
Authorship measures that analysts use include word length average, average number of syllables per word, article frequency, type-token ratio, punctuation (both in terms of overall density and syntactic boundaries) and the measurements of hapax legomena (unique words in a text). Statistical approaches include factor analysis, Bayesian statistics, Poisson distribution, multivariate analysis, and discriminant function analysis of function words.
The Cusum (Cumulative Sum) method for text analysis has also been developed.[12] Cusum analysis works even on short texts and relies on the assumption that each speaker has a unique set of habits, thus rendering no significant difference between their speech and writing. Speakers tend to utilize two to three letter words in a sentence and their utterances tend to include vowel-initial words.
In order to carry out the Cusum test on habits of utilizing two to three letter words and vowel-initial words in a sentential clause, the occurrences of each type of word in the text must be identified and the distribution plotted in each sentence. The Cusum distribution for these two habits will be compared with the average sentence length of the text. The two sets of values should track each other. Any altered section of the text would show a distinct discrepancy between the values of the two reference points. The tampered section will exhibit a different pattern from the rest of the text.
More soon!