Even today, a large part of the most insightful work in corpus linguistics relies on techniques whose use in computer-based corpus studies was pioneered 50 years ago by John Sinclair: collocation and keyword analysis combined with a careful interpretation of the corresponding kwic concordances.
On 25 June 2018, Professor Stefan Evert delivered the 2018 Sinclair Lecture at the University of Birmingham.
Enormous technological advances seem to have had little impact except for allowing corpus linguists to analyze ever larger corpora (even on their own laptop computers) and to make use of automatic linguistic annotation (such as part-of-speech tagging, or the automatic detection of direct speech in novels).
At the same time, research in other fields has been transformed fundamentally. Digital humanities applies a wide range of state-of-the-art techniques for data analysis and visualization, providing exciting new perspectives on language that are, however, often far removed from the actual object of study (a divorce often embraced as “distant reading”). In computer science, the age of deep learning has brought advances in artificial intelligence that may have a lasting impact on commerce and industry as well as society: algorithms are claimed to achieve superhuman performance; end-to-end learning translates between dozens of languages without any linguistic knowledge. As a result, the need for human understanding is increasingly questioned.
Evert discussed perspectives for the future of corpus-linguistic research in such an environment. Rather than uncritically embracing new data analysis techniques or applying deep learning models devoid of any linguistic understanding, he argued that our field needs to develop approaches that combine human interpretation with quantitative analysis and visualization — merging man and machine into what he likes to call, with a little bit of hyperbole, the Hermeneutic Cyborg.
Stefan Evert holds the Chair of Computational Corpus Linguistics at the University of Erlangen-Nuremberg, Germany. After studying mathematics, physics and English linguistics, he received a PhD degree in computational linguistics from the University of Stuttgart, Germany. His research interests include the statistical analysis of corpus frequency data (significance tests in corpus linguistics, statistical association measures, Zipf’s law and word frequency distributions), quantitative approaches to lexical semantics (collocations, multiword expressions and distributional semantics), multidimensional analysis (linguistic variation, language comparison, translation studies), as well as processing large text corpora (IMS Open Corpus Workbench, data model and query language of the Nite XML Toolkit, tools for the Web as corpus).