'Colorful language' and Big Data – linguistics in the digital age

Language matters. On Monday, after less than ten days in the post, the White House communications director Anthony Scaramucci got fired. During his short tenure, he amply showcased his communications skills by using foul, or as he put it, ‘colorful’ language. The link between Scaramucci’s profanities and the speed of his departure does not go unnoticed. Scaramucci’s case also illustrates how certain linguistic effects become immediately measurable. Scaramucci’s infamous interview resulted in record numbers of unique visitors to the New Yorker’s website. In the digital age, news is reported instantly, shared widely and commented on by readers rapidly.

The way politicians use language has always been of interest to linguists. Speeches by great historical leaders show how language and action are inextricably linked, political discourses are useful illustrations of the complex relationship between language and reality (take Brexit discourse as an example). 

While such examples have always been drawn on, the digital age has added a new dimension to linguistics. The object of study has become more diverse as communication takes place on social media and people no longer just talk to people but also talk to the likes of Alexa and Siri.

New forms of communication aside, the single most significant challenge for linguistics in the digital age is the vast amount of data that can usefully provide empirical evidence to underpin scientific accounts of languages. And this data is not only of interest to linguists. The term ‘culturomics’ refers to the study of culture using data science methods and Google books data, literary texts are studied with ‘distant reading’ methods, and the term ‘Digital Humanities’ covers a variety of approaches using ‘digital archives’ or even ‘Big Data’. 

In linguistics, the study of large amounts of electronic texts is known as ‘corpus linguistics’. When corpus linguistics started off in the second half of the last century, corpora were carefully designed and creating an electronic copy of a text could mean keying it in. Corpus linguistics triggered a revolution in lexicography. Instead of heavily relying on examples from literary texts, modern dictionaries are now based on evidence of widely used language patterns. The forerunner of this kind of dictionary was published in 1987 by COBUILD, a collaboration between Collins Publishers and the University of Birmingham. The editor-in-chief was John Sinclair, Chair of Modern English Language at Birmingham from 1965 to 2000. 2017 marks the tenth anniversary of John Sinclair's death and the 30th anniversary of the publication of the first COBUILD dictionary.

Because of its role in the history of corpus linguistics, it was only fitting that the University of Birmingham hosted two weeks of corpus linguistics events this July. Starting with a summer school to train the next generation of corpus linguists, events included a workshop on Pattern Grammar, an approach developed out of the COBUILD project and first published in 1999 by Professor Susan Hunston, OBE, and Dr Gill Francis.

More than 300 international corpus linguists came to Birmingham for the biannual Corpus Linguistics conference and the Annual Sinclair Open Lecture. The plenaries showcased the interdisciplinary reach of corpus linguistics, from engagement with NLP approaches such as topic modelling, to challenges of identifying the aboutness of news downloads, studying spoken, multilingual and literary corpora and demonstrating the applicability of corpus linguistics to engineering education.

The same week Scaramucci’s colourful language made the news, linguists were listening to a presentation at CL2017 on ‘swearing in the Spoken BNC2014.’ The Spoken BNC2014 is a new resource created by Cambridge University Press and the Centre for Corpus Approaches to Social Science (CASS) at Lancaster University. It contains transcriptions of spoken interactions that make it possible to quantify the use of bad language in British English.

Among presentations on spoken language, the AHRC-funded CLiC project illustrated the study of fictional speech in Dickens – even here we find examples of profanities. Conference delegates learned about the NOW corpus - “News on the Web”, which grows daily and so already enables us to analyse how Scaramucci’s stint at the White House was represented in the news over the course of its short duration.

Similarly, Twitter data allows us to trace developments over time and identify geographical patterns of language use – including swearing, as work by Jack Grieve demonstrates, who joined the University only last month as professorial research fellow.

In the digital age, the range of resources for the scientific study of language is ever increasing. These resources might sometimes be called Big Data and be usefully shared across disciplines. However, corpus linguistics highlights that Big Data is often language data. Even if it isn’t, we need to make sense of it and articulate what we learn from its analysis. Language matters.

Professor Michaela Mahlberg

Chair in Corpus Linguistics, Director of the Centre for Corpus Research, University of Birmingham