The 9th International Corpus Linguistics Conference will take place from Monday 24 to Friday 28 July at the University of Birmingham.

cl2017 logo

Corpus Linguistics is a biennial conference which has been running since 2001 and has been hosted by Lancaster University, the University of Liverpool, and the University of Birmingham.

At CL2017 we are proud to have the following plenary speakers:

Opening plenary

  • Andrew Hardie (Lancaster University, UK)
    'Exploratory analysis of word frequencies across corpus texts: towards a critical contrast of approaches'
    Tuesday 10:20 - 11:30


A recent trend in corpus linguistics is the adoption of Latent Dirichlet Allocation (LDA), already widely used by digital humanists (Blevins, 2010; Underwood, 2012) as a method for exploratory corpus analysis. LDA is a machine-learning approach to inducing structure in the content of a corpus based solely on word occurrence across texts or documents as data objects, one of a range of approaches usually if potentially misleadingly dubbed topic modelling. However, adopting this approach to the many-dimensional data of word frequency comes with a high price tag in terms of knowledge that the system ignores or makes nontransparent. The question this raises is whether that price tag is justified.

Various advantages have been asserted for LDA, albeit not without caveats (see Blei, 2012 for a selection of both). All such advantages notwithstanding, LDA has at least three substantive disadvantages. First, it is nondeterministic: randomisation is central to the algorithm. This is problematic from the perspective of scientific replicability for reasons too obvious to belabour. Second, its operation is opaque: the relationship between the underlying distribution data and the resulting statistical model is nontransparent to the analyst. Third, the theory of text generation underpinning the LDA algorithm is dubiously compatible with linguistic understandings of text, topic and discourse.

Moreover, although the lack of linguistic knowledge used in the construction of the model is presented as an advantage of LDA, this is equally characterisable as a disadvantage: the field of corpus analysis has invested much effort in the creation of precisely the knowledge resources which LDA is lauded for not requiring. What exactly does our acceptance of these disadvantages buy us? In examining this issue, we must venture comparisons to longer-established exploratory multivariate analysis approaches that are longer-established in corpus linguistics (cf. Biber, 1988, 1989).

Using example data drawn from the FLOB corpus, I will compare and contrast outcomes of different analytic procedures including LDA models and alternative approaches, with two questions in mind. First, to what extent are these outcomes compatible with one another? Second, to what extent are they transparently interpretable in linguistically meaningful terms?


Pre-conference opening speaker

  • Susan Hunston (University of Birmingham)
    'Corpus Linguistics in 2017: a personal view'
    Monday 18:00
    Muirhead Tower (R21 on the campus map), Lecture Theatre G15


This paper offers a personal reflection on the development of Corpus Linguistics over the last couple of decades. This time period has seen a massive increase in the amount of research in this field, while at the same time reflecting three persistent preoccupations: the primacy of comparison; the primacy of lexis; and an ongoing questioning of the relationship between theory and methodology.

I shall suggest that the major changes and developments in the field can be summarised in terms of five ‘turns’: the quantitative; the cognitive; the modality; the specialisation; and the paradigmatic. The paper will discuss each of these in turn and will offer some speculations about the future.

Keynote speakers

  • Mike Scott (Aston University, UK)
    'News downloads and aboutness'
    Wednesday 11:30 - 12:40


Many of us are using LexisNexis, Factiva or other online sources, often in order tostudy a specific topic within such overall fields as gender studies, journalism, history,sociology, medicine, psychology, law.

Among the issues raised by such downloads as supplied by an online searchengine,there are choice of search-terms, duplicate articles, repeated sections withinarticles, online comments and discussion, disparities in formatting. But the main aimof the presentation is to focus on the problem of relevance: many of the articlesretrieved may have a merely incidental mention of the desired topic.The main aboutness of such articles doesn’t really include the topic butconcerns another, quite different one. For example an article returned by a searchon Brexit (Guardian, 12 January 2017) which concentrates on problems in the UK’sthe National Health Service, contrasting these problems incidentally with the“theoretical risks of Brexit” and claims deficiencies in the Health Service are veryobvious to ordinary voters. Its aboutness does includes Brexit but at a very minorlevel.

The question we will be considering is then, how do we filter aboutness so asto reduce unwanted dross? There are various aspects of relevance to identify inorder to find ways of filtering out irrelevance. One concerns identifying carefullywhat we are really seeking in the first place, since almost any topic such as climatechange, austerity, Brexit has numerous aspects (legal, social, geographical etc.),some of which are more central (within the field of knowledge) than others(gardening, hill- walking, DIY). Once it is clear which aspect of our topic is wanted,means have to be found to get rid of the others. Easier said than done!

  • Christian Mair (University of Freiburg, Germany)
    'Downsizing and upgrading: why we need more spoken, more multilingual and more nonstandard corpora'
    Thursday 11:15 - 12:25


Today, students of English (and a few other mostly European languages) areprivileged in that they can rely on extremely rich corpus-linguistic workingenvironments. In a brief review of 50 years’ corpus-linguistic research I willdemonstrate how the availability of increasingly large corpora and increasinglysophisticated tools for analysis has left a profound mark on the discipline oflinguistics. Traditional descriptive work can now be carried out to higher empiricalstandards. More importantly, new areas of linguistic inquiry have been opened upto rigorous empirical investigation, and corpus-based research has given a generalboost to usage-based theoretical frameworks of all kinds.

As I will show, however, the story of the past fifty years has not been one ofundiluted progress and success. It seems that a “conspiracy” of technological andideological factors has favoured the creation of large monolingual standard writtencorpora. Data which does not fit this template tends to be made to conform to it.For example, much corpus-based work on spoken English is based on transcriptionsrather than the original audio or audiovisual recordings. Similarly, complexmultilingual realities tend to be simplified in corpus-compilation, for example byannotating code-switches into other languages as “extra-corpus material.”

Today, corpus technology and corpus-linguistic theorising have advanced tosuch an extent that these biases can and should be redressed. In the digital textualuniverse in which the humanities and social sciences are all operating today, theclassic definition of the corpus, as a usually digital database compiled by linguistsfor the purposes of linguistic analysis, has become increasingly difficult to upholdand corpus-linguistics will sooner or later merge with the digital humanitiesmovement. A kind of corpus-linguistics which emphasises spoken, multilingual andnonstandard data more than has been the case in the past will make a richercontribution to this development.

  • Susan Conrad (Portland State University, US)
    'From a plate of spaghetti to a cable-stayed bridge: increasing the impact of corpus linguistics in disciplinary education'
    Thursday 17:00 - 18:30 (Sinclair Lecture)


In the 1980s, John Sinclair was instrumental in showing the profound impact corpuslinguistics could have on our understanding of language. Now, ten years after hisdeath, I want to urge corpus linguists to think again about having an impact – thistime on fields that most people don't associate with language study, such as engineering.

Why does an engineer need corpus linguistics? How can corpus-based studiesimprove engineering education? What does it take to move from languagedescriptions to applications that encourage changes in what people do? Whatchallenges face corpus linguists in working with professionals who don’t “speaklinguistics”? These are the general questions I will address, using my work in theCivil Engineering Writing Project as a concrete example.

Begun in 2009, the Civil Engineering Writing Project is a corpus-based projectthat addresses a long-standing problem in engineering education: students' lack ofpreparation for writing in the workplace. Despite decades of discussion, there hadbeen almost no empirical investigation of the problem in the United States. I immediately saw the role corpus linguistics could play in defining the problem,informing teaching materials, and assessing improvements. The project materialshave now been piloted at four universities, with significant improvements instudents’ writing.

My talk will include examples of the corpus-based analyses of words andgrammar that helped us understand the gaps between student and practitionerwriting. The analyses have, for example, clarified the highly controversial areas ofpassive voice and first person pronoun use, and highlighted the importance of clausalsimplicity and certain word choice issues. They demonstrate that language choicesare fundamental to effective engineering. However, the linguistic analyses have alsobecome intertwined with techniques that are less typical in corpus studies. Wemaintain ongoing collaborations with professionals in the community, to mine theircontext expertise and get their help interpreting the linguistic findings. We interviewstudents to gain insight into reasons behind their language patterns – insights that no amount of corpus analysis can reveal. We have made additions to the researchmethodology to include judgments of writing effectiveness, a transition fromdescription to evaluation that is necessary for an applied project. And we areconstantly seeking new ways of turning corpus analyses into information andpractice that engineers value. Although the additional techniques increase thecomplexity of the project, I argue in this talk that expanding corpus research in theseways can make it more useful in more disciplines.

I will reflect on the successes and the continuing challenges of the project.How exactly the plate of spaghetti and the cable-stayed bridge figure in – well, thatwill become clear in the talk.

  • Dan McIntyre (University of Huddersfield, UK)
    'Just what is corpus stylistics?'
    Friday 11:15 - 12:35


Over a relatively short period of time, corpus linguistic methods have been embracedby a wide range of sub-disciplines of linguistics (and, more recently, by otherdisciplines entirely). Corpus linguistics has had a transformative effect on such areasas historical linguistics, child language acquisition and critical discourse analysis, toname but a few. In stylistics, corpus methods are increasingly being adopted, notleast because of the influential work of corpus linguists such as Stubbs (2005) andMahlberg (2013). Indeed, such is the popularity of the corpus approach in stylisticsthat it is now common to see the term corpus stylistics used to describe any stylisticwork that utilises corpus methods. This adoption of corpus as a premodifier todesignate a particular type of stylistics is unusual when compared against thepractices of other sub-disciplines that use corpus methods. So just what is corpusstylistics and how, if at all, does it differ from corpus linguistics? My talk aims to offeranswers to these questions by exploring how stylisticians have used corpora in theirwork. I begin with an overview of research in corpus stylistics before going on toconsider issues with the presuppositions inherent in some definitions of the term. Ithen discuss topics in stylistics that have benefitted particularly from corpusmethods. These include the analysis of speech and thought presentation (e.g.Semino et al., 1997, Semino & Short, 2004), where corpora have enabled thediscovery of quantitative as well as semantic norms. Following this, I consider thewashback effects that corpus linguistics has had on methodological practices instylistics. I illustrate some of these by introducing a software tool calledWorldbuilder, developed by linguists and computer scientists at the University ofHuddersfield to provide a means of improving the systematicity of cognitive stylisticanalyses that utilise Text World Theory (Werth, 1999). I suggest that theincorporation of basic principles from corpus linguistics such as data sampling andannotation are improving methodological and analytical practice in stylistics. Finally,having outlined the impact of corpus linguistics on stylistics, I consider what stylisticshas to offer to corpus linguistics. I suggest that foregrounding theory, arguably thecornerstone of stylistics, offers valuable analytical insight when connected to notionsof statistical salience.


  • Mahlberg, M. (2013). Corpus Stylistics and Dickens’s Fiction. Abingdon: Routledge.
  • Semino, E., Short, M. and Culpeper, J. (1997). Using a corpus to test and refine amodel of speech and thought presentation. Poetics, 25(1), 17-43.doi:10.1016/S0304-422X(97)00007-7
  • Semino, E. and Short, M. (2004). Corpus Stylistics: Speech, Writing and ThoughtPresentation in a Corpus of English Writing. London: Routledge.
  • Stubbs, M.(2005). Conrad in the computer: examples of quantitative stylistic methods. Language and Literature, 14(1), 5-24. doi:10.1177/0963947005048873
  • Werth, P. (1999). Text Worlds: Representing Conceptual Space in Discourse. London: Longman.


John Benjamins logo for CL2017