Corpus statistics group

The corpus statistics group is a collaboration between the University of Birmingham and the University of Nottingham. It brings together researchers from corpus linguistics and statistics who are interested in investigating linguistic patterns across large electronic data sets.

Professor Michaela Mahlberg opening the group’s launch event.

A key focus of the group is the exploration of the reach of methods across a range of humanities, social sciences and science disciplines. The group aims to provide opportunities for discussion around the availability of data sets, infrastructural needs and challenges in the development of appropriate tools. The first platform for this dialogue was the group’s launch event on 11 February 2016, supported by the EPSRC ISF. Members of the group presented work-in-progress research from the collaboration between the two host universities. Over 75 participants from institutions across the country and a variety of disciplines attended this event and joined in the discussion.

Professor Michaela Mahlberg, chair of corpus linguistics at the University of Birmingham, opened the event emphasizing the potential of interdisciplinary perspectives on analyzing corpus data. Researchers from the University of Nottingham – Dr Simon Preston, Dr Yves van Gennip and Anthony Hennessey – provided a mathematical perspective on working with corpora. Dr Preston’s talk explained how corpus analysis applies functions to the raw text data and illustrated the use of matrices in the analysis. Anthony Hennessey demonstrated the use of kernels to examine time dependency in the properties of a corpus. The talk by Dr van Gennip introduced graphical representations and clustering of the patterns in a corpus. A case study presented by Viola Wiegand illustrated the application of such clustering methods on graphs to highlight themes in the co-occurrence patterns of surveillance discourse. A team of researchers from the University of Birmingham and the University of Cambridge - Dr Paul Thompson, Dr Akira Murakami and Professor Susan Hunston – presented findings from their ESRC-funded project Interdisciplinary Research Discourse. Their talk focused on the use of topic modeling in exploring a corpus of research articles. In another cross-institutional talk, the librarians Sarah Bull and Neil Smyth presented developments in the provision of library resources and recent changes in copyright law.

The event concluded with the keynote by Professor Laurence Anthony from Waseda University. As the developer of a multitude of corpus tools, including the popular AntConc, Professor Anthony was in an excellent position to present on ‘arguments for and against DIY corpus tools’ and programming.

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant patterns - Michaela Mahlberg (University of Birmingham)
Presentation slides (PDF - 719KB)
Getting to know your corpus: Applying Topic Modelling to a corpus of research articles - Paul Thompson (University of Birmingham), Akira Murakami (University of Cambridge) and Susan Hunston (University of Birmingham)
Presentation slides (PDF - 3.4MB)
Identifying surveillance discourses Viola Wiegand (University of Birmingham)
Presentation slides (PDF - 6.4MB)
Corpus Analysis from a mathematical perspective Simon Preston (University of Nottingham)
Presentation slides (PDF - 3.4MB)
Preliminary results on modelling time dependence in the Times Digital Archive Tony Hennessey (University of Nottingham)
Presentation slides (PDF - 4.1MB)
Graphical representations of a corpus, and clustering on graphs Yves van Gennip (University of Nottingham)
Presentation slides (PDF - 5.2MB)
The right to read is the right to mine library resources for cross-disciplinary work Sarah Price (University of Birmingham) and Neil Smyth (University of Nottingham)
Presentation slides (PDF - 5.4MB)
[Keynote] Arguments for and against DIY corpus tools creation: A debate about programming - Laurence Anthony (Waseda University)
Presentation slides (PDF - 3.7MB)

The second event with contributions by the Corpus Statistics Group was the first Corpus Linguistics Summer School at the University of Birmingham from 20 to 24 June 2016.

Connecting Cultures

Life-Changing Technologies

Thriving Planet

Global Health

Fairer World

Explore our Spotlights

Corpus statistics group