Corpus statistics group

The corpus statistics group is a collaboration between the University of Birmingham and the University of Nottingham. It brings together researchers from corpus linguistics and statistics who are interested in investigating linguistic patterns across large electronic data sets.

Professor Michaela Mahlberg opening the group’s launch eventProfessor Michaela Mahlberg opening the group’s launch event.

A key focus of the group is the exploration of the reach of methods across a range of humanities, social sciences and science disciplines. The group aims to provide opportunities for discussion around the availability of data sets, infrastructural needs and challenges in the development of appropriate tools. The first platform for this dialogue was the group’s launch event on 11 February 2016, supported by the EPSRC ISF. Members of the group presented work-in-progress research from the collaboration between the two host universities. Over 75 participants from institutions across the country and a variety of disciplines attended this event and joined in the discussion.

Professor Michaela Mahlberg, chair of corpus linguistics at the University of Birmingham, opened the event emphasizing the potential of interdisciplinary perspectives on analyzing corpus data. Researchers from the University of Nottingham – Dr Simon Preston, Dr Yves van Gennip and Anthony Hennessey – provided a mathematical perspective on working with corpora. Dr Preston’s talk explained how corpus analysis applies functions to the raw text data and illustrated the use of matrices in the analysis. Anthony Hennessey demonstrated the use of kernels to examine time dependency in the properties of a corpus. The talk by Dr van Gennip introduced graphical representations and clustering of the patterns in a corpus. A case study presented by Viola Wiegand illustrated the application of such clustering methods on graphs to highlight themes in the co-occurrence patterns of surveillance discourse. A team of researchers from the University of Birmingham and the University of Cambridge - Dr Paul Thompson, Dr Akira Murakami and Professor Susan Hunston – presented findings from their ESRC-funded project Interdisciplinary Research Discourse. Their talk focused on the use of topic modeling in exploring a corpus of research articles. In another cross-institutional talk, the librarians Sarah Bull and Neil Smyth presented developments in the provision of library resources and recent changes in copyright law.

The event concluded with the keynote by Professor Laurence Anthony from Waseda University. As the developer of a multitude of corpus tools, including the popular AntConc, Professor Anthony was in an excellent position to present on ‘arguments for and against DIY corpus tools’ and programming.

The second event with contributions by the Corpus Statistics Group was the first Corpus Linguistics Summer School at the University of Birmingham from 20 to 24 June 2016.