The Centre for Corpus Research at Birmingham has a wide range of corpus resources and tools for research purposes. This page provides information on these resources.
Corpora and interfaces
Bank of English
For access, you require a Bank of English user's user name and password. The corpus is accessed by Telnet connection. Users of computers running Unix-based operating systems (e.g. Mac or Linux) can simply open a 'Terminal' window and type in: telnet titania.bham.ac.uk. You will then be prompted to enter your user name and password. For Windows users: if Telnet is installed, go to 'Run' in the 'Start' menu and type in: telnet titania.bham.ac.uk. You will then be prompted to enter your user name and password. If you do not have Telnet, you can use a freeware programme called 'PuTTY'; you can create a 'Telnet' connection using this programme.
We have now placed the Bank of English on CQPWeb, on a Birmingham server. This allows access to the corpus and (in course of time) other corpora through the CQPWeb interface. It requires a bham.ac.uk email address for you to be able to register. Registration is at http://www.cqpweb.bham.ac.uk/usr/?thisQ=create&uT=y.
Thereafter you can use the portal at http://www.cqpweb.bham.ac.uk/
British Sign Language Corpus Project
Access to some of the video data and ELAN annotation files that form the British Sign Language (BSL) Corpus based at University College London is available here http://www.bslcorpusproject.org/data/. Creating the British Sign Language Corpus was a joint venture involving five UK universities during 2008-2011, led by Dr Adam Schembri who is now based here at the University of Birmingham and who continues to work on corpus-based approaches to the study of BSL linguistics.
CLiC is a web application for the corpus linguistic analysis of Dickens’s novels and other literary texts. The web app is being developed as part of the CLiC Dickens project, a collaboration between the University of Birmingham and the University of Nottingham, funded by the AHRC. For the source code, information on updates and bug reports please see the CLiC github page.
CorporaCoCo is an R package that identifies statistically significant co-occurrence count differences between two corpora and reports an effect size and confidence interval for each of the identified differences. The package produces high quality, customizable plots for use in reports.
CorporaCorpus is an R data package containing a collection of small corpora. The package currently includes the 'DNov' corpus containing the 15 Dickens novels, and the '19C' corpus containing 29 other 19th century novels. The package is designed to make it easy to get the corpora into R and the documentation contains worked examples of how to ingest and process the corpora.
EuroCoAT (European Corpus of Academic Talk) provides transcripts of academic conversations between undergraduate Erasmus students (L1 Spanish) and their lecturers at different host universities. The EuroCoAT project is a collaboration between the Universities of Extremadura, Birmingham, Limerick, Dalarna and VU Amsterdam.
A web-based client program for searching and retrieving lexical, grammatical and textual data from the British National Corpus (BNC). It requires a bham.ac.uk email address for you to be able to register. Registration is at http://bncweb.lancs.ac.uk/bncwebSignup/user/login.php, and thereafter you can use the portal at http://bncweb.lancs.ac.uk. This gives access to the British National Corpus.
Institutional access to the Sketch Engine interface is available on any computer on the University network (but not outside the network). Approximately 160 corpora are included.
Institutional access to the Wordbanks Online service is available on any computer on the University network (but not outside the network). Wordbanks Online is the HarperCollins interface for the 550 million word version of the Bank of English.
Birmingham University users can use the networked version of the programme from any machine on the university network, provided that the user has logged into the network (there is a PDF worksheet which explains how to access the programme and the basics of using Wordsmith Tools).
CLAWS Part of Speech Tagger
The Centre has a licence for the CLAWS tagger (UCREL, Lancaster). Staff or students who are interested in POS-tagging large quantities of data for research should contact Paul Thompson.
The Centre also has a licence for the WMatrix suite of semantic and POS annotation and analysis tools developed by Paul Rayson (UCREL, Lancaster). Staff or students who are interested in using WMatrix for research should contact Paul Thompson.
The Centre has a large number of corpora for use by researchers and students at the university. These include:
- AHRC corpora
- Australian Corpus of English
- British Academic Spoken English corpus
- British Academic Written English corpus
- British National Corpus
- Bank of English
- Brown Corpus
- Freiburg-LOB (FLOB)
- Freiburg-Brown (Frown)
- German Parole
- Global Web-Based English (GloWbE).
- The Helsinki Corpus of English Texts: Diachronic Part
- The Helsinki Corpus of Older Scots
- Innsbruck Computer-Archive of Machine-Readable English Texts (ICAMET)
- The International Corpus of English - East African component
- Italian Corpus
- Italian Newspaper
- Kolhapur Corpus (India)
- Lancaster/IBM Spoken English Corpus (SEC)
- LOB Corpus
- London Lund Corpus
- Micase corpus
- Multilingual Plato
- Newdigate Newsletters
- Polytechnic of Wales Corpus
- Wellington New Zealand Spoken
- Wellington New Zealand Written
- Wolverhampton Business English
Access to some of these corpora is restricted; for further information, contact Paul Thompson at firstname.lastname@example.org