Applying Corpus Linguistics: impacts of corpus research in a minoritised language context

Barber Institute - Lecture Theatre (R14 on campus map)
Monday 15 July 2024 (18:00-19:00)


A photo of Professor Knight

Annual Sinclair Lecture 2024

Speaker: Professor Dawn Knight (Cardiff University)

This year’s John Sinclair lecture examines how corpus linguistic research can respond to real world problems and challenges, showcasing what can be achieved through interdisciplinary and cross-institutional corpus linguistic research projects.

During this talk, Professor Dawn Knight will present a range of user-based case studies (and representative voices) on the impact of corpus-based applied linguistics, with a particular focus on the minoritised language context of Welsh. She will demonstrate how meaningful collaboration and co-creation with potential user groups drove the design and construction of three key resources in the Welsh language:  the National Corpus of Contemporary Welsh (CorCenCC), the Geirfan word list and FreeTxt

Launched in September 2020, CorCenCC is the first corpus of the Welsh language that includes contemporary spoken, written and electronically mediated language. CorCenCC extends to 11 million words, offering a snapshot of the Welsh language across a range of contexts of use. The Geirfan wordlist is a curated list of 500 of the most frequent words in the Welsh language, designed for use by learners at A1/A2 levels of proficiency. This vocabulary list was developed using an innovative symbiosis of corpus-based methods (using data from the CorCenCC corpus) and expert-led introspection and reflection; an approach which can be replicated and adapted for use in any other language context. Finally, FreeTxt is a novel open-source toolkit designed to support the analysis and visualisation of multiple forms of open-ended, free-text data in both English and Welsh. It is designed for non-expert users, with a focus on making the toolkit as widely accessible and intuitive as possible.

Through the development of FreeTxt, in particular, Dawn and her collaborators have created a transformational approach which empowers end-users to direct and lead their own analyses of both small-scale and more extensive qualitative datasets in order to maximise the reach and potential impact generated. The approaches explored in this presentation, and the resources developed, can be replicated and extended for use in other language contexts and across a range of public and professional sectors.

Speaker biography

Dawn Knight is a Professor of English Language and Applied Linguistics at Cardiff University, Wales. Her research interests lie in the areas of corpus linguistics, multimodality and discourse analysis. Dawn has expertise in conceptualising, theorising and applying innovative interdisciplinary approaches/methodologies for extracting and predicting language patterns within/across social and linguistic contexts. Her pioneering work on Welsh language resource development (including CorCenCC and FreeTxt), supported by major AHRC, ESRC and Welsh Government grants, is helping to change the landscape of minoritised language research and the potential real-world applications of corpora/corpus-based enquiry. 

The lecture will be followed by a drinks reception in the Barber Institute foyer.

Cymhwyso Ieithyddiaeth Corpws: effeithiau ymchwil corpws mewn cyd-destun iaith leiafrifol

Mae darlith John Sinclair eleni yn archwilio sut y gall ymchwil iaith corpws ymateb i broblemau a heriau’r byd go iawn, gan arddangos yr hyn y gellir ei gyflawni drwy brosiectau ymchwil corpws ieithyddol rhyngddisgyblaethol a thraws-sefydliadol.

Yn ystod y sgwrs hon, bydd yr Athro Dawn Knight yn cyflwyno amrywiaeth o astudiaethau achos defnyddwyr (a lleisiau cynrychioliadol) ar effaith ieithyddiaeth gymhwysol ar sail corpws, gan ganolbwyntio’n benodol ar gyd-destun iaith leiafrifoledig y Gymraeg. Bydd yn dangos sut yr ysgogodd cydweithio a chyd-greu ystyrlon gyda grwpiau defnyddwyr posibl y gwaith o ddylunio ac adeiladu tri adnodd allweddol yn yr iaith Gymraeg: Corpws Cenedlaethol Cymraeg Cyfoes (CorCenCC), rhestr eiriau Geirfan a ThestunRhydd.

Wedi’i lansio ym mis Medi 2020, CorCenCC yw’r corpws cyntaf o’r Gymraeg sy’n cynnwys iaith lafar, ysgrifenedig ac electronig gyfoes. Mae CorCenCC yn ymestyn i 11 miliwn o eiriau, gan gynnig cipolwg ar y Gymraeg ar draws ystod o gyd-destunau. Mae rhestr eiriau Geirfan yn rhestr wedi’i churadu o 500 o’r geiriau mwyaf cyffredin yn yr iaith Gymraeg, wedi’i dylunio i’w defnyddio gan ddysgwyr ar lefelau hyfedredd A1/A2. Datblygwyd y rhestr eirfa hon gan ddefnyddio symbiosis arloesol o ddulliau seiliedig ar gorpws (gan ddefnyddio data o gorpws CorCenCC) a myfyrio dan arweiniad arbenigwyr; dull y gellir ei ailadrodd a'i addasu i'w ddefnyddio mewn unrhyw gyd-destun iaith arall. Yn olaf, pecyn cymorth ffynhonnell agored newydd yw TestunRhydd a ddyluniwyd i gefnogi dadansoddi a delweddu ffurfiau lluosog o ddata testun rhydd penagored yn y Gymraeg a’r Saesneg. Cafodd ei ddylunio ar gyfer defnyddwyr nad ydyn nhw’n arbenigwyr, a’r nod yw gwneud y pecyn cymorth mor hygyrch a greddfol ag y bo modd.

Trwy ddatblygu TestunRhydd, mae Dawn a’i chydweithwyr wedi creu dull trawsnewidiol sy’n grymuso defnyddwyr terfynol i gyfeirio ac arwain eu dadansoddiadau eu hunain o setiau data ansoddol graddfa fach a helaeth er mwyn cynyddu’r cyrhaeddiad a’r effaith bosibl a gynhyrchir. Gellir ailadrodd y dulliau a archwiliwyd yn y cyflwyniad hwn, a’r adnoddau a ddatblygwyd, i’w defnyddio mewn cyd-destunau ieithyddol eraill ac ar draws ystod o sectorau cyhoeddus a phroffesiynol.