Frequency is overrated: Using text dispersion to measure word importance

Tuesday 20 February 2018 (16:15-17:15)


Jesse Egbert

Speaker: Jesse Egbert (Northern Arizona University, US)
Venue: ERI (G52) European Research Institute


Frequency has long been held as the most important indicator of a word’s importance in a corpus. The corpus frequency approach has been used for a range of applications, including the creation of vocabulary lists and the identification of keywords. However, corpus frequency information alone has major drawbacks. Words are prone to uneven or “bursty” distributions across texts in corpora, making frequency an unreliable measure of word importance (Kilgarriff, 1996; Leech & Rayson, 2014). In this talk I introduce two new methods—one for keyword analysis and one for measuring lexical dispersion—that are based on the dispersion of words across texts in a corpus. For keyword analysis, text dispersion keyness is compared with traditional measures of keyness, showing that the new dispersion-based measure produces higher-quality keyword lists (Egbert & Biber, in press). For the lexical dispersion measure, comparisons are made between DA (measured across texts) and the traditional approach of measuring dispersion across arbitrary corpus parts of equal size. The results reveal clear advantages of the text-based dispersion approach (Burch, Egbert & Biber, 2017; Egbert, Burch & Biber, in preparation). The results of these studies strongly suggest that word importance is best measured using text dispersion rather than corpus frequency. 


Jesse Egbert is Assistant Professor of Applied Linguistics at Northern Arizona University, where he received a Ph.D. in Applied Linguistics in 2014. Prior to joining the faculty at NAU, he was Assistant Professor in the Linguistics Department at Brigham Young University. Jesse specializes in register variation, particularly in academic and online writing. His research also explores issues related to quantitative linguistic research, including corpus design and representativeness, methodological triangulation, and the application of advanced statistical techniques to language data. He is General Editor of the new journal Register Studies. He is co-Editor of the book Triangulating Methodological Approaches in Corpus Linguistic Research (Routledge, 2016) and co-author of the book Register Variation Online (Cambridge, forthcoming). He has published more than 40 book chapters and articles in journals such as: Language Variation and Change, Corpus Linguistics and Linguistic Theory, Journal of English Linguistics, Journal of the Association for Information Science and Technology, and International Journal of Corpus Linguistics.