Implications and applications of dependency-based phraseology extraction

ERI Building, Pritchatts Road, Room G51
Wednesday 17 February 2016 (17:00)

Paul Thompson:

Photograph of Piotr Pezik
Photograph of Piotr Pęzik


Piotr Pęzik (University of Lodz) is an assistant professor in the Institute of English, University of Lodz with a PhD in linguistics. He is a member of corpus and computational linguistics research groups and projects in Poland (PELCRA, National Corpus of Polish, CLARIN-PL) and a former member of the text-mining group at the European Bioinformatics Institute. He has designed and implemented a number of online search engines for written and spoken corpora (e.g., automatic combinatorial dictionaries (e.g. and text mining systems (e.g. His recent research interests include corpus-based studies of linguistic prefabrication and practical applications of automatic phraseology extraction. 


Collocation extraction methods are used in corpus-based studies of phraseology as well as in lexicography and language pedagogy. Most of the actual methods of identifying potential phraseological units in corpora rely on either positionally or relationally defined co-occurrences between their constituents (Evert 2005). After such co-occurrences are identified in a reference corpus, they can be ranked by their frequency, strength of association, dispersion or independence. Approaches can also be mixed in order to maximise the precision and/ or the recall of automatic phraseology extraction (PE). PE systems can be used in an ad hoc fashion to summarize the results of corpus queries (Church and Hanks 1990), but they can also serve to generate automatic combinatorial dictionaries (Kilgarriff & Rychlý 2010) 

In this presentation I will introduce a dependency-based method of extracting phraseology from large reference corpora. In general, dependency grammars share the basic assumption ‘that syntactic structure consists of lexical elements linked by binary asymmetrical relations called dependencies’ (Nivre 2005: 2). It can be argued that the basic assumptions behind dependency representations, which are derived from simple binary relations between words is potentially relevant to the study of collocations, collocational chains and collocational networks. After all, collocational relations often have highly predictable syntactic exponents. The approach to phraseology extraction proposed in this presentation is more directly inspired by the so-called Continuity Constraint on idioms (O’Grady 1988), which has been re-visited in recent work on dependency syntax (Osborne et al. 2012). The method I will propose in this presentation assumes that most phraseological units form lexically recurrent dependency trees or ‘catenae’, even when they are not complete or formally valid syntactic constituents. I will argue that it is possible to use this assumption to automatically identify a large variety of potential phraseological units, including lexical and grammatical collocations, idioms, speech formulas, lexical bundles and many more. One of the clear advantages of this approach over most positional collocation extraction methods is that it provides a more robust mechanism of extracting higher-order phraseological units such as collocational chains and even sentential idioms.           

Following this introduction, I will demonstrate an online application called Phrime. Phrime uses dependency-based phraseology extraction to a) identify potentially prefabricated expressions in large reference corpora, b) organize them into customizable combinatorial dictionaries c) detect phraseology in user-submitted texts and d) develop data-driven phraseodidactic materials. The presentation will conclude with some theoretical implications of dependency-based combinatorial dictionaries for our assessment of the incidence of phraseological prefabrication in language.


  • Church, Kenneth Ward, and Patrick Hanks. “Word Association Norms, Mutual Information, and Lexicography.” Comput. Linguist. 16, no. 1 (March 1990): 22–29.
  • Evert, S. “The Statistics of Word Cooccurrences.” Word Pairs and Collocations. Phil. Diss. Institut für Maschinelle Sprachverarbeitung. Stuttgart, 2005.
  • Kilgarriff, Adam, and Pavel Rychlý. “Semi-Automatic Dictionary Drafting.” In A Way with Words : Recent Advances in Lexical Theory and Analysis : A Festschrift for Patrick Hanks, edited by Gilles-Maurice De Schryver, 299–312. Kampala: Menha Publishers, 2010.
  • O’Grady, William. “The Syntax of Idioms.” Natural Language & Linguistic Theory 16, no. 2 (1998): 279–312.
  • Osborne, Timothy, Michael Putnam, and Thomas Groß. “Catenae: Introducing a Novel Unit of Syntactic Analysis: Catenae: Introducing a Novel Unit of Syntactic Analysis.” Syntax 15, no. 4 (December 2012): 354–96.
  • Nivre, Joakim. “Dependency Grammar and Dependency Parsing.” MSI Report 5133, no. 1959 (2005): 1–32.

If there are any queries about this talk, please contact Paul Thompson at