CL2017 Pre-conference workshop 8

University of Birmingham, Monday 24 July 2017, 13:30 - 16:30

CMLC 5 + Big NLP 2017

cl2017 logo

IMPORTANT: Please note that this workshop now incorporates Workshop 5, The 11th Web as Corpus Workshop (WAC-XI). CMLC 5 was originally conceived as a half-day afternoon workshop, but will now begin in the late morning with a WAC ‘guest session’ of 3 papers, chaired by Stefan Evert. The afternoon session will then proceed normally as a half-day session devoted to CMLC papers.

Workshop convenors

Workshop summary

This workshop continues the successful series of “Challenges in the management of large corpora” events (previously hosted at LREC conferences and CL2015) and will be organised jointly with the second event in the Big-NLP series which began last year at the IEEE Big Data 2016 conference. This will allow us to explore common areas of interest across a range of issues in language resource management, corpus linguistics, natural language processing and data science. Large textual datasets require careful design, collection, cleaning, encoding, annotation, storage, retrieval and curation to be of use for a wide range of research questions and users across a number of disciplines. More historical archives are being digitised, more publishing houses are opening their textual assets for text mining, and many billions of words can be quickly sourced from the web and online social media. A number of key themes and questions emerge of interest to the contributing research communities: (a) is having more data always better? (b) is the full range of text types available online and what quality issues should we be aware of? (c) what infrastructures and frameworks are being developed for the efficient storage, annotation, analysis and retrieval of large datasets? (d) what affordances do visualisation techniques offer for the exploratory analysis approaches of corpora? (e) what are the key legal and ethical issues related to the use of large corpora?

A volume of proceedings is planned.

Call for papers

We welcome papers that focus on the union of the standard topics of CLMC and Big NLP. Topics include the following:

Technical issues:

  • Storage and retrieval solutions for big textual data corpora: primary data, metadata, and annotation data
  • Scalable and efficient NLP tooling for annotating and analysing large datasets
  • distributed and GPGPU computing; using big data analysis frameworks (Hadoop, Spark, etc.) for language processing

Licensing, legal and privacy issues:

  • Licensing models of open and closed data
  • Coping with intellectual property restrictions

Linguistic content issues:

  • Dealing with the variety of language: multilinguality, historical texts, user-generated content, etc.
  • Integration of human computation (crowdsourcing) and automatic annotation
  • Quality management of annotations

Exploitation issues:

  • Query languages
  • Innovative approaches for aggregation and visualisation of text analytics

In the tradition of CMLC, we also invite reports on national corpus initiatives; submitters of these reports should be prepared to present a poster.

Important dates

  • Submission deadline: 12th March 2017, midnight UTC
  • Notification of acceptance: 18th April 2017
  • Camera-ready papers due:  18th June 2017

Workshop home page

Joint Organising Committee

  • Piotr Bański (IDS Mannheim)
  • Adrien Barbaresi (ICLTT Vienna)
  • Hanno Biber (ICLTT Vienna)
  • Evelyn Breiteneder (ICLTT Vienna)
  • Simon Clematide (University of Zurich, CH)
  • Marc Kupietz (IDS Mannheim)
  • John Mariani (Lancaster University, UK)
  • Harald Lüngen (IDS Mannheim)
  • Paul Rayson (Lancaster University, UK)
  • Mark Stevenson (Sheffield University, UK)

Programme committee

  • Felix Bildhauer (IDS Mannheim)
  • Steve Cassidy (Macquarie University)
  • Damir Ćavar (Indiana University, USA)
  • Dan Cristea ("Alexandru Ioan Cuza" University of Iasi)
  • Mark Davies (BYU, USA)
  • Tomaž Erjavec (Jožef Stefan Institute, Ljubliana)
  • Stefan Evert (Friedrich-Alexander-Universität Erlangen-Nürnberg)
  • Alexander Geyken (Berlin-Brandenburgische Akademie der Wissenschaften)
  • Johannes Graën (University of Zurich, Switzerland)
  • Andrew Hardie (Lancaster University, UK)
  • Serge Heiden (University of Lyon, France)
  • Miloš Jakubíček (Lexical Computing Ltd.)
  • Michal Křen (Charles University, Prague)
  • Sandra Kübler (Indiana University, USA)
  • Jochen Leidner (Thomson Reuters, UK)
  • Piotr Pęzik (University of Łódź, Poland)
  • Adam Przepiórkowski (Polish Academy of Sciences)
  • Laura Irina Rusu (IBM Australia)
  • Roland Schäfer (FU Berlin)
  • Roman Schneider (Institut für Deutsche Sprache, Germany)
  • Serge Sharoff (University of Leeds)
  • Gandhi Sivakumar (IBM Australia)
  • Dan Tufiş (Romanian Academy, Bucharest)
  • Tamás Váradi (Research Institute for Linguistics, Hungarian Academy of Sciences)
  • Amir Zeldes (Georgetown University, USA)