CL2017 Pre-conference workshop 5

The 11th Web as Corpus Workshop (WAC-XI)

IMPORTANT: Please note that this workshop has now been incorporated into Workshop 8, CMLC 5 + Big NLP 2017. CMLC 5 was originally conceived as a half-day afternoon workshop, but will now begin in the late morning with a WAC ‘guest session’ of 3 papers, chaired by Stefan Evert. The afternoon session will then proceed normally as a half-day session devoted to CMLC papers.

If you have already registered for Workshop 5, you can either transfer your registration to the new joint workshop as described above (in which case you don’t need to do anything), or contact the Conference Event Team at cl2017@contacts.bham.ac.uk if you wish to cancel your registration and receive a refund.

Workshop convenors

Adrien Barbaresi
Austrian and Berlin-Brandenburg Academies of Sciences
adrien.barbaresi@oeaw.ac.at
Felix Bildhauer
IDS Mannheim
felix.bildhauer@fu-berlin.de
Roland Schäfer
Freie Universität Berlin
roland.schaefer@fu-berlin.de

Workshop summary

For almost a decade, the ACL SIGWAC (http://www.sigwac.org.uk/), and especially the highly successful Web as Corpus (WAC) workshops have served as a platform for researchers interested in the compilation, processing and use of web-derived corpora and other types of corpora containing forms of computer-mediated communication. Past workshops were co-located with major conferences on corpus linguistics and/or computational linguistics (such as ACL, EACL, Corpus Linguistics, LREC, NAACL, WWW). Virtually all creators of large web corpora (such as Aranea, COW,SketchEngine) have been an active part of the WAC scene for many years, and a WAC workshop co-located with the Corpus Linguistics conference can serve as an optimal platform bringing together creators and users of web corpora from different traditions of corpus linguistics and language technology. Similar WAC meetings were held successfully in 2005 (WAC-1, Birmingham) and 2013 (WAC-8, Lancaster). Even though the WAC workshops have a ten year tradition, the field is still new compared to corpus linguistics as a whole, and a number of issues in web corpus construction need much additional research, both fundamental and applied. These issues range from questions of corpus design (e.g., the assessment of corpus composition or the handling of web spam and duplicated material) to more technical aspects (e.g., efficient implementation of individual post-processing steps in document cleansing and linguistic annotation, automatic generation of document-level meta data, or large-scale parallelization to achieve web-scale corpus construction).

The eleventh Web as Corpus workshop (WAC-XI) emphasizes the linguistic aspects of web corpus research more than the technological aspects while keeping in mind that the two are inseparable, as we will elaborate in the following paragraphs.

The World Wide Web has become increasingly popular as a source of linguistic data, not only within the computational linguistics community, but also with theoretical linguists facing problems such as data sparseness or the lack of variation in written data. Accordingly, web corpora continue to gain importance, given their size and diversity in terms of genres and text types. In lexicography, web data have become a major and well-established source of data mostly thanks to the SketchEngine corpora and infrastructure (Kilgarriff et al. 2014), although internet lexicography is not restricted to the SketchEngine (e.g., Geyken 2014). In other areas of linguistics, the adoption of web corpora has been slower compared to lexicography, but the number of publications is rising. In recent years, web data have been used, for example, in morphology (e.g., Battefeld et al. 2016, van Goethem & Hüning 2015, Norde & van Goethem 2015, Schäfer 2016a), construction grammar and syntax (Flach 2016, Stefanowitsch & Flach 2016, Tummers et al. 2015, Zeldes 2012), graphemics (Schäfer & Sayatz 2016) as well as computational linguistics (Lapesa & Evert 2014) and psycholinguistics or neurolinguistics (Willems et al. 2015). Some areas of research dealing exclusively with web data have even emerged, such as the construction of corpora from Twitter data (e.g., Scheffler 2014). Another example is the (manual or automatic) classification of web texts by genre, register, or – more generally speaking – text type, as well as topic area (e.g., Mehler et al. 2011, Dalan & Sharoff 2016, Schäfer & Bildhauer 2016).

Similarly, the areas of corpus evaluation corpus comparison have been advanced greatly through the rise of web corpora (e.g., Biemann et al. 2013 and many publications by Adam Kilgarriff as summarized in chapter 5 of Schäfer & Bildhauer 2013), mostly because web corpora (especially larger ones in the region of several billions of tokens) are often created by downloading texts from the web unselectively with respect to their text type or content. Comparing web corpora to corpora that have been compiled in a traditional way is therefore key in determining the ‘quality’ of web corpora with respect to certain research questions. In other words, while the composition (or stratification) of large web corpora cannot be determined before their construction, it is desirable to at least evaluate it after their construction. Highly specific to web corpora are questions regarding the document collection strategies used in downloading the data – in other words, the crawling algorithms. While such questions might seem to be purely technical in nature at first sight, they involve deeply theoretical questions of corpus design (Schäfer 2016b).

Call for papers

The eleventh Web as Corpus workshop (WAC-XI) takes a (corpus) linguistic look at the state of the art in all these areas. More specifically, in some of the abovementioned publications featuring linguistic case studies, the authors explicitly discuss and/or defend the validity of web corpus data for a specific type of research question while others simply take web corpora as just another source of data. For example, Stefanowitsch & Flach (2016) discuss web data in the context of the corpus-as-input hypothesis in cognitively oriented theoretical linguistics. Accordingly, the aim of the 11th Web as Corpus Workshop is to provide a platform for researchers to present the following:

case studies in corpus or computational linguistics where web data have been used
research specifically related to the validity of web data in corpus, computational, and theoretical linguistics
research on the technical aspects web corpus construction which have a strong influence on theoretical aspects of corpus design (cf. Schäfer & Bildhauer 2013, Barbaresi 2015, and many papers in the proceedings of previous Web as Corpus workshops).

While corpus linguists are putatively more sensitive to questions of data quality, we do not consider computational linguists who use web corpora to be facing a totally different situation, and we in no way intend to exclude computational linguistics from the scope of WAC-XI. Hence, the workshop is open to all types of research pertaining to web corpora (see also below), and we are specifically interested in papers by corpus linguists and computational linguists addressing questions (either as part of a case study or in the form of primary research) such as:

Are there substantial differences in theoretical inferences when web data are used instead of data from traditionally compiled corpora? If so: Why? Are they expected?
Do findings from traditionally compiled corpora and web corpora converge when compared with evidence from other sources (such as psycholinguistic experiments)? If not: Which type of data matches the external findings better?
Is it possible to analyse lectal variation with web corpora, given the frequent lack of relevant meta data?
How good is the quality of the (automatic) linguistic annotation of web data compared to traditionally compiled corpora? How does this empirical linguistic research deal with web corpora? What could corpus designers do to improve it?
Are there differences with regard to the dispersion of linguistic entities in web corpora compared to traditionally compiled corpora? If so: Why? Does it matter? How can we deal with it or even profit from it?
How do very large web corpora compare to smaller, more intentionally stratified web corpora created for a specific task? How can it be decided which type of corpus is better for a given research question?

As part of the workshop and consistent with its general theme, we plan to organize a panel discussion as the first meeting of the CleanerEval shared task on combined paragraph and document quality detection for (web) documents. The CleanerEval shared task follows the successful CleanEval shared task organized by SIGWAC in 2006. While CleanEval focussed specifically on so-called boilerplate removal (the removal of automatically inserted and frequently repeated non-corpus material from web pages), CleanerEval goes beyond this and asks for systems that determine the linguistic quality of paragraphs and whole documents in an automatic fashion, such that corpus designers can decide whether to include them in their corpus or not. In the CleanerEval setting, boilerplate paragraphs are paragraphs with low quality, but there might be other, non-boilerplate paragraphs with low quality as well. CleanerEval was proposed by the organizers of WAC-XI during the final discussion of WAC-X, where the proposal was met with enthusiasm. The WAC-XI panel discussion is intended to serve as a platform for the development of the operationalization of the notions of paragraph and document quality, the annotation guidelines, and the final schedule for the shared task. There can be no doubt that corpus linguists should have a say in what counts as good corpus material and what does not and that this question is not at all a purely technical one. The final meeting of the shared task is planned for to be part of WAC-XII in 2018.

To summarize, the 11th Web as Corpus workshop (WAC-XI) invites reports of the useof web corpora in (corpus) linguistics and language technology, for example

usage-based linguistics
cognitively oriented empirical linguistics, psycholinguistics, neurolinguistics
lectal and stratal analyses of linguistic web data• linguistic studies of rare an non-standard phenomena in web data
linguistic studies of web-specific forms of communication
web-specific lexicography, grammaticography, and language documentation
information extraction & opinion mining
language modeling, distributional semantics

as well as work related to web corpus construction, including but not restricted to:

data collection (large web corpora, smaller web corpora created for specific tasks, and other types of corpora of computer-mediated communication)
cleaning and handling of noise
quality evaluation at the document, paragraph, and sentence levels• duplicate removal and document filtering
linguistic post-processing (especially for non-standard web data)
automatic generation of meta data (including content, genre, register, etc.)

Furthermore, aspects of usability and availability of web-derived corpora have always been highly relevant in the context of WAC-XI. Topics of interest include the following:

development of user interfaces
visualization techniques
tools for statistical analysis of very large corpora
long-term archiving
documentation and standardization
legal and ethical issues

Organising committee

Masayuki Asahara, National Inst. for Japanese Language and Linguistics, JP, masayua@ninjal.ac.jp
Silvia Bernardini, U of Bologna, IT, silvia.bernardini@unibo.it
Niels Brügger. University of Aarhus, DK, nb@cc.au.dk
Susanne Flach, Freie Universität Berlin, DE, susanne.flach@fu-berlin.de
Cédrick Fairon, UCLouvain, BE, cedrick.fairon@uclouvain.be
William H. Fletcher, U.S. Naval Academy, US, fletcher@kwicfinder.com
Jack Grieve, Aston University, UK, j.grieve1@aston.ac.uk
Aurelie Herbelot, Trento, IT, aurelie.herbelot@cantab.net
Miloš Jakubíček, Masaryk University Brno, CZ, milos.jakubicek@sketchengine.co.uk
Iztok Kosem, Trojina, Institute for Applied Slovene Studies, SISI, iztok.kosem@trojina.si
Anne Krause, Universität Leipzig, DE, anne.krause@uni-leipzig.de
Steffen Remus, TU Darmstadt, DE, remus@informatik.tu-darmstadt.de
Antonio Ruiz Tinoco, Sophia University, JP, a-ruiz@sophia.ac.jp
Kevin Scannell, Saint Louis U, US, kscanne@gmail.com
Serge Sharoff, University of Leeds, UK, s.sharoff@leeds.ac.uk
Sabine Schulte im Walde, IMS Stuttgart, DE, schulte@ims.uni-stuttgart.de
Klaus Schulz, LMU München, DE, schulz@cis.uni-muenchen.de
Egon Stemle, EURAC Bozen/Bolzano, IT, egon.stemle@eurac.edu
Peter Uhrig, FAU Erlangen, DE, peter.uhrig@angl.phil.uni-erlangen.de
Marieke van Erp, VU Amsterdam, NL, marieke.van.erp@vu.nl
Wajdi Zaghouani, CMU Qatar, QA, wajdiz@cmu.edu
Amir Zeldes, Georgetown University, Washington, US, amir.zeldes@georgetown.edu
Arne Zeschel, IDS, Mannheim, DE, zeschel@ids-mannheim.de

References

Barbaresi, A. (2015), ‘Ad hoc and general-purpose corpus construction from web sources’, PhD thesis, ENS Lyon, 2015. https://tel.archives-ouvertes.fr/tel-01167309

Battefeld, M.; Leuschner, T.& Rawoens, G. (2016), Evaluative Morphology in German, Dutch and Swedish: Constructional Networks and the Loci of Change, in Coussé, Evie; Norde, Muriel; Vanderbauwhede, G.; Van Goethem, K., eds., ‘Category Change from a Constructional Perspective’, Amsterdam/ Philadelphia: Benjamins.

Biemann, Chr.; Bildhauer, F.; Evert, St.; Goldhahn, D.; Quasthoff, U.; Schäfer, R.; Simon, J.; Swiezinski, L.; Zesch, T. (2013), ‘Scalable construction of high-quality Web corpora’, Journal for Language Technology and Computational Linguistics, 28(2), 23–59.

Dalan, E. & Sharoff, S. (2016), Genre classification for a corpus of academic webpages, in Cook, P.; Evert, St.; Schäfer, R. & Stemle, E. Proceedings of WAC-X.

Flach, S. (2016), Let’s go look at usage: A constructional approach to formal constraints on go-VERB, in Thomas Herbst & Peter Uhrig, ed., ‘Yearbook of the German Cognitive Linguistics Association’, Berlin: De Gruyter.

Geyken, A. (2014), Methoden bei der Wörterbuchplanung in Zeiten der Internetlexikographie, in Heid, U.; Schierholz, St.; Schweickard, W.; Wiegand, H. E.; Gouws, R. H.; Wolski, W., eds., ‘Lexicographica’. Berlin/New York: De Gruyter, 77–112.

Goethem, K. V. & Hüning, M. (2015), ‘From noun to evaluative adjective: conversion or debonding? Dutch top and its equivalents in German’, Journal of Germanic Linguistics 27(4), 365–408.

Kilgarriff, A.; Baisa, V.; Busta, I.; Jakubicek, V.; Kovar, V.; Michelfeit, J.; Rychly, P & Suchomel, V. (2014), ‘The Sketch Engine: ten years on’, Lexicography: 1–30.

Lapesa, G. & Evert, St. (2014), ‘A large scale evaluation of distributional semantic models: Parameters, interactions and model selection’, Transactions of the Association for Computational Linguistics, 2, 531–545.

Mehler, A.; Sharoff, S.; Santini, M. (2011), ‘Genres on the Web’, Berlin/New York: Springer.

Norde, M. & Goethem, K. V. (2014), ‘Bleaching, productivity and debonding of prefixoids. A corpus-based analysis of ‘giant’ in German and Swedish’, Lingvisticae Investigationes 37(2), 256–274.

Schäfer, R. (2016a), Prototype-driven alternations: The case of German weak nouns, Corpus Linguistics and Linguistic Theory, ahead of print.

Schäfer R. (2016b), On Bias-free Crawling and Representative Web Corpora, in Cook, P.; Evert, St.; Schäfer, R. & Stemle, E. Proceedings of WAC-X.

Schäfer, R. & Bildhauer, F. (2013), ‘Web Corpus Construction’, San Francisco: Morgan & Claypool.

Schäfer, R. & Bildhauer, F. (2016), Automatic Classification by Topic Domain for Meta Data Generation, Web Corpus Evaluation, and Corpus Comparison, in Cook, P.; Evert, St.; Schäfer, R. & Stemle, E. Proceedings of WAC-X.

Schäfer, R. & Sayatz, U. (2017), Punctuation and Syntactic Structure in Obwohl and Weil Clauses in Nonstandard Written German, Written Language and Literay, to appear.

Scheffler, T. (2014), A German Twitter Snapshot, in ‘Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC)’, 2284–2289.

Stefanowitsch, A. & Flach, S. (2016), A corpus-based perspective on entrenchment, in Schmid, H. J. ed., ‘Entrenchment and the psychology of language: How we reorganize and adapt linguistic knowledge’, De Gruyter, Berlin.

Tummers, J.; Speelman, D.; Heylen, K. & Geeraerts, D. (2015), ‘Lectal constraining of lexical collocations’, Constructions and Frames 7(1), 1–46.

Willems, R. M.; Frank, S. L.; Nijhof, A. D.; Hagoort, P. & van den Bosch, A. (2015), ‘Prediction During Natural Language Comprehension’, Cerebral Cortex, 1–11.

Zeldes, A. (2012), ‘Productivity in Argument Selection’, Berlin/Boston: De Gruyter.