Version 26 (modified by Felix Bildhauer, 7 years ago) (diff)


9th Web as Corpus Workshop (WAC-9) @ EACL 2014

April 26, 2014 (Gothenburg, Sweden)

Endorsed by ACL SIGWAC.

Accepted Papers (alphabetically by first author's first name)

  • Adrien Barbaresi: Finding viable seed URLs for web corpora: a scouting approach and comparative study of available sources
  • Magali Sanches Duran, Lucas Avanço, Sandra Aluísio, Thiago Pardo and Maria da Graça Volpe Nunes: Some issues on the normalization of a corpus of product reviews in Portuguese
  • Maik Stührenberg: Less destructive cleaning of web documents by using standoff annotation
  • Nikola Ljubešić: {bs,hr,sr}WaC - Web corpora of Bosnian, Croatian and Serbian
  • Roland Schäfer, Adrien Barbaresi and Felix Bildhauer: Focused Web Corpus Crawling
  • Varvara Magomedova, Natalia Slioussar and Maria Kholodilova: Internet data in a study of language change and a program helping to work with them
  • Verena Lyding, Egon Stemle, Andrea Abel, Claudia Borghetti, Marco Brunello, Sara Castagnoli, Felice Dell'Orletta, Henrik Dittmann, Alessandro Lenci and Vito Pirrelli: The PAISÀ Corpus of Italian Web Texts

Information for authors

  • Please submit your camera-ready full paper formatted according to the EACL stylesheet by March 03, 2014. There will be no extension of this deadline. Failure to submit the manuscript in time means that your paper will no bei included in the proceedings.
  • Papers can have a maximum length of 8 pages including everything.
  • LaTeX and MS Word templates are available here.

Online Survey

Please fill out this online survey regarding a panel discussion about a potential shared task following up CLEANEVAL until Sunday, March 02, 2014.


The World Wide Web has become increasingly popular as a source of linguistic data, not only within the NLP communities, but also with theoretical linguists facing problems of data sparseness or data diversity. Accordingly, web corpora continue to gain importance, given their size and diversity in terms of genres/text types. However, the field is still new, and a number of issues in web corpus construction still needs much research (fundamental and applied), ranging from questions of corpus design (e.g., corpus composition assessment, sampling strategies and their relation to crawling algorithms, handling of duplicated material) to more technical aspects (e.g., efficient implementation of individual post-processing steps in document cleansing and linguistic annotation, or large-scale parallelization to achieve web-scale corpus construction). Similarly, the systematic evaluation of web corpora, for example in the form of task-based comparisons to traditional corpora, has only lately shifted into focus.

For almost a decade, the ACL SIGWAC, and especially the highly successful Web as Corpus (WaC) workshops have served as a platform for researchers interested in building and working with web-derived corpora. Past workshops have been co-located with major conferences on computational linguistics and/ or corpus linguistics (such as EACL, LREC, WWW, Corpus Linguistics). As part of the workshop, we will have a panel discussion dedicated to the planning of a shared task for WaC10 (2015), including the nomination of organizers of the shared task. The tracks of the shared task will focus on the quality of web corpus creation tools, tools for linguistic annotation (at least lemmatization, possibly also POS tagging, etc.), and the quality of web corpora themselves.

Organizing Committee

  • Felix Bildhauer, Freie Universität Berlin
  • Roland Schäfer, Freie Universität Berlin

The organizers are not part of the program committee.

Program Committee

  • Adrien Barbaresi, École Normale Supérieure de Lyon
  • Silvia Bernardini, Università di Bologna
  • Chris Biemann, Technische Universität Darmstadt
  • Jesse Egbert, Northern Arizona University
  • Stefan Evert, Friedrich-Alexander Universität Erlangen-Nürnberg
  • Adriano Ferraresi, Università di Bologna
  • William Fletcher, United States Naval Academy
  • Dirk Goldhahn, Universität Leipzig
  • Adam Kilgarriff, Lexical Computing Ltd.
  • Anke Lüdeling and Burkhard Dietterle, Humboldt-Universität zu Berlin
  • Alexander Mehler, Goethe-Universität Frankfurt am Main
  • Uwe Quasthoff, Universität Leipzig
  • Paul Rayson, Lancaster University
  • Sabine Schulte, im Walde, Universität Stuttgart
  • Serge Sharoff, University of Leeds
  • Egon Stemle, European Academy of Bolzano
  • Stephen Wattam, Lancaster University
  • Yannick Versley, Universität Heidelberg
  • Torsten Zesch, Universität Darmstadt

Important dates

  • 11 November 2013: First Call for Workshop Papers
  • 12 December 2013: Second Call for Workshop Papers
  • 4 January 2014: Final Call for Workshop Papers
  • 23 January 2014 30 January 2014 (extended): Workshop Paper Due Date (0:00 UTC-12)
  • 20 February 2014: Notification of Acceptance
  • 3 March 2014: Camera-ready papers due
  • 26-27 April 2014: EACL Workshop Dates