Changes between Initial Version and Version 1 of WAC9


Ignore:
Timestamp:
11/05/13 14:53:21 (10 years ago)
Author:
Felix Bildhauer
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • WAC9

    v1 v1  
     1= 9th Web as Corpus Workshop (WAC9) @ [http://eacl2014.org/ EACL 2014] =
     2== 26-27 April 2014 (Gothenburg, Sweden) ==
     3
     4//Endorsed by [http://www.sigwac.org.uk ACL SIGWAC].//
     5
     6The World Wide Web has become increasingly popular as a source of linguistic data, not only within the NLP communities, but also with theoretical linguists facing problems of data sparseness or data diversity.
     7Accordingly, web corpora continue to gain importance, given their size and diversity in terms of genres/ text types.
     8However, the field is still new, and a number of issues in web corpus construction still needs much research (fundamental and applied), ranging from questions of corpus design (e.g., corpus composition assessment, sampling strategies and their relation to crawling algorithms, handling of duplicated material) to more technical aspects (e.g., efficient implementation of individual post-processing steps in document cleansing and linguistic annotation, or large-scale parallelization to achieve web-scale corpus construction).
     9Similarly, the systematic evaluation of web corpora, for example in the form of task-based comparisons to traditional corpora, has only lately shifted into focus.
     10
     11For almost a decade, the ACL SIGWAC, and especially the highly successful Web as Corpus (WaC) workshops have served as a platform for researchers interested in building and working with web-derived corpora.
     12Past workshops have been co-located with major conferences on computational linguistics and/ or corpus linguistics (such as EACL, LREC, WWW, Corpus Linguistics).
     13As in previous years, the 9th Web as Corpus workshop (WaC9) invites contributions pertaining to all aspects of web corpora, including data collection, cleaning, duplicate removal, document filtering, linguistic post-processing, and use of web corpora in language technology and linguistics.
     14
     15However, a major challenge in the construction of web corpora is the question of the quality and the evaluation of both the software used in the construction of web corpora as well as the corpora themselves.
     16Therefore, WaC9 seeks to put special emphasis on these topics, and it particularly encourages submissions addressing the following points:
     17
     18* noise in web corpora: normalization and implications for linguistic annotation (lemmatization, POS tagging, parsing, etc.)
     19* task-based ("extrinsic") evaluation of web corpora, especially in comparison to traditional corpus resources and n-gram databases (Web 1T 5-Grams, Google Books)
     20* missing meta data in web corpora: enriching web corpora with data by automatic classification with high accuracy
     21* sampling strategies\slash crawling algorithms and their effect on corpus composition\slash corpus quality
     22* non-destructive cleaning and normalization of web data  (Currently available web corpora have usually undergone radical cleaning procedures in order to produce "high-quality" data. At least for some uses of the data, aggressive and sometimes arbitrary removal of material in the form of whole documents or parts thereof can be problematic. The same is true for aggressive normalization of the data. To meet such problems, ways of cleaning and normalizing the data transparently, i.e., preserving the non-normalized forms, should be discussed.)
     23
     24As part of the workshop, we will have a panel discussion dedicated to the planning of a shared task for WaC10 (2015), including the nomination of organizers of the shared task.
     25The tracks of the shared task will focus on the quality of web corpus creation tools, tools for linguistic annotation (at least lemmatization, possibly also POS tagging, etc.), and the quality of web corpora themselves.
     26
     27== Organising Committee ==
     28
     29Felix Bildhauer, Freie Universität Berlin
     30Roland Schäfer, Freie Universität Berlin
     31
     32== Program Comittee ==
     33
     34Organising comittee, plus
     35
     36* Adrien Barbaresi,  École Normale Supérieure de Lyon
     37* Silvia Bernardini, Università di Bologna
     38* Chris Biemann, Technische Universität Darmstadt
     39* Jesse Egbert, Northern Arizona University
     40* Stefan Evert, Friedrich-Alexander Universität Erlangen-Nürnberg
     41* Adriano Ferraresi, Università di Bologna
     42* William Fletcher, United States Naval Academy
     43* Dirk Goldhahn, Universität Leipzig
     44* Adam Kilgarriff, Lexical Computing Ltd.
     45* Anke Lüdeling, Humboldt-Universität zu Berlin
     46* Alexander Mehler, Goethe-Universität Frankfurt am Main
     47* Uwe Quasthoff, Universität Leipzig
     48* Paul Rayson, Lancaster University
     49* Serge Sharoff, University of Leeds
     50* Sabine Schulte, im Walde, Universität Stuttgart
     51* Egon Stemle, European Academy of Bozen/Bolzano
     52* Yannick Versley, Universität Heidelberg
     53* Torsten Zesch, Universität Darmstadt
     54* Stephen Wattam, Lancaster University