Changes between Version 1 and Version 2 of WAC-XI


Ignore:
Timestamp:
10/01/16 11:34:36 (8 years ago)
Author:
Roland Schäfer
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • WAC-XI

    v1 v2  
    22
    33= 11th Web as Corpus Workshop (WAC-XI) =
     4featuring the First !CleanerEval Shared Task panel discussion
    45
    56Endorsed by the Special Interest Group of the ACL on Web as Corpus (SIGWAC)
    6 
    7 All details tba.
    87
    98=== Organizers ===
     
    1312* [http://rolandschaefer.net Roland Schäfer (Freie Universität Berlin)]
    1413
     14== Main workshop ==
    1515
    16 {{{#!comment
     16The World Wide Web has become increasingly popular as a source of linguistic data, not only within the NLP communities, but also with theoretical linguists facing problems such as data sparseness or the lack of variation in written data. Accordingly, web corpora continue to gain importance, given their size and diversity in terms of genres/text types. The field is still new, though, and a number of issues in web corpus construction need much additional research, both fundamental and applied. These issues range from questions of corpus design (e.g., the assessment of corpus composition or the handling of web spam and duplicated material) to more technical aspects (e.g., efficient implementation of individual post-processing steps in document cleansing and linguistic annotation, automatic generation of document-level meta data, or large-scale parallelization to achieve web-scale corpus construction). Similarly, the systematic evaluation of web corpora, for example in the form of task-based comparisons to traditional corpora, has only recently shifted into focus. Finally, other forms of computer-mediated communication (e.g., Twitter) have recently received a lot of attention from corpus designers.
    1717
    18 == Main workshop ==
     18For almost a decade, the ACL SIGWAC (http://www.sigwac.org.uk/), and especially the highly successful Web as Corpus (WAC) workshops have served as a platform for researchers interested in compilation, processing and application of web-derived corpora and other types of CMC corpora. Past workshops were co-located with major conferences on computational linguistics and/or corpus linguistics (such as ACL, EACL, NAACL, LREC, WWW, Corpus Linguistics). As in previous years, the 11th Web as Corpus workshop (WAC-XI) invites contributions pertaining to all aspects of web corpus creation, including but not restricted to
     19
     20* data collection (both large web corpora and other types of CMC corpora)
     21* cleaning/handling of noise
     22* duplicate removal/document filtering
     23* linguistic post-processing (including non-standard data)
     24* automatic generation of meta data (including register, genre, etc.)
     25
     26Furthermore, aspects of usability and availability of web-derived corpora are highly relevant in the context of WAC-XI
     27
     28* development of user interfaces
     29* visualization techniques
     30* tools for statistical analysis of very large (e.g., web-derived) corpora
     31* long-term archiving
     32* documentation and standardization
     33* legal issues
     34
     35Finally, reports of the use of web corpora in language technology and linguistics are welcome, for example
     36
     37* linguistic studies of web-specific forms of communication
     38* linguistic studies of rare phenomena in web data
     39* web-specific lexicography, grammaticography, and language documentation
     40* information extraction & opinion mining
     41* language modeling, distributional semantics
     42* machine translation
     43
    1944
    2045== Panel discussion: == #cleanereval
    2146
     47As part of the workshop, we plan to organize a panel  discussion as the first meeting of the !CleanerEval shared task on combined paragraph and document quality detection for (web) documents. The !CleanerEval shared task follows the successful !CleanEval shared task organized by SIGWAC in 2006. While !CleanEval focussed specifically on so-called boilerplate removal, !CleanerEval goes beyond this and asks for systems that determine the linguistic quality of paragraphs and whole documents in an automatic fashion, such that corpus designers can decide whether to include them in their corpus or not. In the !CleanerEval setting, boilerplate paragraphs are paragraphs with low quality, but there might be other, non-boilerplate paragraphs with low quality as well. !CleanerEval was proposed by the organizers of WAC-XI during the final discussion of WAC-X, where the proposal was met with enthusiasm. The WAC-XI panel discussion is intended to serve as a platform for the development of the operationalization of the notions of paragraph and document quality, the annotation guidelines, and the final schedule for the shared task. The final meeting of the shared task is planned for to be part of WAC-XII in 2018.
     48
    2249== Program committee ==
     50
     51Confirmed reviewers so far:
     52
     53* Masayuki Asahara, National Institute for Japanese Language and Linguistics
     54* Silvia Bernardini, University of Bologna
     55* Niels Brügger, University of Aarhus
     56* Cédrick Fairon, UC Louvain
     57* William H. Fletcher, U.S. Naval Academy
     58* Jack Grieve, Aston University
     59* Aurelie Herbelot, University of Trento
     60* Miloš Jakubíček, Masaryk University Brno
     61* Iztok Kosem, Trojina, Institute for Applied Slovene Studies
     62* Steffen Remus, TU Darmstadt
     63* Antonio Ruiz Tinoco, Sophia University
     64* Kevin Scannell, Saint Louis University
     65* Serge Sharoff, University of Leeds
     66* Sabine Schulte im Walde, IMS Stuttgart
     67* Klaus Schulz, LMU München
     68* Egon Stemle, EURAC Bozen / Bolzano
     69* Peter Uhrig, FAU Erlangen
     70* Marieke van Erp, VU Amsterdam
     71* Wajdi Zaghouani, CMU, Qatar
     72* Amir Zeldes, Georgetown University, Wahsington
     73* Arne Zeschel, Institu für Deutsche Sprache, Mannheim
     74
     75=== Important dates ===#dates
    2376
    2477tba
    2578
    26 == Details ==
    27 
    28 === Important dates ===#dates
    29 
    3079=== Call for papers === #cfp
    3180
    32 === Panel discussion: === #cleanereval
     81tba
    3382
    3483=== Submission website ===
    3584
     85tba
     86
    3687=== Submission format ===
    3788
    38 }}}
     89tba