Changes between Version 1 and Version 2 of WAC-X

11/20/15 15:35:48 (8 years ago)
Roland Schäfer



  • WAC-X

    v1 v2  
    1 10th Web as Corpus Workshop (WAC-X) and EmpiriST Shared Task
     1= 10th Web as Corpus Workshop (WAC-X) and EmpiriST Shared Task =
     3== WAC-X main workshop ==
    35The World Wide Web has become increasingly popular as a source of linguistic data, not only within the NLP communities, but also with theoretical linguists facing problems of data sparseness or data di­versity. Accordingly, web corpora continue to gain importance, given their size and diversity in terms of genres/text types. The field is still new, though, and a number of issues in web corpus construction need much additional research, both fundamental and applied. These issues range from questions of corpus design (e.g., corpus composition assessment, sampling strategies and their relation to crawling algorithms, handling of duplicated material) to more technical aspects (e.g., efficient implementation of individual post-processing steps in document cleansing and linguistic annotation, or large-scale paral­lelization to achieve web-scale corpus construction). Similarly, the systematic evaluation of web cor­pora, for example in the form of task-based comparisons to traditional corpora, has only recently shifted into focus.
    2931* web-specific lexicography, grammaticography, and language documentation
    31 EmpiriST 2015 shared task
     33== EmpiriST 2015 shared task ==
    3335The EmpiriST 2015 shared task aims to encourage the developers of NLP applications to adapt their tools and resources to the processing of German discourse in genres of computer-mediated communica­tion (CMC), including both dialogical (chat, SMS, social networks, etc.) and monological (web pages, blogs, etc.) texts. Since there has been relatively little work in this area for German so far, the shared task focuses on tokenization and part-of-speech tagging as the core annotation steps required by virtu­ally all NLP applications. While we have a particular interest in robust tools that can be applied to dia­logical CMC and web corpora alike, participants are allowed to use different systems for the two sub­sets or submit results for one subset only.