16 | | {{{#!comment |
| 16 | The World Wide Web has become increasingly popular as a source of linguistic data, not only within the NLP communities, but also with theoretical linguists facing problems such as data sparseness or the lack of variation in written data. Accordingly, web corpora continue to gain importance, given their size and diversity in terms of genres/text types. The field is still new, though, and a number of issues in web corpus construction need much additional research, both fundamental and applied. These issues range from questions of corpus design (e.g., the assessment of corpus composition or the handling of web spam and duplicated material) to more technical aspects (e.g., efficient implementation of individual post-processing steps in document cleansing and linguistic annotation, automatic generation of document-level meta data, or large-scale parallelization to achieve web-scale corpus construction). Similarly, the systematic evaluation of web corpora, for example in the form of task-based comparisons to traditional corpora, has only recently shifted into focus. Finally, other forms of computer-mediated communication (e.g., Twitter) have recently received a lot of attention from corpus designers. |
18 | | == Main workshop == |
| 18 | For almost a decade, the ACL SIGWAC (http://www.sigwac.org.uk/), and especially the highly successful Web as Corpus (WAC) workshops have served as a platform for researchers interested in compilation, processing and application of web-derived corpora and other types of CMC corpora. Past workshops were co-located with major conferences on computational linguistics and/or corpus linguistics (such as ACL, EACL, NAACL, LREC, WWW, Corpus Linguistics). As in previous years, the 11th Web as Corpus workshop (WAC-XI) invites contributions pertaining to all aspects of web corpus creation, including but not restricted to |
| 19 | |
| 20 | * data collection (both large web corpora and other types of CMC corpora) |
| 21 | * cleaning/handling of noise |
| 22 | * duplicate removal/document filtering |
| 23 | * linguistic post-processing (including non-standard data) |
| 24 | * automatic generation of meta data (including register, genre, etc.) |
| 25 | |
| 26 | Furthermore, aspects of usability and availability of web-derived corpora are highly relevant in the context of WAC-XI |
| 27 | |
| 28 | * development of user interfaces |
| 29 | * visualization techniques |
| 30 | * tools for statistical analysis of very large (e.g., web-derived) corpora |
| 31 | * long-term archiving |
| 32 | * documentation and standardization |
| 33 | * legal issues |
| 34 | |
| 35 | Finally, reports of the use of web corpora in language technology and linguistics are welcome, for example |
| 36 | |
| 37 | * linguistic studies of web-specific forms of communication |
| 38 | * linguistic studies of rare phenomena in web data |
| 39 | * web-specific lexicography, grammaticography, and language documentation |
| 40 | * information extraction & opinion mining |
| 41 | * language modeling, distributional semantics |
| 42 | * machine translation |
| 43 | |
| 50 | |
| 51 | Confirmed reviewers so far: |
| 52 | |
| 53 | * Masayuki Asahara, National Institute for Japanese Language and Linguistics |
| 54 | * Silvia Bernardini, University of Bologna |
| 55 | * Niels Brügger, University of Aarhus |
| 56 | * Cédrick Fairon, UC Louvain |
| 57 | * William H. Fletcher, U.S. Naval Academy |
| 58 | * Jack Grieve, Aston University |
| 59 | * Aurelie Herbelot, University of Trento |
| 60 | * Miloš Jakubíček, Masaryk University Brno |
| 61 | * Iztok Kosem, Trojina, Institute for Applied Slovene Studies |
| 62 | * Steffen Remus, TU Darmstadt |
| 63 | * Antonio Ruiz Tinoco, Sophia University |
| 64 | * Kevin Scannell, Saint Louis University |
| 65 | * Serge Sharoff, University of Leeds |
| 66 | * Sabine Schulte im Walde, IMS Stuttgart |
| 67 | * Klaus Schulz, LMU München |
| 68 | * Egon Stemle, EURAC Bozen / Bolzano |
| 69 | * Peter Uhrig, FAU Erlangen |
| 70 | * Marieke van Erp, VU Amsterdam |
| 71 | * Wajdi Zaghouani, CMU, Qatar |
| 72 | * Amir Zeldes, Georgetown University, Wahsington |
| 73 | * Arne Zeschel, Institu für Deutsche Sprache, Mannheim |
| 74 | |
| 75 | === Important dates ===#dates |