7 | | === Organizers === |
| 9 | == WAC-X main workshop == |
| 10 | |
| 11 | The World Wide Web has become increasingly popular as a source of linguistic data, not only within the NLP communities, but also with theoretical linguists facing problems of data sparseness or data diversity. Accordingly, web corpora continue to gain importance, given their size and diversity in terms of genres/text types. The field is still new, though, and a number of issues in web corpus construction need much additional research, both fundamental and applied. These issues range from questions of corpus design (e.g., corpus composition assessment, sampling strategies and their relation to crawling algorithms, handling of duplicated material) to more technical aspects (e.g., efficient implementation of individual post-processing steps in document cleansing and linguistic annotation, or large-scale parallelization to achieve web-scale corpus construction). Similarly, the systematic evaluation of web corpora, for example in the form of task-based comparisons to traditional corpora, has only recently shifted into focus. For almost a decade, the ACL SIGWAC (http://www.sigwac.org.uk/), and especially the highly successful Web as Corpus (WAC) workshops have served as a platform for researchers interested in compilation, processing and application of web-derived corpora. Past workshops were co-located with major conferences on computational linguistics and/or corpus linguistics (such as EACL, NAACL, LREC, WWW, Corpus Linguistics). |
| 12 | |
| 13 | == Organizers == |
15 | | === Program committee (preliminary) === |
| 22 | As in previous years, the 10th Web as Corpus workshop (WAC-X) invites contributions pertaining to all aspects of web corpus creation, including but not restricted to |
| 23 | |
| 24 | * data collection (both for large web corpora and smaller custom web corpora) |
| 25 | * cleaning/handling of noise |
| 26 | * duplicate removal/document filtering |
| 27 | * linguistic post-processing (including non-standard data) |
| 28 | * automatic generation of meta data (including register, genre, etc.) |
| 29 | * corpus evaluation (quality of text and annotations, comparison to other corpora, etc.) |
| 30 | |
| 31 | Furthermore, aspects of usability and availability of web-derived corpora are highly relevant in the context of WAC-X |
| 32 | |
| 33 | * development of interfaces |
| 34 | * visualization techniques |
| 35 | * tools for statistical analysis of very large (e.g., web-derived) corpora |
| 36 | * long-term archiving |
| 37 | * documentation and standardization |
| 38 | * legal issues |
| 39 | |
| 40 | Finally, reports of the use of web corpora in language technology and linguistics are welcome, for example |
| 41 | information extraction & opinion mining |
| 42 | |
| 43 | * language modeling, distributional semantics |
| 44 | * machine translation |
| 45 | * linguistic studies of web-specific forms of communication |
| 46 | * linguistic studies of rare phenomena |
| 47 | * web-specific lexicography, grammaticography, and language documentation |
| 48 | |
| 49 | === Submission format === |
| 50 | |
| 51 | All submissions must be in PDF format and should follow the ACL 2015 style guidelines. We strongly recommend the use of the ACL 2015 LaTeX style files or Microsoft Word Style files. We reserve the right to reject submissions that do not conform to these styles including font and page size restrictions. |
| 52 | |
| 53 | * [http://acl2015.org/files/acl2015.pdf General instructions (PDF)] |
| 54 | * LaTeX: [http://acl2015.org/files/acl.bst BST], [http://acl2015.org/files/acl2015.sty STY], [http://acl2015.org/files/acl2015.tex TEX] |
| 55 | * MS Word: [http://acl2015.org/files/acl2015.dot DOT] |
| 56 | |
| 57 | Full paper submissions may consist of up to eight (8) pages of content plus any number of pages consisting of only references. Short papers may consist of up to four (4) pages of content plus any number of pages consisting of only references. Full papers will be distinguished from short papers in the proceedings. |
| 58 | |
| 59 | Papers will be presented either orally or as posters at the workshop. There will be no distinction between papers presented orally and those presented as posters in the proceedings. |
| 60 | |
| 61 | Reviewing of papers will be double-blind. Therefore, the paper must not include the authors' names and affiliations. Furthermore, self-references that reveal the author's identity, e.g., "We previously showed (Smith, 1991) ...", must be avoided. Instead, use citations such as "Smith (1991) previously showed ...". Papers not conforming to these requirements will be rejected without review. |
| 62 | |
| 63 | === Important dates === |
| 64 | |
| 65 | * 8 May 2016: Workshop Paper Due date (23:59 GMT-12) |
| 66 | * 5 June 2016: Notification of Acceptance |
| 67 | * 22 June 2016: Camera-ready papers due |
| 68 | * 12 August 2016: Workshop Date |
| 69 | |
| 70 | |
| 71 | === Program committee === |
42 | | == WAC-X main workshop == |
43 | | |
44 | | The World Wide Web has become increasingly popular as a source of linguistic data, not only within the NLP communities, but also with theoretical linguists facing problems of data sparseness or data diversity. Accordingly, web corpora continue to gain importance, given their size and diversity in terms of genres/text types. The field is still new, though, and a number of issues in web corpus construction need much additional research, both fundamental and applied. These issues range from questions of corpus design (e.g., corpus composition assessment, sampling strategies and their relation to crawling algorithms, handling of duplicated material) to more technical aspects (e.g., efficient implementation of individual post-processing steps in document cleansing and linguistic annotation, or large-scale parallelization to achieve web-scale corpus construction). Similarly, the systematic evaluation of web corpora, for example in the form of task-based comparisons to traditional corpora, has only recently shifted into focus. |
45 | | |
46 | | For almost a decade, the ACL SIGWAC (http://www.sigwac.org.uk/), and especially the highly successful Web as Corpus (WAC) workshops have served as a platform for researchers interested in compilation, processing and application of web-derived corpora. Past workshops were co-located with major conferences on computational linguistics and/or corpus linguistics (such as EACL, NAACL, LREC, WWW, Corpus Linguistics). As in previous years, the 10th Web as Corpus workshop (WAC-X) invites contributions pertaining to all aspects of web corpus creation, including but not restricted to |
47 | | |
48 | | * data collection (both for large web corpora and smaller custom web corpora) |
49 | | * cleaning/handling of noise |
50 | | * duplicate removal/document filtering |
51 | | * linguistic post-processing (including non-standard data) |
52 | | * automatic generation of meta data (including register, genre, etc.) |
53 | | * corpus evaluation (quality of text and annotations, comparison to other corpora, etc.) |
54 | | |
55 | | Furthermore, aspects of usability and availability of web-derived corpora are highly relevant in the context of WAC-X |
56 | | |
57 | | * development of interfaces |
58 | | * visualization techniques |
59 | | * tools for statistical analysis of very large (e.g., web-derived) corpora |
60 | | * long-term archiving |
61 | | * documentation and standardization |
62 | | * legal issues |
63 | | |
64 | | Finally, reports of the use of web corpora in language technology and linguistics are welcome, for example |
65 | | information extraction & opinion mining |
66 | | |
67 | | * language modeling, distributional semantics |
68 | | * machine translation |
69 | | * linguistic studies of web-specific forms of communication |
70 | | * linguistic studies of rare phenomena |
71 | | * web-specific lexicography, grammaticography, and language documentation |