10th Web as Corpus Workshop (WAC-X)
Endorsed by the Special Interest Group of the ACL on Web as Corpus (SIGWAC)
Co-located with ACL 2016
August 12, 2016, Berlin
Location: Humboldt University, Berlin
Room: 2093 (Please see the ACL 2016 homepage for details.)
The proceedings of WAC-X and the EmpiriST shared task are available in the ACL anthology. Slides are linked in the program below.
WAC-X morning session | |
9:30–9:40 | Welcome and Introduction |
9:40–10:00 | Roland Schäfer and Felix Bildhauer Automatic Classification by Topic Domain for Meta Data Generation, Web Corpus Evaluation, and Corpus Comparison (Slides) |
10:00–10:30 | Adrien Barbaresi Efficient construction of metadata-enhanced web corpora (Slides) |
WAC-X noon session | |
11:00–11:30 | Andrew Salway, Dag Elgesem, Knut Hofland, Øystein Reigem and Lubos Steskal Topically-focused Blog Corpora for Multiple Languages (Slides) |
11:30–12:00 | Anne Krause The Challenges and Joys of Analysing Ongoing Language Change in Web-based Corpora: a Case Study (Slides) |
12:00–12:30 | Quirin Würschinger, Mohammad Fazleh Elahi, Desislava Zhekova and Hans-Jörg Schmid Using the Web and Social Media as Corpora for Monitoring the Spread of Neologisms. The case of ’rapefugee’, ’rapeugee’, and ’rapugee’. (Slides) |
EmpiriST session | |
13:30–13:50 | Michael Beißwenger, Sabine Bartsch, Stefan Evert and Kay-Michael Würzner EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora (Slides) |
13:50–14:10 | Thomas Proisl and Peter Uhrig SoMaJo: State-of-the-art tokenization for German web and social media texts (Slides) |
14:10–14:30 | Jakob Prange, Andrea Horbach and Stefan Thater UdS-(retrain|distributional|surface): Improving POS Tagging for OOV Words in German CMC and Web Data (Slides) |
WAC-X and EmpiriST teaser talks | |
14:30–14:35 | Gideon Mendels, Erica Cooper and Julia Hirschberg Babler - Data Collection from the Web to Support Speech Recognition and Keyword Search |
14:35–14:40 | Nikola Ljubešić and Darja Fišer A Global Analysis of Emoji Usage (Slides) |
14:40–14:45 | Erika Dalan and Serge Sharoff Genre classification for a corpus of academic webpages (Slides) |
14:45–14:50 | Roland Schäfer On Bias-free Crawling and Representative Web Corpora (Poster) |
14:55–15:00 | Steffen Remus, Gerold Hintz, Chris Biemann, Christian M. Meyer, Darina Benikova, Judith Eckle-Kohler, Margot Mieskes and Thomas Arnold EmpiriST: AIPHES - Robust Tokenization and POS-Tagging for Different Genres (Slides) |
15:00–15:05 | Egon Stemle bot.zen @ EmpiriST 2015 - A minimally-deep learning PoS-tagger (trained for German CMC and Web data) |
15:05–15:10 | Tobias Horsmann and Torsten Zesch LTL-UDE @ EmpiriST 2015: Tokenization and PoS Tagging of Social Media Text (Slides) |
Posters and discussions | |
15:10–16:30 | WAC-X and EmpiriST poster session |
16:30–17:30 | WAC-X and EmpiriST closing discussion |
17:30–18:30 |
WAC-X main workshop
The World Wide Web has become increasingly popular as a source of linguistic data, not only within the NLP communities, but also with theoretical linguists facing problems of data sparseness or data diversity. Accordingly, web corpora continue to gain importance, given their size and diversity in terms of genres/text types. The field is still new, though, and a number of issues in web corpus construction need much additional research, both fundamental and applied. These issues range from questions of corpus design (e.g., assessment of corpus composition, sampling strategies and their relation to crawling algorithms, and handling of duplicated material) to more technical aspects (e.g., efficient implementation of individual post-processing steps in document cleaning and linguistic annotation, or large-scale parallelization to achieve web-scale corpus construction). Similarly, the systematic evaluation of web corpora, for example in the form of task-based comparisons to traditional corpora, has only recently shifted into focus. For almost a decade, the ACL SIGWAC (http://www.sigwac.org.uk/), and especially the highly successful Web as Corpus (WAC) workshops have served as a platform for researchers interested in compilation, processing and application of web-derived corpora. Past workshops were co-located with major conferences on computational linguistics and/or corpus linguistics (such as EACL, NAACL, LREC, WWW, and Corpus Linguistics).
WAC-X will also feature the final workshop of the EmpiriST 2015 shared task "Automatic Linguistic Annotation of Computer-Mediated Communication / Social Media" (see https://sites.google.com/site/empirist2015/ for details) and the panel discussion "Corpora, open science, and copyright reforms" (see https://www.sigwac.org.uk/wiki/WAC-X#paneldisc for details).
- Paul Cook (University of New Brunswick)
- Stefan Evert (Friedrich-Alexander Universität Erlangen-Nürnberg)
- Roland Schäfer (Freie Universität Berlin)
- Egon Stemle (European Academy of Bozen/Bolzano)
Program committee
The workshop organizers were not part of the program committee.
- Adrien Barbaresi, ÖAW (AT)
- Silvia Bernardini, University of Bologna (IT)
- Douglas Biber, Northern Arizona University (US)
- Felix Bildhauer, Institut für Deutsche Sprache Mannheim (DE)
- Katrien Depuydt, INL, Leiden (NL)
- Jesse de Does, INL, Leiden (NL)
- Cédrick Fairon, UC Louvain (BE)
- William H. Fletcher, U.S. Naval Academy (US)
- Iztok Kosem, Trojina, Institute for Applied Slovene Studies (SI)
- Simon Krek, Jožef Stefan Institute (SI)
- Lothar Lemnitzer, BBAW (DE)
- Nikola Ljubešić, Sveučilišta u Zagrebu (HR)
- Siva Reddy, University of Edinburgh (UK)
- Steffen Remus, TU Darmstadt (DE)
- Pavel Rychly, Masaryk University (CZ)
- Kevin Scannell, Saint Louis University (US)
- Serge Sharoff, University of Leeds (UK)
- Klaus Schulz, LMU München (DE)
- Kay-Michael Würzner, BBAW (DE)
- Torsten Zesch, University of Duisburg-Essen (DE)
- Pierre Zweigenbaum, LIMSI (FR)
EmpiriST 2015 shared task
The EmpiriST 2015 shared task aims to encourage the developers of NLP applications to adapt their tools and resources to the processing of German discourse in genres of computer-mediated communication (CMC), including both dialogical (chat, SMS, social networks, etc.) and monological (web pages, blogs, etc.) texts. Since there has been relatively little work in this area for German so far, the shared task focuses on tokenization and part-of-speech tagging as the core annotation steps required by virtually all NLP applications. While we have a particular interest in robust tools that can be applied to dialogical CMC and web corpora alike, participants are allowed to use different systems for the two subsets or submit results for one subset only. A substantial number of teams from German-speaking countries have already expressed their interest to participate in EmpiriST 2015. Knowledge of German is not essential for participation, though, since there are sufficient amounts of manually annotated training data (at least 10,000 tokens) and key documents are provided in English.
The final workshop of EmpiriST 2015 will be co-located with WAC-X. It will include a detailed presentation of the task and results, a poster session with all participating systems, oral presentations of selected systems, and a plenary discussion about the challenges of CMC in general as well as German CMC genres in particular.
Attachments (14)
(1.4 MB
) - added by 9 years ago.
Slides for Proisl & Uhrig
(2.2 MB
) - added by 9 years ago.
Slides for Horsmann & Zesch
(1.9 MB
) - added by 9 years ago.
Slides for Ljubesic & Fiser
Topically-focused Blog Corpora for Multiple Languages.pdf
(1.1 MB
) - added by 9 years ago.
Slides for Salway et al.
(247.8 KB
) - added by 9 years ago.
Slides for Dalan & Sharoff
(1.8 MB
) - added by 9 years ago.
Slides for Schäfer & Bildhauer
(1.1 MB
) - added by 9 years ago.
Slides for Beißwenger et al.
Shared Task final.pdf
(394.9 KB
) - added by 9 years ago.
Slides for Prange et al.
Topically-focused Blog Corpora for Multiple Languages.2.pdf
(1.1 MB
) - added by 9 years ago.
Slides for Salway et al.
(541.4 KB
) - added by 9 years ago.
Würschinger et al.
WAC-X - Krause - slides.pdf
(542.0 KB
) - added by 9 years ago.
Slides for Krause
(6.7 MB
) - added by 9 years ago.
Slides for Remus et al.
(908.4 KB
) - added by 9 years ago.
Poster für Schäfer
(312.0 KB
) - added by 8 years ago.
Slides for Barbaresi