wiki:WAC-X

Context Navigation

Version 23 (modified by Roland Schäfer, 10 years ago) ( diff )
--

10th Web as Corpus Workshop (WAC-X)

10th Web as Corpus Workshop (WAC-X)

Endorsed by the Special Interest Group of the ACL on Web as Corpus (SIGWAC)

Co-located with ACL 2016
August 12, 2016, Berlin

Contact email: wacx2016 [at] gmail.com

13 July 2016: Workshop program available.

Program

WAC-X morning session

9:30–9:40	Welcome and Introduction
9:40–10:00	Automatic Classification by Topic Domain for Meta Data Generation, Web Corpus Evaluation, and Corpus Comparison
	Roland Schäfer and Felix Bildhauer
10:00–10:30	Efficient construction of metadata-enhanced web corpora
	Adrien Barbaresi

WAC-X noon session

11:00–11:30	Topically-focused Blog Corpora for Multiple Languages
	Andrew Salway, Dag Elgesem, Knut Hofland, Øystein Reigem and Lubos Steskal
11:30–12:00	The Challenges and Joys of Analysing Ongoing Language Change in Web-based Corpora: a Case Study
	Anne Krause
12:00–12:30	Using the Web and Social Media as Corpora for Monitoring the Spread of Neologisms. The case of ’rapefugee’, ’rapeugee’, and ’rapugee’.
	Quirin Würschinger, Mohammad Fazleh Elahi, Desislava Zhekova and Hans-Jörg Schmid

EmpiriST session

13:30–13:50	EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora
	Michael Beißwenger, Sabine Bartsch, Stefan Evert and Kay-Michael Würzner
13:50–14:10	SoMaJo: State-of-the-art tokenization for German web and social media texts
	Thomas Proisl and Peter Uhrig
14:10–14:30	UdS-(retrain\|distributional\|surface): Improving POS Tagging for OOV Words in German CMC and Web Data
	Jakob Prange, Andrea Horbach and Stefan Thater

WAC-X and EmpiriST teaser talks

14:30–14:35	Babler - Data Collection from the Web to Support Speech Recognition and Keyword Search
	Gideon Mendels, Erica Cooper and Julia Hirschberg
14:35–14:40	A Global Analysis of Emoji Usage
	Nikola Ljubešić and Darja Fišer
14:40–14:45	Genre classification for a corpus of academic webpages
	Erika Dalan and Serge Sharoff
14:45–14:50	On Bias-free Crawling and Representative Web Corpora
	Roland Schäfer
14:55–15:00	EmpiriST: AIPHES - Robust Tokenization and POS-Tagging for Different Genres
	Steffen Remus, Gerold Hintz, Chris Biemann, Christian M. Meyer, Darina Benikova, Judith Eckle-Kohler, Margot Mieskes and Thomas Arnold
15:00–15:05	bot.zen @ EmpiriST 2015 - A minimally-deep learning PoS-tagger (trained for German CMC and Web data)
	Egon Stemle
15:05–15:10	LTL-UDE @ EmpiriST 2015: Tokenization and PoS Tagging of Social Media Text
	Tobias Horsmann and Torsten Zesch

Posters and discussion

15:10–16:30	WAC-X and EmpiriST poster session
16:30–17:30	WAC-X and EmpiriST closing discussion
17:30–18:30	Panel discussion Corpora, open science, and copyright reforms

WAC-X main workshop

The World Wide Web has become increasingly popular as a source of linguistic data, not only within the NLP communities, but also with theoretical linguists facing problems of data sparseness or data diversity. Accordingly, web corpora continue to gain importance, given their size and diversity in terms of genres/text types. The field is still new, though, and a number of issues in web corpus construction need much additional research, both fundamental and applied. These issues range from questions of corpus design (e.g., assessment of corpus composition, sampling strategies and their relation to crawling algorithms, and handling of duplicated material) to more technical aspects (e.g., efficient implementation of individual post-processing steps in document cleaning and linguistic annotation, or large-scale parallelization to achieve web-scale corpus construction). Similarly, the systematic evaluation of web corpora, for example in the form of task-based comparisons to traditional corpora, has only recently shifted into focus. For almost a decade, the ACL SIGWAC (http://www.sigwac.org.uk/), and especially the highly successful Web as Corpus (WAC) workshops have served as a platform for researchers interested in compilation, processing and application of web-derived corpora. Past workshops were co-located with major conferences on computational linguistics and/or corpus linguistics (such as EACL, NAACL, LREC, WWW, and Corpus Linguistics).

WAC-X will also feature the final workshop of the EmpiriST 2015 shared task "Automatic Linguistic Annotation of Computer-Mediated Communication / Social Media" (see https://sites.google.com/site/empirist2015/ for details) and the panel discussion "Corpora, open science, and copyright reforms" (see https://www.sigwac.org.uk/wiki/WAC-X#paneldisc for details).

Organizers

Contact email: wacx2016 [at] gmail.com

{{{#!

Important dates

~~8 May 2016~~ 15 May 2016 (extended): Workshop Paper Due date (23:59 GMT-12)
5 June 2016: Notification of Acceptance
22 June 2016: Camera-ready papers due
12 August 2016: Workshop Date

}}}

Call for Papers

As in previous years, the 10th Web as Corpus workshop (WAC-X) invites contributions pertaining to all aspects of web corpus creation, including but not restricted to

data collection (both for large web corpora and smaller custom web corpora)
cleaning/handling of noise
duplicate removal/document filtering
linguistic post-processing (including non-standard data)
automatic generation of meta data (including register, genre, etc.)
corpus evaluation (quality of text and annotations, comparison to other corpora, etc.)

Furthermore, aspects of usability and availability of web-derived corpora are highly relevant in the context of WAC-X

development of corpus interfaces
visualization techniques
tools for statistical analysis of very large (e.g., web-derived) corpora
long-term archiving
documentation and standardization
legal issues

Finally, reports of the use of web corpora in language technology and linguistics are welcome, for example

information extraction & opinion mining
language modeling, distributional semantics
machine translation
linguistic studies of web-specific forms of communication
linguistic studies of rare phenomena
web-specific lexicography, grammaticography, and language documentation

Submission website

Submissions are managed via the SoftConf tool at https://www.softconf.com/acl2016/WAC-X/

Submission format

All submissions must be in PDF format and should follow the ACL 2016 style guidelines. We strongly recommend the use of the ACL 2016 LaTeX style files or Microsoft Word Style files. We reserve the right to reject submissions that do not conform to these styles including font and page size restrictions. Note: Unfortunately, the ACL have not released a Word template for ACL 2016. Below is the link to the template from 2015. If you submit in Word format and your paper is accepted, you will have to make sure that your final submission conforms to the ACL 2016 guidelines in order for it to appear in the proceedings. This should be fairly easy, however.

Full paper submissions may consist of up to eight (8) pages of content plus any number of pages consisting of only references. Short papers may consist of up to four (4) pages of content plus any number of pages consisting of only references. Full papers will be distinguished from short papers in the proceedings.

Papers will be presented either orally or as posters at the workshop. There will be no distinction between papers presented orally and those presented as posters in the proceedings.

Reviewing of papers will be double-blind. Therefore, the paper must not include the author's names and affiliations. Furthermore, self-references that reveal the author's identity, e.g., "We previously showed (Smith, 1991) ...", must be avoided. Instead, use citations such as "Smith (1991) previously showed ...". Papers not conforming to these requirements will be rejected without review.

Program committee

The workshop organizers were not part of the program committee.

Adrien Barbaresi, ÖAW (AT)
Silvia Bernardini, University of Bologna (IT)
Douglas Biber, Northern Arizona University (US)
Felix Bildhauer, Institut für Deutsche Sprache Mannheim (DE)
Katrien Depuydt, INL, Leiden (NL)
Jesse de Does, INL, Leiden (NL)
Cédrick Fairon, UC Louvain (BE)
William H. Fletcher, U.S. Naval Academy (US)
Iztok Kosem, Trojina, Institute for Applied Slovene Studies (SI)
Simon Krek, Jožef Stefan Institute (SI)
Lothar Lemnitzer, BBAW (DE)
Nikola Ljubešić, Sveučilišta u Zagrebu (HR)
Siva Reddy, University of Edinburgh (UK)
Steffen Remus, TU Darmstadt (DE)
Pavel Rychly, Masaryk University (CZ)
Kevin Scannell, Saint Louis University (US)
Serge Sharoff, University of Leeds (UK)
Klaus Schulz, LMU München (DE)
Kay-Michael Würzner, BBAW (DE)
Torsten Zesch, University of Duisburg-Essen (DE)
Pierre Zweigenbaum, LIMSI (FR)

Co-located events

EmpiriST 2015 shared task

The EmpiriST 2015 shared task aims to encourage the developers of NLP applications to adapt their tools and resources to the processing of German discourse in genres of computer-mediated communication (CMC), including both dialogical (chat, SMS, social networks, etc.) and monological (web pages, blogs, etc.) texts. Since there has been relatively little work in this area for German so far, the shared task focuses on tokenization and part-of-speech tagging as the core annotation steps required by virtually all NLP applications. While we have a particular interest in robust tools that can be applied to dialogical CMC and web corpora alike, participants are allowed to use different systems for the two subsets or submit results for one subset only. A substantial number of teams from German-speaking countries have already expressed their interest to participate in EmpiriST 2015. Knowledge of German is not essential for participation, though, since there are sufficient amounts of manually annotated training data (at least 10,000 tokens) and key documents are provided in English.

The final workshop of EmpiriST 2015 will be co-located with WAC-X. It will include a detailed presentation of the task and results, a poster session with all participating systems, oral presentations of selected systems, and a plenary discussion about the challenges of CMC in general as well as German CMC genres in particular.

Panel discussion "Corpora, open science, and copyright reforms"

As part of the 10th Web as Corpus workshop (WAC-X), a panel discussion will be organized. Web corpus designers are probably those who are most affected by issues and uncertainties of copyright legislation and intellectual property rights, especially in the EU. While in some countries, such as the U.S., a Fair Use doctrine allows the use of data for non-commercial research purposes, the situation in Europe is more problematic. For example, German copyright law ("Urheberrecht") requires that any re-use of a work which reaches a certain threshold of creativity be explicitly approved by the author. This poses numerous problems for any corpus creator, but it is completely infeasible for large web corpora containing texts written by millions of different authors. Thus, corpora are re-distributed in crippled form as sentence shuffles (e.g. COW and the Leipzig Corpora Collection), and it is not even clear whether there really is a reliable legal exemption for single sentences. In the famous Infopaq case, a Danish court decided that even snippets of 11 words might be protected under EU copyright laws (http://bit.ly/1GYTDjR).

This situation is highly unsatisfactory. Large web corpora have been shown to be indispensable for many tasks in computational linguistics, in the documentation of standard and non-standard language, and in empirically oriented theoretical linguistics.

Reports written by legal experts – such as the one recently commissioned by the German Research Council (http://bit.ly/1PG4Gq6) – only provide an interpretation of the given legal situation. Only active lobbying in favor of a reasonable copyright reform will eventually bring about the necessary changes such that researchers can build corpus resources and share them freely for academic purposes. Therefore, the goal of this panel discussion is to bring together corpus creators, active users of web corpora, and open science activists in order to share and discuss views on the copyright problem as a political rather than a legal problem. Ideally, a first draft of a joint declaration might come out of this discussion. With such a declaration, the (web) corpus community could make sure that its voice is heard, especially in the ongoing discussion about reforms of the European copyright legislation.

Attachments (14)

somajo.pdf (1.4 MB ) - added by Roland Schäfer 10 years ago. Slides for Proisl & Uhrig
LtlEmpiriWacX.pdf (2.2 MB ) - added by Roland Schäfer 10 years ago. Slides for Horsmann & Zesch
nl_wac2016_pres.pdf (1.9 MB ) - added by Roland Schäfer 10 years ago. Slides for Ljubesic & Fiser
Topically-focused Blog Corpora for Multiple Languages.pdf (1.1 MB ) - added by Roland Schäfer 10 years ago. Slides for Salway et al.
2016-WAC-present.pdf (247.8 KB ) - added by Roland Schäfer 10 years ago. Slides for Dalan & Sharoff
td-wacx.pdf (1.8 MB ) - added by Roland Schäfer 10 years ago. Slides for Schäfer & Bildhauer
WACX_EmpiriST_final.pdf (1.1 MB ) - added by Roland Schäfer 10 years ago. Slides for Beißwenger et al.
Shared Task final.pdf (394.9 KB ) - added by Roland Schäfer 10 years ago. Slides for Prange et al.
Topically-focused Blog Corpora for Multiple Languages.2.pdf (1.1 MB ) - added by Roland Schäfer 10 years ago. Slides for Salway et al.
wuerschinger.pdf (541.4 KB ) - added by Roland Schäfer 10 years ago. Würschinger et al.
WAC-X - Krause - slides.pdf (542.0 KB ) - added by Roland Schäfer 10 years ago. Slides for Krause
empirist_poster_pitch_aiphes_remus_et_al.pdf (6.7 MB ) - added by Roland Schäfer 10 years ago. Slides for Remus et al.
clarax.pdf (908.4 KB ) - added by Roland Schäfer 10 years ago. Poster für Schäfer
ABarbaresi_WAC-X_slides.pdf (312.0 KB ) - added by Roland Schäfer 10 years ago. Slides for Barbaresi

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text