WAC-X – ACL SIGWAC

Context Navigation

← Previous Version
View Latest Version
Next Version →

Version 1 (modified by Roland Schäfer, 9 years ago) ( diff )
--

10th Web as Corpus Workshop (WAC-X) and EmpiriST Shared Task

The World Wide Web has become increasingly popular as a source of linguistic data, not only within the NLP communities, but also with theoretical linguists facing problems of data sparseness or data diversity. Accordingly, web corpora continue to gain importance, given their size and diversity in terms of genres/text types. The field is still new, though, and a number of issues in web corpus construction need much additional research, both fundamental and applied. These issues range from questions of corpus design (e.g., corpus composition assessment, sampling strategies and their relation to crawling algorithms, handling of duplicated material) to more technical aspects (e.g., efficient implementation of individual post-processing steps in document cleansing and linguistic annotation, or large-scale parallelization to achieve web-scale corpus construction). Similarly, the systematic evaluation of web corpora, for example in the form of task-based comparisons to traditional corpora, has only recently shifted into focus. For almost a decade, the ACL SIGWAC (http://www.sigwac.org.uk/), and especially the highly successful Web as Corpus (WAC) workshops have served as a platform for researchers interested in compilation, processing and application of web-derived corpora. Past workshops were co-located with major conferences on computational linguistics and/or corpus linguistics (such as EACL, NAACL, LREC, WWW, Corpus Linguistics). As in previous years, the 10th Web as Corpus workshop (WAC-X) invites contributions pertaining to all aspects of web corpus creation, including but not restricted to

data collection (both for large web corpora and smaller custom web corpora)
cleaning/handling of noise
duplicate removal/document filtering
linguistic post-processing (including non-standard data)
automatic generation of meta data (including register, genre, etc.)
corpus evaluation (quality of text and annotations, comparison to other corpora, etc.)

Furthermore, aspects of usability and availability of web-derived corpora are highly relevant in the context of WAC-X

development of interfaces
visualization techniques
tools for statistical analysis of very large (e.g., web-derived) corpora
long-term archiving
documentation and standardization
legal issues

Finally, reports of the use of web corpora in language technology and linguistics are welcome, for example information extraction & opinion mining

language modeling, distributional semantics
machine translation
linguistic studies of web-specific forms of communication
linguistic studies of rare phenomena
web-specific lexicography, grammaticography, and language documentation

EmpiriST 2015 shared task

The EmpiriST 2015 shared task aims to encourage the developers of NLP applications to adapt their tools and resources to the processing of German discourse in genres of computer-mediated communication (CMC), including both dialogical (chat, SMS, social networks, etc.) and monological (web pages, blogs, etc.) texts. Since there has been relatively little work in this area for German so far, the shared task focuses on tokenization and part-of-speech tagging as the core annotation steps required by virtually all NLP applications. While we have a particular interest in robust tools that can be applied to dialogical CMC and web corpora alike, participants are allowed to use different systems for the two subsets or submit results for one subset only. A substantial number of teams from German-speaking countries have already expressed their interest to participate in EmpiriST 2015. Knowledge of German is not essential for participation, though, since there are sufficient amounts of manually annotated training data (at least 10,000 tokens) and key documents are provided in English.

The final workshop of EmpiriST 2015 will be co-located with WAC-X. It will include a detailed presentation of the task and results, a poster session with all participating systems, oral presentations of selected systems, and a plenary discussion about the challenges of CMC in general as well as German CMC genres in particular.

Attachments (14)

somajo.pdf (1.4 MB ) - added by Roland Schäfer 8 years ago. Slides for Proisl & Uhrig
LtlEmpiriWacX.pdf (2.2 MB ) - added by Roland Schäfer 8 years ago. Slides for Horsmann & Zesch
nl_wac2016_pres.pdf (1.9 MB ) - added by Roland Schäfer 8 years ago. Slides for Ljubesic & Fiser
Topically-focused Blog Corpora for Multiple Languages.pdf (1.1 MB ) - added by Roland Schäfer 8 years ago. Slides for Salway et al.
2016-WAC-present.pdf (247.8 KB ) - added by Roland Schäfer 8 years ago. Slides for Dalan & Sharoff
td-wacx.pdf (1.8 MB ) - added by Roland Schäfer 8 years ago. Slides for Schäfer & Bildhauer
WACX_EmpiriST_final.pdf (1.1 MB ) - added by Roland Schäfer 8 years ago. Slides for Beißwenger et al.
Shared Task final.pdf (394.9 KB ) - added by Roland Schäfer 8 years ago. Slides for Prange et al.
Topically-focused Blog Corpora for Multiple Languages.2.pdf (1.1 MB ) - added by Roland Schäfer 8 years ago. Slides for Salway et al.
wuerschinger.pdf (541.4 KB ) - added by Roland Schäfer 8 years ago. Würschinger et al.
WAC-X - Krause - slides.pdf (542.0 KB ) - added by Roland Schäfer 8 years ago. Slides for Krause
empirist_poster_pitch_aiphes_remus_et_al.pdf (6.7 MB ) - added by Roland Schäfer 8 years ago. Slides for Remus et al.
clarax.pdf (908.4 KB ) - added by Roland Schäfer 8 years ago. Poster für Schäfer
ABarbaresi_WAC-X_slides.pdf (312.0 KB ) - added by Roland Schäfer 8 years ago. Slides for Barbaresi

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text