7th Web as Corpus Workshop (WAC-7)

Lyon, France; 17th April 2012

To be held in association with WWW2012.

Sponsored by ACL SIGWAC and PRESEMT

More and more people are using Web data for linguistic and NLP research: the Web provides an easy source of linguistic data in a great variety of languages. However, a ‘crawl’ is not ready for exploration in the same way a traditional ‘corpus’ is. We need to turn a crawl into a corpus. The workshop, the seventh in an annual series, provides a venue for exploring what it involves, how to do it, and what we find out if we do.

We invite submissions which:

  • describe Web corpus collection projects, or modules for one part of the process (crawling, filtering, de-duplication, language-id, tokenising, indexing, ...)
  • explore characteristics of Web data from a linguistics/NLP perspective including registers, domains, frequency distributions, comparisons between datasets
  • use crawled Web data for NLP purposes (with emphasis on the data rather than the use)

The previous WAC workshops have been co-located with various conferences in computational linguistics. This time the workshop co-locates with WWW2012, the main world conference on the Web technologies and their impact on the society.


Room Saint Clair 4 at Convention Centre, WWW2012

The proceedings are available from here

9.00 Welcome
9.10 Invited Talk: Benno Stein
Exploiting the Web for Text and Language Reuse Applications
10.00Marco Brunello
Understanding the composition of parallel corpora from the web
10.25 Vit Suchomel, Jan Pomikalek
Efficient Web Crawling for Large Text Corpora
10.40 Coffee
11.00 Ed Chow, Dayne Freitag, Paul Kalmar, Tulay Muezzinoglu, John Niekrasz
A corpus of online discussions for research into linguistic memes
11.25 Paul Rayson, Oliver Charles, Ian Auty
Can Google count? Estimating search engine result consistency
11.50 Tobias Roth
Using Web Corpora for the Recognition of Regional Variation in Standard German Collocations
12.15 Yannick Versley, Yana Panchenko
Not Just Bigger: Towards Better-Quality Web Corpora
12.40 Discussion, wrap-up
13.00 End

Organising committee

  • Adam Kilgarriff (Lexical Computing Ltd.)
  • Serge Sharoff (University of Leeds, Workshop Chair)

Programme committee

Organising committee plus:

  • Silvia Bernardini, U of Bologna, Italy
  • Stefan Evert, U of Osnabrück, Germany
  • Cédrick Fairon, UCLouvain, Belgium
  • William H. Fletcher, U.S. Naval Academy, USA
  • Gregory Grefenstette, Exalead, France
  • Igor Leturia, Elhuyar Fundazioa, Basque Country, Spain
  • Preslav Nakov, National U of Singapore
  • Jan Pomikalek (Masaryk University)
  • Reinhard Rapp, U Mainz, Germany
  • Kevin Scannell, Saint Louis U, USA
  • Gilles-Maurice de Schryver, U Gent, Belgium
  • Pierre Zweigenbaum, LIMSI, France
Last modified 12 years ago Last modified on 04/24/12 10:57:27

Attachments (1)

Note: See TracWiki for help on using the wiki.