= 7th Web as Corpus Workshop (WAC-7) = == Lyon, France; 17th April 2012 == To be held in association with [http://www2012.org/ WWW2012]. Sponsored by [http://www.sigwac.org.uk ACL SIGWAC] More and more people are using Web data for linguistic and NLP research: the Web provides an easy source of linguistic data in a great variety of languages. However, a ‘crawl’ is not ready for exploration in the same way a traditional ‘corpus’ is. We need to turn a crawl into a corpus. The workshop, the seventh in an annual series, provides a venue for exploring what it involves, how to do it, and what we find out if we do. We invite submissions which: * describe Web corpus collection projects, or modules for one part of the process (crawling, filtering, de-duplication, language-id, tokenising, indexing, ...) * explore characteristics of Web data from a linguistics/NLP perspective including registers, domains, frequency distributions, comparisons between datasets * use crawled Web data for NLP purposes (with emphasis on the data rather than the use) The previous WAC workshops have been co-located with various conferences in computational linguistics. This time the workshop co-locates with WWW2012, the main world conference on the Web technologies and their impact on the society. == Programme == Room Saint Clair 4 at Convention Centre, WWW2012 The proceedings are available from [https://sigwac.org.uk/raw-attachment/wiki/WAC7/wac7-proc.pdf here] ||9.00|| '''Welcome''' || ||9.10|| '''Invited Talk''': ''Benno Stein'' || || ||Exploiting the Web for Text and Language Reuse Applications|| ||10.00||''Marco Brunello''|| || ||Understanding the composition of parallel corpora from the web || || 10.25 || ''Vit Suchomel, Jan Pomikalek '' || || || Efficient Web Crawling for Large Text Corpora || ||10.40 || ''' Coffee ''' || ||11.00 ||''Ed Chow, Dayne Freitag, Paul Kalmar, Tulay Muezzinoglu, John Niekrasz''|| || ||A corpus of online discussions for research into linguistic memes|| ||11.25 || ''Paul Rayson, Oliver Charles, Ian Auty'' || || ||Can Google count? Estimating search engine result consistency|| ||11.50 || ''Tobias Roth'' || || || Using Web Corpora for the Recognition of Regional Variation in Standard German Collocations || || 12.15 || ''Yannick Versley, Yana Panchenko'' || || || Not Just Bigger: Towards Better-Quality Web Corpora || || 12.40 || '''Discussion, wrap-up''' || || 13.00 || ''' End ''' || {{{#!comment == Important dates == * Submission by '''January 30 2012,''' to be made through [https://www.easychair.org/conferences/?conf=wac7 EasyChair] * Notification of acceptance by February 6 * Camera-ready copy due February 15 Submissions should be formatted using the [http://www.acm.org/sigs/publications/proceedings-templates ACM SIG stylefiles], and not exceeding 8 pages plus an extra page for references. Each submission will be reviewed by at least two members of the programme committee. Accepted papers will be published in the workshop proceedings. }}} == Organising committee == * Adam Kilgarriff (Lexical Computing Ltd.) * Serge Sharoff (University of Leeds, Workshop Chair) == Programme committee == Organising committee plus: * Silvia Bernardini, U of Bologna, Italy * Stefan Evert, U of Osnabrück, Germany * Cédrick Fairon, UCLouvain, Belgium * William H. Fletcher, U.S. Naval Academy, USA * Gregory Grefenstette, Exalead, France * Igor Leturia, Elhuyar Fundazioa, Basque Country, Spain * Preslav Nakov, National U of Singapore * Jan Pomikalek (Masaryk University) * Reinhard Rapp, U Mainz, Germany * Kevin Scannell, Saint Louis U, USA * Gilles-Maurice de Schryver, U Gent, Belgium * Pierre Zweigenbaum, LIMSI, France