Version 13 (modified by Serge Sharoff, 14 years ago) (diff)


ACL SIGWAC home page

The Special Interest Group of the Association for Computational Linguistics (ACL) on Web as Corpus.


  • to promote interest in the use of the web as a source of linguistic data, and as an object of study in its own right;
  • to provide members of the ACL with a special interest in the web-as-corpus with a means of exchanging news of recent research developments and other matters of interest;
  • to sponsor meetings and workshops on the web as corpus that appear to be timely and worthwhile.


We invite papers on various topics concerning the use of Web resources for corpus research and NLP applications, including (but not limited to) the following:

  • linguistic Web crawler technology and Web corpus collection projects
  • applications of Web-derived corpora and other kinds of Web data
  • how far does the “easy way” get you? (using search engines, or Google's n-gram lists; we are particularly interested in a critical discussion of the usefulness and limitations of such approaches)
  • methods and tools for “cleaning” Web pages to turn them into a corpus (contributors to this topic will be encouraged to participate in the second CLEANEVAL competition to be held in 2009)
  • automatic linguistic annotation of Web data: tokenisation, POS tagging, lemmatisation, semantic tagging, etc. (established tools often perform very poorly on Web data)
  • search engine architectures for linguists: bringing linguistics to commercial search engines, or high-performance search technology to linguistics?
  • search engine-related topics such as result ranking (e.g. how to identify “typical” uses rather than returning 50 very similar matches on the first page)
  • duplicate detection, interactive query refinement, etc.
  • reviews and clever uses of search engine APIs (Google, Yahoo, Altavista, and in particular Microsoft's current generous LiveSearch? API)


  • CLEANEVAL, a competition for cleaning webpages
  • Mailing list:
    • sign up here
    • address to send mail to: sigwac@…


Constitution here.

Useful resources

Attachments (2)

Download all attachments as: .zip