15 | | * [WAC5] is scheduled for 8 September 2009, San Sebastian, Spain |
| 15 | * WAC5 is scheduled for 8 September 2009, San Sebastian, Spain |
| 16 | |
| 17 | We invite papers on various topics concerning the use of Web resources for corpus research and NLP applications, including (but not limited to) the following: |
| 18 | |
| 19 | * linguistic Web crawler technology and Web corpus collection projects |
| 20 | * applications of Web-derived corpora and other kinds of Web data |
| 21 | * how far does the “easy way” get you? (using search engines, or Google's n-gram lists; we are particularly interested in a critical discussion of the usefulness and limitations of such approaches) |
| 22 | * methods and tools for “cleaning” Web pages to turn them into a corpus (contributors to this topic will be encouraged to participate in the second CLEANEVAL competition to be held in 2009) |
| 23 | * automatic linguistic annotation of Web data: tokenisation, POS tagging, lemmatisation, semantic tagging, etc. (established tools often perform very poorly on Web data) |
| 24 | * search engine architectures for linguists: bringing linguistics to commercial search engines, or high-performance search technology to linguistics? |
| 25 | * search engine-related topics such as result ranking (e.g. how to identify “typical” uses rather than returning 50 very similar matches on the first page) |
| 26 | * duplicate detection, interactive query refinement, etc. |
| 27 | * reviews and clever uses of search engine APIs (Google, Yahoo, Altavista, and in particular Microsoft's current generous LiveSearch API) |