15 | | * WAC5 is scheduled for 8 September 2009, San Sebastian, Spain |
16 | | |
17 | | We invite papers on various topics concerning the use of Web resources for corpus research and NLP applications, including (but not limited to) the following: |
18 | | |
19 | | * linguistic Web crawler technology and Web corpus collection projects |
20 | | * applications of Web-derived corpora and other kinds of Web data |
21 | | * how far does the “easy way” get you? (using search engines, or Google's n-gram lists; we are particularly interested in a critical discussion of the usefulness and limitations of such approaches) |
22 | | * methods and tools for “cleaning” Web pages to turn them into a corpus (contributors to this topic will be encouraged to participate in the second CLEANEVAL competition to be held in 2009) |
23 | | * automatic linguistic annotation of Web data: tokenisation, POS tagging, lemmatisation, semantic tagging, etc. (established tools often perform very poorly on Web data) |
24 | | * search engine architectures for linguists: bringing linguistics to commercial search engines, or high-performance search technology to linguistics? |
25 | | * search engine-related topics such as result ranking (e.g. how to identify “typical” uses rather than returning 50 very similar matches on the first page) |
26 | | * duplicate detection, interactive query refinement, etc. |
27 | | * reviews and clever uses of search engine APIs (Google, Yahoo, Altavista, and in particular Microsoft's current generous LiveSearch API) |
| 15 | * [wiki:WAC5] is scheduled for 8 September 2009, San Sebastian, Spain |