    1919* task-based ("extrinsic") evaluation of web corpora, especially in comparison to traditional corpus resources and n-gram databases (Web 1T 5-Grams, Google Books)
    2020* missing meta data in web corpora: enriching web corpora with data by automatic classification with high accuracy
     21* sampling strategies/ crawling algorithms and their effect on corpus composition/ corpus quality
    2222* non-destructive cleaning and normalization of web data  (Currently available web corpora have usually undergone radical cleaning procedures in order to produce "high-quality" data. At least for some uses of the data, aggressive and sometimes arbitrary removal of material in the form of whole documents or parts thereof can be problematic. The same is true for aggressive normalization of the data. To meet such problems, ways of cleaning and normalizing the data transparently, i.e., preserving the non-normalized forms, should be discussed.)
    4949* Serge Sharoff, University of Leeds
    5050* Sabine Schulte, im Walde, Universität Stuttgart
     51* Egon Stemle, European Academy of Bolzano
    5252* Yannick Versley, Universität Heidelberg
    5353* Torsten Zesch, Universität Darmstadt
    5454* Stephen Wattam, Lancaster University
