Changes between Version 15 and Version 16 of WAC-XI


Ignore:
Timestamp:
Feb 13, 2017, 3:47:22 PM (3 years ago)
Author:
Roland Schäfer
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • WAC-XI

    v15 v16  
    6565=== !CleanerEval first panel discussion ===
    6666
    67 As part of the workshop and consistent with its general theme, we plan to organise a panel discussion as the first meeting of the !CleanerEval shared task on combined paragraph and document quality detec­tion for (web) documents. The !CleanerEval shared task follows the successful CleanEval shared task organised by SIGWAC in 2006. While !CleanEval focused specifically on boilerplate re­moval (the removal of automatically inserted and frequently repeated non-corpus material from web pages), !CleanerEval goes beyond this basic task. Participating systems should be able to determine the linguistic quality of para­graphs and whole documents in an automatic fashion, such that corpus designers and/or users can decide whether to include them in their corpus or not. In the "CleanerEval setting, boilerplate paragraphs are paragraphs with low quality, but there might be other, non-boilerplate paragraphs with low quality as well. !CleanerEval was proposed by the organisers of WAC-XI during the final discussion of WAC-X, where the proposal was met with great interest. The WAC-XI panel discussion is intended to serve as a platform for the development of the operationalisation of the notions of paragraph and document quality, the an­notation guidelines, and the final schedule for the shared task. There can be no doubt that corpus lin­guists should define what counts as good corpus material and what does not. It would be misguided to threat this ques­tion as a purely technical one. The final meeting of the shared task is planned for to be part of WAC-XII in 2018.
     67As part of the workshop and consistent with its general theme, we plan to organise a panel discussion as the first meeting of the !CleanerEval shared task on combined paragraph and document quality detec­tion for (web) documents. The !CleanerEval shared task follows the successful CleanEval shared task organised by SIGWAC in 2006. While !CleanEval focused specifically on boilerplate re­moval (the removal of automatically inserted and frequently repeated non-corpus material from web pages), !CleanerEval goes beyond this basic task. Participating systems should be able to determine the linguistic quality of para­graphs and whole documents in an automatic fashion, such that corpus designers and/or users can decide whether to include them in their corpus or not. In the !CleanerEval setting, boilerplate paragraphs are paragraphs with low quality, but there might be other, non-boilerplate paragraphs with low quality as well. !CleanerEval was proposed by the organisers of WAC-XI during the final discussion of WAC-X, where the proposal was met with great interest. The WAC-XI panel discussion is intended to serve as a platform for the development of the operationalisation of the notions of paragraph and document quality, the an­notation guidelines, and the final schedule for the shared task. There can be no doubt that corpus lin­guists should define what counts as good corpus material and what does not. It would be misguided to threat this ques­tion as a purely technical one. The final meeting of the shared task is planned for to be part of WAC-XII in 2018.
    6868
    6969