The script available at http://corpus1.leeds.ac.uk/cleaneval/cleanset.pl was used to pre-process the data. You may or may not choose to use it or somethign based on it.
This is a perl script which takes two arguments: first, the file to be scored, and second, the gold-standard file to compare it with. It calculates scores based on (1) the edit distance between the two and the extent to which contestant-inserted markup tags indicate blocks of text starting and ending in the same places; and (2) based on alignment of text alone, ignoring the contestant-inserted markup tags. Comments in the code provide more detail. It has been well tested for English but not so well tested for Chinese: we hope to publish an amended version for Chinese shortly. Script available at http://cleaneval.sigwac.org.uk/cleaneval_scorer.zip (-- zipped so our server does not try to run it).
Prepared by Francis Chantree.
There are three versions of each file: original, pre-processed, and manually cleaned. All files of each kind are gathered in a directory. The file number remains the same for the three versions of the same file.
There are around 60 items for each language, mirrored at Leeds and Trento:
Leeds
English original | http://corpus1.leeds.ac.uk/cleaneval/devel/en-original.tgz |
Chinese original | http://corpus1.leeds.ac.uk/cleaneval/devel/zh-original.tgz |
English stripped | http://corpus1.leeds.ac.uk/cleaneval/devel/en-stripped.tgz |
Chinese stripped | http://corpus1.leeds.ac.uk/cleaneval/devel/zh-stripped.tgz |
English cleaned | http://corpus1.leeds.ac.uk/cleaneval/devel/en-cleaned.tgz |
Chinese cleaned | http://corpus1.leeds.ac.uk/cleaneval/devel/zh-cleaned.tgz |
Trento
The development dataset is not designed as a training set for supervised-training methods. We suspect that the task is too heterogeneous for supervised training to be appropriate, and the set will in any case be too small. It is of course open to participants to explore whether supervised training methods do perform well.