CLEANEVAL: Guidelines for annotators
Marco Baroni, Serge Sharoff
Last updated Adam Kilgarriff 22 Jan 2007
Introduction
Your task is to "clean up" a set of webpages so that their contents
can be easily used for further linguistic processing and analysis. In
short, this implies
- removing all HTML/Javascript code and "boilerplate" (headers,
copyright notices, link lists, materials repeated across most pages of
a site, etc.);
- adding a basic encoding of the structure of the page using a
minimal set of symbols to mark the beginning of headers, paragraphs
and list elements.
Input
You will start with a set of links to webpages and numbered text files
that contain preprocessed versions of those pages with some code
already removed.
- Open the local copy of your HTML file in a browser and scroll through the page before cleaning (if the page is not displayed properly, you can use the original page with the link provided).
- Save its text version to your local disk using the same name, but adding your email account. For instance, if your account is ml06zz, save your 163.txt file as 163-ml06zz.txt
- Open the text file in a text editor (preferably Notepad++, available from V:\MODL5007\n++\, and ABSOLUTELY NOT MS Word) and clean the text file paying attention to the formatting displayed in the browser.
Code Removal
Despite the preliminary cleaning we perform, some code from HTML pages
might remain. It is possible to detect it as text that does not
appear on the webpage. It will often look like this:
text-decoration: none;
color: #33F;
background: #FFFFF5;
or
start += "9";
var end = allcookies.indexOf(';', start);
Please remove such fragments if this text is not displayed on the
webpage itself.
Boilerplater removal
Most webpages contain what we call "boilerplate", i.e., textual
materials that, intuitively, are extraneous to the proper, coherent
contents of the page.
Boilerplate is often machine-generated, and includes (but it is not
necessarily limited to):
- Navigation information
- Internal and external link lists
- Copyright notices and other legal information
- Standard header, footer and template materials that are repeated across (a subset of) the pages of the same site
- Advertisements
- Web-spam, such as automated postings by spammers to blogs
If you clean a webpage from a discussion forum, you can find replies
that quote substantial portions of other postings, either in boxes or
after '>'. Delete such fragments as well.
Boilerplate must be removed from all the pages in the corpus.
Structural annotation
We would like to preserve some basic information about the structure
of the page.
- <p>
- Insert the symbol <p> before any paragraph in the document (a
paragraph might look like a traditional printed paragraph, or it might
be a textual/typographic unit more specific of the Web, such as a post
to a bulletin board or a comment to a blog entry).
- <h>
- You should use the symbol <h> at the beginning of each
section which looks, on the original page viewed in your browser, like
a header of some sort or other information "about" the text, rather
than part of the text itself (a title-like sequence at the beginning
of the document, the title of a section, the page author, etc.) If a paragraph is displayed in bold, this does not make it a header.
- <l>
- Finally, use the symbol <l> for any element of a list. The same symbol is used for itemized and numbered lists.
An (artificially simple) annotated page might look like this:
<h>My blog
<p>Hi guys!
<p>Today, it's been a very productive day. In the morning, I did the
following three things:
<l>Brush teeth
<l>Take shower
<l>Shave
<h>Comments
<h>Becksy
<p>Great, man!
Do not worry about white-spaces and newlines: we will normalize them
afterward, using your annotation as the only reliable source of
information about the structure of the page.
No nesting
Note that the markup is not nested in any way. The current scheme does not
represent "list within paragraphs" or "headings within lists" or any ohter nested structure, and each
opening tag will be interpreted as 'closing' the previous element. While we fully realise that this scheme is linguistically impoverished, allowing nested structures will introduce a range of complexities and sources of disagreement.
Deletions only
No regular text at all is to be added to the input. The task is to make deletions, and add tags; never to add text (even if it appears as regular text in the browser).
Troubleshooting
- Keep lists of links, for instance, to the full text of news articles or software download, only if the links contain full sentences. Delete them otherwise.
- Remove forms, such as those for filling name, address, etc. Also remove headings of such forms.
- Do not add paragraph/header marks if the text in your browser does not look like a separate paragraph.
- Remove links to footnotes. (in your browser they are normally display as superscript numbers with links).
Tips
The following are simple tips that can help you in annotating webpages faster. Feel free to use any other technique you like.
- Use two parallel windows aligned vertically.
- Use shortcuts instead of mouse when possible. For switching between windows, use Alt-Tab, for navigating on the page, use navigation keys (PageUp, PageDown). To select text for deletion, hold Shift and use navigation keys.
- <p> is the most frequent symbol you will insert. You can type it once, then select it and copy to the clipboard by pressing Ctrl-C. After that you can insert it when needed using Ctrl-V.