CLEANEVAL: Guidelines for annotators

Marco Baroni, Serge Sharoff
Last updated Adam Kilgarriff 22 Jan 2007

Introduction

Your task is to "clean up" a set of webpages so that their contents can be easily used for further linguistic processing and analysis. In short, this implies

  1. removing all HTML/Javascript code and "boilerplate" (headers, copyright notices, link lists, materials repeated across most pages of a site, etc.);
  2. adding a basic encoding of the structure of the page using a minimal set of symbols to mark the beginning of headers, paragraphs and list elements.

Input

You will start with a set of links to webpages and numbered text files that contain preprocessed versions of those pages with some code already removed.

  1. Open the local copy of your HTML file in a browser and scroll through the page before cleaning (if the page is not displayed properly, you can use the original page with the link provided).
  2. Save its text version to your local disk using the same name, but adding your email account. For instance, if your account is ml06zz, save your 163.txt file as 163-ml06zz.txt
  3. Open the text file in a text editor (preferably Notepad++, available from V:\MODL5007\n++\, and ABSOLUTELY NOT MS Word) and clean the text file paying attention to the formatting displayed in the browser.

Code Removal

Despite the preliminary cleaning we perform, some code from HTML pages might remain. It is possible to detect it as text that does not appear on the webpage. It will often look like this:

text-decoration:	none;
color:	#33F;
background:	#FFFFF5;

or

start += "9";
var end = allcookies.indexOf(';', start);

Please remove such fragments if this text is not displayed on the webpage itself.

Boilerplater removal

Most webpages contain what we call "boilerplate", i.e., textual materials that, intuitively, are extraneous to the proper, coherent contents of the page.

Boilerplate is often machine-generated, and includes (but it is not necessarily limited to):

If you clean a webpage from a discussion forum, you can find replies that quote substantial portions of other postings, either in boxes or after '>'. Delete such fragments as well.

Boilerplate must be removed from all the pages in the corpus.

Structural annotation

We would like to preserve some basic information about the structure of the page.

<p>
Insert the symbol <p> before any paragraph in the document (a paragraph might look like a traditional printed paragraph, or it might be a textual/typographic unit more specific of the Web, such as a post to a bulletin board or a comment to a blog entry).
<h>
You should use the symbol <h> at the beginning of each section which looks, on the original page viewed in your browser, like a header of some sort or other information "about" the text, rather than part of the text itself (a title-like sequence at the beginning of the document, the title of a section, the page author, etc.) If a paragraph is displayed in bold, this does not make it a header.
<l>
Finally, use the symbol <l> for any element of a list. The same symbol is used for itemized and numbered lists.

An (artificially simple) annotated page might look like this:


<h>My blog

<p>Hi guys!

<p>Today, it's been a very productive day. In the morning, I did the
following three things:

<l>Brush teeth

<l>Take shower

<l>Shave

<h>Comments

<h>Becksy

<p>Great, man!

Do not worry about white-spaces and newlines: we will normalize them afterward, using your annotation as the only reliable source of information about the structure of the page.

No nesting

Note that the markup is not nested in any way. The current scheme does not represent "list within paragraphs" or "headings within lists" or any ohter nested structure, and each opening tag will be interpreted as 'closing' the previous element. While we fully realise that this scheme is linguistically impoverished, allowing nested structures will introduce a range of complexities and sources of disagreement.

Deletions only

No regular text at all is to be added to the input. The task is to make deletions, and add tags; never to add text (even if it appears as regular text in the browser).

Troubleshooting

  1. Keep lists of links, for instance, to the full text of news articles or software download, only if the links contain full sentences. Delete them otherwise.
  2. Remove forms, such as those for filling name, address, etc. Also remove headings of such forms.
  3. Do not add paragraph/header marks if the text in your browser does not look like a separate paragraph.
  4. Remove links to footnotes. (in your browser they are normally display as superscript numbers with links).

Tips

The following are simple tips that can help you in annotating webpages faster. Feel free to use any other technique you like.

  1. Use two parallel windows aligned vertically.
  2. Use shortcuts instead of mouse when possible. For switching between windows, use Alt-Tab, for navigating on the page, use navigation keys (PageUp, PageDown). To select text for deletion, hold Shift and use navigation keys.
  3. <p> is the most frequent symbol you will insert. You can type it once, then select it and copy to the clipboard by pressing Ctrl-C. After that you can insert it when needed using Ctrl-V.