= 8th Web as Corpus Workshop (WAC-8) @ [http://ucrel.lancs.ac.uk/cl2013/ Corpus Linguistics 2013]=
== Monday, 22 July 2013 (Lancaster, UK) ==

//Endorsed by [http://www.sigwac.org.uk ACL SIGWAC].//


Web corpora and other Web-derived data have become a gold mine for corpus linguistics and natural language processing.  The Web is an easy source of unprecedented amounts of linguistic data from a broad range of registers and text types.  However, a collection of Web pages is not immediately suitable for exploration in the same way a traditional corpus is.

Since the first Web as Corpus Workshop organised at the Corpus Linguistics 2005 Conference, a highly successful series of yearly Web as Corpus workshops provides a venue for interested researchers to meet, share ideas and discuss the problems and possibilities of compiling and using Web corpora.  After a stronger focus on application-oriented natural language processing and Web technology in recent years – with workshops taking place at NAACL-HLT 2010, 2011 and WWW 2012 – the 8th Web as Corpus Workshop returns to its roots in the corpus linguistics community.

Accordingly, the leading theme of this workshop is the application of Web data in language research, including linguistic evaluation of Web-derived corpora as well as strategies and tools for high-quality automatic annotation of Web text. We invite papers on all aspects of building and using Web corpora, with a particular focus on (but not limited to) the following:

 * applications of Web corpora and other Web-derived data sets for language research
 * automatic linguistic annotation of Web data such as tokenisation, part-of-speech tagging, lemmatisation and semantic tagging\\ (the accuracy of currently available off-the-shelf tools is still unsatisfactory for many types of Web data)
 * critical exploration of the characteristics of Web data from a linguistic perspective and its applicability to language research
 * presentation of Web corpus collection projects or software tools required for some part of this process (crawling, filtering, de-duplication, language identification, indexing, ...)


{{{#!comment
== Important dates ==
 * ~~March 3~~ March 7: Submission of extended abstract to be made through !EasyChair ([https://www.easychair.org/conferences/?conf=wac8 closed])
 * ~~March 17~~ ~~March 23~~ March 27: Notification of acceptance
 * June 23: Submission of full paper
 * July 22: Workshop
}}}


{{{#!comment
== Accepted Papers ==
([wiki:WAC8/accepted_papers Abstracts])
}}}

== Proceedings ==
Download [raw-attachment:wac8-proceedings.pdf here].


{{{#!comment
{{{
@proceedings{Evert2013,
title = {Proceedings of the 8th Web as Corpus Workshop (WAC-8)},
year = {2013}
address = {Lancaster, UK},
editor = {Evert, Stefan and Stemle, Egon and Rayson, Paul},
}
}}}
}}}


== Programme ==

||  9.00 - 11:00||||=  '''Session 1 (Introduction & Methodology)'''  =||
||  9:00|| Akshay Minocha, Siva Reddy and Adam Kilgarriff        ([raw-attachment:talk01.pptx slides]) || Feed Corpus : An Ever Growing Up-to-date Corpus ||
||  9:30|| Stephen Wattam, Paul Rayson and Damon Berridge        ([raw-attachment:talk02.pdf slides]) || LWAC: Longitudinal Web-as-Corpus Sampling ||
|| 10:00|| Roland Schäfer, Adrien Barbaresi and Felix Bildhauer  ([raw-attachment:talk03.pdf slides]) || The Good, the Bad, and the Hazy: Design Decisions in Web Corpus Construction ||
|| 10:30|| Jesse Egbert and Douglas Biber                        ([raw-attachment:talk04.pdf slides]) || Developing a User-based Method of Web Register Classification ||
|| 11:00 - 11:30||  '''Tea Break'''                               ||||
|| 11:30 - 13:00||||=  '''Session 2 (Methodology 2)'''             =||
|| 11:30|| Alexander Piperski, Vladimir Belikov, Nikolay Kopylov, Vladimir Selegey and Serge Sharoff ([raw-attachment:talk05.pdf slides]) || Big and diverse is beautiful: A large corpus of Russian to study linguistic variation ||
|| 12:00|| David Lutz, Parry Cadwallader and Mats Rooth          ([raw-attachment:talk06.pdf slides]) || A web application for filtering and annotating web speech data || 
|| 12:30|| Sarah Schulz, Verena Lyding and Lionel Nicolas        ([raw-attachment:talk07.pdf slides]) || STirWaC - Compiling a diverse corpus based on texts from the web for South Tyrolean German ||
|| 13:00 - 14:00||  '''Lunch '''                                  ||||
|| 14:00 - 15:30||||=  '''Session 3 (Compilation)'''               =|| 
|| 14:00|| Adam Kilgarriff and Vít Suchomel                      ([raw-attachment:talk08.pptx slides]) || Web Spam ||
|| 14:30|| Adriano Ferraresi and Silvia Bernardini               ([raw-attachment:talk09.pdf slides]) || The academic Web-as-Corpus ||
|| 15:00|| Silke Scheible, Sabine Schulte im Walde, Marion Weller and Max Kisselew ([raw-attachment:talk10.pdf slides]) || A Compact but Linguistically Detailed Database for German Verb Subcategorisation relying on Dependency Parses from a Web Corpus ||
|| 15:30 - 16:00||  '''Tea Break'''                               ||||
|| 16:00 - 18:00||||=  '''Session 4 (Applications)'''              =|| 
|| 16:00|| Andrew Brindle                                       ([raw-attachment:talk11.pdf slides]) || Thug breaks man's jaw: A Corpus Analysis of Responses to Interpersonal Street Violence ||
|| 16:30|| Colleen Crangle                                      ([raw-attachment:talk12.pdf slides]) || A web-based model of semantic relatedness and the analysis of electroencephalographic (EEG) data ||
|| 17:00|||| Discussion and wrap-up                               ||
|| 18:00||  '''Pub'''                                             ||||
|| 19:00||  '''Dinner'''                                          ||||


{{{#!comment
The proceedings are available [https://sigwac.org.uk/raw-attachment/wiki/WAC8/wac8-proc.pdf here]
|| || '''Plenary Discussion''': ''Shared Task?'' ||
}}}

{{{#!comment
== Submission Information ==


{{{#!comment
Authors are invited to submit extended abstracts on original, unpublished work in the topic area of this workshop.  Contributions must be submitted in PDF format and should not exceed two (2) pages, including references.  Submissions should be formatted using the format of [http://www.acl2013.org/site/call.html the ACL 2013 proceedings].

Authors of those papers that are accepted will be invited to submit full papers (up to eight pages) before the workshop itself and these will appear in an online proceedings.
}}}

Long paper submissions should follow the two-column format of ACL 2013 proceedings without exceeding eight (8) pages of content plus two extra pages for references. Please use the ACL LaTeX style files or Microsoft Word style files; also, submissions must conform to the official ACL style guidelines, which are contained in the style files, and they must be in PDF.

||=  LaTeX     =||=  MS Word   =||
|| [raw-attachment:acl2013.tex]  || [raw-attachment:acl2013.doc]  ||
|| [raw-attachment:acl2013.sty]  || [raw-attachment:acl2013.msword.pdf acl2013.pdf]  ||
|| [raw-attachment:acl2013.latex.pdf acl2013.pdf]  || [raw-attachment:acl2013.dot]  ||
|| [raw-attachment:acl.bst]      || ||
}}}


== Organising committee ==
 * Stefan Evert, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU)
 * Egon Stemle, European Academy of !Bozen/Bolzano (EURAC)
 * Paul Rayson, Lancaster University
	

== Programme committee ==
Organising committee plus:

 * Silvia Bernardini, U of Bologna, Italy
 * Paul Cook, U of Melbourne, Australia
 * Cédrick Fairon, UCLouvain, Belgium
 * William H. Fletcher, U.S. Naval Academy, USA
 * Sebastian Hoffmann, U Trier, Germany
 * Adam Kilgarriff. Lexical Computing Ltd, UK
 * Preslav Nakov, QCRI, Qatar Foundation
 * Reinhard Rapp, U Aix-Marseille, France & U Mainz, Germany
 * Serge Sharoff, U of Leeds, UK
 * Stephen Wattam, Lancaster U, UK
 * Eros Zanchetta, U of Bologna, Italy
 * Pierre Zweigenbaum, LIMSI, France

{{{#!comment
process PC list with:
cat << }}} | sed -e "s/ -- \(.*\)//" | grep -E '^ \* [A-Z]'
}}}

{{{#!comment
 * Silvia Bernardini, U of Bologna, Italy -- silvia@sslmit.unibo.it
 * Paul Cook, U of Melbourne, Australia -- paulcook@unimelb.edu.au
 * ? Katrien Depuydt & Jesse de Does, INL, Leiden, The Netherlands -- katrien.depuydt@inl.nl, jesse.dedoes@inl.nl
 * Cédrick Fairon, UCLouvain, Belgium -- Cedrick.Fairon@uclouvain.be
 * William H. Fletcher, U.S. Naval Academy, USA -- fletcher@kwicfinder.com
 * ? Gregory Grefenstette, Exalead, France -- ggrefens@exalead.com
 * Sebastian Hoffmann, U Trier, Germany -- hoffmann@uni-trier.de
 * Adam Kilgarriff. Lexical Computing Ltd, UK -- adam@lexmasterclass.com
 * ? Igor Leturia, Elhuyar Fundazioa, Basque Country, Spain -- igor@elhuyar.com
 * Preslav Nakov, QCRI, Qatar Foundation -- pnakov@qf.org.qa, preslav.nakoV@gmail.com
 * Reinhard Rapp, U Aix-Marseille, France & U Mainz, Germany -- reinhardrapp@gmx.de
 * ? Kevin Scannell, Saint Louis U, USA -- kscanne@gmail.com
 * Serge Sharoff, U of Leeds, UK -- s.sharoff@leeds.ac.uk
 * Stephen Wattam, Lancaster U, UK -- stephenwattam@gmail.com
 * Eros Zanchetta, U of Bologna, Italy -- eros@sslmit.unibo.it
 * Pierre Zweigenbaum, LIMSI, France -- pz@limsi.fr
 - Marco Baroni, U of Trento, Italy -- marco.baroni@lett.unitn.it
 - Frank Keller, U of Edinburgh -- keller@inf.ed.ac.uk
 - Jan Pomikalek, Masaryk University, Czech Republic -- xpomikal@fi.muni.cz
 - Gilles-Maurice de Schryver, U Gent, Belgium -- gillesmaurice.deschryver@UGent.be
}}}

{{{#!comment
PC invite e-mail:
"""
Dear ###,

As organisers of the 8th Web as Corpus Workshop (WAC-8), this year at Corpus Linguistics 2013 [http://ucrel.lancs.ac.uk/cl2013/], we would like to ask you if you could be part of the WAC-8 Program Committee. After a stronger focus on application-oriented natural language processing and Web technology in recent years the workshop returns to its roots in the corpus linguistics community. Consequently, we expect very interesting submissions from this joint theme. Moreover, we are trying to assemble a reasonably large PC, to keep reviewing assignments to a minimum.

The important dates are the following (and cf. [https://sigwac.org.uk/wiki/WAC8] for more information):
 * March 3: Submission of extended abstract to be made through EasyChair <https://www.easychair.org/conferences/?conf=wac8>
 * March 17: Notification of acceptance
 * June 23: Submission of full paper
Authors will be invited to submit extended abstracts of a maximum of two (2) pages, including references.

We would be grateful if you could let us know your availability asap, but in any case by Monday February 18th. Please note that PC membership does not preclude you from submitting an extended abstract to the workshop.

We believe you would make a great addition to the WAC-8 WaCKy PC. Looking forward to your favourable response.
 

Thanks,
### (on behalf of the organising committee).
"""
}}}

{{{#!comment
Author notification e-mail, reject:
"""

Dear [*FIRST-NAME*],

We are sorry to inform you that your submission "[*TITLE*]" for the 8th Web as Corpus Workshop (WAC-8) was not accepted.

The Program Committee worked hard to thoroughly review all the submitted papers. Each paper received at least two reviews from leading professionals in the field.
The reviews and comments to your paper are appended to this email.

The program of the workshop will soon be available at https://sigwac.org.uk/wiki/WAC8#Programme.

We thank you very much for submitting your work to WAC-8 and we hope to see you at the workshop.


Sincerely,

PC Chairs
Stefan Evert, Egon Stemle, and Paul Rayson
"""

###
Author notification e-mail, accept:
"""
Dear [*FIRST-NAME*],

We are pleased to inform you that your submission "[*TITLE*]" for the 8th Web as Corpus Workshop (WAC-8) has been accepted.  Congratulations!

The Program Committee worked hard to thoroughly review all the submitted papers.  Please repay their efforts, by following their suggestions when preparing the final version of your article.

Please submit the final version of your article (in accordance with the style guidelines) before June 23.

The program of the workshop will soon be available at https://sigwac.org.uk/wiki/WAC8#Programme.

We look forward to seeing you in Lancaster!


Sincerely, 

PC Chairs
Stefan Evert, Egon Stemle, and Paul Rayson
"""
}}}

{{{#!comment
SUBJECT: wac8: camera-ready copy
"""
Paper : [*NUMBER*]
Authors : [*AUTHORS*]
Title : [*TITLE*]

-------------------------------------------------------

Dear [*FIRST-NAME*],

You have already received the reviewers' comments in a previous email. Please take them carefully into account when preparing your camera-ready long paper for the proceedings. 

The final paper is due on June 23.
The page limit is 8+2 and is strict.

This is a firm deadline for the production of the proceedings. Please submit your paper using your EasyChair author account.

Long paper submissions should follow the two-column format of ACL 2013 proceedings without exceeding eight (8) pages of content plus two extra pages for references. Please use the ACL LaTeX style files or Microsoft Word style files, available at: [ http://sigwac.org.uk/wiki/WAC8#SubmissionInformation ]; also, submissions must conform to the official ACL style guidelines, which are contained in the style files, and they must be in PDF.

We greatly appreciate your cooperation in these matters. Thank you again for your contribution to wac8.

Sincerely, 

PC Chairs 
Stefan Evert, Egon Stemle, and Paul Rayson
"""
}}}

{{{#!comment
SUBJECT: Call for Participation: 8th Web as Corpus Workshop (22 July 2013, Lancaster, UK)
"""
CALL FOR PARTICIPATION

    8th Web as Corpus Workshop (WAC-8)
    Endorsed by ACL SIGWAC
    Hosted by the Corpus Linguistics 2013 Conference
 
    Monday, 22 July 2013 (Lancaster, UK)

** Note that registration for the workshop and the main conference closes on SUNDAY, JUNE 30. **
Registration URL: http://ucrel.lancs.ac.uk/cl2013/register.php

Further details can be found on the workshop homepage at

    http://sigwac.org.uk/wiki/WAC8

______________________________________________________________________
 
Web corpora and other Web-derived data have become a gold mine for corpus linguistics and natural language processing. The Web is an easy source of unprecedented amounts of linguistic data from a broad range of registers and text types. However, a collection of Web pages is not immediately suitable for exploration in the same way a traditional corpus is.
 
Since the first Web as Corpus Workshop organised at the Corpus Linguistics 2005 Conference, a highly successful series of yearly Web as Corpus workshops provides a venue for interested researchers to meet, share ideas and discuss the problems and possibilities of compiling and using Web corpora. After a stronger focus on application-oriented natural language processing and Web technology in recent years – with workshops taking place at NAACL-HLT 2010, 2011 and WWW 2012 – the 8th Web as Corpus Workshop returns to its roots in the corpus linguistics community.
 
Accordingly, the leading theme of this workshop is the application of Web data in language research, including linguistic evaluation of Web-derived corpora as well as strategies and tools for high-quality automatic annotation of Web text. The workshop brings together presentations on all aspects of building, using and evaluating Web corpora, with a particular focus on the following topics:
 
* applications of Web corpora and other Web-derived data sets for language research
* automatic linguistic annotation of Web data such as tokenisation, part-of-speech tagging, lemmatisation and semantic tagging (the accuracy of currently available off-the-shelf tools is still unsatisfactory for many types of Web data)
* critical exploration of the characteristics of Web data from a linguistic perspective and its applicability to language research
* presentation of Web corpus collection projects or software tools required for some part of this process (crawling, filtering, de-duplication, language identification, indexing, ...)

______________________________________________________________________

PROGRAMME

09:00 Akshay Minocha, Siva Reddy and Adam Kilgarriff -- Feed Corpus: An Ever Growing Up-to-date Corpus
09:30 Stephen Wattam, Paul Rayson and Damon Berridge -- LWAC: Longitudinal Web-as-Corpus Sampling
10:00 Roland Schäfer, Adrien Barbaresi and Felix Bildhauer -- The Good, the Bad, and the Hazy: Design Decisions in Web Corpus Construction
10:30 Jesse Egbert and Douglas Biber -- Developing a User-based Method of Web Register Classification

11:00 - 11:30	Tea Break 	

11:30 Adam Kilgarriff and Vít Suchomel -- Web Spam
12:00 David Lutz, Parry Cadwallader and Mats Rooth -- A web application for filtering and annotating web speech data
12:30 Sarah Schulz, Verena Lyding and Lionel Nicolas -- STirWaC - Compiling a diverse corpus based on texts from the web for South Tyrolean German

13:00 - 14:00	Lunch 	

14:00 Alexander Piperski, Vladimir Belikov, Nikolay Kopylov, Vladimir Selegey and Serge Sharoff -- Big and diverse is beautiful: A large corpus of Russian to study linguistic variation
14:30 Adriano Ferraresi and Silvia Bernardini -- The academic Web-as-Corpus
15:00 Silke Scheible and Sabine Schulte Im Walde -- A Compact but Linguistically Detailed Database for German Verb Subcategorisation relying on Dependency Parses from a Web Corpus

15:30 - 16:00	Tea Break 	

16:00 Andrew Brindle -- Thug breaks man's jaw: A Corpus Analysis of Responses to Interpersonal Street Violence
16:30 Colleen Crangle -- A web-based model of semantic relatedness and the analysis of electroencephalographic (EEG) data
17:00 Discussion and wrap-up

18:00 Pub

______________________________________________________________________ 

Looking forward to seeing you at the workshop,
The organising committee.
 
Stefan Evert, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU)
Egon Stemle, European Academy of Bozen/Bolzano (EURAC)
Paul Rayson, Lancaster University
"""
}}}


{{{#!comment
SUBJECT: Missing Registration: 8th Web as Corpus Workshop
"""
Dear [*FIRST-NAME*],

until about a week ago no author of your paper had registered for the 8th Web as Corpus Workshop (WAC-8).

** Note that one of the authors needs to register and that registration for the workshop and the main conference closes on SUNDAY, JUNE 30. **
Registration URL: http://ucrel.lancs.ac.uk/cl2013/register.php

Further details can be found on the workshop homepage at

   http://sigwac.org.uk/wiki/WAC8


Looking forward to seeing you at the workshop,
The organising committee.

Stefan Evert, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU)
Egon Stemle, European Academy of Bozen/Bolzano (EURAC)
Paul Rayson, Lancaster University
"""
}}}