Context Navigation

ACL SIGWAC home page

The Special Interest Group of the Association for Computational Linguistics (ACL) on Web as Corpus.

Join the SIG by signing up to the mailing list!

The Special Interest Group on Web as Corpus aims to research the opportunities and limitations of using textual web data for

performing linguistic research
modelling knowledge of language
modelling extralinguistic knowledge

Objectives

To build a community around the web-as-corpus research
To support and promote information exchange and the dissemination of results and best practices
To organize workshops, hackathons and shared tasks

Download the constitution of ACL SIGWAC .

Topics of interest

Given ever growing data needs of Large Language Models (LLMs), Web Corpora have now taken a central place in Natural Language Processing (NLP), Computational Linguistics (CL) and Machine Learning (ML). As such, SIGWAC has decided to separate our topics of interest in three different aspects:

Technical Aspects

Filtering strategies for web data in LLM pre-training.
Impact of web data in the pre-training data mix of LLMs.
Crawling and ranking.
Construction of web graphs.
Language identification, multilinguality, and Web as a Corpus for low resource languages.
Web indexing, information retrieval and LLM application in document representations.
Semantic web and automatic annotation of multilingual web data.

Legal Aspects

Intellectual Property and licensing of Web data.
Robot exclusion protocol and other opt-out methods for AI training.
Privacy preservation in web corpora, automatic PII detection and redaction.
Study and application of the TDM directive in the EU.
Study and application of the AI act in the EU.
Scope of data usage.

Societal aspects

Socio-linguistic studies of web data.
Web-graph as a tool for web corpora exploration in a multidisciplinary setting.
Study of bias and toxicity in web corpora.
Study of illegal content prevalence in web corpora.
Web corpora as a means to promote multilingualism and multiculturalism.

Beyond these topics of interest we also aim to:

Promote interest in the use of the web as a source of linguistic data, and as an object of study in its own right;
Provide members of the ACL with a special interest in the web-as-corpus with a means of exchanging news of recent research developments and other matters of interest (e.g. the upcoming crisis on web data authenticity given the recent staggering improvements in generative large language models);
Sponsor meetings and workshops on the web as corpus that appear to be timely and worthwhile.

Officers

Nikola Ljubešić (co-president)
Benoît Sagot (co-president)
Veronika Laippala (co-secretary)
Pedro Ortiz Suarez (co-secretary)

Resources

Corpora

Technologies

Additional information

Schäfer and Bildhauer's web corpus book
Stephanie Evert's WAC website
CLEANEVAL, a competition for cleaning webpages

Meetings

WAC-XII at LREC 2020, Marseille, France, 16 May 2020… CANCELLED due to Covid-19 outbreak but proceedings have been published!
WAC-XI at Corpus Linguistics 2017, Birmingham, UK, 24-27 July 2017
WAC-X at ACL 2016, Berlin, Germany, 12 August 2016
WAC@eLex2015, In 2015 we will meet at eLex, Herstmonceux Castle, UK, 10 August 2015
WAC9, at EACL 2014, Gothenburg, Sweden, 26-27 April 2014
WAC8, at Corpus Linguistics 2013, Lancaster, UK, 22 July 2013
WAC7, at WWW12, Lyon, France, 17 April 2012
BUCC, Building and Using Comparable Corpora, Portland, Oregon, 24 June 2011, In 2011 we will meet at the BUCC workshop at ACL2011
WAC6, at NAACL-HLT, Los Angeles, USA, 5 June 2010: programme here
WAC5, at SPLN, San Sebastian, Basque Country, Spain, 7 September 2009
WAC4 at LREC, Marrakech, Morocco, 1 June 2008
WAC3, Louvain-la-Neuve, Belgium, 15-16 September 2007
WAC2, at EACL, Trento, Italy, April 2006
WAC1, at Corpus Linguistics conference, Birmingham, UK, July 2005

ACL SIGWAC annual reports

Last modified 2 years ago Last modified on 07/02/24 16:23:57

Attachments (2)

constitution.txt (1.7 KB ) - added by Jan Pomikálek 19 years ago. Constitution of the SIGWAC
at.gif (884 bytes ) - added by Serge Sharoff 18 years ago.

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text