9 | | * to promote interest in the use of the web as a source of linguistic data, and as an object of study in its own right; |
10 | | * to provide members of the ACL with a special interest in the web-as-corpus with a means of exchanging news of recent research developments and other matters of interest (e.g. the upcoming crisis on web data authenticity given the recent staggering improvements in generative large language models); |
11 | | * to sponsor meetings and workshops on the web as corpus that appear to be timely and worthwhile. |
| 13 | * To build a community around the web-as-corpus research |
| 14 | * To support and promote information exchange and the dissemination of results and best practices |
| 15 | * To organize workshops, hackathons and shared tasks |
| 18 | |
| 19 | |
| 20 | == Officers == |
| 21 | * [https://nljubesi.github.io Nikola Ljubešić] (co-president) |
| 22 | * [http://alpage.inria.fr/~sagot/ Benoît Sagot] (co-president) |
| 23 | * [https://www.utu.fi/en/people/veronika-laippala Veronika Laippala] (co-secretary) |
| 24 | * [https://portizs.eu Pedro Ortiz Suarez] (co-secretary) |
| 25 | |
| 26 | |
| 27 | == Resources == |
| 28 | |
| 29 | === Corpora === |
| 30 | |
| 31 | * [https://commoncrawl.org CommonCrawl] |
| 32 | * [https://oscar-project.org OSCAR] |
| 33 | * [https://paracrawl.eu ParaCrawl] |
| 34 | * [https://macocu.eu MaCoCu] |
| 35 | * [https://www.clarin.si/info/new-classla-web-corpora-and-tutorial-on-usage-of-the-corpora-via-clarin-si-concordancers/ CLASSLA South Slavic web corpora] |
| 36 | * [http://sketch.juls.savba.sk/aranea_about/ Aranea web corpora] |
| 37 | * [https://www.clarin.si/noske/wacs.cgi/ CLARIN.SI web corpora] |
| 38 | * [http://corpus.leeds.ac.uk/internet.html University of Leeds (CTS) web corpora] |
| 39 | * [http://www.sketchengine.co.uk/ Web corpora on Sketchengine (commercial product)] |
| 40 | * [https://wacky.sslmit.unibo.it/doku.php?id=start WaCKy corpora] |
| 41 | |
| 42 | === Technologies === |
| 43 | * [https://corpus.tools A Masaryk University and Lexical Computing list of tools for harvesting and processing web data] |
| 44 | * [https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier The XGENRE multilingual text genre classifier] |
| 45 | * [https://github.com/TurkuNLP/multilingual-register-labeling Massively Multilingual Modeling of Web Registers by TurkuNLP] |
| 46 | |
| 47 | === Additional information === |
| 48 | * [https://link.springer.com/book/10.1007/978-3-031-02152-7 Schäfer and Bildhauer's web corpus book] |
| 49 | * [http://webascorpus.sf.net/ Stephanie Evert's WAC website] |
| 50 | * [https://sigwac.org.uk/cleaneval CLEANEVAL], a competition for cleaning webpages |
| 51 | |