Changes between Version 70 and Version 71 of WikiStart


Ignore:
Timestamp:
08/14/23 14:31:46 (16 months ago)
Author:
nikola
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • WikiStart

    v70 v71  
    55Join the SIG by [http://devel.sslmit.unibo.it/mailman/listinfo/sigwac signing up to the mailing list!]
    66
     7The Special Interest Group on '''Web as Corpus''' aims to research the opportunities and limitations of using textual web data for
     81. performing linguistic research
     92. modelling knowledge of language
     103. modelling extralinguistic knowledge
    711
    812== Objectives ==
    9   * to promote interest in the use of the web as a source of linguistic data, and as an object of study in its own right;
    10   * to provide members of the ACL with a special interest in the web-as-corpus with a means of exchanging news of recent research developments and other matters of interest (e.g. the upcoming crisis on web data authenticity given the recent staggering improvements in generative large language models);
    11   * to sponsor meetings and workshops on the web as corpus that appear to be timely and worthwhile.
     13* To build a community around the web-as-corpus research
     14* To support and promote information exchange and the dissemination of results and best practices
     15* To organize workshops, hackathons and shared tasks
    1216
    1317Download the [attachment:wiki:WikiStart:constitution.txt?format=raw constitution of ACL SIGWAC].
     18
     19
     20== Officers ==
     21* [https://nljubesi.github.io Nikola Ljubešić] (co-president)
     22* [http://alpage.inria.fr/~sagot/ Benoît Sagot] (co-president)
     23* [https://www.utu.fi/en/people/veronika-laippala Veronika Laippala] (co-secretary)
     24* [https://portizs.eu Pedro Ortiz Suarez] (co-secretary)
     25
     26
     27== Resources ==
     28
     29=== Corpora ===
     30
     31* [https://commoncrawl.org CommonCrawl]
     32* [https://oscar-project.org OSCAR]
     33* [https://paracrawl.eu ParaCrawl]
     34* [https://macocu.eu MaCoCu]
     35* [https://www.clarin.si/info/new-classla-web-corpora-and-tutorial-on-usage-of-the-corpora-via-clarin-si-concordancers/ CLASSLA South Slavic web corpora]
     36* [http://sketch.juls.savba.sk/aranea_about/ Aranea web corpora]
     37* [https://www.clarin.si/noske/wacs.cgi/ CLARIN.SI web corpora]
     38* [http://corpus.leeds.ac.uk/internet.html University of Leeds (CTS) web corpora]
     39* [http://www.sketchengine.co.uk/ Web corpora on Sketchengine (commercial product)]
     40* [https://wacky.sslmit.unibo.it/doku.php?id=start WaCKy corpora]
     41
     42=== Technologies ===
     43* [https://corpus.tools A Masaryk University and Lexical Computing list of tools for harvesting and processing web data]
     44* [https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier The XGENRE multilingual text genre classifier]
     45* [https://github.com/TurkuNLP/multilingual-register-labeling Massively Multilingual Modeling of Web Registers by TurkuNLP]
     46
     47=== Additional information ===
     48  * [https://link.springer.com/book/10.1007/978-3-031-02152-7 Schäfer and Bildhauer's web corpus book]
     49  * [http://webascorpus.sf.net/ Stephanie Evert's WAC website]
     50  * [https://sigwac.org.uk/cleaneval CLEANEVAL], a competition for cleaning webpages
     51
    1452
    1553== Meetings ==
     
    3068
    3169
    32 == Other Activities ==
    33   * [https://sigwac.org.uk/cleaneval CLEANEVAL], a competition for cleaning webpages
    34   * Mailing list:
    35     * sign up [http://devel.sslmit.unibo.it/mailman/listinfo/sigwac here]
    36     * address to send mail to sigwac at sslmit.unibo.it
    37 
    38 == Officers ==
    39 * Nikola Ljubešić (co-president)
    40 * Benoît Sagot (co-president)
    41 * Veronika Laippala (co-secretary)
    42 * Pedro Ortiz Suarez (co-secretary)
    43 
    4470== ACL SIGWAC annual reports ==
    4571
     
    5581 * [http://aclweb.org/adminwiki/index.php?title=2012Q3_Reports:_SIGWAC ACL SIGWAC 2012 Q3 report]
    5682 * [http://aclweb.org/adminwiki/index.php?title=Reports Older reports...]
    57 
    58 == Resources ==
    59 
    60 === Information ===
    61   * [http://webascorpus.sf.net/ Stephanie Evert's WAC website]
    62   * [http://www.morganclaypool.com/doi/abs/10.2200/S00508ED1V01Y201305HLT022 Schäfer and Bildhauer's web corpus book]
    63 
    64 === Web corpora ===
    65   * [http://sketch.juls.savba.sk/aranea_about/ Aranea web corpora]
    66   * [http://corporafromtheweb.org/ Corpora from the Web (COW) of Freie Universität Berlin]
    67   * [http://corpus.leeds.ac.uk/internet.html University of Leeds (CTS) web corpora]
    68   * [http://www.sketchengine.co.uk/ Web corpora on Sketchengine]
    69   * [http://wacky.sslmit.unibo.it/ WaCky corpora]
    70