Changes between Version 71 and Version 72 of WikiStart


Ignore:
Timestamp:
07/02/24 16:01:51 (2 months ago)
Author:
nikola
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • WikiStart

    v71 v72  
    1717Download the [attachment:wiki:WikiStart:constitution.txt?format=raw constitution of ACL SIGWAC].
    1818
     19== Topics of interest ==
     20Given ever growing data needs of Large Language Models (LLMs), Web Corpora have now taken a central place in Natural Language Processing (NLP), Computational Linguistics (CL) and Machine Learning (ML). As such, SIGWAC has decided to separate our topics of interest in three different aspects:
     21
     22=== Technical Aspects ===
     23* Filtering strategies for web data in LLM pre-training.
     24* Impact of web data in the pre-training data mix of LLMs.
     25* Crawling and ranking.
     26* Construction of web graphs.
     27* Language identification, multilinguality, and Web as a Corpus for low resource languages.
     28* Web indexing, information retrieval and LLM application in document representations.
     29* Semantic web and automatic annotation of multilingual web data.
     30
     31=== Legal Aspects ===
     32* Intellectual Property and licensing of Web data.
     33* Robot exclusion protocol and other opt-out methods for AI training.
     34* Privacy preservation in web corpora, automatic PII detection and redaction.
     35* Study and application of the TDM directive in the EU.
     36* Study and application of the AI act in the EU.
     37* Scope of data usage.
     38
     39=== Societal aspects ===
     40* Socio-linguistic studies of web data.
     41* Web-graph as a tool for web corpora exploration in a multidisciplinary setting.
     42* Study of bias and toxicity in web corpora.
     43* Study of illegal content prevalence in web corpora.
     44* Web corpora as a means to promote multilingualism and multiculturalism.
     45
     46Beyond these topics of interest we also aim to:
     47* Promote interest in the use of the web as a source of linguistic data, and as an object of study in its own right;
     48* Provide members of the ACL with a special interest in the web-as-corpus with a means of exchanging news of recent research developments and other matters of interest (e.g. the upcoming crisis on web data authenticity given the recent staggering improvements in generative large language models);
     49* Sponsor meetings and workshops on the web as corpus that appear to be timely and worthwhile.
     50
     51
    1952
    2053== Officers ==
     
    3063
    3164* [https://commoncrawl.org CommonCrawl]
     65* [https://hplt-project.org HPLT]
    3266* [https://oscar-project.org OSCAR]
    3367* [https://paracrawl.eu ParaCrawl]