26 | | The World Wide Web has become increasingly popular as a source of linguistic evidence, not only within the computational linguistics community, but also with theoretical linguists facing problems such as data sparseness or the lack of variation in traditional corpora of written language. Accordingly, web corpora continue to gain relevance, given their size and diversity in terms of genres and text types. In lexicography, web data have become a major and well-established resource with dedicated research data and an environment such as the !SketchEngine. In other areas of linguistics, the adoption rate of web corpora has been slower but steady. Furthermore, some areas of research dealing exclusively with web (or similar) data have emerged, such as the construction and exploitation of corpora based on short messages. Another example is the (manual or automatic) classification of web texts by genre, register, or – more generally speaking – text type, as well as topic area. Similarly, the areas of corpus evaluation and corpus comparison have been advanced greatly through the rise of web corpora, mostly because web corpora (especially larger ones in the region of several billions of tokens) are often created by downloading texts from the web unselectively with respect to their text type or content. While the composition (or stratification) of such corpora cannot be determined before their construction, it is desirable to evaluate it afterwards, at least. Also, comparing web corpora to corpora that have been compiled in a traditional way is key in determining the quality of web corpora with respect to a given research question. |
| 26 | The World Wide Web has become increasingly popular as a source of linguistic evidence, not only within the computational linguistics community, but also with theoretical linguists facing problems such as data sparseness or the lack of variation in traditional corpora of written language. Accordingly, web corpora continue to gain relevance, given their size and diversity in terms of genres and text types. In lexicography, web data have become a major and well-established resource with dedicated research data and specialised tools such as the !SketchEngine. In other areas of linguistics, the adoption rate of web corpora has been slower but steady. Furthermore, some completely new areas of research dealing exclusively with web (or similar) data have emerged, such as the construction and exploitation of corpora based on short messages. Another example is the (manual or automatic) classification of web texts by genre, register, or – more generally speaking – text type, as well as topic area. Similarly, the areas of corpus evaluation and corpus comparison have been advanced greatly through the rise of web corpora, mostly because web corpora (especially larger ones in the region of several billions of tokens) are often created by downloading texts from the web unselectively with respect to their text type or content. While the composition (or stratification) of such corpora cannot be determined before their construction, it is desirable to evaluate it afterwards, at least. Also, comparing web corpora to corpora that have been compiled in a more traditional way is key in determining the quality of web corpora with respect to a given research question. |