Changes between Version 1 and Version 2 of WAC8/accepted_papers
- Timestamp:
- 05/23/13 12:45:26 (12 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
WAC8/accepted_papers
v1 v2 2 2 ||=Andrew Brindle =|| 3 3 ||=Thug breaks man's jaw: A Corpus Analysis of Responses to Interpersonal Street Violence =|| 4 || Abstract: A great deal of what is bad in the world, from genocide to interpersonal violence, is the product of men and their masculinities ( DeKeseredy and Schwartz, 2005).Work by criminologists such as Anderson (1990) have argued that instances of interpersonal violence originate from strongly held values in the construction and defence of personal street status and that violence is a tool for both the formation of and the protection of self-image. Furthermore, Messerschmidt (2004) writes that among certain men violence is a core component of masculinity and a means of proving one’s manhood. However, Winlow (2001) considers that street and pub fights function as a means for working-class men to actualize a masculine identity due to the loss of traditional industrial job opportunities in a post-modern society. Clearly, violence is one means by which certain men live up to the ideals of hegemonic masculinity; such practices may be learned through interactions with particular peer groups, or virtual peer groups. \\ This paper examines a corpus constructed of online responses to an article in an online edition of the British tabloid newspaper The Sun describing an act of interpersonal street violence between two men. The report describes how a man, in an unprovoked attack, left another man unconscious in a street after breaking his jaw. The article produced 190 responses from readers, the majority of whom either through avatars or online names indicated that they were male. The responses were collected and compiled into a corpus containing 6,606 tokens. This was then analysed using theWordSmith Tools software package. Taking a corpus-based approach, the data was analysed by undertaking concordance analyses of keywords and collocates of those words. \\The findings of the study of keyword collocates and concordance lines indicate that regardless of the negative depiction of the aggressor in the online article, the assailant and his actions were defended, and at times admired and praised, while the victim was criticized for his lack of fighting skills, and not considered as innocent. However, the findings also provide data revealing that other respondents reject such actions, clearly demonstrating that multiple constructs of masculine identity exist among the tabloid readership who responded to the article. \\ The paper concludes by discussing the hypothesis that masculine identity and specifically hegemonic masculinity is constructed of multiple identities, and rejecting the notion that violence is a response to the destabilizing effects of post-modernism, while arguing that interpersonal violence is a means by which certain men express and validate masculinity. Furthermore, the importance of investigating and analysing online peer groups is emphasised as an invaluable source in comprehending aspects of social behaviour within contemporary society. ||4 || Abstract: A great deal of what is bad in the world, from genocide to interpersonal violence, is the product of men and their masculinities (!DeKeseredy and Schwartz, 2005).Work by criminologists such as Anderson (1990) have argued that instances of interpersonal violence originate from strongly held values in the construction and defence of personal street status and that violence is a tool for both the formation of and the protection of self-image. Furthermore, Messerschmidt (2004) writes that among certain men violence is a core component of masculinity and a means of proving one’s manhood. However, Winlow (2001) considers that street and pub fights function as a means for working-class men to actualize a masculine identity due to the loss of traditional industrial job opportunities in a post-modern society. Clearly, violence is one means by which certain men live up to the ideals of hegemonic masculinity; such practices may be learned through interactions with particular peer groups, or virtual peer groups. \\ This paper examines a corpus constructed of online responses to an article in an online edition of the British tabloid newspaper The Sun describing an act of interpersonal street violence between two men. The report describes how a man, in an unprovoked attack, left another man unconscious in a street after breaking his jaw. The article produced 190 responses from readers, the majority of whom either through avatars or online names indicated that they were male. The responses were collected and compiled into a corpus containing 6,606 tokens. This was then analysed using the !WordSmith Tools software package. Taking a corpus-based approach, the data was analysed by undertaking concordance analyses of keywords and collocates of those words. \\The findings of the study of keyword collocates and concordance lines indicate that regardless of the negative depiction of the aggressor in the online article, the assailant and his actions were defended, and at times admired and praised, while the victim was criticized for his lack of fighting skills, and not considered as innocent. However, the findings also provide data revealing that other respondents reject such actions, clearly demonstrating that multiple constructs of masculine identity exist among the tabloid readership who responded to the article. \\ The paper concludes by discussing the hypothesis that masculine identity and specifically hegemonic masculinity is constructed of multiple identities, and rejecting the notion that violence is a response to the destabilizing effects of post-modernism, while arguing that interpersonal violence is a means by which certain men express and validate masculinity. Furthermore, the importance of investigating and analysing online peer groups is emphasised as an invaluable source in comprehending aspects of social behaviour within contemporary society. || 5 5 6 6 [[BR]] … … 14 14 ||=Adam Kilgarriff and Vít Suchomel =|| 15 15 ||=Web Spam =|| 16 || Abstract: Web spamming 'refers to actions intended to mislead search engines into ranking some pages higher than they deserve'. Web spam is a problem for web corpus builders because it is quite like the material we want to gather, but we do not want it. It is on the increase: when we compare two corpora gathered using the same methods in 2008 and 2012, enTenTen08 andenTenTen12, the web spam in the later one is a striking difference. In this paper we first review some relevant literature, and then identify some characteristics of web spam that we have noted, and suggest corresponding strategies for distinguishing it from good text. ||16 || Abstract: Web spamming 'refers to actions intended to mislead search engines into ranking some pages higher than they deserve'. Web spam is a problem for web corpus builders because it is quite like the material we want to gather, but we do not want it. It is on the increase: when we compare two corpora gathered using the same methods in 2008 and 2012, !enTenTen08 and !enTenTen12, the web spam in the later one is a striking difference. In this paper we first review some relevant literature, and then identify some characteristics of web spam that we have noted, and suggest corresponding strategies for distinguishing it from good text. || 17 17 18 18 [[BR]] … … 32 32 ||=Alexander Piperski, Vladimir Belikov, Nikolay Kopylov, Vladimir Selegey and Serge Sharoff =|| 33 33 ||=Big and diverse is beautiful: A large corpus of Russian to study linguistic variation =|| 34 || Abstract: The General Internet Corpus of Russian (GICR) is aimed at studying linguistic varia -tion in present-day Russian available on the Web. In addition to traditional morphosyntac-tic annotation, the corpus will be richly anno-tated with metadata aimed at sociolinguistics research of language variation, including re-gional, gender, age, and genre variation. The sources of metadata include explicit informa-tion available about the author in his/her pro-file, information coming from IP or URL, as well as machine learning from textual features. ||34 || Abstract: The General Internet Corpus of Russian (GICR) is aimed at studying linguistic variation in present-day Russian available on the Web. In addition to traditional morphosyntactic annotation, the corpus will be richly annotated with metadata aimed at sociolinguistics research of language variation, including regional, gender, age, and genre variation. The sources of metadata include explicit information available about the author in his/her pro-file, information coming from IP or URL, as well as machine learning from textual features. || 35 35 36 36 [[BR]] … … 50 50 ||=Stephen Wattam, Paul Rayson and Damon Berridge =|| 51 51 ||=LWAC: Longitudinal Web-as-Corpus Sampling =|| 52 ||Abstract: As the web develops, issues surrounding network and content stability increasingly affect sampling of web data. The needs of those aiming to investigate the impact network-based effects such as link rot have upon language content are currently poorly served by linguistic search engines such as WebCorp, which attempt to produce language samples more comparable to offline corpora. \\ We present here an open-source tool, LWAC, for formal longitudinal sampling of URI lists, designed to download portions of the web in a fast, parallel manner that imitates end users. LWAC is designed to run on commodity hardware and provide a high-performance method of corpus construction for investigating both language change online (in a conventional manner) and epistemic issues in the web-as-corpus field. ||52 ||Abstract: As the web develops, issues surrounding network and content stability increasingly affect sampling of web data. The needs of those aiming to investigate the impact network-based effects such as link rot have upon language content are currently poorly served by linguistic search engines such as !WebCorp, which attempt to produce language samples more comparable to offline corpora. \\ We present here an open-source tool, LWAC, for formal longitudinal sampling of URI lists, designed to download portions of the web in a fast, parallel manner that imitates end users. LWAC is designed to run on commodity hardware and provide a high-performance method of corpus construction for investigating both language change online (in a conventional manner) and epistemic issues in the web-as-corpus field. || 53 53 54 54 [[BR]] … … 68 68 ||=Colleen Crangle =|| 69 69 ||=A web-based model of semantic relatedness and the analysis of electroencephalographic (EEG) data =|| 70 || Abstract: Recent studies of language and the brain have shown that models of semantics extracted from web-based corpora can predict brain activity as measured by functional magnetic resonance imaging (fMRI), magnetoencephalography (MEG), or electroencephalography (EEG). In Mitchell et al. (2008) the semantics of a word was represented by its distributional properties in the data set contributed by Google Inc. This data set consists of English word n-grams and their frequencies in an approximately 1-trillion-word set of web pages (Brants, Franz 2006). For nouns referring to physical objects, co-occurrence patterns with 25 manually-selected sensory-motor verbs provided the semantic model. Taking 60 such nouns and their fMRI images, statistically significant pre-dictions were made as to the semantic catego-ry (mammal or tool, for example) of the words in this set. \\ Since Mitchell, other web-based corpora and other ways of selecting semantic features have been investigated to see if they offered improved methods of predicting from brain data the word someone is seeing or hearing or otherwise attending to. \\ Murphy et al. (2012), for example, used a 16 billion-word set of English-language web-page documents as their corpus and point-wise mutual information (Turney, 2001) combined with co-occurrence frequencies to provide a semantic model. Pereira et al. (2010) used a large text corpus consisting of pertinent articles from Wikipedia and Latent Dirichlet allocation (LDA, Blei et al., 2003) to provide the semantic model. In Jelodor et al. (2010) we find WordNet (Fellbaum, 1998) used as a supplementary source of information to construct a semantic model. Several WordNet similarity measures com-puted the similarity of each of the 60 nouns with each of the 25 sensory-motor verbs of Mitchell et al. \\ In this paper, I take a model of semantic relatedness extracted from the Web and examine the extent to which it corresponds to predictions made from EEG data about the relations between sets of words participants are attending to. Unlike previous work that looked at isolated word predictions, this work examines sets of words and the relations between them. The brain data are drawn from experiments in which statements about commonly known geographic facts of Europe were presented auditorily to participants who were asked to determine the truth or falsity of each statement while EEG recordings were made (Suppes et al, 1999; Suppes et al., 2009). The corpus is the Google Inc. data set and semantic relatedness is obtained from a point-wise mutual infor-mation measure. \\ Corpus-based models of semantics face the unavoidable evaluation question, namely how well distributional information extracted from a corpus matches the semantic knowledge of language users. Corpus-based studies of semantics and the brain potentially offer a new way to answer this question. ||70 || Abstract: Recent studies of language and the brain have shown that models of semantics extracted from web-based corpora can predict brain activity as measured by functional magnetic resonance imaging (fMRI), magnetoencephalography (MEG), or electroencephalography (EEG). In Mitchell et al. (2008) the semantics of a word was represented by its distributional properties in the data set contributed by Google Inc. This data set consists of English word n-grams and their frequencies in an approximately 1-trillion-word set of web pages (Brants, Franz 2006). For nouns referring to physical objects, co-occurrence patterns with 25 manually-selected sensory-motor verbs provided the semantic model. Taking 60 such nouns and their !fMRI images, statistically significant pre-dictions were made as to the semantic category (mammal or tool, for example) of the words in this set. \\ Since Mitchell, other web-based corpora and other ways of selecting semantic features have been investigated to see if they offered improved methods of predicting from brain data the word someone is seeing or hearing or otherwise attending to. \\ Murphy et al. (2012), for example, used a 16 billion-word set of English-language web-page documents as their corpus and point-wise mutual information (Turney, 2001) combined with co-occurrence frequencies to provide a semantic model. Pereira et al. (2010) used a large text corpus consisting of pertinent articles from Wikipedia and Latent Dirichlet allocation (LDA, Blei et al., 2003) to provide the semantic model. In Jelodor et al. (2010) we find !WordNet (Fellbaum, 1998) used as a supplementary source of information to construct a semantic model. Several !WordNet similarity measures computed the similarity of each of the 60 nouns with each of the 25 sensory-motor verbs of Mitchell et al. \\ In this paper, I take a model of semantic relatedness extracted from the Web and examine the extent to which it corresponds to predictions made from EEG data about the relations between sets of words participants are attending to. Unlike previous work that looked at isolated word predictions, this work examines sets of words and the relations between them. The brain data are drawn from experiments in which statements about commonly known geographic facts of Europe were presented auditorily to participants who were asked to determine the truth or falsity of each statement while EEG recordings were made (Suppes et al, 1999; Suppes et al., 2009). The corpus is the Google Inc. data set and semantic relatedness is obtained from a point-wise mutual information measure. \\ Corpus-based models of semantics face the unavoidable evaluation question, namely how well distributional information extracted from a corpus matches the semantic knowledge of language users. Corpus-based studies of semantics and the brain potentially offer a new way to answer this question. ||