- Lal Zimman
- Will Hayworth
Compared to the study of phonetic and grammatical variables, the study of lexical variation remains marginal in quantitative sociolinguistics. While the lexicon is often ignored because of its propensity for rapid, non-linear changes and its status as “above the level of awareness” (Silverstein 1981), these characteristics also make it amenable for the analysis of rapid sociopolitical change. Corpus linguistics provides a set of tools for the analysis of sociolinguistic variation in lexical usage (e.g. Baker 2010), but these methods have yet to be integrated into the study of language change. Corpus studies are often limited to synchronic perspectives and, more often than not, include data from relatively non-vernacular speaking contexts (e.g. newspapers). The current study departs from existing corpus sociolinguistic research in a number of ways, offering two primary contributions. The first comes from the examination of change in everyday counter-hegemonic discourse in a transgender community, while the second concerns the use of novel, general purpose computing tools for the analysis of relatively unstructured internet data.
The first contribution is an analysis of change in the use of body part terminology over the course of more than 15 years of interactions (2000-2017) in an online community for trans men and others on the trans masculine identity spectrum. For decades, the most important principle of transgender language activism has been the notion that gender identity is a matter of self-identification (Zimman 2016, 2017). During the time period in which this online community was active, there emerged an empowering parallel discourse that biological sex, too, is open to self-identification, rather than being an objective fact about the body. A preliminary synchronic analysis of talk about genitals demonstrates that trans speakers in this community implement a combination of terms that are normatively “male” (e.g. dick), normatively “female” (e.g. vagina), gender-neutral (e.g. privates), and creative or trans-specific (e.g. front hole), all in reference to trans men’s surgically-unmodified genitals. These practices were part of a complex expansion of self-identification discourse that gained traction within English-speaking trans communities during the time period examined, and this analysis highlights the relationship between the emergence of non-normative lexical usage and discourses that overtly challenged medical and scientific authority. Whereas most corpus studies of language, gender, and sexuality focus on hegemonic and oppressive discourses (e.g. Baker 2004), this study offers a view of linguistic strategies of empowerment.
The other major contribution of this paper is methodological. The creation of corpora based on relatively unstructured internet-based data is generally a laborious process. We demonstrate a series of corpus methods that rely on cutting-edge computing techniques for information retrieval and analysis. Specifically, data were collected through a crawling pipeline that parses social media data, stores it in a cloud database, and allows for analysis using commodity tools on offer from major cloud providers. The pipeline and examination tools are “serverless,” scaling on demand with no fixed infrastructure investment and at minimal cost. Our methods thus provide a reference architecture for flexible sociolinguistic analysis of social media and other Internet-based data for a range of sociolinguistic purposes.
Baker, Paul. 2004. “Unnatural acts”: Discourses of homosexuality within the House of Lords debates on gay male law reform. Journal of Sociolinguistics 8(1). 88–106.
Silverstein, Michael. 1981. The limits of awareness. Sociolinguistic Working Paper 84. 1–30.