Semantic change and socio-semantic variation: the case of COVID-related neologisms on Reddit

Quirin Würschinger; Barbara McGillivray

doi:10.1515/lingvan-2023-0106

Article Publicly Available

Semantic change and socio-semantic variation: the case of COVID-related neologisms on Reddit

Quirin Würschinger and Barbara McGillivray

Published/Copyright: March 15, 2024

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Linguistics Vanguard Volume 11 Issue s3

Abstract

COVID-19 has triggered innovations in science and society globally, leading to the emergence or establishment of formal neologisms such as infodemic and working from home (WFH). While previous work on COVID-related lexical innovation has focused on such formal neologisms, this paper uses data from Reddit to study semantic neologisms like lockdown and mask, which have changed in meaning due to the pandemic. First, we identify words that have undergone meaning changes since the start of the pandemic. Our approach, based on word embeddings, successfully detects a variety of COVID-related terms that dominate the resulting list of semantic neologisms. Next, we generate community-specific semantic representations for the communities r/Coronavirus and r/conspiracy, which are both highly engaged in COVID-related discourse. We analyse socio-semantic variation along two dimensions: an evaluative dimension, based on amelioration/pejorization, and the loyalty/betrayal dimension of Moral Foundations Theory. Our findings reveal that the detected semantic neologisms exhibit more negative and betrayal-related associations in r/conspiracy, a subreddit critical of COVID-related sociopolitical measures. Mapping the community-specific representations for the term vaccines on a shared semantic space confirms these differences and reveals more fine-grained denotational and connotational differences between the two communities.

Keywords: lexical innovation; semantic change; social variation; word embeddings; Reddit

1 Introduction

COVID-19^[1] has led to significant changes in social practices and spurred rapid innovations in science and society. Language, particularly at the lexical level, reflects such cultural shifts. New words emerge, and the meanings of existing word change. Previous research on COVID-related lexical innovation has mainly focused on formal neologisms like infodemic or pancession (Mahlberg and Brookes 2021; Roig-Marín 2020). We aim to add to this work by investigating semantic change and socio-semantic variation in COVID-related terms like lockdown or mask on Reddit. Specifically, we examine the potential variation in the use of these terms, particularly in their evaluative associations and their connections to concepts of loyalty or betrayal. This approach allows us to study the socio-semantic profiles of these neologisms and provides insights into societal attitudes towards the pandemic.

2 Theoretical background and previous work

2.1 Semantic neology

Lexical innovations include formal neologisms, that is new lexical items, like Zoom fatigue, and semantic neologisms, that is new meanings of existing lexical items, such as booster (Geeraerts 2010; Tournier 1985). Lexical semantic change and innovation depend on the semantic relationships between old and new meanings of lexemes (Koch 2016) and can relate to denotational or connotational aspects (Geeraerts 2010; Koch 2016; Leech 1981; Lipka 1992). Denotational change refers to new concepts and practices, often involving generalization versus specialization (e.g. lockdown narrowing to COVID-related shutdowns), and metonymic change (e.g. jab for vaccine injections). Conversely, connotational change involves shifting associations and attitudes, which are often related to evaluative dimension changes (Koch 2016), with words adopting increasingly negative (pejorization) or positive meanings (amelioration).

Researchers have used web data, social media, and large corpora to identify COVID-related formal neologisms (Roig-Marín 2020; Scott 2020; Thorne 2020). However, COVID-related semantic neologisms have received less attention, partly due to the methodological complexity of studying semantic change. Existing research mainly consists of qualitative case studies on COVID-related meaning variation and change (Dong et al. 2021; Irshad et al. 2021; Ullah Shaheen et al. 2021).

2.2 Computational analyses of semantic change and socio-semantic variation

Recent advances in natural language processing (NLP) have enabled large-scale, quantitative studies of lexical semantic change. Most NLP approaches use word embeddings (Mikolov et al. 2013) as representations of lexeme meanings based on distributional properties (Firth 1957). Although various types of change are acknowledged (Tahmasebi et al. 2021: Table 1.2), most previous research has not differentiated between denotational and connotational change. Many studies investigate “usage change”, including cases of homonymy, but do not distinguish between these two types (Gonen et al. 2020). Despite the success of word embeddings in large-scale studies of long-term change (Hamilton et al. 2016; Kim et al. 2014; Kutuzov et al. 2018), linguistic processes underlying semantic innovation and their effects have remained underexplored.

Recent advances allow for more precise examinations of short-term lexical semantic shifts over years, not decades, and focus on the relation between such shifts and socio-semantic variation (Del Tredici et al. 2019; Robertson et al. 2021; Shoemark et al. 2019; Tsakalidis et al. 2019), confirming the significance of social meaning in NLP (Nguyen et al. 2021). For example, Hofmann et al. (2021) used dynamic word embeddings incorporating temporal and social dynamics to investigate meaning change, while Lucy and Bamman (2021) explored community-specific usage patterns and word senses on Reddit. Signoroni et al. (2022) examined COVID-related meaning changes on Reddit with an approach similar to ours, but focused on the Italian language without exploring socio-semantic variation or distinguishing between distinct types of usage changes.

2.3 Moral foundations theory

In our analysis, we draw upon Moral Foundations Theory (MFT) to explore the moral dimensions that shape the meanings and interpretations of COVID-related terms on Reddit. MFT describes the basis of human moral reasoning (Graham et al. 2013; Haidt 2007, 2012) and argues that there are five core moral foundations that underlie moral judgements across cultures: care/harm, fairness/cheating, loyalty/betrayal, authority/subversion, and sanctity/degradation, with liberty/oppression sometimes considered as a sixth foundation. These moral foundations have been found to be associated with moral intuitions, emotions, and values concerning societal issues such as COVID.

MFT offers a useful framework for examining how moral values influence word meaning and use across communities (Dehghani et al. 2016; Graham et al. 2009), revealing the emphasis on specific moral foundations in a community and their impact on language use (Hofmann et al. 2022; Sagi and Dehghani 2014). We focus on MFT’s loyalty/betrayal dimension, concerning group allegiance and disloyalty. Previous work indicates that binding moral foundations, including loyalty/betrayal, influence attitudes towards COVID-related sociopolitical measures (Bruchmann and LaPierre 2022; Nan et al. 2022; Tarry et al. 2022; Zhou et al. 2022). The loyalty/betrayal foundation gains particular relevance due to heightened social polarization since the start of the pandemic (Lang et al. 2021), resulting in the emergence of conspiracy theories and communities, such as r/conspiracy. Loyalty/betrayal values have been linked with conspiracy thinking (Leone et al. 2019) and conspiracy communities often exhibit strong (dis)loyalty towards particular narratives or beliefs. We assess the similarity between the semantic representations of COVID-related neologisms and the semantic representations of loyalty and betrayal, thereby gauging the extent to which these terms align with the poles of this moral foundation.

2.4 Socio-semantic variation in neologisms

Earlier theoretical work has highlighted the importance of socio-semantic variation for the study of meaning and meaning change (Clark 1996; Geeraerts 2015; Hasan 1989; Nguyen et al. 2021). Apart from occasional earlier attempts (e.g. Peirsman et al. 2010), there have been relatively few large-scale empirical approaches on inter-community variation (Del Tredici et al. 2019; Gonen et al. 2020; Hofmann et al. 2022; Schmid et al. 2020).

We aim to add to this work by focusing on community-specific differences in the use of semantic neologisms. New words, by definition, have low degrees of conventionality in the speech community, as do new meanings associated with existing words. Semantic neologisms associated with COVID are particularly susceptible to socio-semantic variation, due to the heightened social polarization that has accompanied the pandemic. This trend has been documented in news reports (Hart et al. 2020) as well as on social media platforms (Green et al. 2020; Jiang et al. 2020; Jing and Ahn 2021; Lang et al. 2021). From a sociolinguistic perspective, fragmentation and polarization of the speech community into echo chambers can drive socio-semantic variation and contribute to semantic change. Echo chambers can impede the diffusion of linguistic conventions, which is essential for establishing shared norms of language use across the speech community (Schmid 2020).

3 Data

We draw on data from the social media platform Reddit, which provides a large sample of authentic language use spanning the period before and after the pandemic. The platform features a large community of about 52 million daily active users, who participate in about 130,000 active communities known as “subreddits”.^[2] Communities are referred to using the prefix r/, and typically form around types of content (r/pics), specific topics (r/UkrainianConflict), shared interests (r/politics), or attitudes (r/Conservative). Users join subreddits according to their interests and contribute forum posts (“submissions”) and comments in a hierarchically organized structure.

We collect Reddit data (Baumgartner et al. 2020) using the Python library psaw (https://github.com/dmarx/psaw). First, to identify COVID-related semantic neologisms that have changed in meaning since the start of the pandemic, we retrieve a random sample of comments for the years 2019 and 2020. These two data sets are used to train historical word embedding models. Second, to study socio-semantic variation between communities, we use our 2020 data set to identify those communities on Reddit that are most actively involved in the COVID discourse, based on keyword queries for “COVID” using the Reddit API. We then select two communities that are representative of two main stances in COVID-related discourse: neutral, mainstream positions (r/Coronavirus), and more sceptical and critical stances towards the predominant public and sociopolitical attitudes and measures (r/conspiracy). For both subreddits, we obtain all comments for the year 2020. We then train word embedding models for each of these data sets to investigate socio-semantic variation between the communities. Table 1 contains an overview of the data sets used.

Table 1:

Data sets for semantic change detection (2019 and 2020) and socio-semantic variation; total number of comments and tokens (in millions).

Data set	Comments (in millions)	Tokens (in millions)
2019	5.3	178
2020	5.4	185
r/Coronavirus	4.1	110
r/conspiracy	4.0	109

4 Methods

We performed a series of preprocessing steps to clean up the Reddit data sets before generating and analysing semantic representations.^[3] We removed duplicate comments, non-English comments, and comments shorter than 10 tokens. We then lowercased and tokenized all texts and removed punctuation and numeric tokens. Within each comment, we further eliminated tokens with three or fewer characters due to the high prevalence of ambiguous and non-standard variants in this category. Additionally we used a blacklist to remove material involving bots and usernames, subreddit titles (e.g. AskReddit), and Reddit-specific jargon (e.g. submission).

For each of our data sets (Table 1), we trained word2vec word embeddings (Mikolov et al. 2013) as implemented in Gensim (Rehurek and Sojka 2011).^[4] We identified the intersecting vocabulary between models and subset both models to this shared vocabulary. Subsequently, we applied orthogonal Procrustes alignment (Hamilton et al. 2016; Schönemann 1966) to align their 300-dimensional vector spaces.

To measure change over time, we calculated cosine distances between the embeddings of the same words in the two historical models. To analyse dimensions of semantic variation, we projected the embeddings onto two semantic axes, following the approach by An et al. (2018).^[5] This allows us to study how the representations of our models differ along axes defined by antonyms such as good versus bad. Finally, we visualized semantic spaces by performing dimensionality reduction using t-SNE (van der Maaten and Hinton 2008) on the vectors.

5 Results

5.1 Meaning change and semantic neology

To detect semantic neologisms, we used the historical data sets (Table 1) to train embedding models for the years 2019 and 2020. The models’ vocabulary sizes are 252,564 (2019) and 277,707 (2020), respectively, and the shared vocabulary contains 190,756 types. We then aligned the models with orthogonal Procrustes alignment (Hamilton et al. 2016).

Next, as described in Section 4, we computed pairwise cosine distances for all words in the shared vocabulary, comparing their vector representations in the 2019 and 2020 models. This allowed us to identify those words that exhibit the highest semantic distance. To ensure reliable semantic representations and mitigate effects of frequency differences on change scores, we excluded words used fewer than 100 times in either of the two subcorpora.

Table 2 presents the resulting list of 20 candidates^[6] for semantic change, along with the primary type of change^[7] associated with each. At a first glance, our approach seems to effectively detect semantic neologisms, as most lexemes are COVID-related and likely experienced meaning shifts due to the socio-cultural impact of the pandemic. We further analyse the semantic representations of these change candidates below to validate the observed changes.

Table 2:

Words that show the highest degree of semantic change between 2019 and 2020. Semantic distance (SemDist) is based on the cosine distance between vector representations between the 2019 and the 2020 embedding models. (The cosine similarity of lockdowns is greater than one since its semantic similarity score based on the word2vec models is slightly below zero.) All COVID-related words are highlighted in bold, and we specify the primary type of change associated with each.

Word	SemDist	Change
lockdowns	1.02	denotational
maskless	1.00	denotational
sunsetting	1.00	connotational
childe	0.98	homonymous
megalodon	0.98	homonymous
Newf	0.96	homonymous
corona	0.93	homonymous
filtrate	0.92	connotational
Chaz	0.90	homonymous
Klee	0.89	homonymous
Rona	0.89	homonymous
Cerb	0.87	homonymous
rittenhouse	0.87	homonymous
vacuo	0.86	connotational
moderna	0.84	homonymous
pandemic	0.84	denotational
spreader	0.84	denotational
distancing	0.83	denotational
Sars	0.83	connotational
Quarantines	0.82	denotational

In addition to many well-established COVID-related words such as lockdowns or distancing, our model also detects less widespread terms such as cerb (“Canada Emergency Response Benefit for COVID”) or vacuo (medical term for vacuum). The semantic changes for the majority of the detected COVID-related neologisms are denotational rather than connotational in nature, according to Koch (2016): the terms spreader, distancing, maskless, pandemic, and quarantines were used in more diverse contexts before the start of the pandemic, but were used to refer to a more narrow set of COVID-related concepts in 2020. For instance, distancing has evolved to predominantly refer to social distancing practices for infection prevention.

A second set of words in Table 2 display connotational differences that can be related to the social and stylistic dimensions of meaning (Leech 1981), showing that the diachronic change captured by our models can be traced back to socio-semantic variation at the community level. Our models appear to capture stylistic semantic variation for the terms filtrate, sars, and vacuo. The terms filtrate and PCR test still refer to the same concepts and entities in 2020, but their stylistic signature has changed significantly. Initially limited to formal, academic discourse, these terms spread into public discourse and have partly lost their connotations of jargon and formality. The term sunsetting represents an example of socio-semantic variation. Not related to COVID, sunsetting mainly refers to terminating programmes and services, often in legal or business contexts.^[8] Since 2020, gamers have increasingly used it to denote the disappearance of virtual items.

In addition, our models capture distributional changes that are not the result of semantic change in the narrow, linguistic sense. We detect cases of homonymy where new words surface that are identical in form to existing words, yet are etymologically and semantically unrelated. This applies to the COVID-related terms moderna, corona, cerb, and rona,^[9] and all remaining cases of non-COVID-related terms in Table 2 can also be attributed to emerging homonymy.^[10]

5.2 Socio-semantic variation

5.2.1 COVID-related communities

To study socio-semantic variation in the use of these COVID-related neologisms, we identified those communities that were most actively engaged in COVID-related discourse. To this end, we extracted all COVID-related comments from our 2020 data set that contain the term COVID and related formal variants, using frequency of occurrence as a proxy for agenda-setting, following Hofmann et al. (2022).^[11] The resulting data set contains 3.8 million comments and 145 million word tokens. We then determined the communities with the highest number of COVID-related comments. Figure 1 presents the 15 most active communities in this data set.

Figure 1:

Most active COVID-related communities in our 2020 data set.

As described in Section 3, we selected two communities that represent diverging viewpoints in the COVID discourse and provide enough data for generating community-specific semantic representations. At the time of data collection, the subreddit r/Coronavirus had about 2.4 million users and contained open discussions and mainstream positions on the pandemic.^[12] The subreddit r/conspiracy had about 1.7 million users and represents a slightly smaller community of sceptics who are critical of COVID-related measures such as masks, lockdowns, and the general response by science, media, and politics. As stated in its community description, its scepticism extends to sociopolitical issues beyond the pandemic: “We hope to challenge issues which have captured the public’s imagination, from JFK and UFOs to 9/11.” For both subreddits, we retrieved all comments for the year 2020, resulting in the data sets described in Table 1.

5.2.2 Dimensions of socio-semantic variation

To study the degree to which the neologisms show socio-semantic variation, we trained community-specific word embeddings models. We studied social differences for the semantic neologisms identified in the previous step by analysing variation along two semantic dimensions to generalize beyond single cases.

As pointed out in Section 2.1, semantic variation and change often involve connotational differences along an evaluative dimension (Koch 2016). We aimed to detect such differences by projecting the semantic representations of the detected neologisms on an evaluative continuum. We followed the methodology proposed by An et al. (2018) and constructed two semantic poles around the seed pole words good and bad for each model and generated a semantic axis by substracting the two pole vectors from each other.^[13] We then projected the target words on this axis, measuring their cosine similarity with the axis. Higher similarity values indicate a closer association with good, lower values with bad.

On a second dimension, we investigated whether community-specific semantic representations align with the loyalty/betrayal dimension of MFT. Since loyalty/betrayal correlates with attitudes towards COVID, sociopolitical measures (Bruchmann and LaPierre 2022; Nan et al. 2022; Tarry et al. 2022; Zhou et al. 2022), and conspiracy thinking (Leone et al. 2019), we aimed to determine whether these differences are reflected in semantic representations between communities. Analogous to the approach for the evaluative axis described above, we generated semantic poles for the two concepts by using the semantic representations of the words loyalty and betrayal to represent the conceptual poles of this moral axis. We used both terms as seed words to construct a semantic axis and projected the target neologisms to assess how closely they are associated with loyalty/betrayal in the two communities.

Figure 2 presents the results for the projections on both dimensions. It covers the set of semantic neologisms detected in the previous step, except for those unrelated to COVID^[14] and those that were not used in the selected communities.^[15] To compensate for this removal, we added four words that have been recognized as contentious topics in COVID discourse: masks, vaccines, science, and research (Lang et al. 2021; Nan et al. 2022; Ullah Shaheen et al. 2021).

Figure 2:

Projecting semantic representations for the communities r/Coronavirus and r/conspiracy on two semantic axes. Higher values in cosine similarity between the target words and the semantic axes indicate closer association with good and loyalty, respectively. Words are sorted according to the difference in their semantic similarity with the semantic axes.

Overall, Figure 2 reveals substantial differences between the two communities, which were determined to be statistically significant using the Wilcoxon signed-rank test (Wilcoxon 1945).^[16] These differences do not seem to result from frequency effects, as the terms are used with similar frequency in the two communities (Table 4).

Figure 2a, which covers the evaluative dimension, shows that the COVID-related semantic neologisms are generally more negatively connotated in r/conspiracy than in r/Coronavirus. The cosine similarity between the target words’ semantic representations and the semantic axis is generally lower for this community. Several words such as corona, pandemic, and spreader show little difference between communities. However, terms related to sociopolitical measures combating the pandemic, such as lockdowns, quarantines, distancing, vaccines, and masks receive notably more negative evaluations in r/conspiracy. For instance, an r/Coronavirus user views vaccines positively: “it seems too good to be true that the vaccines is [sic] already ready”. Conversely, an r/conspiracy user regards vaccines as “dangerous and ineffective and remember every fake problem breeds a multitude of fake solutions”. These examples illustrate the connotational differences in semantic representations between the communities along the evaluative dimension.

Figure 2b presents the results of the projection onto the loyalty/betrayal axis. On the whole, the target words are more closely associated with the betrayal pole in r/conspiracy than in r/Coronavirus. Similarly to the results on the evaluative dimension, the first set of terms such as corona, sars, and pandemic exhibits little difference between communities. However, words related to government measures such as masks, lockdowns, quarantines, and vaccines are more strongly associated with betrayal in r/conspiracy, and more strongly with loyalty in r/Coronavirus. Finally, the terms research and science show the greatest differences between the two communities. The subreddit r/Coronavirus exhibits a stronger association of both terms with loyalty compared to the conspiracy community. For instance, an r/Coronavirus user expresses trust, stating they are “comfortable trusting the people for whom creating, testing, administering, approving vaccines is their life’s work”. In contrast, an r/conspiracy user declares, “[the government] has been caught in so many lies regarding vaccines, there’s no way I trust anything”. This illustrates a stronger sense of loyalty towards institutions, science, and sociopolitical measures in r/Coronavirus, while the r/conspiracy community tends to exhibit a sense of betrayal in relation to these issues.

Overall, we observe a significant overlap in socio-semantic variation, with the same words generally exhibiting differences in their semantics between communities along both dimensions. In particular, terms related to sociopolitical measures lockdowns, quarantines, and vaccines are more closely associated with bad and betrayal for r/conspiracy. The word distancing constitutes an exception, since it is connected more strongly with bad in r/conspiracy, but shows little variation in terms of loyalty/betrayal. Furthermore, more general terms like research and science, which are assessed neutrally or positively in the mainstream r/Coronavirus community, do not exhibit associations with loyalty in the conspiracy community. This is in contrast to the positive and loyal connections that these terms, along with those indicating sociopolitical measures, demonstrate within the r/Coronavirus community. These findings are consistent with the conspiracy community’s overarching goal to “challenge issues which have captured the public’s imagination”, as stated in its community description. This scepticism appears to apply to COVID-related issues as well as research and science as authoritative sources of objective truth.

5.2.3 Maps of socio-semantic variation

Finally, we aimed to get a more differentiated picture of the types of socio-semantic variation detected above. We visualized the semantic space surrounding the word vaccines, which has been demonstrated to be particularly revealing of sociopolitical divides in previous studies (Nan et al. 2022), and exhibited the highest degree of socio-semantic variation in Figure 2. To this end, we aligned the embedding models of both communities and used t-SNE (van der Maaten and Hinton 2008) for reducing the dimensionality of the vector representations to visualize lexical meanings in a two-dimensional space.

Figure 3 illustrates the semantic space of vaccines, with Figure 3a displaying the 10 nearest semantic neighbours for each community. Despite minor differences, the plot indicates a largely shared semantic space for vaccines in both communities. Semantic clusters form without clear community separation. A central bottom cluster contains the singular form vaccine and its suffixation vaccination. While semantically close to vaccines, they differ as singular forms. Our models capture these grammatical differences, consistent with prior studies (e.g. Giulianelli et al. 2020). The last three terms, orthographic variants (vax, vx) or spelling mistakes (vaccins), are scattered across the space. Overall, Figure 3a demonstrates a high overlap for vaccines, suggesting both communities agree on its core denotational meaning.

Figure 3:

Semantic maps for the meaning of the term vaccines in the communities r/Coronavirus and r/conspiracy.

To get a better view of the semantic differences between the two communities, we now focus on the discrepancies in the semantic space of the term vaccines. Figure 3b presents the 20 nearest neighbours for each community after having filtered out those words that are shared between the two communities. The resulting semantic space of vaccines shows a clear separation between the two communities. The semantic neighbours for r/Coronavirus cluster towards the top right of the plot. They cover a broad range of vaccine-related terms, including biological terms (adenoviruses) and terms related to vaccine development (assays, supercomputers) and the pharmaceutical industry (drugmakers).

The semantic neighbours for r/conspiracy are less diverse and show two main clusters. The first set towards the bottom centre covers vaccines that are unrelated to COVID, such as chickenpox, measles, and hpv. The appearance of these terms may be explained by the fact that speakers in this community are not only critical of the COVID vaccines, but show general scepticism towards vaccination. The second main group in the mid-left part of the plot contains terms that are associated with conspiracy theories. These theories generally regard the vaccines as dangerous and claim that they cause a range of bio-medical side effects: causing brain damage due to neurotoxins and decreased fertility due to hcg,^[17] and turning people into genetically modified organisms (gmos).

These differences in the semantic space of the term vaccines in the two communities shed light on the socio-semantic variation on the evaluative and loyalty/betrayal dimensions identified by our embeddings projection approach in the previous section. The two communities seem to share a core denotational semantic representation of vaccines, as shown by the similarity in closest neighbours in Figure 3a. However, Figure 3b highlights connotational and denotational differences between the communities. Users in r/Coronavirus generally view vaccines as effective medical tools, express loyalty towards their deployment, and evaluate them positively: “the components in modern vaccines have been studied and we have massive amounts of empirical data regarding their safety and efficacy”. In contrast, users in r/conspiracy often perceive vaccines as dangerous or manipulative, display disloyalty towards involved institutions, and evaluate them negatively: “scientists are evil people, insane, and yes, vaccines have side effects”.

6 Discussion

In this paper, we found that the coronavirus pandemic has caused considerable semantic innovation in the English lexicon. This demonstrates to the potential of word embedding models for studying very recent semantic change, as well as the strength of the impact of COVID on language and society (Signoroni et al. 2022). Our results show that a high proportion of words that show the greatest shifts between 2019 and 2020 are related to the pandemic. After a closer inspection, we find that the detected shifts represent different types of semantic variation over time. The first group contains homonyms like the proper noun rittenhouse that fall outside the scope of semantic neology. The detected neologisms mainly show denotational change, as in the semantic specialization of distancing. In addition, we find connotational changes that can be related to stylistic (e.g. filtrate) and social (e.g. sunsetting) dimensions of meaning. Overall, our semantic change detection method successfully identified neologisms to a significant degree. However, our findings emphasize the necessity of differentiating between various types of “usage change” (Tahmasebi et al. 2021). A more in-depth analysis was needed to eliminate cases of homonymy and differentiate between denotational and connotational changes, ultimately providing a more precise understanding of the semantic shifts occurring. More differentiated analyses of candidates for semantic changes have been outside the focus of most previous approaches in NLP, but recent approaches using token-based embeddings show promising results in that direction (Giulianelli et al. 2020).

In the second part of the paper, we focused on the socio-semantic variation of COVID-related neologisms. Drawing from theoretical models of semantic change (Koch 2016) and Moral Foundations Theory (Graham et al. 2013), we examined variations on evaluative (good/bad) and moral (loyalty/betrayal) dimensions of meaning. We observed significant differences between communities, with r/conspiracy showing more negative and loyalty-related associations for COVID-related terms overall. Terms related to sociopolitical measures (e.g. vaccines, lockdowns, masks) were evaluated more negatively in r/conspiracy than in r/Coronavirus. Additionally, these terms, along with general terms like research and science, had stronger associations with disloyalty or betrayal in r/conspiracy.

To get a more detailed picture of socio-semantic variation, we analysed the semantic space surrounding the term vaccines by visualizing its nearest semantic neighbours in both communities. We noticed substantial overlap between communities, indicating a shared core denotational meaning across the groups. Nevertheless, we also identified denotational and connotational differences, aligning with the general pattern of variations observed in the evaluative and loyalty/betrayal dimensions of meaning.

These results concur with previous research that identified diverging views on COVID-related concepts between communities (Lang et al. 2021). Prior studies have demonstrated differences in positive versus negative evaluations and loyalty/betrayal attitudes towards COVID issues (Bruchmann and LaPierre 2022; Nan et al. 2022; Tarry et al. 2022; Zhou et al. 2022). In our data set, the conspiracy community exhibits stronger associations with negative evaluation and betrayal, consistent with earlier work (Leone et al. 2019). This suggests that differing attitudes between communities manifest in diverging semantic representations of COVID-related terms.

Furthermore, our results emphasize the significance of incorporating socio-semantic variation in studies of semantic change. As community-specific variation can drive semantic shifts through processes such as (inter-)subjectification, amelioration, and pejoration (Koch 2016), examining socio-semantic variation can enrich investigations of meaning change. Additionally, studies of semantic change that overlook community-specific effects might misinterpret the observed variation as evidence of diachronic change. This consideration is particularly crucial in studies utilizing social media data, where highly active, polarized communities can skew aggregate measures of semantic change. In such instances, taking socio-semantic variation into account can offer a more nuanced perspective of semantic change, while also shedding light on the underlying social distinctions between communities.

Corresponding author: Quirin Würschinger, LMU, Munich, Germany, E-mail: q.wuerschinger@lmu.de

Funding source: Ludwig-Maximilians-UniversitÃ¤t MÃ¼nchen

Research ethics: Not applicable.
Informed consent: Not applicable.
Author contributions: QW: conceptualization, data curation, formal analysis, investigation, methodology, software, visualization, validation, writing – original draft, writing – review & editing. BMcG: conceptualization, funding acquisition, supervision, methodology, writing – original draft, writing – review & editing.
Competing interests: The authors state no conflict of interest.
Research funding: This work was supported by Ludwig-Maximilians-UniversitÃ¤t MÃ¼nchen
Data availability: Not applicable.

Appendix

Extended list of candidates for semantic change (Table 3)

Table 3:

An extended list of 50 words that show the highest semantic difference between our 2019 and 2020 models. COVID-related terms are marked in bold. Proper nouns were removed. In total, 19 terms were manually identified as being related to COVID.

	Lexeme	SemDist
1	lockdowns	1.02
2	maskless	1.00
3	sunsetting	1.00
4	newf	0.96
5	corona	0.93
6	filtrate	0.92
7	chaz	0.90
8	rona	0.89
9	cerb	0.87
10	vacuo	0.86
11	moderna	0.84
12	pandemic	0.84
13	spreader	0.84
14	distancing	0.83
15	sars	0.83
16	quarantines	0.82
17	yada	0.82
18	recounts	0.82
19	alway	0.81
20	yadda	0.80
21	pandemics	0.80
22	pansies	0.79
23	tosser	0.79
24	bipoc	0.79
25	ventilators	0.79
26	budging	0.79
27	diys	0.78
28	thst	0.78
29	flyweight	0.77
30	yeap	0.77
31	mrna	0.77
32	tiktoks	0.77
33	buuuut	0.76
34	coomer	0.76
35	unfortunatly	0.75
36	anywho	0.75
37	quarantining	0.74
38	venti	0.74
39	webrip	0.74
40	obvi	0.74
41	fkin	0.74
42	modus	0.73
43	tink	0.73
44	duplicating	0.73
45	retinoids	0.73
46	parasol	0.72
47	copypastas	0.72
48	excercise	0.72
49	newbies	0.72
50	mers	0.72

Frequency counts per community (Table 4)

Table 4:

Frequency of target words in each community for both social embedding models.

Word	r/Coronavirus	r/conspiracy
masks	115,111	118,227
pandemic	73,902	67,364
vaccines	41,005	37,084
science	40,419	36,256
research	33,946	30,612
distancing	25,353	23,914
corona	21,750	20,878
sars	19,670	19,031
lockdowns	16,637	15,881
moderna	3,805	4,120
rona	1,720	1,879
quarantines	1,634	1,806
spreader	1,566	1,741
maskless	1,150	1,323

References

An, Jisun, Haewoon Kwak & Yong-Yeol Ahn. 2018. Semaxis: A lightweight framework to characterize domain-specific word semantics beyond sentiment. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long papers), 2450–2461. Melbourne, Australia: Association for Computational Linguistics.10.18653/v1/P18-1228Search in Google Scholar

Baumgartner, Jason, Savvas Zannettou, Brian Keegan, Megan Squire & Jeremy Blackburn. 2020. The Pushshift Reddit dataset. Proceedings of the International AAAI Conference on Web and Social Media 14. 830–839. https://doi.org/10.1609/icwsm.v14i1.7347.Search in Google Scholar

Bruchmann, Kathryn & Liya LaPierre. 2022. Moral foundations predict perceptions of moral permissibility of COVID-19 public health guideline violations in United States university students. Frontiers in Psychology 12. 795278. https://doi.org/10.3389/fpsyg.2021.795278.Search in Google Scholar

Clark, Herbert H. 1996. Using language. Cambridge: Cambridge University Press.Search in Google Scholar

Dehghani, Morteza, Kate Johnson, Joe Hoover, Eyal Sagi, Justin Garten, Niki Jitendra Parmar, Stephen Vaisey, Rumen Iliev & Jesse Graham. 2016. Purity homophily in social networks. Journal of Experimental Psychology: General 145(3). 366–375. https://doi.org/10.1037/xge0000139.Search in Google Scholar

Del Tredici, Marco, Raquel Fernández & Gemma Boleda. 2019. Short-term meaning shift: A distributional exploration. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and short papers), 2069–2075. Minneapolis, MN: Association for Computational Linguistics.10.18653/v1/N19-1210Search in Google Scholar

Dong, Jihua, Louisa Buckingham & Hao Wu. 2021. A discourse dynamics exploration of attitudinal responses towards COVID-19 in academia and media. International Journal of Corpus Linguistics 26(4). 532–556. https://doi.org/10.1075/ijcl.21103.don.Search in Google Scholar

Firth, John R. 1957. A synopsis of linguistic theory, 1930–1955 (Studies in Linguistic Analysis). Oxford: Basil Blackwell.Search in Google Scholar

Geeraerts, Dirk. 2010. Theories of lexical semantics. Oxford: Oxford University Press.10.1093/acprof:oso/9780198700302.001.0001Search in Google Scholar

Geeraerts, Dirk. 2015. How words and vocabularies change. In John R. Taylor (ed.), The Oxford handbook of the word, 416–430. Oxford: Oxford University Press.10.1093/oxfordhb/9780199641604.013.026Search in Google Scholar

Giulianelli, Mario, Marco Del Tredici & Raquel Fernández. 2020. Analysing lexical semantic change with contextualised word representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 3960–3973. Association for Computational Linguistics.10.18653/v1/2020.acl-main.365Search in Google Scholar

Gonen, Hila, Ganesh Jawahar, Djamé Seddah & Yoav Goldberg. 2020. Simple, interpretable and stable method for detecting words with usage change across corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 538–555. Association for Computational Linguistics.10.18653/v1/2020.acl-main.51Search in Google Scholar

Graham, Jesse, Jonathan Haidt, Sena Koleva, Matt Motyl, Ravi Iyer, Sean P. Wojcik & Peter H. Ditto. 2013. Moral foundations theory: The pragmatic validity of moral pluralism. Advances in Experimental Social Psychology 47. 55–130. https://doi.org/10.1016/B978-0-12-407236-7.00002-4.Search in Google Scholar

Graham, Jesse, Jonathan Haidt & Brian A. Nosek. 2009. Liberals and conservatives rely on different sets of moral foundations. Journal of Personality and Social Psychology 96. 1029–1046. https://doi.org/10.1037/a0015141.Search in Google Scholar

Green, Jon, Jared Edgerton, Daniel Naftel, Kelsey Shoub & S. Cranmer. 2020. Elusive consensus: Polarization in elite communication on the COVID-19 pandemic. Science Advances 6(28). 1–5. https://doi.org/10.1126/sciadv.abc2717.Search in Google Scholar

Haidt, Jonathan. 2007. The new synthesis in moral psychology. Science 316(5827). 998–1002. https://doi.org/10.1126/science.1137651.Search in Google Scholar

Haidt, Jonathan. 2012. The righteous mind: Why good people are divided by politics and religion. New York: Knopf Doubleday.Search in Google Scholar

Hamilton, William L., Jure Leskovec & Dan Jurafsky. 2016. Diachronic word embeddings reveal statistical laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 1489–1501. Berlin: Association for Computational Linguistics. Available at: http://www.aclweb.org/anthology/P16-1141.10.18653/v1/P16-1141Search in Google Scholar

Hart, P. Sol, Sedona Chinn & Stuart Soroka. 2020. Politicization and polarization in COVID-19 news coverage. Science Communication 42(5). 679–697. https://doi.org/10.1177/1075547020950735.Search in Google Scholar

Hasan, Ruqaiya. 1989. Semantic variation and sociolinguistics. Australian Journal of Linguistics 9(2). 221–275. https://doi.org/10.1080/07268608908599422.Search in Google Scholar

Hofmann, Valentin, Xiaowen Dong, Janet Pierrehumbert & Hinrich Schuetze. 2022. Modeling ideological salience and framing in polarized online groups with graph neural networks and structured sparsity. In Findings of the Association for Computational Linguistics: NAACL 2022, 536–550. Seattle: Association for Computational Linguistics.10.18653/v1/2022.findings-naacl.41Search in Google Scholar

Hofmann, Valentin, Janet Pierrehumbert & Hinrich Schütze. 2021. Dynamic contextualized word embeddings. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Vol. 1: Long papers), 6970–6984. Association for Computational Linguistics.10.18653/v1/2021.acl-long.542Search in Google Scholar

Irshad, Sadia, Sadia Arshad & Kaukab Saba. 2021. Lexicogrammatical features of Covid-19: A syntagmatic and paradigmatic corpus based analysis. CORPORUM: Journal of Corpus Linguistics 4(2). 76–94.Search in Google Scholar

Jiang, Julie, Emily Chen, Shen Yan, Kristina Lerman & Emilio Ferrara. 2020. Political polarization drives online conversations about COVID-19 in the United States. Human Behavior and Emerging Technologies 2(3). 200–211. https://doi.org/10.1002/hbe2.202.Search in Google Scholar

Jing, Elise & Yong-Yeol Ahn. 2021. Characterizing partisan political narrative frameworks about COVID-19 on Twitter. EPJ Data Science 10(1). 1–18. https://doi.org/10.1140/epjds/s13688-021-00308-4.Search in Google Scholar

Kim, Yoon, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde & Slav Petrov. 2014. Temporal analysis of language through neural language models. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science, 61–65. Baltimore, MD: Association for Computational Linguistics.10.3115/v1/W14-2517Search in Google Scholar

Koch, Peter. 2016. Meaning change and semantic shifts. In Päivi Juvonen & Maria Koptjevskaja-Tamm (eds.), The lexical typology of semantic shifts, 21–66. Berlin/Boston: De Gruyter Mouton.10.1515/9783110377675-002Search in Google Scholar

Kutuzov, Andrey, Lilja Øvrelid, Terrence Szymanski & Erik Velldal. 2018. Diachronic word embeddings and semantic shifts: A survey. In Proceedings of the 27th International Conference on Computational Linguistics, 1384–1397. Santa Fe, NM: Association for Computational Linguistics. https://www.aclweb.org/anthology/C18-1117 (accessed 3 August 2020).Search in Google Scholar

Lang, Jun, Wesley W. Erickson & Zhuo Jing-Schmidt. 2021. #MaskOn! #MaskOff! Digital polarization of mask-wearing in the United States during COVID-19. PLoS One 16(4). e0250817. https://doi.org/10.1371/journal.pone.0250817.Search in Google Scholar

Leech, Geoffrey N. 1981. Semantics, 2nd edn. Harmondsworth: Penguin Books.Search in Google Scholar

Leone, Luigi, Mauro Giacomantonio & Marco Lauriola. 2019. Moral foundations, worldviews, moral absolutism and belief in conspiracy theories. International Journal of Psychology 54(2). 197–204. https://doi.org/10.1002/ijop.12459.Search in Google Scholar

Lipka, Leonhard. 1992. An outline of English lexicology (Forschung und Studium Anglistik). Tübingen: Niemeyer.Search in Google Scholar

Lucy, Li & David Bamman. 2021. Characterizing English variation across social media communities with BERT. Transactions of the Association for Computational Linguistics 9. 538–556. https://doi.org/10.1162/tacl\text{\_}a\text{\_}00383.10.1162/tacl_a_00383Search in Google Scholar

van der Maaten, Laurens & Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9(86). 2579–2605.Search in Google Scholar

Mahlberg, Michaela & Gavin Brookes. 2021. Language and Covid-19: Corpus linguistics and the social reality of the pandemic. International Journal of Corpus Linguistics 26(4). 441–443. https://doi.org/10.1075/ijcl.00043.mah.Search in Google Scholar

Mickus, Timothee, Denis Paperno, Mathieu Constant & Kees van Deemter. 2020. What do you mean, BERT? In Proceedings of the Society for Computation in Linguistics 2020, 279–290. New York, NY: Association for Computational Linguistics. Available at: https://aclanthology.org/2020.scil-1.35 (accessed 24 July 2023).Search in Google Scholar

Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado & Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, vol. 26. Red Hook, NY, USA: Curran Associates. Available at: https://papers.nips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html.Search in Google Scholar

Nan, Xiaoli, Yuan Wang, Kathryn Thier, Clement Adebamowo, Sandra Quinn & Shana Ntiri. 2022. Moral foundations predict COVID-19 vaccine hesitancy: Evidence from a national survey of Black Americans. Journal of Health Communication 27(11-12). 801–811. https://doi.org/10.1080/10810730.2022.2160526.Search in Google Scholar

Nguyen, Dong, Laura Rosseel & Jack Grieve. 2021. On learning and representing social meaning in NLP: A sociolinguistic perspective. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human language technologies, 603–612. Association for Computational Linguistics.10.18653/v1/2021.naacl-main.50Search in Google Scholar

Peirsman, Yves, Kris Heylen & Dirk Geeraerts. 2010. Applying word space models to sociolinguistics: Religion names before and after 9/11. In Dirk Geeraerts, Gitte Kristiansen & Yves Peirsman (eds.), Advances in cognitive sociolinguistics, 111–138. Berlin: De Gruyter Mouton.10.1515/9783110226461.111Search in Google Scholar

Rehurek, Radim & Petr Sojka. 2011. Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic 3(2). 45–50.Search in Google Scholar

Robertson, Alexander, Farhana Ferdousi Liza, Nguyen Dong, Barbara McGillivray & Scott A. Hale. 2021. Semantic journeys: Quantifying change in emoji meaning from 2012–2018. In Workshop Proceedings of the 15th International AAAI Conference on Web and Social Media.Search in Google Scholar

Roig-Marín, Amanda. 2020. English-based coroneologisms: A short survey of our Covid-19-related vocabulary. English Today 37. 193–195. https://doi.org/10.1017/S0266078420000255.Search in Google Scholar

Sagi, Eyal & Morteza Dehghani. 2014. Measuring moral rhetoric in text. Social Science Computer Review 32(2). 132–144. https://doi.org/10.1177/0894439313506837.Search in Google Scholar

Schmid, Hans-Jörg. 2020. The dynamics of the linguistic system: Usage, conventionalization, and entrenchment. Oxford: Oxford University Press.10.1093/oso/9780198814771.001.0001Search in Google Scholar

Schmid, Hans-Jörg, Quirin Würschinger, Melanie Keller & Ursula Lenker. 2020. Battling for semantic territory across social networks: The case of Anglo-Saxon on Twitter. Yearbook of the German Cognitive Linguistics Association 8(1). 3–26. https://doi.org/10.1515/gcla-2020-0002.Search in Google Scholar

Schönemann, Peter H. 1966. A generalized solution of the orthogonal procrustes problem. Psychometrika 31(1). 1–10. https://doi.org/10.1007/BF02289451.Search in Google Scholar

Scott, Ben. 2020. Know your covidiots from your cove-dwellers. Bloomberg.com. https://www.bloomberg.com/opinion/articles/2020-04-03/coronavirus-know-your-covidiots-from-your-cove-dwellers (accessed 22 August 2021).Search in Google Scholar

Shoemark, Philippa, Farhana Ferdousi Liza, Nguyen Dong, Hale Scott & Barbara McGillivray. 2019. Room to glo: A systematic comparison of semantic change detection approaches with word embeddings. In Kentaro Inui, Jing Jiang, Vincent Ng & Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 66–76.10.18653/v1/D19-1007Search in Google Scholar

Signoroni, Edoardo, Elisabetta Jezek & Rachele Sprugnoli. 2022. Word usage change and the pandemic: A computational analysis of short-term usage change in the Italian Reddit community. IJCoL: Italian Journal of Computational Linguistics 8(2). 39–62. https://doi.org/10.4000/ijcol.1076.Search in Google Scholar

Tahmasebi, Nina, Lars Borin & Jatowt Adam. 2021. Survey of computational approaches to lexical semantic change detection. In Nina Tahmasebi, Lars Borin, Yang Xu & Simon Hengchen (eds.), Computational approaches to semantic change, 1–91. Berlin: Language Science Press.Search in Google Scholar

Tarry, Hammond, Valérie Vézina, Jacob Bailey & Leah Lopes. 2022. Political orientation, moral foundations, and COVID-19 social distancing. PLoS One 17(6). e0267136. https://doi.org/10.1371/journal.pone.0267136.Search in Google Scholar

Thorne, Tony. 2020. #CORONASPEAK – the language of Covid-19 goes viral – 2. https://language-and-innovation.com/2020/04/15/coronaspeak-part-2-the-language-of-covid-19-goes-viral/ (accessed 12 February 2021).Search in Google Scholar

Tournier, Jean. 1985. Introduction descriptive a` la lexicoge´netique de l’anglais contemporain. Paris: Champion-Slatkine.Search in Google Scholar

Tsakalidis, Adam, Marya Bazzi, Mihai Cucuringu, Pierpaolo Basile & Barbara McGillivray. 2019. Mining the UK web archive for semantic change detection. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), 1212–1221. Varna, Bulgaria: INCOMA.10.26615/978-954-452-056-4_139Search in Google Scholar

Ullah Shaheen, Zafar, Ayyaz Qadeer & Fouzia Rehman Khan. 2021. Conspiracy theories (CT) vs truth based reporting: A corpus driven analysis of Covid-19 online newspaper(s) discourse. CORPORUM: Journal of Corpus Linguistics 4(2). 112–135.Search in Google Scholar

Wilcoxon, Frank. 1945. Individual comparisons by ranking methods. Biometric Bulletin 1(6). 80–83. https://doi.org/10.2307/3001968.Search in Google Scholar

Zhou, Alvin, Wenlin Liu, Hye Min Kim, Eugene Lee, Jieun Shin, Yafei Zhang, Ke M. Huang-Isherwood, Chuqing Dong & Aimei Yang. 2022. Moral foundations, ideological divide, and public engagement with U.S. government agencies’ COVID-19 vaccine communication on social media. Mass Communication and Society 25. 1–27. https://doi.org/10.1080/15205436.2022.2151919.Search in Google Scholar

Received: 2022-03-25

Accepted: 2023-09-07

Published Online: 2024-03-15

Articles in the same Issue

https://doi.org/10.1515/lingvan-2023-0106

Keywords for this article

lexical innovation; semantic change; social variation; word embeddings; Reddit