Home Linguistics & Semiotics What can NLP do for linguistics? Towards using grammatical error analysis to document non-standard English features
Article Open Access

What can NLP do for linguistics? Towards using grammatical error analysis to document non-standard English features

  • Li Nguyen EMAIL logo , Shiva Taslimipoor and Zheng Yuan
Published/Copyright: September 27, 2024

Abstract

This paper proposes a novel use of grammatical error detection/correction (GED/GEC) tools to document non-standard English varieties. This is motivated by the fact that both GED/GEC technology and sociolinguistics aim to identify linguistic deviation from the so-called standard, yet there has been little communication between the two fields. We thus investigate whether state-of-the-art GED/GEC models can be effectively repurposed to automatically detect dialectal differences and assist linguistic missions in this regard. We explore this in the context of written Singaporean English (Cambridge Write & Improve), and spoken Vietnamese English (CanVEC), representing an established non-standard variety (Case 1) and an emerging variety (Case 2), respectively. We find that our GED/GEC systems are able to successfully detect a number of both established and new features. We further highlight some of the remaining areas that the systems overlook, as well as opportunities for future developments. This research bridges the gap between GED/GEC and dialectology, emphasizing their shared theme of linguistic deviation from a socially defined standard.

1 Introduction

The study of non-standard forms of English has been a subject of considerable interest in the field of sociolinguistics, since at least Labov’s foundational studies in the 1960s (Labov 1966 and following work). In the context of world Englishes, Kachru (1992) was the first to introduce a differentiation between errors and innovations, which he referred to as “mistakes” and “deviations” respectively. As Kachru argued, while mistakes might not be acceptable to native English speakers, deviations may be more tolerable as they emerge in contexts of new Englishes and are systematic within a particular variety (cf. Mollin 2006). Singular concord in Belfast English, as in (1), copula deletion in Singaporean English in (2), or polarity-based levelling in Multicultural London English (MLE) in (3), for example, are well-established features of non-standard English varieties.

(1)
The eggs is/are cracked. [Belfast English]
= The eggs are cracked. [Standard English]
(2)
Flower Ø there ah. [Singaporean English]
= Flower s are there. [Standard English]
(3)
It’s like Ramses’ revenge [MLE]
and we was at Chessington […]
= It’s like Ramses’ revenge [Standard English]
and we were at Chessington […]

The distinction between errors and dialectal features, nonetheless can be empirically muddy. Bamgbose (1998) proposed a comprehensive framework comprising five criteria to distinguish errors from innovation: (i) demographic strength or number of users employing the form; (ii) geographic spread of the form’s usage; (iii) support from influential individuals or media promoting the form; (iv) the form’s codification, that is, its systematic recognition in language usage; and (v) acceptability, that is, the level of approval or recognition the form receives. However, each of these criteria has shown certain conceptual and practical limitations (Hamid and Baldauf 2013; Ranta 2022). “Acceptability” in particular has been found to render differently for different people (O’Hara-Davies 2010), suggesting that there are no identifiable patterns for correctness.

In this work, we make the case that the mission of grammatical error detection/correction (GED/GEC) – tools that automatically detect and correct English grammar mistakes – is similar to that of dialectology and sociolinguistics, that is, to identify features that deviate from the standard. Documentation of dialectal features, however, has moved at a much slower pace than the development of GED/GEC technology, mainly due to the enormous amount of time, money, and training efforts that go into manual annotation (Anderwald and Szmrecsanyi 2009; Leech 2006). This work thus sets out to investigate whether GED/GEC tools can be used to automate the documentation process and expedite research into the characterization of non-standard features.

More specifically, we test the effectiveness of some state-of-the-art GED/GEC systems (Bryant et al. 2023) on two datasets of non-standard English: written Singaporean English from the Cambridge Write & Improve corpus (Yannakoudakis et al. 2018; https://writeandimprove.com/), and transcribed spoken Vietnamese English in Canberra, Australia, from the CanVEC corpus (Nguyen and Bryant 2020). Together, these two datasets represent both an established non-standard variety that is already well documented (Case 1) and an emerging variety that has barely been examined (Case 2). It should be emphasized that although we explore this using GED/GEC tools, we do not mean to support the “correction” of such non-standard features. Rather, we are working towards accelerating the process of making these features known and more widely accepted.

2 Background

2.1 Grammatical error detection and correction

The automatic detection and correction of errors in text or speech, such as spelling, punctuation, grammar, and word choice errors, are referred to as grammatical error detection (GED) and grammatical error correction (GEC) respectively. These tasks are useful for improving the writing skills of students, professionals, and non-native speakers (Bryant et al. 2023). Most previous works have treated GED and GEC separately, using sequence labelling methods for GED (Rei and Yannakoudakis 2016) and machine translation techniques for GEC (Yuan and Briscoe 2016). More recently however, Omelianchuk et al. (2020) proposed a sequence labelling approach, named GECToR, that can perform both error detection and correction, and benefits from fast inference compared to sequence-to-sequence models. In the case of sequence labelling for GED, the system assigns error-tags to each word token in the sequence, while in GEC a correction to the whole sentence is proposed. Table 1 exemplifies the difference between the two tasks; the “GED” row shows an example of binary GED error-tags and the following row shows the corrected form of the original sentence.

Table 1:

An example sentence with equivalent GED and GEC output. “C” indicates “correct” and “I” indicates “incorrect”.

Source Output
Original He go to school yesterday .
GED C I C C C C
GEC He went to school yesterday .

A useful tool that links the two tasks of GED and GEC is ERRANT (Bryant et al. 2017), which automatically generates error-tags for any text that has been corrected either manually or automatically. Its error-tags can be more fine-grained, with part-of-speech and morphology-enhanced information; so, for example, replacement noun number (R:NOUN:NUM), replacement verb tense (R:VERB:TENSE), and missing verb (M:VERB) correlate with our target non-standard features. Sequence labelling GED models are able to learn these tags from annotated corpora. While various machine learning approaches have been used for modelling GED (Rei 2017), fine-tuning pre-trained transformer language models have proved most successful (Yuan et al. 2021). State-of-the-art GEC models are also transformer-based in the form of encoder-decoder architecture (Bryant et al. 2023).

As we discuss in Section 3.2, we apply state-of-the-art GED and GEC methods to predict the relevant tags in our datasets, and then analyse the error-tags where they correlate with non-standard features.

2.2 Case 1: Singaporean English

There are generally two types of Singapore English. One is a fairly standard form used primarily for formal communication, known as Standard Singapore English. The other is a more casual, colloquial form called Singapore Colloquial English or Singlish. Singlish has undergone significant restructuring influenced by local substrates such as Baba Malay, Bazaar Malay, Teochew, Tamil, Hokkien, Cantonese, and Mandarin (Kuo and Jernudd 1993; Low 2012; Sim 2021, 2023; Tan 2017, 2023). Forbes (1993) described this variety as a “restricted code”, akin to pidginization between English and long-term contact with other indigenous and ethnic languages. Although governmental efforts such as the Speak Good English Movement (https://www.languagecouncils.sg/goodenglish) have attempted to replace it entirely with Standard Singapore English, Singlish is still widely used in the community, both in spoken and written form (Babcock 2022; Cavallaro and Ng 2009). It should also be noted that while Standard Singapore English generally exhibits similarities to British or American English, it still carries some distinct features, particularly in the mesolect form, such as copula deletion, article deletion, or lack of subject-verb agreement in present tense (Platt 1975; Platt and Weber 1980; Tay 1993). These features resemble Singlish. In fact, the coexistence of Standard Singapore English and Singlish is not diglossic, and speakers frequently use features from both varieties in their daily speech or even in one utterance.

In this work, we thus use the term Singaporean English to broadly refer to English spoken in Singapore, be it Standard Singapore English or Singapore Colloquial English. Some well-documented features of this variety include lack of overt inflection (both in the verbal and nominal domains), copula/auxiliary deletion, reduplication, serial verb constructions, and widespread use of discourse particles (Ansaldo 2009; Babcock 2022; Fong 2004; Gupta 1994; Khoo 2015; Lim and Ansaldo 2013; Siemund 2013; Soh et al. 2022; Wee 2004, among others). Table 2 provides a non-exhaustive list of these features with examples from the literature. These features will form the basis for our evaluation of GED/GEC models on the Singaporean English dataset in later sections.

Table 2:

Documented typical Singaporean English features with examples from existing literature.

Feature Example Source
Optional plural marking Buy 2 children ticket Ø to get 20 % off your tickets ? Lu (2023)
Optional verbal agreement When you drive, everybody is scared … Fong (2004)
Ya, then everybody bring Ø .
Optional tense marking We went in, take half an hour to come out. Fong (2004)
Copula deletion Careful, laksa Ø very hot. Ansaldo (2009)
Auxiliary deletion She Ø beaten the eggs. Lim (2001)
Reduplication We buddy-buddy . You don’t play me out, OK? Wee (2004)
Serial verb construction Take the book bring come . Lim and Ansaldo (2013)
Sentence-final particles You still have class meh ? Soh et al. (2022)
In situ interrogative This bus go where ? Siemund (2013)
Subject drop Ø don’t want. Gupta (1994)

2.3 Case 2: Vietnamese-Australian English

In comparison to Singaporean English, Vietnamese English is much less well described. We specifically examine Vietnamese-Australian English in the context of an established migrant community in Canberra, Australia. This community is highly bilingual, relatively young, and highly educated. Code-switching within the community is thus naturally observed (Nguyen 2021; Thai 2005), and conversations within families are on average a mix of half Vietnamese, a quarter code-switching, and a quarter English (Nguyen 2021: 54). As such, the nature of the English spoken here is closer to that of the English spoken in a country where English is a second language, like Singapore, rather than to that of speakers of English as a foreign language.

The first and only work to document some features of this English variety is that of Nguyen (2021). Using data from the CanVEC corpus, this study reports considerable variation between null and overt realization for (i) verbal agreement for third person singular present tense, (ii) present/past tense marking, and (iii) auxiliary. Table 3 summarizes these documented features with manually annotated examples from the corpus. Similar to Case 1 (Section 2.2), these features will later form the basis for our evaluation of GED/GEC models on the Vietnamese English dataset.

Table 3:

Documented non-standard Vietnamese-Australian English features in CanVEC.

Feature Example
Optional verbal agreement he just walk Ø around and go es anywhere
Optional tense marking […] but actually they left , […] at the time the Vietnamese military invite Cambodia right?
Optional auxiliary because of course they will ask you, where do you want to go, how long Ø you going, and when Ø you coming back?

3 Data and method

3.1 Data

3.1.1 Write & improve Singaporean data

Cambridge English Write & Improve (W&I) is an online web platform that assists non-native English learners with their writing (Yannakoudakis et al. 2018). Learners from around the world submit letters, stories, articles, and essays in response to various prompts for automated assessment. The W&I system then provides instant feedback. Since 2014, W&I annotators have manually annotated some of these submissions with corrections and CEFR ability levels (Little 2006).

In this project, we introduce the W&I Singaporean data, which consists of error-annotated texts that were submitted from Singapore. Since evaluation of GEC is typically carried out at the sentence level, we split the data into 485 sentences using ERRANT. It should be noted that although other Singaporean English data is available, very few datasets, if any, are open access and annotated for non-standard features. Since our purpose is to evaluate how good these systems are for the task, human annotation is essential to benchmark their performances against the gold standard.

3.1.2 CanVEC data

The second dataset is the Canberra Vietnamese English natural speech corpus (CanVEC; https://github.com/Bak3rLi/CanVEC), which consists of 23 self-recorded conversations among 45 Vietnamese immigrants living in Canberra, Australia (Nguyen and Bryant 2020). The data was fully transcribed in ELAN and consists of 14,047 clauses,[1] of which 2,582 are monolingual English.

Since Vietnamese-Australian English is not very well documented (Section 2.3), we employed two human annotators to manually identify all the non-standard features in this dataset, which included but were not limited to the features listed in Table 3. Both annotators were native speakers of Standard English and reached a reported agreement rate of 98 %; cases that were mismatched were later identified as human errors and corrected accordingly.

3.2 Methods

3.2.1 GED

We approached GED as a multi-class sequence labelling task that assigned a label to each token (i.e. word) in the input sentence. The label indicates whether a token is correct (C) or erroneous, and if erroneous, what operation is required to correct it: whether it is an unnecessary token (U), should be replaced (R), or requires a missing token before it (M). This is a four-class token classification task, which we call GED-4. For more linguistically motivated fine-grained error annotation, 55 labels were used to combine operations (M, R, U) with part-of-speech-based error-tags, such as verb, verb tense, noun, noun number, preposition, and so on; we refer to this as GED-55.

We employed ELECTRA (Clark et al. 2020; https://github.com/shivaat/electra-GED), one of the state-of-the-art pre-trained transformer models for GED, for token classification, and fine-tuned it for a few epochs using the Adam optimizer with a learning rate of 3 × 10−5. We performed the fine-tuning on the Cambridge Learner Corpus (CLC; Nicholls 2003). The GED tags for the dataset were extracted using ERRANT (Bryant et al. 2017).

3.2.2 GEC

We treated GEC as a sequence-to-sequence task, where systems learn to “translate” an ungrammatical input sentence to a grammatical output sentence. We used the Transformer sequence-to-sequence model (Vaswani et al. 2017) and applied a two-step training strategy following Yuan et al. (2021):

  1. we pre-trained a GEC system on CLC data

  2. we fine-tuned the system on high-quality, in-domain data, which consisted of:

    1. the Cambridge English Write & Improve + LOCNESS (W&I) corpus (Bryant et al. 2019)

    2. the First Certificate in English corpus (Yannakoudakis et al. 2011)

    3. the National University of Singapore Corpus of Learner English (Dahlmeier et al. 2013)

All training details, including hyper-parameters, followed the unconstrained GEC Baseline system proposed in Yuan et al. (2021).[2]

4 Results and discussion

To conduct our linguistic analysis, a human annotator went through the system output and manually marked whether each sentence contained relevant results. Since GED/GEC systems were primarily trained to process written text, many corrections were made for punctuation, capitalization, spelling, and so on. These “errors” are deemed irrelevant for our purpose, and therefore excluded from the qualitative assessment.

It should be noted that as Singaporean English is already well documented, we rely more on previously documented features as a benchmark, rather than on frequency within our small sample. On the other hand, as Vietnamese-Australian English is much less described but yielded a reasonably larger sample, we attribute more weight to frequency across speakers as an indication of varietal features.

4.1 Case 1: established non-standard features

Out of 485 Singaporean English sentences, only 7 % (36 of 485) contained a relevant sample, that is, a sample that contains error-tags that are not punctuation or spelling mistakes. Since the relevant sample for this dataset is too small, a percentage-based analysis is not considered meaningful and so we only examine whether the output labels (if any) for each system match the previously documented non-standard features for Singaporean English. Table 4 presents an example.[3]

Table 4:

Evaluation of system output for an example sentence of Singaporean English. “R” stands for “replacement”, e.g. <R:VERB:TENSE> signals that the tense of the verb needs to be replaced; represents a successful case of output labels matching a previously documented feature; × represents an unsuccessful case.

Source Output
Original I simply smiled, not knowing how to respond. The nurse walks into the room with a tray of food in her hands.
GED-4 I simply smiled, not knowing how to respond. The nurse walks into the room with a tray of food in her hands. ×
GED-55 I simply smiled, not knowing how to respond. The nurse <R:VERB:TENSE> walks </R:VERB:TENSE> into the room with a tray of food in her hands.
GEC I simply smiled, not knowing how to respond. The nurse walks into the room with a tray of food in her hands. ×
Human I simply smiled, not knowing how to respond. The nurse came in the room with a tray of food in her hands.

As Table 4 shows, the target non-standard feature here is the optional past tense marking. From context, we know that the event of the nurse walking into the room happened in the past, and so Standard English would require the verb ‘walk’ to be in past tense. GED-55 was the only system that was able to identify this difference, hence earning a tick mark () for its performance. It should be noted that system performance is very consistent on this small sample of Singaporean English; that is, if a system can identify optional past tense marking, it is able to do so in all relevant instances, while in contrast, if a system misses out on this feature, it consistently fails in this regard. Table 5 reports system performance on the relevant sample of the W&I Singaporean English dataset.

Table 5:

System performance on documented non-standard features of Singaporean English.

Feature In data GED-4 GED-55 GEC Human annotation
Documented features
Optional plural marking
Copula deletion
Auxiliary deletion
Serial verb construction
Optional verbal agreement ×
Optional tense marking × ×
Subject drop × ×
Reduplication ×
Sentence-final particles ×
In situ interrogative ×
New feature
Existential ‘have’ for listing ×

Results show that seven out of 10 documented features of Singaporean English appear in the W&I dataset, and when a feature is present, it is identified by at least one GED/GEC system. This result seems promising, with GED-55 appearing the best as it successfully detected all the relevant features.

We further observe that GED systems in particular were also able to pick up another non-standard feature that had not been previously documented, which is the use of the existential verb ‘have’ in listing constructions. Table 6 provides an illustration.

Table 6:

System output for an example sentence showing the non-documented feature of using existential ‘have’ instead of ‘be’ in a listing construction in Write & Improve. “R” stands for “replacement”, e.g. <R:VERB> signals the verb that needs to be replaced.

Source Output
Original My favourite food has pasta, salad, dessert, and orange juice
GED-4 My favourite food <R> has </R> pasta, salad, dessert, and orange juice.
GED-55 My favourite food <R:VERB> has </R:VERB> pasta, salad, dessert, and orange juice.
GEC My favourite food has pasta, salad, dessert, and orange juice.
Human My favourite food is pasta, salad, dessert, and orange juice.

In this case, the writer used the finite form of ‘have’ instead of ‘be’ in an existential construction where more than one item was being named. This is a known feature of Mandarin Chinese (Huang 1990; Her 1991), which also differs to that of Standard English where only ‘be’ is acceptable. Given the multilingual setting in Singapore (§2.2), this is likely a non-standard feature as a result of language contact. Both GED-4 and GED-55 successfully identified this deviation, which is also aligned with the human edits.

4.2 Case 2: emergent non-standard features

Compared to the Singaporean English data, there is a higher proportion of relevant output in the Vietnamese-Australian English dataset, accounting for 17 % of the sample (427 of 2,582 sentences; 1,101 morphosyntactic tags). This allows us to take consistency – that is, the frequency of use within and across different speakers (van Rooy 2011) – into account. Although different researchers have different thresholds for what makes a feature “consistent”, the idea is that the more frequently a feature is used, the more likely that it is a systematic innovation. For linguists interested in varietal features, cross-speaker consistency is more important than within-speaker consistency, as it indicates integration and acceptance of a particular feature at a community level.

Following previous work in sociolinguistics which measures linguistic integration and acceptance at a community level, we distinguish between “frequency” and “diffusion” (Nguyen 2018, 2021; Poplack 1988). The former refers to the number of tokens occurring in the corpus, and the latter refers to the number of different speakers using that feature. The benchmark is arbitrary, but features that occur more than 10 times are often considered frequent, and those that are used by more than 10 speakers are diffused. In this work, however, we redefine “diffusion” relative to the sample size, and consider features that are used by more than five speakers as “community consistent”. Table 7 lists 10 features that meet both of these criteria and are hence considered established non-standard features in the Canberra Vietnamese community. This table also reports system performance as compared to the gold-standard human annotation for each of these features (with a total of 916 cases).

Table 7:

System performances on the Vietnamese English dataset. True positives (TP) = correctly identified cases; false positives (FP) = incorrectly identified cases; false negatives (FN) = cases missed; precision (P) measures the accuracy of positive identifications (TP/(TP+FP)); and recall (R) gauges the system’s ability to capture all relevant cases ((TP/(TP+FN)). For TP, P, and R, the higher the better; while for FP and FN, the lower the better (as indicated by the direction of the arrows). The best results for each feature are highlighted in boldface.

Tool Feature TP FP FN P R Human
GED-4 Documented features
Auxiliary deletion 31 6 20 83.78 60.78 51
Optional verbal marking 173 2 39 98.86 81.60 212
Optional tense marking 0 0 280 0.00 0.00 280
New features
Optional plural marking 37 3 5 92.50 88.10 42
Optional preposition 50 5 7 90.91 87.72 57
Optional article 52 4 4 92.86 92.86 56
Copula deletion 58 1 4 98.31 93.55 62
Object drop 0 0 86 0.00 0.00 86
Full-form negation in interrogatives 0 0 38 0.00 0.00 38
Subject drop 0 0 32 0.00 0.00 32

GED-55 Documented features

Auxiliary deletion 51 0 0 100.00 100.00 51
Optional verbal marking 210 0 2 100.00 99.06 212
Optional tense marking 260 5 20 98.11 92.86 280
New features
Optional plural marking 42 0 0 100.00 100.00 42
Optional preposition 50 2 7 96.15 87.72 57
Optional article 55 1 1 98.21 98.21 56
Copula deletion 62 0 0 100.00 100.00 62
Object drop 82 3 4 96.47 95.35 86
Full-form negation in interrogatives 38 0 0 100.00 100.00 38
Subject drop 0 2 32 0.00 0.00 32

GEC Documented features

Auxiliary deletion 33 2 18 94.29 64.71 51
Optional verbal marking 0 0 212 0.00 0.00 212
Optional tense marking 0 0 280 0.00 0.00 280
New features
Optional plural marking 36 6 6 85.71 85.71 42
Optional preposition 52 3 5 94.55 91.23 57
Optional article 50 5 6 90.91 89.29 56
Copula deletion 53 7 9 88.33 85.48 62
Object drop 0 0 86 0.00 0.00 86
Full-form negation in interrogatives 0 0 38 0.00 0.00 38
Subject drop 31 0 1 100.00 96.88 32

Like the results for Singaporean English (Table 5), previously documented features were picked up by at least one GED/GEC system. All systems managed to identify new features such as optional plural marking or copula deletion, features that were also found in Singaporean English. In particular, all systems were able to highlight the new features, including optional prepositions (and then you go out Ø the sea) and optional articles (they got Ø house like us here).

It is also clear that GED-55 remains the most robust, giving the best performance for almost every feature. It is furthermore particularly good at identifying features that other systems were not able to, for example extra pronouns, object drop, and full-form negation in interrogatives. These cases are illustrated in Table 8.

Table 8:

System output for example sentences showing some of the undocumented features in CanVEC. “R” stands for “replacement”, “U” stands for “unnecessary”, “M” stands for “missing”; e.g. <M:PRON> signals a missing pronoun, <R:CONTR> signals a to-replace contraction.

Source Extra pronouns & object drop Full negation in interrogative
Original and everyone they tend to put down to the bottom near the engine does not it get boring?
GED-4 and everyone they <R> tend </R> to put down to the bottom near the engine <R> does </R> <R> not </R> it get boring?
GED-55 and everyone <U:PRON> they </U:PRON> tend to put <M:PRON> down </M:PRON> to the bottom near the engine does <R:CONTR> not </R:CONTR> it get boring?
GEC Everyone and everyone they tend to put down to the bottom near the engine But does not it get boring?
Human and everyone they tend to put it down to the bottom near the engine doesn’t it get boring?

4.3 Discussion

Results on both datasets show that established non-standard features were successfully identified by at least one GED/GEC system, with several features being picked up by all systems. At a broad level, this is particularly promising given the fact that one dataset is written English (W&I), while the other is transcribed spoken English (CanVEC). For Vietnamese-Australian English, GED/GEC systems are even helpful in pulling out new features that are clearly consistent in the community, but had hitherto not been documented.

Results on both datasets also strongly suggest that GED-55 performs best, linguistically, for this kind of task. This is possibly because GED-55 has the most sophisticated labelling scheme and is trained to find less obvious deviations that are more likely to be non-standard features (e.g. subject/object drop, full negation in interrogatives).

It should be noted, however, that these systems tend to have a bias towards one type of labelling over another. Table 9 demonstrates an example. In fact, the deviation tell lie could be classified as either an optional article or as optional plural marking. All systems nonetheless suggested the former. This trend is consistently observed across both datasets (n = 32, 100 % of the overlapping cases). Although the systems’ choice in this case is aligned with the gold-standard human annotation, there is no clear basis upon which the feature should be described as one category and not the other. This is to say that these systems may still bear biases similar to humans, of which linguists need to be aware. These system biases appear extremely systematic, however, as all systems tend to gravitate towards the same direction when there are competing choices. This allows us to mass-modify the output if need be, depending on the context of the documentation.

Table 9:

System output demonstrating system bias in labelling Singaporean English. “M” stands for “missing”; e.g. <M:DET> signals a missing determiner.

Source Output
Original In my opinion, in certain circumstances it is not bad to tell lie.
GED-4 In my opinion, in certain circumstances it is not bad to tell <M> lie </M>.
GED-55 In my opinion, in certain circumstances it is not bad to tell <M:DET> lie </M:DET>.
GEC In my opinion, in certain circumstances it is not bad to tell a lie.
Human In my opinion, in certain circumstances it is not bad to tell a lie.

Furthermore, it is worth noting that not all non-standard features are defined syntactically and since these systems were built to deal with typical grammatical “errors”, we may not expect them to detect subtle deviations in term of style or semantics. We observe, nonetheless, that while GED/GEC systems missed instances such as overuse of pronouns in coordinated clauses (He went to the shop and he bought the flowers) or unidiomatic expressions (The film was about a serial murderer ), they were able to pick up a few others in this category, such as word choice or modal deletion. Table 10 provides some illustrations. Note that these features were not documented in Table 7 as they did not meet the “consistency” threshold (Section 4.2).

Table 10:

System output for some examples of non-grammatical deviation in CanVEC. “R” stands for “replacement”, “M” stands for “missing”; e.g. <M:OTHER> signals a missing undefined part of speech.

Source Word choice Modal deletion
Original I need much time at home. oh I see closer to the time.
GED-4 I need <R> much </R> time at home. oh I see closer to the time.
GED-55 I need <R:OTHER> much </R:OTHER> time at home. oh I <M:OTHER> see </M:OTHER> closer to the time.
GEC I need a lot of time at home. oh I will see closer to the time.
Human I need a lot of time at home. oh I may see closer to the time.

Finally, although we acknowledge the limitations of these systems in that they are still heavily biased towards written standards (a large proportion of the output is in fact changes such as punctuation, spelling mistakes, added fragments, etc.), this issue can be easily overcome. In particular, irrelevant labels can be systematically ignored as part of the pre- and post-processing steps.

5 Conclusions

In this work, we have examined how useful GED/GEC systems can be for documenting non-standard English varieties, and evaluated whether these systems can be effectively repurposed to automatically detect dialectal features. Preliminary results are promising, showing that recent GED/GEC systems are able to automatically detect both previously documented features and new features in Singaporean English and Vietnamese-Australian English. One GED system, namely GED-55, was particularly competent in picking up on some little-described but potentially significant features such as the use of ‘have’ instead of ‘be’ in existential listing constructions in Singlish, or full-form negation in interrogatives in Vietnamese-Australian English. However, since GED/GEC was built to be conservative, there remain some complex discourse-pragmatic features that the systems overlook, such as the overuse of subject pronouns in coordinated clauses or unidiomatic multi-word expressions. There is thus scope for fine-tuning existing systems to further develop them into technologies that are more sensitive to all kinds of non-standard features in different English varieties.

For language testing and learning, a computational analysis of this kind also has the potential of leading to improvements which enable technology to be more inclusive. An ideal system, for example, would be able to automatically detect dialectal features, interact with users, and adjust its assessment accordingly (e.g. “You appear to be writing in Singlish. Would you like Singlish constructions to be allowed in the grammar checker?”). This kind of system will gear us towards a future where technology works to minimize social biases, rather than amplifying them as has been seen in recent investigations (Blodgett et al. 2020; Helm et al. 2023). It is also worth emphasizing that while our study focuses on English, there has been significant development of GED and GEC models for various other languages in recent years, such as Czech, German, Italian, Swedish (Volodina et al. 2023), Arabic (Alhafni et al. 2023), and Chinese (Yue et al. 2022). The potential impact of this innovative approach could thus generalize well beyond English.

On a broader scale, we also hope to have demonstrated the need and opportunities for bringing NLP developments closer to the linguistic realities in which we live. Some recent linguistic and NLP works have also demonstrated this gap when testing modern technologies on real-life, non-canonical language production (Doğruoz and Sitaram 2022; Doğruoz et al. 2021, 2023; Nguyen et al. 2022; Nguyen et al. 2023a; Nguyen et al. 2023b; Yong et al. 2023; Zhang et al. 2023). Even in a language like English, which has long served as a lingua franca, variation is nowhere near exceptional – what is known as the “standard” in one place might not be so in another. Thus, linguistic diversity needs to be at the forefront of future technological developments, be it for improving existing tools or building new ones to serve different communities and purposes.


Corresponding author: Li Nguyen, Nanyang Technological University, Singapore, Singapore, E-mail:

Funding source: Cambridge University Press & Assessment

Acknowledgement

We thank Diane Nicholls, Scott Thomas, and Andrew Caines for enabling access to the Singaporean data in the Cambridge English W&I corpus. We are also grateful to our colleagues at Cambridge for their helpful comments on an earlier draft, and to the two anonymous reviewers for their generous appraisal.

  1. Research funding: This work was funded by Cambridge University Press & Assessment.

References

Alhafni, Bashar, Go Inoue, Christian Khairallah & Nizar Habash. 2023. Advancements in Arabic grammatical error detection and correction: An empirical investigation. In Houda Bouamor, Juan Pino & Kalika Bali (eds.). Proceedings of the 2023 Conference on empirical Methods in natural language processing, 6430–6448. Singapore: Association for Computational Linguistics.10.18653/v1/2023.emnlp-main.396Search in Google Scholar

Anderwald, Lieselotte & Benedikt Szmrecsanyi. 2009. Corpus linguistics and dialectology. In Anke Lüdeling & Merja Kytö (eds.). Corpus linguistics: An international handbook, 1126–1140. Berlin: De Gruyter Mouton.10.1515/9783110213881.2.1126Search in Google Scholar

Ansaldo, Umberto. 2009. The Asian typology of English: Theoretical and methodological considerations. English World-Wide 30(2). 133–148. https://doi.org/10.1075/eww.30.2.02ans.Search in Google Scholar

Babcock, Joshua. 2022. Postracial policing, “mother tongue” sourcing, and images of Singlish standard. Journal of Linguistic Anthropology 32(2). 326–344. https://doi.org/10.1111/jola.12354.Search in Google Scholar

Bamgbose, Ayo. 1998. Torn between the norms: Innovations in world Englishes. World Englishes 17(1). 1–14. https://doi.org/10.1111/1467-971X.00078.Search in Google Scholar

Blodgett, Su Lin, Solon Barocas, Hal DauméIII & Wallach Hanna. 2020. Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th annual meeting of the Association for Computational Linguistics, 5454–5476 [online]. Association for Computational Linguistics.10.18653/v1/2020.acl-main.485Search in Google Scholar

Bryant, Christopher, Mariano Felice, Øistein E. Andersen & Briscoe Ted. 2019. The BEA-2019 shared task on grammatical error correction. In Proceedings of the fourteenth workshop on innovative Use of NLP for building educational applications, 52–75. Florence: Association for Computational Linguistics.10.18653/v1/W19-4406Search in Google Scholar

Bryant, Christopher, Mariano Felice & Ted Briscoe. 2017. Automatic annotation and evaluation of error types for grammatical error correction. In Proceedings of the 55th annual meeting of the association for computational linguistics (vol. 1: Long papers), 793–805. Vancouver: Association for Computational Linguistics.10.18653/v1/P17-1074Search in Google Scholar

Bryant, Christopher, Yuan Zheng, Muhammad Reza Qorib, Hannan Cao, Hwee Tou Ng & Ted Briscoe. 2023. Grammatical error correction: A survey of the state of the art. Computational Linguistics 49(3). 643–701. https://doi.org/10.1162/coli\text{\_}a\text{\_}00478.Search in Google Scholar

Cavallaro, Francesco & Bee Chin Ng. 2009. Between status and solidarity in Singapore. World Englishes 28(2). 143–159. https://doi.org/10.1111/j.1467-971x.2009.01580.x.Search in Google Scholar

Clark, Kevin, Luong Minh-Thang, V. Le Quoc & Christopher D. Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. In International Conference on learning representations. https://openreview.net/forum?id=r1xMH1BtvB.Search in Google Scholar

Dahlmeier, Daniel, Hwee Tou Ng & Siew Mei Wu. 2013. Building a large annotated corpus of learner English: The NUS corpus of learner English. In Proceedings of the eighth workshop on innovative Use of NLP for building educational applications, 22–31. Atlanta, GA: Association for Computational Linguistics. https://aclanthology.org/W13-1703.Search in Google Scholar

Doğruöz, A. Seza & Sunayana Sitaram. 2022. Language technologies for low resource languages: Sociolinguistic and multilingual insights. In Maite Melero, Sakriani Sakti & Claudia Soria (eds.). Proceedings of the 1st annual Meeting of the ELRA/ISCA special interest group on under-resourced languages, 92–97. Marseille: European Language Resources Association. https://aclanthology.org/2022.sigul-1.12.Search in Google Scholar

Doğruöz, A. Seza, Sunayana Sitaram, Barbara E. Bullock & Jacqueline Toribio Almeida. 2021. A survey of code-switching: Linguistic and social perspectives for language technologies. In Chengqing Zong, Fei Xia, Wenjie Li & Roberto Navigli (eds.), Proceedings of the 59th annual Meeting of the association for computational Linguistics and the 11th international joint Conference on natural language processing (vol. 1: Long papers), 1654–1666. Bangkok, Thailand: Association for Computational Linguistics.10.18653/v1/2021.acl-long.131Search in Google Scholar

Doğruöz, A. Seza, Sunayana Sitaram & Xin Yong Zheng. 2023. Representativeness as a forgotten lesson for multilingual and code-switched data collection and preparation. In Houda Bouamor, Juan Pino & Kalika Bali (eds.). Findings of the association for computational linguistics: Emnlp 2023, 5751–5767. Singapore: Association for Computational Linguistics.10.18653/v1/2023.findings-emnlp.382Search in Google Scholar

Fong, Vivienne. 2004. The verbal cluster. In Lisa Lim (ed.). Singapore English: A grammatical description, 75–104. Amsterdam: John Benjamins.10.1075/veaw.g33.06fonSearch in Google Scholar

Forbes, Duncan. 1993. Singlish. English Today 9(2). 18–22. https://doi.org/10.1017/S0266078400000304.Search in Google Scholar

Gupta, Anthea Fraser. 1994. The step-tongue: Children’s English in Singapore. Clevedon: Multilingual Matters.Search in Google Scholar

Hamid, M. Obaidul & Richard B. BaldaufJr. 2013. Second language errors and features of World Englishes. World Englishes 32(4). 476–494. https://doi.org/10.1111/weng.12056.Search in Google Scholar

Helm, Paula, Gábor Bella, Gertraud Koch & Fausto Giunchiglia. 2023. Diversity and language technology: How techno-linguistic bias can cause epistemic injustice. arXiv. https://doi.org/10.48550/arXiv.2307.13714.Search in Google Scholar

Her, One-Soon. 1991. On the Mandarin possessive and existential verb YOU and its idiomatic expressions. Language Sciences 13(3/4). 381–398. https://doi.org/10.1016/0388-0001(91)90023-t.Search in Google Scholar

Huang, C. T. James. 1990. Shuo shi he you – on ‘be’ and ‘have’ in Chinese. Bulletin of the Institute of History and Philology Academia Sinica 59. 43–64.Search in Google Scholar

Kachru, Braj B. 1992. The other tongue: English across cultures, 2nd edn. Urbana: University of Illinois Press.Search in Google Scholar

Khoo, Johanna Wei Ling. 2015. Serial verb constructions in Singapore Colloquial English. Düsseldorf: Heinrich-Heine Universität Düsseldorf PhD thesis.Search in Google Scholar

Kuo, Eddie C. Y. & Björn H. Jernudd. 1993. Balancing macro- and micro-sociolinguistic perspectives in language management: The case of Singapore. Language Problems and Language Planning 17(1). 1–21. https://doi.org/10.1075/lplp.17.1.01kuo.Search in Google Scholar

Labov, William. 1966. The social stratification of English in New York City. Cambridge: Cambridge University Press.Search in Google Scholar

Leech, Geoffrey. 2006. Adding linguistic annotation. In Martin Wynne (ed.). Developing linguistic corpora: A guide to good practice, 17–29. Oxford: Oxbow Books.Search in Google Scholar

Lim, Lisa. 2001. Towards a reference grammar of Singaporean English: Final research report. Singapore: National University of Singapore.Search in Google Scholar

Lim, Lisa & Umberto Ansaldo. 2013. Structure dataset 21: Singlish. In Susanne Maria Michaelis, Philippe Maurer, Martin Haspelmath & Magnus Huber (eds.). Atlas of pidgin and creole language structures online. Leipzig: Max Planck Institute for Evolutionary Anthropology. http://apics-online.info/contributions/21.Search in Google Scholar

Little, David. 2006. The common European framework of reference for languages: Content, purpose, origin, reception and impact. Language Teaching 39(3). 167–190. https://doi.org/10.1017/S0261444806003557.Search in Google Scholar

Low, Ee Ling. 2012. English in Singapore and Malaysia. In Andy Kirkpatrick (ed.). The Routledge handbook of world Englishes. London: Routledge.Search in Google Scholar

Lu, Luke. 2023. Exploring Singlish as a pedagogical resource in the ELT classroom: Implementing bidialectal pedagogy in Singapore. TESOL Quarterly 57(2). 323–350. https://doi.org/10.1002/tesq.3148.Search in Google Scholar

Mollin, Sandra. 2006. Euro-English: Assessing variety status. Tübingen: Gunter Narr.Search in Google Scholar

Myers-Scotton, Carol. 1993. Duelling languages: Grammatical structure in codeswitching. Oxford: Clarendon.10.1093/oso/9780198240594.001.0001Search in Google Scholar

Nguyen, Li. 2018. Borrowing or code-switching? Traces of community norms in Vietnamese-English speech. Australian Journal of Linguistics 38(4). 443–466. https://doi.org/10.1080/07268602.2018.1510727.Search in Google Scholar

Nguyen, Li. 2021. Cross-generational linguistic variation in the Canberra Vietnamese heritage language community: A corpus-centred investigation. Cambridge: University of Cambridge PhD thesis.Search in Google Scholar

Nguyen, Li & Christopher Bryant. 2020. CanVEC – the Canberra Vietnamese-English code-switching natural speech corpus. In Proceedings of the 2020 international conference on language resources and evaluation, 4121–4129. Marseille: Language Resources and Evaluation. https://www.aclweb.org/anthology/2020.lrec-1.507.Search in Google Scholar

Nguyen, Li, Christopher Bryant, Oliver Mayeux & Zheng Yuan. 2023a. How effective is machine translation on low-resource code-switching? A case study comparing human and automatic metrics. In Anna Rogers, Jordan Boyd-Graber & Naoaki Okazaki (eds.). Findings of the Association for computational linguistics: Acl 2023, 14186–14195. Toronto: Association for Computational Linguistics.10.18653/v1/2023.findings-acl.893Search in Google Scholar

Nguyen, Li, Oliver Mayeux & Zheng Yuan. 2023b. Code-switching input for machine translation: A case study of Vietnamese-English data. International Journal of Multilingualism. 1–22. Advance online publication https://doi.org/10.1080/14790718.2023.2224013.Search in Google Scholar

Nguyen, Li, Zheng Yuan & Graham Seed. 2022. Building educational technologies for code-switching: Current practices, difficulties and future directions. Languages 7(3). https://doi.org/10.3390/languages7030220.Search in Google Scholar

Nicholls, Diane. 2003. The Cambridge learner corpus: Error coding and analysis for lexicography and ELT. In Proceedings of the corpus linguistics 2003 conference, Lancaster, UK, 572–581.Search in Google Scholar

O’Hara-Davies, Breda. 2010. Brunei English: A developing variety. World Englishes 29(3). 406–419. https://doi.org/10.1111/j.1467-971X.2010.01653.Search in Google Scholar

Omelianchuk, Kostiantyn, Vitaliy Atrasevych, Artem Chernodub & Oleksandr Skurzhanskyi. 2020. GECToR – grammatical error correction: Tag, not rewrite. In Proceedings of the fifteenth workshop on innovative use of NLP for building educational applications, 163–170. Seattle, WA: Association for Computational Linguistics.10.18653/v1/2020.bea-1.16Search in Google Scholar

Platt, John. 1975. The Singapore English speech continuum and its basilect ‘Singlish’ as a ‘creoloid’: Implicational scaling and its pedagogical implications. Anthropological Linguistics 17. 363–374.Search in Google Scholar

Platt, John & Heidi Weber. 1980. English in Singapore and Malaysia: Status, features, functions. Oxford: Oxford University Press.Search in Google Scholar

Poplack, Shana. 1988. Language status and language accommodation along a linguistic border. In Peter H. Lowenberg (ed.). Language spread and language policy: Issues, implications, and case studies, 90–118. Washington, DC: Georgetown University Press.Search in Google Scholar

Ranta, Elina. 2022. From learners to users – errors, innovations, and universals. ELT Journal 76(3). 311–319. https://doi.org/10.1093/elt/ccac024.Search in Google Scholar

Rei, Marek. 2017. Semi-supervised multitask learning for sequence labeling. In Proceedings of the 55th annual meeting of the association for computational linguistics (vol. 1: Long papers), 2121–2130. Vancouver: Association for Computational Linguistics.10.18653/v1/P17-1194Search in Google Scholar

Rei, Marek & Helen Yannakoudakis. 2016. Compositional sequence labeling models for error detection in learner writing. In Proceedings of the 54th annual meeting of the association for computational linguistics (vol. 1: Long papers), 1181–1191. Berlin: Association for Computational Linguistics.10.18653/v1/P16-1112Search in Google Scholar

van Rooy, Bertus. 2011. A principled distinction between error and conventionalized innovation in African Englishes. In Joybrato Mukherjee & Marianne Hundt (eds.). Exploring second-Language varieties of English and learner Englishes: Bridging a paradigm gap, 189–207. Amsterdam: John Benjamins.10.1075/scl.44.10rooSearch in Google Scholar

Siemund, Peter. 2013. Varieties of English: A typological approach. Cambridge: Cambridge University Press.10.1017/CBO9781139028240Search in Google Scholar

Sim, Jasper Hong. 2021. Sociophonetic variation in English /l/ in the child-directed speech of English-Malay bilinguals. Journal of Phonetics 88. https://doi.org/10.1016/j.wocn.2021.101084.Search in Google Scholar

Sim, Jasper Hong. 2023. Negotiating social meanings in a plural society: Social perceptions of variants of /l/ in Singapore English. Language in Society 52. 617–644. https://doi.org/10.1017/S0047404522000173.Search in Google Scholar

Soh, Ying Qi, Junwen Lee & Ying-Ying Tan. 2022. Ethnicity and tone production on Singlish particles. Languages 7(3). https://doi.org/10.3390/languages7030243.Search in Google Scholar

Tan, Ying-Ying. 2017. Singlish: An illegitimate conception in Singapore’s language policies? European Journal of Language Policy 9(1). 85–104. https://doi.org/10.3828/ejlp.2017.6.Search in Google Scholar

Tan, Ying-Ying. 2023. The curious case of nomenclatures: Singapore English, Singlish, and Singaporean English. English Today 39(3). 194–206. https://doi.org/10.1017/S0266078423000044.Search in Google Scholar

Tay, Mary Wan Joo. 1993. The English language in Singapore: Issues and development. Singapore: UniPress.Search in Google Scholar

Thai, Bao Duy. 2005. Code choice and code convergent borrowing in Canberra Vietnamese. In Thao Le (ed.). Proceedings of the international conference on critical discourse analysis: Theory into research. Tasmania: University of Tasmania.Search in Google Scholar

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser & Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan & R. Garnett (eds.) Advances in neural information processing systems, Vol. 30, 5998–6008. Red Hook, NY: Curran Associates. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf (accessed 26 August 2024).Search in Google Scholar

Volodina, Elena, Christopher Bryant, Andrew Caines, De Clercq Orphée, Jennifer-Carmen Frey, Elizaveta Ershova, Alexandr Rosen & Olga Vinogradova. 2023. MultiGED-2023 shared task at NLP4CALL: Multilingual grammatical error detection. In David Alfter, Elena Volodina, Thomas François, Arne Jönsson & Evelina Rennes (eds.). Proceedings of the 12th workshop on NLP for computer assisted language learning, 1–16. Tórshavn, Faroe Islands: LiU Electronic Press. https://aclanthology.org/2023.nlp4call-1.1.10.3384/ecp197001Search in Google Scholar

Wee, Lionel. 2004. Reduplication and discourse particles. In Lisa Lim (ed.). Singapore English: A grammatical description, 105–126. Amsterdam: John Benjamins.10.1075/veaw.g33.07weeSearch in Google Scholar

Yannakoudakis, Helen, Øistein E. Andersen, Ardeshir Geranpayeh, Ted Briscoe & Diane Nicholls. 2018. Developing an automated writing placement system for ESL learners. Applied Measurement in Education 31(3). 251–267. https://doi.org/10.1080/08957347.2018.1464447.Search in Google Scholar

Yannakoudakis, Helen, Ted Briscoe & Medlock Ben. 2011. A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th annual Meeting of the Association for computational linguistics: Human language technologies, 180–189. Portland, OR: Association for Computational Linguistics. https://aclanthology.org/P11-1019.Search in Google Scholar

Yong, Zheng Xin, Ruochen Zhang, Jessica Forde, Skyler Wang, Arjun Subramonian, Holy Lovenia, Samuel Cahyawijaya, Genta Winata, Lintang Sutawika, Jan Christian Blaise Cruz, Yin Lin Tan, Long Phan, Long Phan, Rowena Garcia, Thamar Solorio & Alham Aji. 2023. Prompting multilingual large language models to generate code-mixed texts: The case of South East Asian languages. In Genta Winata, Sudipta Kar, Marina Zhukova, Thamar Solorio, Mona Diab, Sunayana Sitaram, Monojit Choudhury & Kalika Bali (eds.). Proceedings of the 6th workshop on computational approaches to linguistic code-switching, 43–63. Singapore: Association for Computational Linguistics.10.18653/v1/2023.calcs-1.5Search in Google Scholar

Yuan, Zheng & Ted Briscoe. 2016. Grammatical error correction using neural machine translation. In Proceedings of the 2016 conference of the north American Chapter of the Association for computational linguistics: Human language technologies, 380–386. San Diego: Association for Computational Linguistics.10.18653/v1/N16-1042Search in Google Scholar

Yuan, Zheng, Shiva Taslimipoor, Christopher Davis & Christopher Bryant. 2021. Multi-class grammatical error detection for correction: A tale of two systems. In Proceedings of the 2021 conference on empirical methods in natural language processing, 8722–8736. Punta Cana, Dominican Republic: Association for Computational Linguistics.10.18653/v1/2021.emnlp-main.687Search in Google Scholar

Yue, Tianchi, Shulin Liu, Huihui Cai, Tao Yang, Shengkang Song & TingHao Yu. 2022. Improving Chinese grammatical error detection via data augmentation by conditional error generation. In Smaranda Muresan, Preslav Nakov & Aline Villavicencio (eds.). Findings of the association for computational linguistics: Acl 2022, 2966–2975. Dublin: Association for Computational Linguistics.10.18653/v1/2022.findings-acl.233Search in Google Scholar

Zhang, Ruochen, Samuel Cahyawijaya, Jan Christian Blaise Cruz, Genta Winata & Alham Aji. 2023. Multilingual large language models are not (yet) code-switchers. In Houda Bouamor, Juan Pino & Kalika Bali (eds.). Proceedings of the 2023 conference on empirical methods in natural language processing, 12567–12582. Singapore: Association for Computational Linguistics.10.18653/v1/2023.emnlp-main.774Search in Google Scholar

Received: 2024-01-03
Accepted: 2024-07-16
Published Online: 2024-09-27

© 2024 the author(s), published by De Gruyter, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

  1. Frontmatter
  2. Editorial
  3. Editorial 2024
  4. Phonetics & Phonology
  5. The role of recoverability in the implementation of non-phonemic glottalization in Hawaiian
  6. Epenthetic vowel quality crosslinguistically, with focus on Modern Hebrew
  7. Japanese speakers can infer specific sub-lexicons using phonotactic cues
  8. Articulatory phonetics in the market: combining public engagement with ultrasound data collection
  9. Investigating the acoustic fidelity of vowels across remote recording methods
  10. The role of coarticulatory tonal information in Cantonese spoken word recognition: an eye-tracking study
  11. Tracking phonological regularities: exploring the influence of learning mode and regularity locus in adult phonological learning
  12. Morphology & Syntax
  13. #AreHashtagsWords? Structure, position, and syntactic integration of hashtags in (English) tweets
  14. The meaning of morphomes: distributional semantics of Spanish stem alternations
  15. A refinement of the analysis of the resultative V-de construction in Mandarin Chinese
  16. L2 cognitive construal and morphosyntactic acquisition of pseudo-passive constructions
  17. Semantics & Pragmatics
  18. “All women are like that”: an overview of linguistic deindividualization and dehumanization of women in the incelosphere
  19. Counterfactual language, emotion, and perspective: a sentence completion study during the COVID-19 pandemic
  20. Constructing elderly patients’ agency through conversational storytelling
  21. Language Documentation & Typology
  22. Conative animal calls in Macha Oromo: function and form
  23. The syntax of African American English borrowings in the Louisiana Creole tense-mood-aspect system
  24. Syntactic pausing? Re-examining the associations
  25. Bibliographic bias and information-density sampling
  26. Historical & Comparative Linguistics
  27. Revisiting the hypothesis of ideophones as windows to language evolution
  28. Verifying the morpho-semantics of aspect via typological homogeneity
  29. Psycholinguistics & Neurolinguistics
  30. Sign recognition: the effect of parameters and features in sign mispronunciations
  31. Influence of translation on perceived metaphor features: quality, aptness, metaphoricity, and familiarity
  32. Effects of grammatical gender on gender inferences: Evidence from French hybrid nouns
  33. Processing reflexives in adjunct control: an exploration of attraction effects
  34. Language Acquisition & Language Learning
  35. How do L1 glosses affect EFL learners’ reading comprehension performance? An eye-tracking study
  36. Modeling L2 motivation change and its predictive effects on learning behaviors in the extramural digital context: a quantitative investigation in China
  37. Ongoing exposure to an ambient language continues to build implicit knowledge across the lifespan
  38. On the relationship between complexity of primary occupation and L2 varietal behavior in adult migrants in Austria
  39. The acquisition of speaking fundamental frequency (F0) features in Cantonese and English by simultaneous bilingual children
  40. Sociolinguistics & Anthropological Linguistics
  41. A computational approach to detecting the envelope of variation
  42. Attitudes toward code-switching among bilingual Jordanians: a comparative study
  43. “Let’s ride this out together”: unpacking multilingual top-down and bottom-up pandemic communication evidenced in Singapore’s coronavirus-related linguistic and semiotic landscape
  44. Across time, space, and genres: measuring probabilistic grammar distances between varieties of Mandarin
  45. Navigating linguistic ideologies and market dynamics within China’s English language teaching landscape
  46. Streetscapes and memories of real socialist anti-fascism in south-eastern Europe: between dystopianism and utopianism
  47. What can NLP do for linguistics? Towards using grammatical error analysis to document non-standard English features
  48. From sociolinguistic perception to strategic action in the study of social meaning
  49. Minority genders in quantitative survey research: a data-driven approach to clear, inclusive, and accurate gender questions
  50. Variation is the way to perfection: imperfect rhyming in Chinese hip hop
  51. Shifts in digital media usage before and after the pandemic by Rusyns in Ukraine
  52. Computational & Corpus Linguistics
  53. Revisiting the automatic prediction of lexical errors in Mandarin
  54. Finding continuers in Swedish Sign Language
  55. Conversational priming in repetitional responses as a mechanism in language change: evidence from agent-based modelling
  56. Construction grammar and procedural semantics for human-interpretable grounded language processing
  57. Through the compression glass: language complexity and the linguistic structure of compressed strings
  58. Could this be next for corpus linguistics? Methods of semi-automatic data annotation with contextualized word embeddings
  59. The Red Hen Audio Tagger
  60. Code-switching in computer-mediated communication by Gen Z Japanese Americans
  61. Supervised prediction of production patterns using machine learning algorithms
  62. Introducing Bed Word: a new automated speech recognition tool for sociolinguistic interview transcription
  63. Decoding French equivalents of the English present perfect: evidence from parallel corpora of parliamentary documents
  64. Enhancing automated essay scoring with GCNs and multi-level features for robust multidimensional assessments
  65. Sociolinguistic auto-coding has fairness problems too: measuring and mitigating bias
  66. The role of syntax in hashtag popularity
  67. Language practices of Chinese doctoral students studying abroad on social media: a translanguaging perspective
  68. Cognitive Linguistics
  69. Metaphor and gender: are words associated with source domains perceived in a gendered way?
  70. Crossmodal correspondence between lexical tones and visual motions: a forced-choice mapping task on Mandarin Chinese
Downloaded on 19.12.2025 from https://www.degruyterbrill.com/document/doi/10.1515/lingvan-2024-0001/html
Scroll to top button