Home Supervised prediction of production patterns using machine learning algorithms
Article Open Access

Supervised prediction of production patterns using machine learning algorithms

  • Jungyeon Kim ORCID logo EMAIL logo
Published/Copyright: June 21, 2024

Abstract

When an English word ending in a stop is adapted to Korean, a vowel is variably inserted after the final stop: some words always take the epenthetic vowel, and some never do, while some vary between these alternatives. Although there are different linguistic factors that possibly affect this insertion, it is not easy to determine which pattern will be chosen if a new word comes into the borrowing language. This study conducted classification data analyses of production patterns based on machine learning algorithms including support vector machines and random forests. These two classifiers show similar results where vowel tenseness is the best predictor among all the possible predictors. This indicates that vowel tenseness is most influential in classifying the patterns (no vowel insertion, optional vowel insertion, or vowel insertion). Results suggest that while vowel tenseness remains significant, other factors such as stop voicing and stop place also hold some importance, albeit to a lesser degree. The contribution of this study is that it provides insight into the factors that regulate vowel insertion, and these findings support the need for a behavioral experiment to see if the current results can make right predictions with respect to the behavior of nonce items.

1 Introduction

Loanword adaptation (or repair) is the process by which a word’s original structure is altered to phonologically fit the borrowing language; that is, a word from a source language cannot enter a borrowing language without being phonologically adapted (Haspelmath 2009). However, there are some loanword patterns that cannot be explained by the phonological grammar of the borrowing language. One of those patterns involves what Peperkamp (2005) calls unnecessary repair, in which loanword repair occurs even when there are no illegal sequences in the native phonotactics of the borrowing language. Korean provides a particularly interesting case of this apparently unnecessary adaptation: a voiceless stop is permitted in the word-final position in the Korean native phonology (e.g., [pap˺] ‘meal’, [ap˺] ‘front’), so we would expect that words borrowed from English ending in a voiceless stop would be adapted as ending in a Korean voiceless stop, but it is surprising to find that a vowel is frequently inserted after the final stop (see Table 1). Previous studies have suggested that different factors affect the possibility of vowel epenthesis in this environment, including vowel tenseness, stop voicing, stop place, final stress, and word size (Jun 2002; Kang 2003; Kim 2018; Kwon 2017; Rhee and Choi 2001), as indicated in Table 2.

Table 1:

Vowel insertion patterns of English words ending in a stop.

No vowel insertion Optional vowel insertion Vowel insertion
lab → [læp˺]

hot → [hɑt˺]

book → [puk˺]
type → [thaɪp˺]∼[thaɪphɨ]

flute → [phɨllut˺]∼[phɨlluthɨ]

tag → [thæk˺]∼[thægɨ]
loop → [luphɨ]

brand → [pɨrændɨ]

peak → [phikhɨ]
Table 2:

Linguistic features influencing the likelihood of vowel epenthesis following English word-final stops.

Factor Observations Examples
Vowel tenseness Vowel insertion is more likely when the vowel preceding the English word-final stop is tense than when it is lax Lax: step → [sɨthɛp˺]

Tense: state → [sɨthɛithɨ]
Stop voicing Vowel insertion is more likely when the English word-final stop is voiced than when it is voiceless Voiceless: plot → [phɨllot˺]

Voiced: plug → [phɨllʌgɨ]
Stop place Vowel insertion is more likely when the English word-final stop is coronal than when it is labial or dorsal Labial/dorsal: cap/bag → [khæp˺]/[pæk˺]

Coronal: bat → [pæthɨ]
Word size Vowel insertion is more likely when the English word is monosyllabic than when it is polysyllabic Polysyllabic: moonlight → [munlait˺]

Monosyllabic: light → [laithɨ]
Final stress Vowel insertion is more likely when the English final syllable is stressed than when it is unstressed Unstressed: ˈhandbag → [hændɨbæk˺]

Stressed: handˈmade → [hændɨmɛidɨ]

Since predicting the specific adaptation pattern in any case is not so easy even with the many possible features that have been proposed by prior research, the purpose of the present study is to identify which production pattern will be chosen if a new English word ending in a stop is adapted into Korean. That is, this study attempts to find out which linguistic factor serves to produce the optimal output. In order to discover the best predictor, a corpus was built by collecting loanword lists published by the National Academy of the Korean Language (2001, 2002, 2007a, 2007b, 2010). This corpus consisted of 540 Korean loanwords from English whose English source word ends in a stop; out of 540 English words with a coda stop, 264 were consistently adapted with final vowel insertion and 214 were always adapted without final vowel insertion, while 62 were variably adapted both with and without vowel insertion.

This study employed a classification technique as a predictive model (Hastie et al. 2009; Kuhn and Johnson 2013) and a classification task was conducted using support vector machines (SVMs; Cortes and Vapnik 1995). Introduced by Cortes and Vapnik (1995) for binary classification, SVMs have been increasingly used for research in various areas including linguistics, communication, bioinformatics, and so on (Razzaghi et al. 2016; Rodríguez-Pérez et al. 2017; Schohn and Cohn 2000; Sebastiani 2002; Yang 1999). SVMs are a popular choice for classification, with several advantages. They are less prone to overfitting, which occurs when a model learns the training data too well but fails to generalize to new data. SVMs find a balance between maximizing the margin and minimizing the training error, which allows them to generalize well to unseen data. Also, they are memory efficient and can handle datasets with a limited number of training instances.

Despite the strengths of SVMs, it is acknowledged that their trained parameters can be difficult to analyze and interpret. Thus, a different classifier, random forests (RFs; Breiman 2001), was also applied on the same data to properly interpret the contribution of linguistic factors and to see if similar variable importance results are obtained by both SVMs and RFs. RFs are one of the most versatile machine learning algorithms as they do not require many tunings of their setting. They have been applied to sociolinguistic data and to acoustic cue weighting in perception (Brown et al. 2014; Tagliamonte and Baayen 2012). Breiman (2001) proposed RFs as an ensemble learning technique that constructs a forest by growing independent decision trees for classification tasks and a decision tree is built by randomly selecting a subset of the data. This process is repeated multiple times and prediction is made based on a majority of votes (Liaw and Wiener 2002). The algorithm splits the dataset into two sets: the training set that comprises approximately 66.7 % of the data (in-bag set) and the testing set that consists of the remaining 33.3 % of the data (out-of-bag set). Then samples are selected from the dataset without replacement during the growth of the forest (Strobl et al. 2009). Since no study has conducted analyses based on classification in this specific context of Korean loanwords from English, the contribution of this study is that it provides methodology where we can classify different patterns of loanword adaptation and predict a pattern given a great deal of corpus data comprising established loanwords. The goal of the current study is to demonstrate how machine learning classifiers can be utilized to determine the primary phonological factors that account for variable adaptation of loanwords in Korean.

2 Methods

In this study, the adaptation pattern of vowel insertion was modeled using an ordinal fixed-effects logistic regression model implemented in the ordinal package (Christensen 2022) in R (R Core Team 2022) to assess the significance of the linguistic factors that have been proposed to affect vowel insertion. An ordinal regression model was employed since the analysis involved ordered dependent variable categories, with no vowel insertion (NVI) being the lowest and vowel insertion (VI) being the highest (see Table 1). The vowel insertion pattern was calculated by building a single model where the dependent variable was the adaptation pattern (NVI, OVI [optional vowel insertion], or VI) and the five context predictors were vowel tenseness (lax or tense), stop voicing (voiceless or voiced), stop place (labial or coronal or dorsal), final stress (unstressed or stressed), and word size (monosyllabic or polysyllabic). All fixed factors were coded using treatment coding with the reference level for the intercept being set to lax, voiceless, labial, unstressed, and monosyllabic. Ordinal logistic regression was run in this study instead of multinomial logistic regression since the former is more parsimonious in that it would return only one coefficient describing adjacent pairs of the outcome levels rather than a separate coefficient for every possible pair. This model, however, could not incorporate a random-effects structure (i.e., by-item intercepts) because the variance-covariance matrix of the parameters could not be calculated, which might be explained by the fact that the present data do not involve hierarchical structures with few grouping levels, where the estimation of random effects would become unstable, leading to decreased model reliability.

The current study employed the SVM algorithm in Python (Python Software Foundation 2022) to help determine which category of adaptation patterns a new data point belongs to if it is adapted into the borrowing language. This computational study involved two different types of classification, binary classification (NVI or VI) and ternary classification (NVI, OVI, or VI). The binary classification classifies all the data points as ending in a C(onsonant) or a V(owel) or both (for the OVI case). It classifies ending in a consonant as true (1) or false (0), and ending in a vowel as true (1) or false (0): that is, it has binary C (true or false that the word ends in a consonant) as well as binary V (true or false that the word ends in a vowel). In the binary C classification, the presence or absence of C is focused: if a vowel is inserted after the final stop and a word ends in a vowel then C is false, whereas if a vowel is not inserted and a word ends in a consonant then C is true; but if a word has both VI and NVI cases, then C is also true. Similarly, in the binary V classification, the presence or absence of V is focused: if a vowel is not inserted and a word ends in a consonant then V is false, if a vowel is inserted then V is true, and if a word belongs to the OVI case then V is also true. On the other hand, the ternary classification classifies the pattern of each data point as NVI, OVI, or VI. In this study, we expected to obtain better results with higher accuracy in the binary classification models rather than in those with ternary classification, taking into account the different chance levels.

The production patterns following the English final stop were fitted using the sklearn package in Python (Pedregosa et al. 2011). We fitted separate models to two classification types, the binary classification (NVI or VI) and the ternary classification (NVI, OVI, or VI). Each model had class as the outcome variable (i.e., NVI, OVI, or VI) and the linguistic attributes as the predictor variables (i.e., lax, tense, voiceless, voiced, labial, coronal, dorsal, monosyllabic, polysyllabic, unstressed, or stressed; see Table 3). For this study, data preprocessing was conducted for experimental setup since SVMs handle categorical variables, for which a dummy variable is created with case values as either 0 or 1. SVMs require each data case to be represented a vector of real numbers and hence the original loanword list was converted into numeric data. A categorical variable consisting of two levels is represented by a set of two dummy variables, and a variable consisting of three levels is represented by three variables. That is, a two-category attribute such as {voiced, voiceless} was represented as (0, 1) and (1, 0); a three-category attribute like {labial, coronal, dorsal} was represented as (0, 0, 1), (0, 1, 0), and (1, 0, 0). Each attribute set was randomly divided into two subsets: a training subset comprising 75 % of the data and a testing subset comprising the remaining 25 %. All models were fitted using a linear kernel since it is the simplest among the kernel functions used in SVMs and its simplicity makes it a suitable choice when dealing with limited training data. The set type of SVMs that was chosen was nu-SVR, which is relatively robust to outliers and allows adjusting the flexibility of the model by using the nu parameter. Model performance was evaluated using a five-fold cross-validation. To avoid the challenges of working with a limited dataset, we employed Monte Carlo cross-validation (Kuhn and Johnson 2013; Picard and Cook 1984). This method consisted of repeating the train/test process 100 times for each model with random train/test splits each time. After conducting the 100 iterations, we calculated the average of the classification rates from each iteration to obtain the overall classification rate.

Table 3:

Thirty-one classifiers for the five different factors (vowel tenseness, stop voicing, stop place, word size, and final stress).

Number of factors Classifier
1 (i) tenseness, (ii) voicing, (iii) place, (iv) size, (v) stress
2 (i) tenseness + voicing, (ii) tenseness + place, (iii) tenseness + size, (iv) tenseness + stress, (v) voicing + place, (vi) voicing + size, (vii) voicing + stress, (viii) place + size, (ix) place + stress, (x) size + stress
3 (i) tenseness + voicing + place, (ii) tenseness + voicing + size, (iii) tenseness + voicing + stress, (iv) tenseness + place + size, (v) tenseness + place + stress, (vi) tenseness + size + stress, (vii) voicing + place + size, (viii) voicing + place + stress, (ix) voicing + size + stress, (x) place + size + stress
4 (i) tenseness + voicing + place + size, (ii) tenseness + voicing + place + stress, (iii) tenseness + voicing + size + stress, (iv) tenseness + place + size + stress, (v) voicing + place + size + stress
5 (i) tenseness + voicing + place + size + stress

3 Results and discussion

The regression model built for the corpus analysis found significant effects of several linguistic factors. Figure 1 shows the predicted probability of vowel insertion patterns classified as VI, OVI, and NVI depending on each predictor. First, tense pre-stop vowels significantly increased the likelihood of each pattern relative to NVI (see Table 5 in Appendix). The model shows that NVI patterns (reference category) were predicted to be significantly more common than OVI patterns at baseline (i.e., tenseness = lax, voicing = voiceless, place = labial, size = monosyllabic, and stress = unstressed); VI patterns were predicted to be significantly less common than OVI patterns at baseline. These results are shown by the significant positive estimate values (β = 3.397, p < 0.001 for NVI vs. OVI; β = 4.456, p < 0.001 for OVI vs. VI). Importantly, tense pre-stop vowels significantly increased the predicted probability of VI patterns relative to other patterns, compared with the reference likelihood (β = 3.323, OR = 27.74, p < 0.001). Second, words ending in voiced stops also significantly changed relative rates of VI patterns in relation to words ending in voiceless stops. There was a significant increase in the relative rates of VI patterns in words ending in voiced stops, compared with the reference condition rates (β = 2.215, OR = 9.16, p < 0.001). The model also found a significant effect of stop place; that is, there was a significant increase in the relative rates of VI patterns in words ending in coronal and dorsal stops, compared with the reference rates, that is, the rates of VI patterns to other patterns for words ending in labial stops (β = 3.572, OR = 35.57, p < 0.001 for labials vs. coronals; β = 2.034, OR = 7.65, p < 0.001 for labials vs. dorsals).

Figure 1: 
Predicted probability of vowel insertion patterns in the regression model classified as vowel insertion (VI; dark blue), optional vowel insertion (OVI; mid blue), and no vowel insertion (NVI; light blue) depending on each linguistic factor (vowel tenseness in A, stop voicing in B, stop place in C, word size in D, and final stress in E).
Figure 1:

Predicted probability of vowel insertion patterns in the regression model classified as vowel insertion (VI; dark blue), optional vowel insertion (OVI; mid blue), and no vowel insertion (NVI; light blue) depending on each linguistic factor (vowel tenseness in A, stop voicing in B, stop place in C, word size in D, and final stress in E).

The SVM model for ternary data in Table 6 (see Appendix) shows overall classification rates of 71 %, which represents well above chance classification. The ternary model of the classification matrix in Table 7 shows that the five-attribute classifier of vowel tenseness + stop voicing + stop place + word size + final stress is the most accurately classified at 78 %, while the one-attribute classifiers of stop voicing and word size are the worst at 56 %. The models for binary data, on the other hand, show overall classification rates of 78 %, which is higher than for the ternary one. Two four-attribute classifiers are the most accurately classified at 86 % for both binary C and binary V: vowel tenseness + stop voicing + stop place + final stress, and vowel tenseness + stop voicing + stop place + word size for binary C; vowel tenseness + stop voicing + stop place + final stress for binary V. The model for binary data also shows that the one-attribute classifier of word size is the worst at 62 % for binary C and 58 % for binary V, respectively.

It is evident that classifiers incorporating the vowel tenseness attribute demonstrate strong performance, achieving classification rates exceeding 77 % for the binary C model. This finding would suggest that tenseness of the vowel preceding the English final stop holds a phonological/phonetic importance in the Korean grammar. Additionally, it can be observed that increasing the number of attributes resulted in higher accuracy rates across the three models (see Table 6). The increase in accuracy percentages with the addition of more attributes is not so surprising, as including a greater number of explanatory variables naturally enhances the fit of the model in SVM analyses (Srivastava and Bhambhu 2010). Generally, more context in any multivariate analyses tends to lead to better performance of the model (Altman 1992; Cunningham and Delany 2007; Vapnik 1995). However, an exceptional case is found in the models for binary data: the five-attribute predictor (vowel tenseness + stop voicing + stop place + word size + final stress) is less accurately classified than the four-attribute predictor (vowel tenseness + stop voicing + stop place + final stress and vowel tenseness + stop voicing + stop place + word size), with 85 % versus 86 % (see Table 7). This result shows that the presence of a larger number of variables does not necessarily guarantee a more satisfactory outcome. A possible reason for this could be that each cue points toward different directions: in the model for binary V, all five attributes except word size might be pointing in the same direction so that the four-attribute classification has a stronger effect and is better accurately classified than the one containing all the possible cues.

As shown in Figure 2, which illustrates the classifiers with the top three overall classification rates for each of the three models, the most accurate classification rate of the ternary model was 7 % lower than that of the two binary models (78 % vs. 86 %). The best result (86 %) was obtained from both binary C and binary V models in the classifier of vowel tenseness + stop voicing + stop place + final stress. Thus, we can assume that this classifier has the greatest impact on vowel insertion patterns when an English word ending in a stop is adapted into Korean. Specifically, the classification results revealed a substantial effect of both vowel tenseness and stop voicing on the vowel insertion phenomenon. Classifiers including the tenseness attribute are found to consistently achieve greater accuracy than those containing the other attributes. All 16 classifications containing vowel tenseness showed high accuracy both in ternary and binary models; they were all higher than 70 % in the ternary model and 77 % in binary C model (see Table 7 in Appendix). Although the binary V model does not exhibit the exactly same pattern as the other two models, it still generally yields very high accuracy level; out of the 16 classifiers that include vowel tenseness, 12 had accuracy higher than 79 %, and four range from 70 % to 73 %.

Figure 2: 
Classifiers with the top three overall classification rates in the support vector machine models of the data for ternary, binary C(onsonant), and binary V(owel).
Figure 2:

Classifiers with the top three overall classification rates in the support vector machine models of the data for ternary, binary C(onsonant), and binary V(owel).

In this study, an RF classifier was also applied on the same production data to see if similar results are obtained by both SVM and RF classifiers. We made use of the classical RF algorithm implemented in the R package randomForest (Liaw and Wiener 2002). RFs were fitted to the data to predict the production pattern with three levels (NVI, OVI, and VI), and the dependent variable was modeled as a function of linguistic factors containing vowel tenseness, stop voicing, stop place, final stress, and word size. Since the mtry parameter in the RF algorithm determines the number of predictors used in constructing each decision tree and experimenting with various mtry values is recommended to enhance the model’s performance by optimizing mtry, the train() function from the caret package was utilized to tune the mtry parameter in this study (Kuhn 2008). To optimize mtry, a ten-fold cross-validation with five repetitions was conducted and the final RF model exhibited a high classification accuracy of 80.75 % for the three different production patterns. The confusion matrix in Table 4 shows the model’s ability to distinguish the three patterns. The columns in the table represent the model’s predictions while the rows represent the observed categories in the data. Instances that are accurately classified are situated on the diagonal of the table and misclassifications are found off the diagonal. Furthermore, the fourth column presents the balanced accuracy. The model accurately classified NVI and VI patterns with an accuracy of approximately 87 %, which is remarkable for unseen data. The accuracy for the OVI pattern, however, indicates that this type was not effectively learned by the model, with an error rate of 50 %. A possible reason for the poor performance of the model with reference to predicting OVI could be that the model predominantly predicted OVI instances as VI (n = 70); the majority class appears to influence the model’s prediction in such cases.

Table 4:

Confusion matrix for the RF (random forests) model. Rows represent the observed labels and the columns represent predicted ones along with balanced accuracy. Accurately classified instances are situated on the diagonal.

NVI OVI VI Accuracy
NVI 60 10 9 0.870
OVI 0 0 0 0.500
VI 4 8 70 0.869

In order to compute the predictor importance in the RF model, the varImp() function from the caret package was employed (Kuhn 2008). In this study, permutation-based variable importance was employed to assess the differences in predictive accuracy of the model. The method of estimating variable importance using RFs is different from more conventional analyses like regression in that permutation-based variable importance measures the impact of each predictor by randomly permuting its values and observing the change in the model’s performance (Breiman 2001). In other words, if a certain predictor (e.g., vowel tenseness) is connected to a specific level of the response variable (e.g., VI), randomly shuffling the values of vowel tenseness reduces the connection between VI and the predictor. This analysis directly ranks the relative importance of several linguistic factors in predicting the different patterns of vowel insertion. The estimated relative importance of predictors is shown in Figure 3, where predictors that are closer to zero are considered to have minimal contribution in predicting the dependent variable. These results show that the predictors can be categorized into three distinct groups based on their importance. Primary predictors consist of vowel tenseness and coronal stop place, where vowel tenseness was the most important predictor of dependent variable, followed by coronal stop place. Random permutation of the values for these predictors noticeably decreased the accuracy of the model’s prediction of production patterns. Moreover, stop voicing and dorsal stop place are also important predictors, although their importance is relatively lower compared to the predictors in the main group. The remaining predictors, word size and final stress, constitute the third group and are the least important in the RF model. Final stress is specifically associated with a mean decrease in accuracy that is very close to zero, indicating that it has little importance as a predictor in this model. Overall, vowel tenseness is the best predictor of production patterns in the data, whereas other predictors are less informative as predictors. These findings from RFs are consistent with the SVM results in that the vowel tenseness predictor greatly contributes to the classification accuracy in both RF and SVM models.

Figure 3: 
The estimated relative importance of variables in random forests considering all predictors. Greater values of mean decrease in accuracy rates are predicted to have a more substantial impact on the model’s classification accuracy.
Figure 3:

The estimated relative importance of variables in random forests considering all predictors. Greater values of mean decrease in accuracy rates are predicted to have a more substantial impact on the model’s classification accuracy.

In this study, we trained classifiers of SVMs and RFs to get an insight into the factors that can contribute to the prediction of vowel insertion following an English word-final stop when a word is borrowed into Korean. From the SVM results, we have found that binary classification (NVI or VI) achieved more satisfying results than ternary classification (NVI, OVI, or VI), with the average accuracy of the ternary model being 7 % lower than that of the two binary models (78 % vs. 71 %; see Table 6). More importantly, vowel tenseness was found to consistently achieve greater accuracy than other predictors. Similar results were obtained by the RF model, with vowel tenseness the most important predictor of production pattern, as random permutation of the values for this predictor noticeably decreased the accuracy of the prediction of the model. Predictors other than vowel tenseness were found to be less informative, although factors such as stop voicing and stop place also held some variable importance. These findings support the need for a behavioral experiment to see if the current results can make the correct predictions with respect to the behavior of nonce items.

Table 5:

Summary of fixed effect coefficients in the logistic regression model for the ordinal distribution of no vowel insertion (NVI), optional vowel insertion (OVI), and vowel insertion (VI). NVI is the reference level. Fixed factors are vowel tenseness (lax vs. tense), stop voicing (voiceless vs. voiced), stop place (labial vs. coronal vs. dorsal), final stress (unstressed vs. stressed), and word size (monosyllabic vs. polysyllabic), with lax, voiceless, labial, unstressed, and monosyllabic set as the baseline levels. Values represent model coefficient estimate, standard error, odds ratio, 95 % confidence interval, and p value.

Fixed effect β SE OR 95 % CI p
NVI | OVI (Intercept) 3.397 0.595 29.87 9.29–96.02 <0.001***
OVI | VI (Intercept) 4.456 0.612 86.18 25.90–286.72 <0.001***
Tenseness = tense 3.323 0.324 27.74 15.10–54.04 <0.001***
Voicing = voiced 2.215 0.315 9.16 5.02–17.32 <0.001***
Place = coronal 3.572 0.397 35.57 16.84–80.10 <0.001***
Place = dorsal 2.034 0.397 7.65 3.59–17.12 <0.001***
Stress = stressed 0.847 0.483 2.33 0.92–6.15 0.452
Size = polysyllabic −0.369 0.491 0.69 0.27–1.84 0.080
  1. ***p < 0.001.

Table 6:

Support vector machine (SVM) classification matrix for the data of ternary, binary C(onsonant), and binary V(owel). Attributes are vowel tenseness, stop voicing, stop place, word size, and final stress. Values represent the percentage of correct classification (rounded to zero decimal places) with the standard deviation (rounded to one decimal place) in parentheses.

Classifier Average accuracy (%) of 100 rounds (SD)
Ternary Binary C Binary V
Tenseness 70 (3.8) 80 (1.5) 73 (3.3)
Voicing 56 (5.8) 66 (4.4) 60 (3.5)
Place 67 (3.5) 70 (5.8) 75 (6.7)
Size 56 (4.1) 62 (5.5) 58 (4.1)
Stress 60 (1.4) 65 (3.3) 66 (5.9)
Mean of one-attribute classifiers 62 69 66

Tenseness + voicing 76 (3.1) 84 (2.8) 79 (2.4)
Tenseness + place 75 (0.8) 77 (4.8) 84 (3.1)
Tenseness + size 70 (3.8) 80 (0.9) 71 (5.2)
Tenseness + stress 71 (3.4) 80 (4.3) 70 (4.2)
Voicing + place 69 (5.1) 72 (3.5) 78 (2.7)
Voicing + size 64 (3.2) 69 (4.8) 71 (5.6)
Voicing + stress 66 (3.8) 70 (4.0) 72 (4.0)
Place + size 69 (3.3) 69 (2.5) 79 (3.1)
Place + stress 70 (1.0) 69 (3.5) 80 (2.0)
Syllabicity + stress 60 (3.8) 66 (1.7) 66 (4.4)
Mean of two-attribute classifiers 69 74 75

Tenseness + voicing + place 77 (1.6) 85 (3.0) 85 (2.6)
Tenseness + voicing + size 75 (3.9) 84 (1.9) 79 (2.0)
Tenseness + voicing + stress 76 (4.7) 84 (2.1) 79 (3.7)
Tenseness + place + size 75 (3.3) 80 (3.3) 84 (4.7)
Tenseness + place + stress 76 (1.5) 80 (2.9) 85 (1.2)
Tenseness + size + stress 71 (4.0) 80 (3.8) 71 (3.7)
Voicing + place + size 68 (4.8) 74 (3.2) 79 (4.1)
Voicing + place + stress 69 (3.9) 72 (3.6) 78 (6.0)
Voicing + size + stress 64 (3.0) 70 (4.3) 71 (4.8)
Place + size + stress 69 (3.9) 71 (3.9) 79 (4.5)
Mean of three-attribute classifiers 72 78 79

Tenseness + voicing + place + size 77 (2.3) 86 (4.8) 85 (1.5)
Tenseness + voicing + place + stress 77 (4.7) 86 (2.8) 86 (1.7)
Tenseness + voicing + size + stress 76 (3.4) 84 (4.5) 79 (3.8)
Tenseness + place + size + stress 75 (3.1) 82 (2.3) 85 (4.6)
Voicing + place + size + stress 69 (1.4) 76 (2.3) 80 (5.2)
Mean of four-attribute classifiers 75 83 83

Tenseness + voicing + place + size + stress 78 (3.3) 85 (1.9) 85 (3.2)
Mean of all classifiers 71 78 78
Table 7:

Hierarchy of overall classification accuracy rates (%) for each model of ternary, binary C(onsonant), and binary V(owel). Attributes are vowel tenseness, stop voicing, stop place, word size, and final stress.

Ternary Binary C Binary V
78

Tenseness + voicing + place + size + stress
86

Tenseness + voicing + place + stress

Tenseness + voicing + place + size
86

Tenseness + voicing + place + stress
77

Tenseness + voicing + place

Tenseness + voicing + place + size

Tenseness + voicing + place + stress
85

Tenseness + voicing + place

Tenseness + voicing + place + size + stress
85

Tenseness + voicing + place

Tenseness + voicing + place + size

Tenseness + voicing + place + size + stress

Tenseness + place + size + stress

Tenseness + place + stress
76

Tenseness + voicing

Tenseness + voicing + stress

Tenseness + place + stress

Tenseness + voicing + size + stress
84

Tenseness + voicing + stress

Tenseness + voicing + size

Tenseness + voicing + size + stress

Tenseness + voicing
84

Tenseness + place

Tenseness + place + size
75

Tenseness + place

Tenseness + place + size

Tenseness + voicing + size

Tenseness + place + size + stress
82

Tenseness + place + size + stress
80

Place + stress

Voicing + place + size + stress
71

Tenseness + stress

Tenseness + size + stress
80

Tenseness

Tenseness + size

Tenseness + stress

Tenseness + place + size

Tenseness + place + stress

Tenseness + size + stress
79

Tenseness + voicing

Place + size

Place + size + stress

Tenseness + voicing + size

Tenseness + voicing + stress

Tenseness + voicing + size + stress

Voicing + place + size
70

Tenseness

Tenseness + size
77

Tenseness + place
78

Voicing + place

Voicing + place + stress
69

Voicing + place

Voicing + place + size + stress

Voicing + place + stress

Place + size

Place + size + stress
76

Voicing + place + size + stress
75

Place
68

Voicing + place + size
74

Voicing + place + size
73

Tenseness
67

Place
72

Voicing + place

Voicing + place + stress
72

Voicing + stress
66

Voicing + stress
71

Place + size + stress
71

Voicing + size + stress

Tenseness + size + stress

Voicing + size

Tenseness + size
64

Voicing + size + stress

Voicing + size
70

Place

Voicing + stress

Voicing + size + stress
70

Tenseness + stress
60

Stress

Size + stress
69

Voicing + size

Place + size

Place + stress
66

Stress

Size + stress
56

Voicing

Size
66

Voicing

Size + stress
60

Voicing
65

Stress
58

Size
62

Size

The contribution of the current study is that it provides methods where we can classify different adaptation patterns and determine which predictor is most influential in classification. In this study we show that using machine learning classifiers in the investigation of loanword adaptation presents several advantages and opportunities for further study. Machine learning techniques offer quantitative methods for analyzing large datasets and identifying subtle phonological patterns. These classifiers can also be trained to predict the outcomes of loanword adaptation for new entries into a language. By uncovering predictive patterns in existing loanwords, we have developed models capable of anticipating how future loanwords will be adapted. This interdisciplinary approach could potentially open up new avenues for research and provide valuable insights into the dynamics of language contact.


Corresponding author: Jungyeon Kim, Kangwon National University, Chuncheon, South Korea, E-mail:

Appendix

References

Altman, Naomi. 1992. An introduction to kernel and nearest-neighbor nonparametric regression. American Statistician 46. 175–185. https://doi.org/10.1080/00031305.1992.10475879.Search in Google Scholar

Breiman, Leo. 2001. Random forests. Machine Learning 45(1). 5–32. https://doi.org/10.1023/a:1010933404324.10.1023/A:1010933404324Search in Google Scholar

Brown, Lucien, Bodo Winter, Kaori Idemaru & Sven Grawunder. 2014. Phonetics and politeness: Perceiving Korean honorific and non-honorific speech through phonetic cues. Journal of Pragmatics 66. 45–60. https://doi.org/10.1016/j.pragma.2014.02.011.Search in Google Scholar

Christensen, Rune Haubo Bojesen. 2022. ordinal – Regression models for ordinal data, version 2022.11-16 [R package]. Available at: https://CRAN.R-project.org/package=ordinal.Search in Google Scholar

Cortes, Corinna & Vladimir Vapnik. 1995. Support-vector networks. Machine Learning 20(3). 273–297. https://doi.org/10.1007/bf00994018.Search in Google Scholar

Cunningham, Pádraig & Sarah Jane Delany. 2007. K-nearest neighbor classifiers. Multiple Classifier Systems 34. 1–17.Search in Google Scholar

Haspelmath, Martin. 2009. Lexical borrowing: Concepts and issues. In Martin Haspelmath & Uri Tadmor (eds.), Loanwords in the world’s languages: A comparative handbook, 35–54. Berlin: de Gruyter Mouton.10.1515/9783110218442.35Search in Google Scholar

Hastie, Trevor, Robert Tibshirani & Jerome Friedman. 2009. The elements of statistical learning: Data mining, inference, and prediction. New York: Springer.10.1007/978-0-387-84858-7Search in Google Scholar

Jun, Eun. 2002. Yeongeo chayongeo eumjeol mal pyeswaeeumui payeol yeobuwa moeum sabibe gwanhan silheomjeok yeongu [An experimental study of the effect of release of English syllable final stops on vowel epenthesis in English loanwords]. Eumseong Eumun Hyeongtaeron Yeongu [Studies in Phonetics, Phonology and Morphology] 8. 117–134.Search in Google Scholar

Kang, Yoonjung. 2003. Perceptual similarity in loanword adaptation: English postvocalic word-final stops in Korean. Phonology 20. 219–273. https://doi.org/10.1017/s0952675703004524.Search in Google Scholar

Kim, Jungyeon. 2018. Production and perception of English word-final stops by Korean speakers. Stony Brook: Stony Brook University dissertation.Search in Google Scholar

Kuhn, Max. 2008. Building predictive models in R using the caret package. Journal of Statistical Software 28(5). 1–26. https://doi.org/10.18637/jss.v028.i05.Search in Google Scholar

Kuhn, Max & Kjell Johnson. 2013. Applied predictive modeling. New York: Springer.10.1007/978-1-4614-6849-3Search in Google Scholar

Kwon, Harim. 2017. Language experience, speech perception and loanword adaptation: Variable adaptation of English word-final plosives into Korean. Journal of Phonetics 60. 1–19. https://doi.org/10.1016/j.wocn.2016.10.001.Search in Google Scholar

Liaw, Andy & Matthew Wiener. 2002. Classification and regression by randomForest. R News 2(3). 18–22.Search in Google Scholar

National Academy of the Korean Language. 2001. Oeraeeo bareum siltae josa [Survey of pronunciation of loanwords]. Seoul: National Academy of the Korean Language.Search in Google Scholar

National Academy of the Korean Language. 2002. Oeraeeo pyogi yongnyejip [Guidelines for loanword orthography]. Seoul: National Academy of the Korean Language.Search in Google Scholar

National Academy of the Korean Language. 2007a. Oeraeeo sayong [Usage of loanwords]. Seoul: National Academy of the Korean Language.Search in Google Scholar

National Academy of the Korean Language. 2007b. Oeraeeo injido ihaedo sayongdo mit taedojosa [Recognition, understanding, usage and attitude to loanwords]. Seoul: National Academy of the Korean Language.Search in Google Scholar

National Academy of the Korean Language. 2010. Oeraeeo pyogi gyubeom yeonghyang pyeongga [Evaluation for influence of loanword orthography]. Seoul: National Academy of the Korean Language.Search in Google Scholar

Pedregosa, Fabian, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, M. Perrot & Edouard Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12. 2825–2830.Search in Google Scholar

Peperkamp, Sharon. 2005. A psycholinguistic theory of loanword adaptation. In Mare Ettlinger, Nicholas Fleischer & Mischa Park-Doob (eds.), Proceedings of the 30th annual meeting of the Berkeley Linguistics Society, 341–352. Berkeley: Berkeley Linguistics Society.10.3765/bls.v30i1.919Search in Google Scholar

Picard, Richard & Dennis Cook. 1984. Cross-validation of regression models. Journal of the American Statistical Association 79(387). 575–583. https://doi.org/10.1080/01621459.1984.10478083.Search in Google Scholar

Python Software Foundation. 2022. Python, version 3.10.4 [Programming language]. Available at: http://www.python.org.Search in Google Scholar

R Core Team. 2022. R: A language and environment for statistical computing, version 4.1.3. Vienna: R Foundation for Statistical Computing. Available at: https://www.R-project.org/.Search in Google Scholar

Razzaghi, Talayeh, Oleg Roderick, Ilya Safro & Nicholas Marko. 2016. Multilevel weighted support vector machine for classification on healthcare data with missing values. PLoS One 11(5). e0155119. https://doi.org/10.1371/journal.pone.0155119.Search in Google Scholar

Rhee, Seok-Jae & Yoo-Kyung Choi. 2001. Yeongeo chayongeoui moeum sabibe daehan tonggyegwanchalgwa geu uiui [A statistical observation of vowel insertion in English loanwords in Korean and its significance]. Eumseong Eumun Hyeongtaeron Yeongu [Studies in Phonetics, Phonology and Morphology] 7. 153–176.Search in Google Scholar

Rodríguez-Pérez, Raquel, Vogt Marin & Jürgen Bajorath. 2017. Support vector machine classification and regression prioritize different structural features for binary compound activity and potency value prediction. ACS Omega 2017(2). 6371–6379. https://doi.org/10.1021/acsomega.7b01079.Search in Google Scholar

Schohn, Greg & David Cohn. 2000. Less is more: Active learning with support vector machines. In Pat Langley (ed.), Proceedings of the 17th International Conference on Machine-learning, 839–846. San Francisco: Morgan Kaufmann.Search in Google Scholar

Sebastiani, Fabrizio. 2002. Machine-learning in automated text categorization. ACM Computing Surveys 34(1). 1–47. https://doi.org/10.1145/505282.505283.Search in Google Scholar

Srivastava, Durgesh & Lekha Bhambhu. 2010. Data classification using support vector machine. Journal of Theoretical and Applied Information Technology 12. 1–7.Search in Google Scholar

Strobl, Carolin, James Malley & Gerhard Tutz. 2009. An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods 14(4). 323–348. https://doi.org/10.1037/a0016973.Search in Google Scholar

Tagliamonte, Sali & Harald Baayen. 2012. Models, forests, and trees of York English: Was/were variation as a case study for statistical practice. Language Variation and Change 24(2). 135–178. https://doi.org/10.1017/s0954394512000129.Search in Google Scholar

Vapnik, Vladimir. 1995. The nature of statistical learning theory. New York: Springer.10.1007/978-1-4757-2440-0Search in Google Scholar

Yang, Yiming. 1999. An evaluation of statistical approaches to text categorization. Information Retrieval 1(1–2). 69–90. https://doi.org/10.1023/a:1009982220290.10.1023/A:1009982220290Search in Google Scholar

Received: 2022-04-29
Accepted: 2024-04-08
Published Online: 2024-06-21

© 2024 the author(s), published by De Gruyter, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

  1. Frontmatter
  2. Editorial
  3. Editorial 2024
  4. Phonetics & Phonology
  5. The role of recoverability in the implementation of non-phonemic glottalization in Hawaiian
  6. Epenthetic vowel quality crosslinguistically, with focus on Modern Hebrew
  7. Japanese speakers can infer specific sub-lexicons using phonotactic cues
  8. Articulatory phonetics in the market: combining public engagement with ultrasound data collection
  9. Investigating the acoustic fidelity of vowels across remote recording methods
  10. The role of coarticulatory tonal information in Cantonese spoken word recognition: an eye-tracking study
  11. Tracking phonological regularities: exploring the influence of learning mode and regularity locus in adult phonological learning
  12. Morphology & Syntax
  13. #AreHashtagsWords? Structure, position, and syntactic integration of hashtags in (English) tweets
  14. The meaning of morphomes: distributional semantics of Spanish stem alternations
  15. A refinement of the analysis of the resultative V-de construction in Mandarin Chinese
  16. L2 cognitive construal and morphosyntactic acquisition of pseudo-passive constructions
  17. Semantics & Pragmatics
  18. “All women are like that”: an overview of linguistic deindividualization and dehumanization of women in the incelosphere
  19. Counterfactual language, emotion, and perspective: a sentence completion study during the COVID-19 pandemic
  20. Constructing elderly patients’ agency through conversational storytelling
  21. Language Documentation & Typology
  22. Conative animal calls in Macha Oromo: function and form
  23. The syntax of African American English borrowings in the Louisiana Creole tense-mood-aspect system
  24. Syntactic pausing? Re-examining the associations
  25. Bibliographic bias and information-density sampling
  26. Historical & Comparative Linguistics
  27. Revisiting the hypothesis of ideophones as windows to language evolution
  28. Verifying the morpho-semantics of aspect via typological homogeneity
  29. Psycholinguistics & Neurolinguistics
  30. Sign recognition: the effect of parameters and features in sign mispronunciations
  31. Influence of translation on perceived metaphor features: quality, aptness, metaphoricity, and familiarity
  32. Effects of grammatical gender on gender inferences: Evidence from French hybrid nouns
  33. Processing reflexives in adjunct control: an exploration of attraction effects
  34. Language Acquisition & Language Learning
  35. How do L1 glosses affect EFL learners’ reading comprehension performance? An eye-tracking study
  36. Modeling L2 motivation change and its predictive effects on learning behaviors in the extramural digital context: a quantitative investigation in China
  37. Ongoing exposure to an ambient language continues to build implicit knowledge across the lifespan
  38. On the relationship between complexity of primary occupation and L2 varietal behavior in adult migrants in Austria
  39. The acquisition of speaking fundamental frequency (F0) features in Cantonese and English by simultaneous bilingual children
  40. Sociolinguistics & Anthropological Linguistics
  41. A computational approach to detecting the envelope of variation
  42. Attitudes toward code-switching among bilingual Jordanians: a comparative study
  43. “Let’s ride this out together”: unpacking multilingual top-down and bottom-up pandemic communication evidenced in Singapore’s coronavirus-related linguistic and semiotic landscape
  44. Across time, space, and genres: measuring probabilistic grammar distances between varieties of Mandarin
  45. Navigating linguistic ideologies and market dynamics within China’s English language teaching landscape
  46. Streetscapes and memories of real socialist anti-fascism in south-eastern Europe: between dystopianism and utopianism
  47. What can NLP do for linguistics? Towards using grammatical error analysis to document non-standard English features
  48. From sociolinguistic perception to strategic action in the study of social meaning
  49. Minority genders in quantitative survey research: a data-driven approach to clear, inclusive, and accurate gender questions
  50. Variation is the way to perfection: imperfect rhyming in Chinese hip hop
  51. Shifts in digital media usage before and after the pandemic by Rusyns in Ukraine
  52. Computational & Corpus Linguistics
  53. Revisiting the automatic prediction of lexical errors in Mandarin
  54. Finding continuers in Swedish Sign Language
  55. Conversational priming in repetitional responses as a mechanism in language change: evidence from agent-based modelling
  56. Construction grammar and procedural semantics for human-interpretable grounded language processing
  57. Through the compression glass: language complexity and the linguistic structure of compressed strings
  58. Could this be next for corpus linguistics? Methods of semi-automatic data annotation with contextualized word embeddings
  59. The Red Hen Audio Tagger
  60. Code-switching in computer-mediated communication by Gen Z Japanese Americans
  61. Supervised prediction of production patterns using machine learning algorithms
  62. Introducing Bed Word: a new automated speech recognition tool for sociolinguistic interview transcription
  63. Decoding French equivalents of the English present perfect: evidence from parallel corpora of parliamentary documents
  64. Enhancing automated essay scoring with GCNs and multi-level features for robust multidimensional assessments
  65. Sociolinguistic auto-coding has fairness problems too: measuring and mitigating bias
  66. The role of syntax in hashtag popularity
  67. Language practices of Chinese doctoral students studying abroad on social media: a translanguaging perspective
  68. Cognitive Linguistics
  69. Metaphor and gender: are words associated with source domains perceived in a gendered way?
  70. Crossmodal correspondence between lexical tones and visual motions: a forced-choice mapping task on Mandarin Chinese
Downloaded on 8.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/lingvan-2024-0034/html
Scroll to top button