Supervised prediction of production patterns using machine learning algorithms

Jungyeon Kim

doi:10.1515/lingvan-2024-0034

Article Open Access

Supervised prediction of production patterns using machine learning algorithms

Jungyeon Kim

Published/Copyright: June 21, 2024

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Linguistics Vanguard Volume 10 Issue 1

Abstract

When an English word ending in a stop is adapted to Korean, a vowel is variably inserted after the final stop: some words always take the epenthetic vowel, and some never do, while some vary between these alternatives. Although there are different linguistic factors that possibly affect this insertion, it is not easy to determine which pattern will be chosen if a new word comes into the borrowing language. This study conducted classification data analyses of production patterns based on machine learning algorithms including support vector machines and random forests. These two classifiers show similar results where vowel tenseness is the best predictor among all the possible predictors. This indicates that vowel tenseness is most influential in classifying the patterns (no vowel insertion, optional vowel insertion, or vowel insertion). Results suggest that while vowel tenseness remains significant, other factors such as stop voicing and stop place also hold some importance, albeit to a lesser degree. The contribution of this study is that it provides insight into the factors that regulate vowel insertion, and these findings support the need for a behavioral experiment to see if the current results can make right predictions with respect to the behavior of nonce items.

Keywords: classification; supervised prediction; support vector machines; random forests

1 Introduction

Loanword adaptation (or repair) is the process by which a word’s original structure is altered to phonologically fit the borrowing language; that is, a word from a source language cannot enter a borrowing language without being phonologically adapted (Haspelmath 2009). However, there are some loanword patterns that cannot be explained by the phonological grammar of the borrowing language. One of those patterns involves what Peperkamp (2005) calls unnecessary repair, in which loanword repair occurs even when there are no illegal sequences in the native phonotactics of the borrowing language. Korean provides a particularly interesting case of this apparently unnecessary adaptation: a voiceless stop is permitted in the word-final position in the Korean native phonology (e.g., [pap˺] ‘meal’, [ap˺] ‘front’), so we would expect that words borrowed from English ending in a voiceless stop would be adapted as ending in a Korean voiceless stop, but it is surprising to find that a vowel is frequently inserted after the final stop (see Table 1). Previous studies have suggested that different factors affect the possibility of vowel epenthesis in this environment, including vowel tenseness, stop voicing, stop place, final stress, and word size (Jun 2002; Kang 2003; Kim 2018; Kwon 2017; Rhee and Choi 2001), as indicated in Table 2.

Table 1:

Vowel insertion patterns of English words ending in a stop.

No vowel insertion	Optional vowel insertion	Vowel insertion
lab → [læp˺] hot → [hɑt˺] book → [puk˺]	type → [t^haɪp˺]∼[t^haɪp^hɨ] flute → [p^hɨllut˺]∼[p^hɨllut^hɨ] tag → [t^hæk˺]∼[t^hægɨ]	loop → [lup^hɨ] brand → [pɨrændɨ] peak → [p^hik^hɨ]

Table 2:

Linguistic features influencing the likelihood of vowel epenthesis following English word-final stops.

Factor	Observations	Examples
Vowel tenseness	Vowel insertion is more likely when the vowel preceding the English word-final stop is tense than when it is lax	Lax: step → [sɨt^hɛp˺] Tense: state → [sɨt^hɛit^hɨ]
Stop voicing	Vowel insertion is more likely when the English word-final stop is voiced than when it is voiceless	Voiceless: plot → [p^hɨllot˺] Voiced: plug → [p^hɨllʌgɨ]
Stop place	Vowel insertion is more likely when the English word-final stop is coronal than when it is labial or dorsal	Labial/dorsal: cap/bag → [k^hæp˺]/[pæk˺] Coronal: bat → [pæt^hɨ]
Word size	Vowel insertion is more likely when the English word is monosyllabic than when it is polysyllabic	Polysyllabic: moonlight → [munlait˺] Monosyllabic: light → [lait^hɨ]
Final stress	Vowel insertion is more likely when the English final syllable is stressed than when it is unstressed	Unstressed: ˈhandbag → [hændɨbæk˺] Stressed: handˈmade → [hændɨmɛidɨ]

Since predicting the specific adaptation pattern in any case is not so easy even with the many possible features that have been proposed by prior research, the purpose of the present study is to identify which production pattern will be chosen if a new English word ending in a stop is adapted into Korean. That is, this study attempts to find out which linguistic factor serves to produce the optimal output. In order to discover the best predictor, a corpus was built by collecting loanword lists published by the National Academy of the Korean Language (2001, 2002, 2007a, 2007b, 2010). This corpus consisted of 540 Korean loanwords from English whose English source word ends in a stop; out of 540 English words with a coda stop, 264 were consistently adapted with final vowel insertion and 214 were always adapted without final vowel insertion, while 62 were variably adapted both with and without vowel insertion.

This study employed a classification technique as a predictive model (Hastie et al. 2009; Kuhn and Johnson 2013) and a classification task was conducted using support vector machines (SVMs; Cortes and Vapnik 1995). Introduced by Cortes and Vapnik (1995) for binary classification, SVMs have been increasingly used for research in various areas including linguistics, communication, bioinformatics, and so on (Razzaghi et al. 2016; Rodríguez-Pérez et al. 2017; Schohn and Cohn 2000; Sebastiani 2002; Yang 1999). SVMs are a popular choice for classification, with several advantages. They are less prone to overfitting, which occurs when a model learns the training data too well but fails to generalize to new data. SVMs find a balance between maximizing the margin and minimizing the training error, which allows them to generalize well to unseen data. Also, they are memory efficient and can handle datasets with a limited number of training instances.

Despite the strengths of SVMs, it is acknowledged that their trained parameters can be difficult to analyze and interpret. Thus, a different classifier, random forests (RFs; Breiman 2001), was also applied on the same data to properly interpret the contribution of linguistic factors and to see if similar variable importance results are obtained by both SVMs and RFs. RFs are one of the most versatile machine learning algorithms as they do not require many tunings of their setting. They have been applied to sociolinguistic data and to acoustic cue weighting in perception (Brown et al. 2014; Tagliamonte and Baayen 2012). Breiman (2001) proposed RFs as an ensemble learning technique that constructs a forest by growing independent decision trees for classification tasks and a decision tree is built by randomly selecting a subset of the data. This process is repeated multiple times and prediction is made based on a majority of votes (Liaw and Wiener 2002). The algorithm splits the dataset into two sets: the training set that comprises approximately 66.7 % of the data (in-bag set) and the testing set that consists of the remaining 33.3 % of the data (out-of-bag set). Then samples are selected from the dataset without replacement during the growth of the forest (Strobl et al. 2009). Since no study has conducted analyses based on classification in this specific context of Korean loanwords from English, the contribution of this study is that it provides methodology where we can classify different patterns of loanword adaptation and predict a pattern given a great deal of corpus data comprising established loanwords. The goal of the current study is to demonstrate how machine learning classifiers can be utilized to determine the primary phonological factors that account for variable adaptation of loanwords in Korean.

2 Methods

In this study, the adaptation pattern of vowel insertion was modeled using an ordinal fixed-effects logistic regression model implemented in the ordinal package (Christensen 2022) in R (R Core Team 2022) to assess the significance of the linguistic factors that have been proposed to affect vowel insertion. An ordinal regression model was employed since the analysis involved ordered dependent variable categories, with no vowel insertion (NVI) being the lowest and vowel insertion (VI) being the highest (see Table 1). The vowel insertion pattern was calculated by building a single model where the dependent variable was the adaptation pattern (NVI, OVI [optional vowel insertion], or VI) and the five context predictors were vowel tenseness (lax or tense), stop voicing (voiceless or voiced), stop place (labial or coronal or dorsal), final stress (unstressed or stressed), and word size (monosyllabic or polysyllabic). All fixed factors were coded using treatment coding with the reference level for the intercept being set to lax, voiceless, labial, unstressed, and monosyllabic. Ordinal logistic regression was run in this study instead of multinomial logistic regression since the former is more parsimonious in that it would return only one coefficient describing adjacent pairs of the outcome levels rather than a separate coefficient for every possible pair. This model, however, could not incorporate a random-effects structure (i.e., by-item intercepts) because the variance-covariance matrix of the parameters could not be calculated, which might be explained by the fact that the present data do not involve hierarchical structures with few grouping levels, where the estimation of random effects would become unstable, leading to decreased model reliability.

The current study employed the SVM algorithm in Python (Python Software Foundation 2022) to help determine which category of adaptation patterns a new data point belongs to if it is adapted into the borrowing language. This computational study involved two different types of classification, binary classification (NVI or VI) and ternary classification (NVI, OVI, or VI). The binary classification classifies all the data points as ending in a C(onsonant) or a V(owel) or both (for the OVI case). It classifies ending in a consonant as true (1) or false (0), and ending in a vowel as true (1) or false (0): that is, it has binary C (true or false that the word ends in a consonant) as well as binary V (true or false that the word ends in a vowel). In the binary C classification, the presence or absence of C is focused: if a vowel is inserted after the final stop and a word ends in a vowel then C is false, whereas if a vowel is not inserted and a word ends in a consonant then C is true; but if a word has both VI and NVI cases, then C is also true. Similarly, in the binary V classification, the presence or absence of V is focused: if a vowel is not inserted and a word ends in a consonant then V is false, if a vowel is inserted then V is true, and if a word belongs to the OVI case then V is also true. On the other hand, the ternary classification classifies the pattern of each data point as NVI, OVI, or VI. In this study, we expected to obtain better results with higher accuracy in the binary classification models rather than in those with ternary classification, taking into account the different chance levels.

The production patterns following the English final stop were fitted using the sklearn package in Python (Pedregosa et al. 2011). We fitted separate models to two classification types, the binary classification (NVI or VI) and the ternary classification (NVI, OVI, or VI). Each model had class as the outcome variable (i.e., NVI, OVI, or VI) and the linguistic attributes as the predictor variables (i.e., lax, tense, voiceless, voiced, labial, coronal, dorsal, monosyllabic, polysyllabic, unstressed, or stressed; see Table 3). For this study, data preprocessing was conducted for experimental setup since SVMs handle categorical variables, for which a dummy variable is created with case values as either 0 or 1. SVMs require each data case to be represented a vector of real numbers and hence the original loanword list was converted into numeric data. A categorical variable consisting of two levels is represented by a set of two dummy variables, and a variable consisting of three levels is represented by three variables. That is, a two-category attribute such as {voiced, voiceless} was represented as (0, 1) and (1, 0); a three-category attribute like {labial, coronal, dorsal} was represented as (0, 0, 1), (0, 1, 0), and (1, 0, 0). Each attribute set was randomly divided into two subsets: a training subset comprising 75 % of the data and a testing subset comprising the remaining 25 %. All models were fitted using a linear kernel since it is the simplest among the kernel functions used in SVMs and its simplicity makes it a suitable choice when dealing with limited training data. The set type of SVMs that was chosen was nu-SVR, which is relatively robust to outliers and allows adjusting the flexibility of the model by using the nu parameter. Model performance was evaluated using a five-fold cross-validation. To avoid the challenges of working with a limited dataset, we employed Monte Carlo cross-validation (Kuhn and Johnson 2013; Picard and Cook 1984). This method consisted of repeating the train/test process 100 times for each model with random train/test splits each time. After conducting the 100 iterations, we calculated the average of the classification rates from each iteration to obtain the overall classification rate.

Table 3:

Thirty-one classifiers for the five different factors (vowel tenseness, stop voicing, stop place, word size, and final stress).

Number of factors	Classifier
1	(i) tenseness, (ii) voicing, (iii) place, (iv) size, (v) stress
2	(i) tenseness + voicing, (ii) tenseness + place, (iii) tenseness + size, (iv) tenseness + stress, (v) voicing + place, (vi) voicing + size, (vii) voicing + stress, (viii) place + size, (ix) place + stress, (x) size + stress
3	(i) tenseness + voicing + place, (ii) tenseness + voicing + size, (iii) tenseness + voicing + stress, (iv) tenseness + place + size, (v) tenseness + place + stress, (vi) tenseness + size + stress, (vii) voicing + place + size, (viii) voicing + place + stress, (ix) voicing + size + stress, (x) place + size + stress
4	(i) tenseness + voicing + place + size, (ii) tenseness + voicing + place + stress, (iii) tenseness + voicing + size + stress, (iv) tenseness + place + size + stress, (v) voicing + place + size + stress
5	(i) tenseness + voicing + place + size + stress

3 Results and discussion

The regression model built for the corpus analysis found significant effects of several linguistic factors. Figure 1 shows the predicted probability of vowel insertion patterns classified as VI, OVI, and NVI depending on each predictor. First, tense pre-stop vowels significantly increased the likelihood of each pattern relative to NVI (see Table 5 in Appendix). The model shows that NVI patterns (reference category) were predicted to be significantly more common than OVI patterns at baseline (i.e., tenseness = lax, voicing = voiceless, place = labial, size = monosyllabic, and stress = unstressed); VI patterns were predicted to be significantly less common than OVI patterns at baseline. These results are shown by the significant positive estimate values (β = 3.397, p < 0.001 for NVI vs. OVI; β = 4.456, p < 0.001 for OVI vs. VI). Importantly, tense pre-stop vowels significantly increased the predicted probability of VI patterns relative to other patterns, compared with the reference likelihood (β = 3.323, OR = 27.74, p < 0.001). Second, words ending in voiced stops also significantly changed relative rates of VI patterns in relation to words ending in voiceless stops. There was a significant increase in the relative rates of VI patterns in words ending in voiced stops, compared with the reference condition rates (β = 2.215, OR = 9.16, p < 0.001). The model also found a significant effect of stop place; that is, there was a significant increase in the relative rates of VI patterns in words ending in coronal and dorsal stops, compared with the reference rates, that is, the rates of VI patterns to other patterns for words ending in labial stops (β = 3.572, OR = 35.57, p < 0.001 for labials vs. coronals; β = 2.034, OR = 7.65, p < 0.001 for labials vs. dorsals).

Figure 1:

Predicted probability of vowel insertion patterns in the regression model classified as vowel insertion (VI; dark blue), optional vowel insertion (OVI; mid blue), and no vowel insertion (NVI; light blue) depending on each linguistic factor (vowel tenseness in A, stop voicing in B, stop place in C, word size in D, and final stress in E).

The SVM model for ternary data in Table 6 (see Appendix) shows overall classification rates of 71 %, which represents well above chance classification. The ternary model of the classification matrix in Table 7 shows that the five-attribute classifier of vowel tenseness + stop voicing + stop place + word size + final stress is the most accurately classified at 78 %, while the one-attribute classifiers of stop voicing and word size are the worst at 56 %. The models for binary data, on the other hand, show overall classification rates of 78 %, which is higher than for the ternary one. Two four-attribute classifiers are the most accurately classified at 86 % for both binary C and binary V: vowel tenseness + stop voicing + stop place + final stress, and vowel tenseness + stop voicing + stop place + word size for binary C; vowel tenseness + stop voicing + stop place + final stress for binary V. The model for binary data also shows that the one-attribute classifier of word size is the worst at 62 % for binary C and 58 % for binary V, respectively.

It is evident that classifiers incorporating the vowel tenseness attribute demonstrate strong performance, achieving classification rates exceeding 77 % for the binary C model. This finding would suggest that tenseness of the vowel preceding the English final stop holds a phonological/phonetic importance in the Korean grammar. Additionally, it can be observed that increasing the number of attributes resulted in higher accuracy rates across the three models (see Table 6). The increase in accuracy percentages with the addition of more attributes is not so surprising, as including a greater number of explanatory variables naturally enhances the fit of the model in SVM analyses (Srivastava and Bhambhu 2010). Generally, more context in any multivariate analyses tends to lead to better performance of the model (Altman 1992; Cunningham and Delany 2007; Vapnik 1995). However, an exceptional case is found in the models for binary data: the five-attribute predictor (vowel tenseness + stop voicing + stop place + word size + final stress) is less accurately classified than the four-attribute predictor (vowel tenseness + stop voicing + stop place + final stress and vowel tenseness + stop voicing + stop place + word size), with 85 % versus 86 % (see Table 7). This result shows that the presence of a larger number of variables does not necessarily guarantee a more satisfactory outcome. A possible reason for this could be that each cue points toward different directions: in the model for binary V, all five attributes except word size might be pointing in the same direction so that the four-attribute classification has a stronger effect and is better accurately classified than the one containing all the possible cues.

As shown in Figure 2, which illustrates the classifiers with the top three overall classification rates for each of the three models, the most accurate classification rate of the ternary model was 7 % lower than that of the two binary models (78 % vs. 86 %). The best result (86 %) was obtained from both binary C and binary V models in the classifier of vowel tenseness + stop voicing + stop place + final stress. Thus, we can assume that this classifier has the greatest impact on vowel insertion patterns when an English word ending in a stop is adapted into Korean. Specifically, the classification results revealed a substantial effect of both vowel tenseness and stop voicing on the vowel insertion phenomenon. Classifiers including the tenseness attribute are found to consistently achieve greater accuracy than those containing the other attributes. All 16 classifications containing vowel tenseness showed high accuracy both in ternary and binary models; they were all higher than 70 % in the ternary model and 77 % in binary C model (see Table 7 in Appendix). Although the binary V model does not exhibit the exactly same pattern as the other two models, it still generally yields very high accuracy level; out of the 16 classifiers that include vowel tenseness, 12 had accuracy higher than 79 %, and four range from 70 % to 73 %.

Figure 2:

Classifiers with the top three overall classification rates in the support vector machine models of the data for ternary, binary C(onsonant), and binary V(owel).

In this study, an RF classifier was also applied on the same production data to see if similar results are obtained by both SVM and RF classifiers. We made use of the classical RF algorithm implemented in the R package randomForest (Liaw and Wiener 2002). RFs were fitted to the data to predict the production pattern with three levels (NVI, OVI, and VI), and the dependent variable was modeled as a function of linguistic factors containing vowel tenseness, stop voicing, stop place, final stress, and word size. Since the mtry parameter in the RF algorithm determines the number of predictors used in constructing each decision tree and experimenting with various mtry values is recommended to enhance the model’s performance by optimizing mtry, the train() function from the caret package was utilized to tune the mtry parameter in this study (Kuhn 2008). To optimize mtry, a ten-fold cross-validation with five repetitions was conducted and the final RF model exhibited a high classification accuracy of 80.75 % for the three different production patterns. The confusion matrix in Table 4 shows the model’s ability to distinguish the three patterns. The columns in the table represent the model’s predictions while the rows represent the observed categories in the data. Instances that are accurately classified are situated on the diagonal of the table and misclassifications are found off the diagonal. Furthermore, the fourth column presents the balanced accuracy. The model accurately classified NVI and VI patterns with an accuracy of approximately 87 %, which is remarkable for unseen data. The accuracy for the OVI pattern, however, indicates that this type was not effectively learned by the model, with an error rate of 50 %. A possible reason for the poor performance of the model with reference to predicting OVI could be that the model predominantly predicted OVI instances as VI (n = 70); the majority class appears to influence the model’s prediction in such cases.

Table 4:

Confusion matrix for the RF (random forests) model. Rows represent the observed labels and the columns represent predicted ones along with balanced accuracy. Accurately classified instances are situated on the diagonal.

	NVI	OVI	VI	Accuracy
NVI	60	10	9	0.870
OVI	0	0	0	0.500
VI	4	8	70	0.869

In order to compute the predictor importance in the RF model, the varImp() function from the caret package was employed (Kuhn 2008). In this study, permutation-based variable importance was employed to assess the differences in predictive accuracy of the model. The method of estimating variable importance using RFs is different from more conventional analyses like regression in that permutation-based variable importance measures the impact of each predictor by randomly permuting its values and observing the change in the model’s performance (Breiman 2001). In other words, if a certain predictor (e.g., vowel tenseness) is connected to a specific level of the response variable (e.g., VI), randomly shuffling the values of vowel tenseness reduces the connection between VI and the predictor. This analysis directly ranks the relative importance of several linguistic factors in predicting the different patterns of vowel insertion. The estimated relative importance of predictors is shown in Figure 3, where predictors that are closer to zero are considered to have minimal contribution in predicting the dependent variable. These results show that the predictors can be categorized into three distinct groups based on their importance. Primary predictors consist of vowel tenseness and coronal stop place, where vowel tenseness was the most important predictor of dependent variable, followed by coronal stop place. Random permutation of the values for these predictors noticeably decreased the accuracy of the model’s prediction of production patterns. Moreover, stop voicing and dorsal stop place are also important predictors, although their importance is relatively lower compared to the predictors in the main group. The remaining predictors, word size and final stress, constitute the third group and are the least important in the RF model. Final stress is specifically associated with a mean decrease in accuracy that is very close to zero, indicating that it has little importance as a predictor in this model. Overall, vowel tenseness is the best predictor of production patterns in the data, whereas other predictors are less informative as predictors. These findings from RFs are consistent with the SVM results in that the vowel tenseness predictor greatly contributes to the classification accuracy in both RF and SVM models.

Figure 3:

The estimated relative importance of variables in random forests considering all predictors. Greater values of mean decrease in accuracy rates are predicted to have a more substantial impact on the model’s classification accuracy.

In this study, we trained classifiers of SVMs and RFs to get an insight into the factors that can contribute to the prediction of vowel insertion following an English word-final stop when a word is borrowed into Korean. From the SVM results, we have found that binary classification (NVI or VI) achieved more satisfying results than ternary classification (NVI, OVI, or VI), with the average accuracy of the ternary model being 7 % lower than that of the two binary models (78 % vs. 71 %; see Table 6). More importantly, vowel tenseness was found to consistently achieve greater accuracy than other predictors. Similar results were obtained by the RF model, with vowel tenseness the most important predictor of production pattern, as random permutation of the values for this predictor noticeably decreased the accuracy of the prediction of the model. Predictors other than vowel tenseness were found to be less informative, although factors such as stop voicing and stop place also held some variable importance. These findings support the need for a behavioral experiment to see if the current results can make the correct predictions with respect to the behavior of nonce items.

Table 5:

Summary of fixed effect coefficients in the logistic regression model for the ordinal distribution of no vowel insertion (NVI), optional vowel insertion (OVI), and vowel insertion (VI). NVI is the reference level. Fixed factors are vowel tenseness (lax vs. tense), stop voicing (voiceless vs. voiced), stop place (labial vs. coronal vs. dorsal), final stress (unstressed vs. stressed), and word size (monosyllabic vs. polysyllabic), with lax, voiceless, labial, unstressed, and monosyllabic set as the baseline levels. Values represent model coefficient estimate, standard error, odds ratio, 95 % confidence interval, and p value.

Fixed effect	β	SE	OR	95 % CI	p
NVI \| OVI (Intercept)	3.397	0.595	29.87	9.29–96.02	<0.001^***
OVI \| VI (Intercept)	4.456	0.612	86.18	25.90–286.72	<0.001^***
Tenseness = tense	3.323	0.324	27.74	15.10–54.04	<0.001^***
Voicing = voiced	2.215	0.315	9.16	5.02–17.32	<0.001^***
Place = coronal	3.572	0.397	35.57	16.84–80.10	<0.001^***
Place = dorsal	2.034	0.397	7.65	3.59–17.12	<0.001^***
Stress = stressed	0.847	0.483	2.33	0.92–6.15	0.452
Size = polysyllabic	−0.369	0.491	0.69	0.27–1.84	0.080

***p < 0.001.

Table 6:

Support vector machine (SVM) classification matrix for the data of ternary, binary C(onsonant), and binary V(owel). Attributes are vowel tenseness, stop voicing, stop place, word size, and final stress. Values represent the percentage of correct classification (rounded to zero decimal places) with the standard deviation (rounded to one decimal place) in parentheses.

Classifier	Average accuracy (%) of 100 rounds (SD)
	Ternary	Binary C	Binary V
Tenseness	70 (3.8)	80 (1.5)	73 (3.3)
Voicing	56 (5.8)	66 (4.4)	60 (3.5)
Place	67 (3.5)	70 (5.8)	75 (6.7)
Size	56 (4.1)	62 (5.5)	58 (4.1)
Stress	60 (1.4)	65 (3.3)	66 (5.9)
Mean of one-attribute classifiers	62	69	66

Tenseness + voicing	76 (3.1)	84 (2.8)	79 (2.4)
Tenseness + place	75 (0.8)	77 (4.8)	84 (3.1)
Tenseness + size	70 (3.8)	80 (0.9)	71 (5.2)
Tenseness + stress	71 (3.4)	80 (4.3)	70 (4.2)
Voicing + place	69 (5.1)	72 (3.5)	78 (2.7)
Voicing + size	64 (3.2)	69 (4.8)	71 (5.6)
Voicing + stress	66 (3.8)	70 (4.0)	72 (4.0)
Place + size	69 (3.3)	69 (2.5)	79 (3.1)
Place + stress	70 (1.0)	69 (3.5)	80 (2.0)
Syllabicity + stress	60 (3.8)	66 (1.7)	66 (4.4)
Mean of two-attribute classifiers	69	74	75

Tenseness + voicing + place	77 (1.6)	85 (3.0)	85 (2.6)
Tenseness + voicing + size	75 (3.9)	84 (1.9)	79 (2.0)
Tenseness + voicing + stress	76 (4.7)	84 (2.1)	79 (3.7)
Tenseness + place + size	75 (3.3)	80 (3.3)	84 (4.7)
Tenseness + place + stress	76 (1.5)	80 (2.9)	85 (1.2)
Tenseness + size + stress	71 (4.0)	80 (3.8)	71 (3.7)
Voicing + place + size	68 (4.8)	74 (3.2)	79 (4.1)
Voicing + place + stress	69 (3.9)	72 (3.6)	78 (6.0)
Voicing + size + stress	64 (3.0)	70 (4.3)	71 (4.8)
Place + size + stress	69 (3.9)	71 (3.9)	79 (4.5)
Mean of three-attribute classifiers	72	78	79

Tenseness + voicing + place + size	77 (2.3)	86 (4.8)	85 (1.5)
Tenseness + voicing + place + stress	77 (4.7)	86 (2.8)	86 (1.7)
Tenseness + voicing + size + stress	76 (3.4)	84 (4.5)	79 (3.8)
Tenseness + place + size + stress	75 (3.1)	82 (2.3)	85 (4.6)
Voicing + place + size + stress	69 (1.4)	76 (2.3)	80 (5.2)
Mean of four-attribute classifiers	75	83	83

Tenseness + voicing + place + size + stress	78 (3.3)	85 (1.9)	85 (3.2)
Mean of all classifiers	71	78	78

Table 7:

Hierarchy of overall classification accuracy rates (%) for each model of ternary, binary C(onsonant), and binary V(owel). Attributes are vowel tenseness, stop voicing, stop place, word size, and final stress.

Ternary	Binary C	Binary V
78 Tenseness + voicing + place + size + stress	86 Tenseness + voicing + place + stress Tenseness + voicing + place + size	86 Tenseness + voicing + place + stress
77 Tenseness + voicing + place Tenseness + voicing + place + size Tenseness + voicing + place + stress	85 Tenseness + voicing + place Tenseness + voicing + place + size + stress	85 Tenseness + voicing + place Tenseness + voicing + place + size Tenseness + voicing + place + size + stress Tenseness + place + size + stress Tenseness + place + stress
76 Tenseness + voicing Tenseness + voicing + stress Tenseness + place + stress Tenseness + voicing + size + stress	84 Tenseness + voicing + stress Tenseness + voicing + size Tenseness + voicing + size + stress Tenseness + voicing	84 Tenseness + place Tenseness + place + size
75 Tenseness + place Tenseness + place + size Tenseness + voicing + size Tenseness + place + size + stress	82 Tenseness + place + size + stress	80 Place + stress Voicing + place + size + stress
71 Tenseness + stress Tenseness + size + stress	80 Tenseness Tenseness + size Tenseness + stress Tenseness + place + size Tenseness + place + stress Tenseness + size + stress	79 Tenseness + voicing Place + size Place + size + stress Tenseness + voicing + size Tenseness + voicing + stress Tenseness + voicing + size + stress Voicing + place + size
70 Tenseness Tenseness + size	77 Tenseness + place	78 Voicing + place Voicing + place + stress
69 Voicing + place Voicing + place + size + stress Voicing + place + stress Place + size Place + size + stress	76 Voicing + place + size + stress	75 Place
68 Voicing + place + size	74 Voicing + place + size	73 Tenseness
67 Place	72 Voicing + place Voicing + place + stress	72 Voicing + stress
66 Voicing + stress	71 Place + size + stress	71 Voicing + size + stress Tenseness + size + stress Voicing + size Tenseness + size
64 Voicing + size + stress Voicing + size	70 Place Voicing + stress Voicing + size + stress	70 Tenseness + stress
60 Stress Size + stress	69 Voicing + size Place + size Place + stress	66 Stress Size + stress
56 Voicing Size	66 Voicing Size + stress	60 Voicing
	65 Stress	58 Size
	62 Size

The contribution of the current study is that it provides methods where we can classify different adaptation patterns and determine which predictor is most influential in classification. In this study we show that using machine learning classifiers in the investigation of loanword adaptation presents several advantages and opportunities for further study. Machine learning techniques offer quantitative methods for analyzing large datasets and identifying subtle phonological patterns. These classifiers can also be trained to predict the outcomes of loanword adaptation for new entries into a language. By uncovering predictive patterns in existing loanwords, we have developed models capable of anticipating how future loanwords will be adapted. This interdisciplinary approach could potentially open up new avenues for research and provide valuable insights into the dynamics of language contact.

Corresponding author: Jungyeon Kim, Kangwon National University, Chuncheon, South Korea, E-mail: jungyeonkim@kangwon.ac.kr

Appendix

References

Altman, Naomi. 1992. An introduction to kernel and nearest-neighbor nonparametric regression. American Statistician 46. 175–185. https://doi.org/10.1080/00031305.1992.10475879.Search in Google Scholar

Breiman, Leo. 2001. Random forests. Machine Learning 45(1). 5–32. https://doi.org/10.1023/a:1010933404324.10.1023/A:1010933404324Search in Google Scholar

Brown, Lucien, Bodo Winter, Kaori Idemaru & Sven Grawunder. 2014. Phonetics and politeness: Perceiving Korean honorific and non-honorific speech through phonetic cues. Journal of Pragmatics 66. 45–60. https://doi.org/10.1016/j.pragma.2014.02.011.Search in Google Scholar

Christensen, Rune Haubo Bojesen. 2022. ordinal – Regression models for ordinal data, version 2022.11-16 [R package]. Available at: https://CRAN.R-project.org/package=ordinal.Search in Google Scholar

Cortes, Corinna & Vladimir Vapnik. 1995. Support-vector networks. Machine Learning 20(3). 273–297. https://doi.org/10.1007/bf00994018.Search in Google Scholar

Cunningham, Pádraig & Sarah Jane Delany. 2007. K-nearest neighbor classifiers. Multiple Classifier Systems 34. 1–17.Search in Google Scholar

Haspelmath, Martin. 2009. Lexical borrowing: Concepts and issues. In Martin Haspelmath & Uri Tadmor (eds.), Loanwords in the world’s languages: A comparative handbook, 35–54. Berlin: de Gruyter Mouton.10.1515/9783110218442.35Search in Google Scholar

Hastie, Trevor, Robert Tibshirani & Jerome Friedman. 2009. The elements of statistical learning: Data mining, inference, and prediction. New York: Springer.10.1007/978-0-387-84858-7Search in Google Scholar

Jun, Eun. 2002. Yeongeo chayongeo eumjeol mal pyeswaeeumui payeol yeobuwa moeum sabibe gwanhan silheomjeok yeongu [An experimental study of the effect of release of English syllable final stops on vowel epenthesis in English loanwords]. Eumseong Eumun Hyeongtaeron Yeongu [Studies in Phonetics, Phonology and Morphology] 8. 117–134.Search in Google Scholar

Kang, Yoonjung. 2003. Perceptual similarity in loanword adaptation: English postvocalic word-final stops in Korean. Phonology 20. 219–273. https://doi.org/10.1017/s0952675703004524.Search in Google Scholar

Kim, Jungyeon. 2018. Production and perception of English word-final stops by Korean speakers. Stony Brook: Stony Brook University dissertation.Search in Google Scholar

Kuhn, Max. 2008. Building predictive models in R using the caret package. Journal of Statistical Software 28(5). 1–26. https://doi.org/10.18637/jss.v028.i05.Search in Google Scholar

Kuhn, Max & Kjell Johnson. 2013. Applied predictive modeling. New York: Springer.10.1007/978-1-4614-6849-3Search in Google Scholar

Kwon, Harim. 2017. Language experience, speech perception and loanword adaptation: Variable adaptation of English word-final plosives into Korean. Journal of Phonetics 60. 1–19. https://doi.org/10.1016/j.wocn.2016.10.001.Search in Google Scholar

Liaw, Andy & Matthew Wiener. 2002. Classification and regression by randomForest. R News 2(3). 18–22.Search in Google Scholar

National Academy of the Korean Language. 2001. Oeraeeo bareum siltae josa [Survey of pronunciation of loanwords]. Seoul: National Academy of the Korean Language.Search in Google Scholar

National Academy of the Korean Language. 2002. Oeraeeo pyogi yongnyejip [Guidelines for loanword orthography]. Seoul: National Academy of the Korean Language.Search in Google Scholar

National Academy of the Korean Language. 2007a. Oeraeeo sayong [Usage of loanwords]. Seoul: National Academy of the Korean Language.Search in Google Scholar

National Academy of the Korean Language. 2007b. Oeraeeo injido ihaedo sayongdo mit taedojosa [Recognition, understanding, usage and attitude to loanwords]. Seoul: National Academy of the Korean Language.Search in Google Scholar

National Academy of the Korean Language. 2010. Oeraeeo pyogi gyubeom yeonghyang pyeongga [Evaluation for influence of loanword orthography]. Seoul: National Academy of the Korean Language.Search in Google Scholar

Pedregosa, Fabian, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, M. Perrot & Edouard Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12. 2825–2830.Search in Google Scholar

Peperkamp, Sharon. 2005. A psycholinguistic theory of loanword adaptation. In Mare Ettlinger, Nicholas Fleischer & Mischa Park-Doob (eds.), Proceedings of the 30th annual meeting of the Berkeley Linguistics Society, 341–352. Berkeley: Berkeley Linguistics Society.10.3765/bls.v30i1.919Search in Google Scholar

Picard, Richard & Dennis Cook. 1984. Cross-validation of regression models. Journal of the American Statistical Association 79(387). 575–583. https://doi.org/10.1080/01621459.1984.10478083.Search in Google Scholar

Python Software Foundation. 2022. Python, version 3.10.4 [Programming language]. Available at: http://www.python.org.Search in Google Scholar

R Core Team. 2022. R: A language and environment for statistical computing, version 4.1.3. Vienna: R Foundation for Statistical Computing. Available at: https://www.R-project.org/.Search in Google Scholar

Razzaghi, Talayeh, Oleg Roderick, Ilya Safro & Nicholas Marko. 2016. Multilevel weighted support vector machine for classification on healthcare data with missing values. PLoS One 11(5). e0155119. https://doi.org/10.1371/journal.pone.0155119.Search in Google Scholar

Rhee, Seok-Jae & Yoo-Kyung Choi. 2001. Yeongeo chayongeoui moeum sabibe daehan tonggyegwanchalgwa geu uiui [A statistical observation of vowel insertion in English loanwords in Korean and its significance]. Eumseong Eumun Hyeongtaeron Yeongu [Studies in Phonetics, Phonology and Morphology] 7. 153–176.Search in Google Scholar

Rodríguez-Pérez, Raquel, Vogt Marin & Jürgen Bajorath. 2017. Support vector machine classification and regression prioritize different structural features for binary compound activity and potency value prediction. ACS Omega 2017(2). 6371–6379. https://doi.org/10.1021/acsomega.7b01079.Search in Google Scholar

Schohn, Greg & David Cohn. 2000. Less is more: Active learning with support vector machines. In Pat Langley (ed.), Proceedings of the 17th International Conference on Machine-learning, 839–846. San Francisco: Morgan Kaufmann.Search in Google Scholar

Sebastiani, Fabrizio. 2002. Machine-learning in automated text categorization. ACM Computing Surveys 34(1). 1–47. https://doi.org/10.1145/505282.505283.Search in Google Scholar

Srivastava, Durgesh & Lekha Bhambhu. 2010. Data classification using support vector machine. Journal of Theoretical and Applied Information Technology 12. 1–7.Search in Google Scholar

Strobl, Carolin, James Malley & Gerhard Tutz. 2009. An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods 14(4). 323–348. https://doi.org/10.1037/a0016973.Search in Google Scholar

Tagliamonte, Sali & Harald Baayen. 2012. Models, forests, and trees of York English: Was/were variation as a case study for statistical practice. Language Variation and Change 24(2). 135–178. https://doi.org/10.1017/s0954394512000129.Search in Google Scholar

Vapnik, Vladimir. 1995. The nature of statistical learning theory. New York: Springer.10.1007/978-1-4757-2440-0Search in Google Scholar

Yang, Yiming. 1999. An evaluation of statistical approaches to text categorization. Information Retrieval 1(1–2). 69–90. https://doi.org/10.1023/a:1009982220290.10.1023/A:1009982220290Search in Google Scholar

Received: 2022-04-29

Accepted: 2024-04-08

Published Online: 2024-06-21

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/lingvan-2024-0034

Keywords for this article

classification; supervised prediction; support vector machines; random forests

Creative Commons

BY 4.0