An Experimental Comparison of Modeling Techniques and Combination of Speaker – Specific Information from Different Languages for Multilingual Speaker Identification

H.S. Jayanna; B.G. Nagaraja

doi:10.1515/jisys-2014-0128

Article Open Access

An Experimental Comparison of Modeling Techniques and Combination of Speaker – Specific Information from Different Languages for Multilingual Speaker Identification

Published/Copyright: August 1, 2015

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Intelligent Systems Volume 25 Issue 4

Abstract

Most of the state-of-the-art speaker identification systems work on a monolingual (preferably English) scenario. Therefore, English-language autocratic countries can use the system efficiently for speaker recognition. However, there are many countries, including India, that are multilingual in nature. People in such countries have habituated to speak multiple languages. The existing speaker identification system may yield poor performance if a speaker’s train and test data are in different languages. Thus, developing a robust multilingual speaker identification system is an issue in many countries. In this work, an experimental evaluation of the modeling techniques, including self-organizing map (SOM), learning vector quantization (LVQ), and Gaussian mixture model-universal background model (GMM-UBM) classifiers for multilingual speaker identification, is presented. The monolingual and crosslingual speaker identification studies are conducted using 50 speakers of our own database. It is observed from the experimental results that the GMM-UBM classifier gives better identification performance than the SOM and LVQ classifiers. Furthermore, we propose a combination of speaker-specific information from different languages for crosslingual speaker identification, and it is observed that the combination feature gives better performance in all the crosslingual speaker identification experiments.

Keywords: Multilingual; speaker identification; modeling techniques; speaker-specific information

1 Introduction

The demand for intelligent information retrieval is greatly increased as the Internet and telecommunications are rapidly growing [24]. Speech is one of the most effective communication media between human and machine. A biometric security system based on human speech has great potential in several domains [17]. Speaker recognition is defined as a task of identifying speakers from their voice [4]. Speaker recognition can be classified as speaker identification and speaker verification [6]. In speaker identification, as there is no identity claim, the system identifies the most likely speaker of the test speech signal. In a speaker verification task, a user’s speech is used to classify the speaker as being either who the speaker claims to be or an impostor [14]. Depending on the mode of operation, speaker identification can be either text dependent (with constraint on what is spoken) or text independent (no constraint on what is spoken) [26].

Most of the speaker recognition systems work on a single-language (English/European languages) environment. Some nations (India, South Africa, Canada, etc.) are bilingual or multilingual [19]. Citizens of a multilingual country can speak more than one language fluently. Criminals often switch to using another language, especially after committing a crime. Therefore, training a person’s voice in one language and identifying that person in some other language or in a multilingual environment is an interesting task [2]. Speaker identification can be performed in monolingual, crosslingual, and multilingual modes [2, 19]. In the monolingual speaker identification, training and testing languages are the same for a speaker, whereas in crosslingual speaker identification, training is done in one language (say x) and testing is done in a different language (say y) [2, 20]. In multilingual speaker identification, speaker-specific models are trained in one language and tested in multiple languages.

The performance degradation due to mismatch between Mandarin and Sichuan dialect in training and testing was studied in Ref. [13]. In that study, a combined Gaussian mixture model (GMM) trained by using Mandarin and Sichuan dialects was proposed to alleviate the mismatch problem. The results showed that the combined GMM is more robust for the language-mismatch condition than the GMM trained solely using Mandarin or Sichuan dialect speech. An attempt was made to recognize a trilingual or multilingual speaker in Ref. [3]. In that work, training data of 60 s and different testing data of 1, 3, 7, 10, and 15 s were used. The degradation in crosslingual speaker recognition performance was analyzed using feature clustering in diverse regions of feature space.

The relevance of language in speaker recognition was investigated in Ref. [8] using bilingual speakers set in two different languages (Spanish and Catalan). The recognition system was built using linear prediction cepstral coefficient features, covariance matrices, and vector quantization (VQ) classifiers. It was observed that there is a small degradation in recognition performance if the training and testing languages are different. The conventional Mel frequency cepstral coefficient (MFCC) realization based on windowed (Hamming) density functional theory (DFT) may not yield good performance due to the high variance of the spectrum estimation [15, 16, 22]. To overcome this problem, Kinnunen et al. demonstrated the use of multitaper MFCC features for speaker verification tasks in Refs. [15, 16]. The experimental results in the NIST-2002 and NIST-2008 databases showed that multitaper outperforms the conventional single-window technique. In this work, experiments have been conducted in the monolingual and crosslingual context using the sine-weighted cepstrum estimator (SWCE) multitaper MFCC feature and different modeling techniques like self-organizing map (SOM), learning vector quantization (LVQ), and GMM-universal background model (UBM) in our own database.

The rest of the paper is organized as follows. The speech database details for the study are discussed in Section 2. In Section 3, feature extraction using a multitaper MFCC technique is discussed. Section 4 presents the speaker identification study using the SOM modeling technique. Monolingual and crosslingual speaker identification using the LVQ technique are given in Section 5. Section 6 describes the speaker identification study using the GMM-UBM classifier. A combination of speaker-specific information from different languages for crosslingual speaker identification is presented in Section 7. Limitations and future work are given in Section 8. Section 9 gives the summary of the present work.

2 Speech Database for the Study

Experiments are carried out on our own database of 50 speakers created from speakers who can speak three different languages (E – English, H – Hindi, and K – Kannada). This database includes 30 male and 20 female speakers. The voice recording was done in an engineering college laboratory. The speakers were undergraduate students and faculty in an engineering college. The age of the speakers varied from 18 to 35 years. The speakers were asked to read smaller stories in three different languages. The training and testing data were recorded in different sessions with a minimum gap of 2 days. The approximate training and testing data length is 2 min. Recording was done using free downloadable WaveSurfer 1.8.8p3 software [Royal Institute of Technology (KTH), Stockholm, Sweden] and Beetel Head phone-250 [Beetel Teletech Limited, India] with a frequency range 20–20 kHz. The speech files are stored in wav format. The detailed specifications used for collecting the database are shown in Table 1.

Table 1

Description of Database.

Item	Description
Number of speakers	50
Sessions	Training and testing
Sampling rate	8 kHz
Sampling format	One-channel, Lin16 sample encoding
Languages covered	English, Hindi, and Kannada
Microphone	Beetel Head phone-250
Recording software	WaveSurfer 1.8.8p3
Maximum duration	120 s/story/language
Minimum duration	Depends on speaker

In this work, the subset of the Indian Institute of Technology Guwahati multivariability speaker recognition (IITG-MV) database [10] is used to train the UBM. The IITG-MV database is collected in a setup having five different sensors, two different environments, two different languages, and two different styles [10]. The five different sensors include a headphone mounted close to the speaker, in-built tablet PC microphone, two mobile phones, and one digital voice recorder [23]. The speech data are sampled at 8 kHz and stored with 16 bits/sample resolution. The recording was done in the office (controlled environment) and hostel rooms, laboratory, and corridors, etc. (uncontrolled environments). The recording was done in two languages, namely, Indian English and the favorite language of the speaker, which happens to be one of the Indian languages like Hindi, Kannada, Tamil, Oriya, and so on [23].

3 Feature Extraction Using Multitaper MFCC

The basic idea in multitapering is to pass the analysis frame through the multiple window functions and then estimate the weighted average of individual subspectrum to obtain the final spectrum [15]. In particular, the side-lobe leakage effect in the normal windowed DFT can be partially suppressed by using the multitaper approach. As each taper in a multitaper method is pairwise orthogonal to all the other tapers, the windowed speech signals other statistically independent approximations of the underlying spectrum [1]. Different types of tapers have been proposed for spectrum estimation in Ref. [16]. It was mentioned that the range of K should be between 3 and 8, and also recommended to start with K= 6. In this work, the SWCE [15] multitaper is used.

A frame duration of 20 ms, and 10 ms of overlapping durations are considered for multitaper MFCC feature extraction. Let F = [f(0) f(1) … f(N– 1)]^T denote one frame of speech (N samples) signal. The multitaper spectrum estimation is obtained by using Eq. (1) [16]:

(1)S^(f) = ∑j = 1Kλ(j)|∑n = 0N − 1wj[n]f[n]e−i2πfn/N|2.

Here, K represents the number of multitapers used. W_j = [w_j(0) w_j(1) … w_j(N– 1)]^T is the multitaper weight and j = 1, 2, …, K, are used with corresponding weights, λ(j). Figure 1 shows the block diagram representation of the multitaper MFCC method. We consider only the first 13 dimensional feature vectors excluding the 0th coefficient computed using 22 filters in the filter bank. Cepstral mean subtraction is applied to the multitaper MFCC to remove a linear channel effect. Silence and low-energy speech parts are removed using an energy-based voice activity detection technique. The threshold we used is 0.06 times the average frame energy for selection of speech frames.

Figure 1:

Block Diagram of the Multitaper MFCC Technique.

4 Speaker Modeling using SOM

The SOM learning process is unsupervised and competitive [18]. The SOMs are an “elastic network” of points fitted to some given distribution [18]. The advantage of clustering formed by the SOM is that it retains the underlying structure of the input space, while the dimensionality of the space is reduced. During the training process, the SOM learns both the distribution and topology of the input vectors [27]. In SOM competitive learning, instead of updating only the winner node, SOM updates the neighborhood nodes around the winning node. In SOM training, weight vectors are randomly initialized with small weights. Each data instance of the input data is presented to the network, and the distance between that instance and each node’s weight vector is calculated using the Euclidean distance measurement given by Eq. (2).

(2)d = ∑i = 1n(xi − wi)2,

where x_i represents the input vector and w_i is the weight vector. The winning neuron will be based on the smallest Euclidean distance. Adjust the weights of the winning neuron and its topological neighborhood neurons in the direction of the input vector by using Eqs. (3)–(5):

(3)hj = exp(−|uj − uj∗|22σ2),

where h_j is the topological neighborhood, σ is the effective width of the topological neighborhood, and uj∗ is the winning neuron. The change to the weight vector w_j can be obtained as

(4)Δwj = ηhj(x − wj),

where η is the learning rate parameter of the algorithm. The weights of the winning neuron and all nodes within a defined neighborhood of the winning neuron are then updated.

(5)w(t + 1) = w(t) + η(t)h(t)[x(t) − w(t)],

where h(t) is the topological neighborhood at time t. The entire process following the initialization of weights is carried out over a number of iterations, until no noticeable change in the weight vectors is observed. In this work, initial weight vectors are taken from the random samples in the input and also the winning neuron is based on the smallest Euclidean distance. The performance of the SOM depends on the parameters such as η, neighborhood (h), and number of iterations [9]. The experiments are conducted for different values of η, h, and iterations for different train and test languages. The best performance is quoted against the different train/test languages in Table 2.

Table 2

Speaker Identification Performance (%) for the 50 Speakers of Our Own Database Using 20-s Training and Testing Data for the SOM Modeling Technique.

Train/Test language	Codebook size
Train/Test language	16	32	64	128	256
E/E	56	58	60	70	76
H/H	50	60	60	66	72
K/K	52	60	66	72	72
H/E	46	54	64	68	72
K/E	48	60	62	70	70
E/H	50	60	60	68	66
K/H	44	50	56	62	70
E/K	46	52	60	58	60
H/K	44	50	54	50	58

5 Speaker Modeling using LVQ

LVQ is a supervised version of VQ and uses class information to optimize the positions of code vectors obtained by using SOM, so as to improve the quality of the classifier decision regions. LVQ algorithms directly describe class boundaries based on the nearest-neighbor rule and a winner-takes-it-all paradigm [9]. If the class label of the input vector and the code vector agree, then the code vector is moved in the direction of the input vector. Otherwise, the code vector is moved away from the input vector. An input vector is picked at random from the input space. The LVQ initially follows the steps involved in SOM for tuning weight vectors (codebooks). These weight vectors are further fine tuned by LVQ, which is as follows:

Suppose X_t is an input vector at time t and W_j is the weight vector for the class j at time t. Let ξ_wc denote the class associated with the weight vector W_c and ξ_x denote the class label of the input vector X. The weight vector W_c is adjusted as follows [11, 12]:

If ξ_wc = ξ_x, then
Wc(t + 1) = Wc(t) + η(t)[x − Wc(t)],
where 0 < η(t) < 1.
else
Wc(t + 1) =Wc(t) − η(t)[x − Wc(t)].
The other weight vectors are not modified.

The algorithm shows that the LVQ performance depends on the parameters like η and iterations. The experiments are conducted for different values of η, and iterations for different train and test languages. The best performance is quoted against the different train/test languages in Table 3. The performance of LVQ is better than the SOM modeling technique. The improvement in the identification performance may be due to employing supervised learning over initially obtained unsupervised code vectors in the LVQ modeling technique [12].

Table 3

Speaker Identification Performance (%) for the 50 Speakers of Our Own Database Using 20-s Training and Testing Data for the LVQ Modeling Technique.

Train/Test language	Codebook size
Train/Test language	16	32	64	128	256
E/E	60	68	76	80	84
H/H	60	62	70	76	78
K/K	58	66	70	76	80
H/E	48	56	64	70	80
K/E	58	60	62	68	76
E/H	48	60	70	72	80
K/H	40	48	56	70	72
E/K	42	60	72	72	80
H/K	50	54	62	64	64

6 Speaker Modeling using GMM-UBM

The expectation-maximization algorithm was used to estimate the parameters (mean vectors, covariance matrices, and mixture weights) of the GMM models [25]. The k-means algorithm was used to obtain the initial estimate for each cluster [10]. The gender-independent speaker-specific model is created by performing maximum a posteriori estimation from the UBM using speaker-specific training speech. The monolingual experimental results for the 50 speakers of our own database using 20 s of speech data for training and testing and for different Gaussian mixtures are given in Table 4. The UBMs were trained in English, Hindi, and Kannada languages of approximately 1800 s of speech data using the IITG-MV database [23] for E/E, H/H, and K/K monolingual experiments. The speaker identification system trained and tested with the English language (E/E) gives the highest performance of 90% for 128 Gaussian mixtures. The performance of the speaker identification trained and tested with Hindi language (H/H) is 82% for 64 and 128 Gaussian mixtures. The performance of the speaker identification system trained and tested with Kannada language (K/K) is 82% for 128 and 256 Gaussian mixtures.

Table 4

Speaker Identification Performance (%) for the 50 Speakers of Our Own Database Using 20 s Training and Testing Data for the GMM-UBM Modeling Technique.

Train/Test language	Gaussian mixtures
Train/Test language	16	32	64	128	256
E/E	64	76	84	90	88
H/H	68	72	82	82	80
K/K	64	80	80	82	82

6.1 GMM-UBM Crosslingual Experiments

In the GMM-UBM technique, UBM is a general, speaker-independent model trained using a large set of speakers to represent speaker-independent feature distribution [25]. In this subsection, we examine the effect of training language-specific UBM and multilanguage UBM in a crosslingual scenario. Training language-specific UBMs were trained in English (E/H and E/K), Hindi (H/E and H/K), and Kannada (K/E and K/H) languages, and multilanguage UBMs were trained in multiple languages such as Telugu, Tamil, Malyalam, Assami, etc., of approximately 1800 s of speech data using the IITG-MV database. Table 5 shows the effect of training language-specific and multilanguage UBMs on crosslingual speaker identification for 50 speakers of our own database using 20 s of train and test speech data.

Table 5

Crosslingual Speaker Identification Performance (%) for the 50 Speakers of Our Own Database Using 20-s Training and Testing Data for the GMM-UBM Modeling Technique.

Train/Test language	Technique	Gaussian mixtures
Train/Test language	Technique	16	32	64	128	256
H/E	Training language-specific UBM	54	66	72	76	84
	Multilanguage UBM	50	64	70	78	80
K/E	Training language-specific UBM	60	68	70	74	78
	Multilanguage UBM	56	60	64	74	72
E/H	Training language-specific UBM	58	66	70	74	80
	Multilanguage UBM	60	64	68	72	78
K/H	Training language-specific UBM	46	60	70	74	78
	Multilanguage UBM	50	60	66	76	76
E/K	Training language-specific UBM	56	66	70	80	82
	Multilanguage UBM	60	66	74	78	82
H/K	Training language-specific UBM	54	60	66	70	70
	Multilanguage UBM	52	60	60	66	66

Some of the observations we made from the monolingual and crosslingual results are as follows:

It was observed that the results are better for the monolingual than the crosslingual scenario. This may be due to the variation in fluency and word stress when the same speaker speaks different languages and also due to different phonetic and prosodic patterns of the languages [7].
The GMM-UBM classifier gives a better identification performance than the SOM and LVQ classifiers.
The performance of LVQ is better than that of the SOM modeling technique.
It is observed from the GMM-UBM crosslingual experiments that the training language-specific UBM gives better identification performance than the multilanguage UBM in a crosslingual scenario. This may be due to the presence of almost similar phonetic contents in multiple training languages that update the Gaussian components analogous to those phonemes [5].

7 Combination of Speaker-Specific Information from Different Languages for Crosslingual Speaker Identification

Crosslingual experimental results showed that the language mismatch between train and test data leads to a considerable performance degradation. So, if we have a priori knowledge of languages known to the speaker, then it is possible to build a robust speaker identification system for the language-mismatch environment (crosslingual). This section proposes a combination of speaker-specific information from different languages for crosslingual speaker identification. However, combined features can increase the complexity of the speaker identification system by doubling the dimensionality of the features. In this direction, frame reduction and smoothing are achieved using an adaptive weighted-sum algorithm [21]. Figure 2 shows the block diagram of the proposed method.

Figure 2:

Block Diagram of the Combination of Speaker-Specific Information from Different Languages for Crosslingual Speaker Identification.

Given a sequence of original feature vectors F = (f[0], f[1], …, f[N– 1])^T, where the nth vector f_n contains M feature elements, frame reduction for a sliding window of length α is obtained by Eq. (6) [21]:

(6)Freduction(m) = 1/N∑i = nn + αwi(m)f2n + i − 1(m),

where fi(m) denotes the mth feature element of the vector f_i and wi(m) is the weight of such feature element. Weight values are transformed from the Euclidean distance between two feature vector frames within the sliding window. Subscript (2n + i – 1) was used for frame skipping of sliding window to reduce 50% length of the feature sequence. N is a normalization factor, and for a sliding window of length α, N is given by Eq. (7) [21]:

(7)N = ∑i = 1αwi(m).

The following steps are used in the speaker identification process:

Choose a language training data.
Extract multitaper MFCC features.
Perform frame reduction and feature smoothing.
Choose another language training data.
Extract multitaper MFCC features.
Perform frame reduction and feature smoothing.
Combine speaker-specific information.
Generate the speaker model using GMM-UBM.
Choose the testing data.
Extract multitaper MFCC features.
Compare test data with speaker models.
Use the decision logic to find out the winner.

Speaker identification experiments are conducted using 50 speakers of our own database. Multitaper MFCC is used as a feature, and GMM-UBM is used as a modeling technique. The training language-specific UBMs are trained using an IITG-MV database (controlled and headphone speech data) of approximately 1800 s of speech data. The sliding window of length α= 3 is considered. The crosslingual experimental results for the 50 speakers of our own database using 20 s of training and testing data for different Gaussian mixtures are given in Table 6.

Table 6

Crosslingual Speaker Identification Performance (%) for the 50 Speakers of Our Own Database Using 20-s Training and Testing Data for Baseline (Multitaper MFCC-GMM-UBM) and Proposed System.

Train/Test language	Technique	Gaussian mixtures
Train/Test language	Technique	16	32	64	128	256
H/E	Baseline	54	66	72	76	84
	Proposed	60	70	74	80	88
K/E	Baseline	60	68	70	74	78
	Proposed	64	72	78	80	80
E/H	Baseline	58	66	70	74	80
	Proposed	62	74	74	82	80
K/H	Baseline	46	60	70	74	78
	Proposed	54	66	80	80	76
E/K	Baseline	56	66	70	80	82
	Proposed	58	66	74	82	82
H/K	Baseline	54	60	66	70	70
	Proposed	58	66	74	76	76

Some of the observations we made from the baseline results (Tables 2–5) and the proposed system results (Table 6) are as follows:

The speaker identification system trained in Hindi and tested in English language (H/E) gives the highest performance of 84% and 88% for baseline and proposed system, respectively.
The performance of the speaker identification system trained in Kannada and tested in English language (K/E) is 78% and 80% for baseline and proposed system, respectively.
The speaker identification system trained in English and tested in Hindi language (E/H) yields a highest performance of 80% and 82% for baseline and proposed system, respectively.
The performance of the speaker identification system trained in Kannada and tested in Hindi language (K/H) is 78% and 80% for baseline and proposed system, respectively.
The speaker identification system trained in English and tested in Kannada language (E/K) yields a highest performance of 82% for both baseline and proposed system.
The performance of the speaker identification system trained in Hindi and tested in Kannada language (H/K) is 70% and 76% for baseline and proposed system, respectively.
The results show that the combination of speaker-specific information from different languages perform better than using only one training language in all the crosslingual speaker identification experiments.
As in the previous sections, here also speaker identification gives poor performance with Kannada as training and/or testing language.

To illustrate the distribution of the features in the feature space for two different languages, we have taken 1-s speech data for two different languages (English and Hindi); computed only the first four multitaper MFCCs, i.e. c0, c1, c2, and c3; and plotted excluding c0 in Figure 3. It can be observed that the combination of speaker-specific information from different languages results in more number of feature vectors for the same speaker. Furthermore, the presence of feature vectors at different places demonstrates the different spectral information. Hence, the improvement in performance may be due to different aspects of speaker-specific information in different languages.

Figure 3:

Distribution of Features for 1 s of Speech Data for Two Different Languages (o – English and * – Hindi).

8 Limitations and Future Work

The issues in multilingual speaker recognition may be summarized as follows:

A major problem of carrying out research in a language-mismatched environment is the non-availability of standard multilingual databases for Indian languages and those of other multilingual countries. Creating a multilingual database is an important issue that needs to be addressed.
In real-life applications, predicting the speaker training and testing languages that would be used if the speaker knows multiple languages is not always possible. Therefore, in a multilingual environment, obtaining the better identification performance requires maximizing existing resources (language parameters) from resource-rich languages to reduce the language-mismatch impact, rather than using language-dependent systems.
Alternate to the existing features, new features that are independent of languages used in a multilingual scenario need to be explored.
A broad review of the literature has yet to reveal research measuring the pause rate of either a first language or second language for multilingual speakers.
The effectiveness of the methods needs to be verified with different languages, different data sizes, and a large amount of speaker set.

9 Conclusions

In this work, we compared the performance of SOM, LVQ, and GMM-UBM classifiers for monolingual and crosslingual speaker identification, and we found that the GMM-UBM classifier outperform the others. Then, we made an attempt to examine the effect of training language-specific UBM and multilanguage UBM on crosslingual mode using a multitaper MFCC-GMM-UBM system. As a result, we found that training language-specific UBM gives better performance than the multilanguage background models for the language-mismatch condition. Furthermore, we proposed a combination of speaker-specific information from different languages for crosslingual speaker identification, and it was observed that the combination feature gives better performance in all the crosslingual speaker identification experiments.

Corresponding authors: H.S. Jayanna, Siddaganga Institute of Technology, Department of Information Science and Engineering, Tumkur-572103, Karnataka, India, e-mail: jayannahs@gmail.com; and B.G. Nagaraja, Jain Institute of Technology, Department of Electronics and Communication Engineering, Davangere-577005, Karnataka, India, e-mail: nagarajbg@gmail.com

Bibliography

[1] M. J. Alam, T. Kinnunen, P. Kenny, P. Ouellet and D. O’Shaughnessy, Multitaper MFCC features for speaker verification using i-vectors, in: Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 547–552, 2011.10.1109/ASRU.2011.6163886Search in Google Scholar

[2] P. H. Arjun, Speaker recognition in Indian languages: a feature based approach, Ph.D. dissertation, Indian Institute of Technology, Kharagpur, India, 2005.Search in Google Scholar

[3] P. H. Arjun, S. Sitaram and E. Sharma, DA-IICT cross-lingual and multilingual corpora for speaker recognition, in: Proc. IEEE, Advances in Pattern Recognition, Kolkata, pp. 187–190, 2009.10.1109/ICAPR.2009.72Search in Google Scholar

[4] B. S. Atal, Automatic recognition of speakers from their voices, Proc. IEEE64 (1976), 460–475.10.1109/PROC.1976.10155Search in Google Scholar

[5] U. Bhattacharjee and K. Sarmah, A multilingual speech database for speaker recognition, in: Proc. IEEE, Signal Processing, Computing and Control, Waknaghat Solan, pp. 1–5, 2012.10.1109/ISPCC.2012.6224374Search in Google Scholar

[6] J. Campbell, Speaker recognition: a tutorial, Proc. IEEE85 (1997), 1437–1462.10.1109/5.628714Search in Google Scholar

[7] G. Durou, Multilingual text-indepeandent speaker identification, in: Proc. Multi-lingual Interoperability in Speech Technology (MIST), Leusden, Netherlands, pp. 115–118, 1999.Search in Google Scholar

[8] M. Faundez-Zanuy and A. Satu-Villar, Speaker recognition experiments on a bilingual database, in: Proc. 14th European Signal Processing Conference, Florence, Italy, pp. 261–264, 2006.Search in Google Scholar

[9] T. E. Filgueiras, R. O. Messina and E. F. Cabral Jr., Learning vector quantization in text-independent automatic speaker recognition, in: Proc. 5th Brazilian Symposium, pp. 135–139, 1998.Search in Google Scholar

[10] B. C. Haris, G. Pradhan, A. Misra, S. Shukla, R. Sinha and S. R. M. Prasanna, Multivariability speech database for robust speaker recognition, in: Proc. IEEE, Communications (NCC), pp. 1–5, 2011.10.1109/NCC.2011.5734775Search in Google Scholar

[11] S. Haykin, Neural networks: a comprehensive foundation, Prentice-Hall Inc., Upper Saddle River, NJ, 1999.Search in Google Scholar

[12] H. S. Jayanna, Limited data speaker recognition, Ph.D. thesis, Indian Institute of Technology, Guwahati, India, 2009.Search in Google Scholar

[13] Z. Jing, G. Wei-Guo and Y. Li-Ping, Mandarin-Sichuan dialect bilingual text-independent speaker verification using GMM, J. Comput. Appl.28 (2008), 792–794.10.3724/SP.J.1087.2008.00792Search in Google Scholar

[14] T. Kinnunen, E. Karpov and P. Frnti, Real-time speaker identification and verification, IEEE Trans. Audio Speech Lang. Process. 14 (2006), 277–288.10.1109/TSA.2005.853206Search in Google Scholar

[15] T. Kinnunen, R. Saeidi, J. Sandberg and M. H. Sandsten, What else is new than the Hamming window? Robust MFCCs for speaker recognition via multitapering, in: Proc. Interspeech, Makuhari, Japan, pp. 2734–2737, 2010.10.21437/Interspeech.2010-724Search in Google Scholar

[16] T. Kinnunen, R. Saeidi, F. Sedlak, K. A. Lee, J. Sandberg, M. Hansson-Sandsten and H. Li, Low-variance multitaper MFCC features: a case study in robust speaker verification, IEEE Trans. Audio Speech and Lang. Process.20 (2012), 1990–2001.10.1109/TASL.2012.2191960Search in Google Scholar

[17] N. T. Kleynhans and E. Barnard, Language dependence in multilingual speaker verification, in: Proc. of Symposium of the Pattern Recognition Association of South Africa, Langebaan, South Africa, pp. 117–121, 2005.Search in Google Scholar

[18] T. Kohonen, The self-organizing map, Proc. IEEE78 (1990), 1464–1480.10.1109/5.58325Search in Google Scholar

[19] B. G. Nagaraja and H. S. Jayanna, Combination of features for crosslingual speaker identification with the constraint of limited data, in: Proc. Fourth Int. Conf. on Signal and Image Processing Springer, Coimbatore, Tamilnadu, India, pp. 143–148, 2012.10.1007/978-81-322-0997-3_13Search in Google Scholar

[20] B. G. Nagaraja and H. S. Jayanna, Mono and cross lingual speaker identification with the constraint of limited data, in: Proc. IEEE, PRIME-2012, Periyar University, Salem, Tamilnadu, India, pp. 457–461, 2012.10.1109/ICPRIME.2012.6208386Search in Google Scholar

[21] S. Nuratch, P. Boonpramuk and C. Wutiwiwatchai, Feature smoothing and frame reduction for speaker recognition, in: Proc. IEEE, Asian Language Processing, Harbin, pp. 311–314, 2010.10.1109/IALP.2010.49Search in Google Scholar

[22] D. B. Percival and A. T. Walden, Spectral analysis for physical applications, Cambridge Univ. Press, Cambridge, MA, 1993.10.1017/CBO9780511622762Search in Google Scholar

[23] G. Pradhan and S. R. M. Prasanna, Significance of vowel onset point information for speaker verification, Int. J. Comput. Commun. Technol.2 (2011), 56–61.10.47893/IJCCT.2013.1164Search in Google Scholar

[24] X. K. Qing and K. Chen, On use of GMM for multilingual speaker verification: an empirical study, in: Proc. of International Symposium on Chinese Spoken Language Processing (ISCSLP’2000), Beijing, China, pp. 263–266, 2000.Search in Google Scholar

[25] D. A. Reynolds and R. C. Rose, Robust text-independent speaker identification using Gaussian mixture speaker models, IEEE Trans. Speech and Audio Processing3 (1995), 72–83.10.1109/89.365379Search in Google Scholar

[26] A. E. Rosenberg, Automatic speaker verification: a review, Proc. IEEE64 (1976), 475–487.10.1109/PROC.1976.10156Search in Google Scholar

[27] M. L. Westerlund, Classification with Kohonen self-organizing maps, Soft Computing, Haskoli Islands, 2005.Search in Google Scholar

Received: 2014-9-3

Published Online: 2015-8-1

Published in Print: 2016-10-1

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Articles in the same Issue

https://doi.org/10.1515/jisys-2014-0128

Keywords for this article

Multilingual; speaker identification; modeling techniques; speaker-specific information

Creative Commons

BY-NC-ND 3.0