Home Employing Emotion Cues to Verify Speakers in Emotional Talking Environments
Article Publicly Available

Employing Emotion Cues to Verify Speakers in Emotional Talking Environments

  • Ismail Shahin EMAIL logo
Published/Copyright: February 24, 2015
Become an author with De Gruyter Brill

Abstract

Usually, people talk neutrally in environments where there are no abnormal talking conditions such as stress and emotion. Other emotional conditions that might affect people’s talking tone include happiness, anger, and sadness. Such emotions are directly affected by the patient’s health status. In neutral talking environments, speakers can be easily verified; however, in emotional talking environments, speakers cannot be easily verified as in neutral talking ones. Consequently, speaker verification systems do not perform well in emotional talking environments as they do in neutral talking environments. In this work, a two-stage approach has been employed and evaluated to improve speaker verification performance in emotional talking environments. This approach employs speaker’s emotion cues (text-independent and emotion-dependent speaker verification problem) based on both hidden Markov models (HMMs) and suprasegmental HMMs as classifiers. The approach is composed of two cascaded stages that combine and integrate an emotion recognizer and a speaker recognizer into one recognizer. The architecture has been tested on two different and separate emotional speech databases: our collected database and the Emotional Prosody Speech and Transcripts database. The results of this work show that the proposed approach gives promising results with a significant improvement over previous studies and other approaches such as emotion-independent speaker verification approach and emotion-dependent speaker verification approach based completely on HMMs.

1 Introduction

Listeners can obtain different types of information from speech signals. Such types of information are as follows: (i) speech recognition, which conveys information about the content of the speech signal; (ii) speaker recognition, which yields information about the speaker’s identity; (iii) emotion recognition, which gives information about the emotional state of the speaker; (iv) health recognition, which provides information on the patient’s health status; (v) language recognition, which produces information on the language being spoken; (vi) accent recognition, which generates information about the speaker’s accent; (vii) age recognition, which delivers information about the speaker’s age; (viii) gender recognition, which gives information about the speaker’s gender.

There are two types of speaker recognition: speaker identification and speaker verification (authentication). Speaker identification is the task of automatically determining who is speaking from a set of known speakers. Speaker verification is the task of automatically determining if a person really is the person he or she claims to be. Speaker verification can be used in intelligent health-care systems [6, 9, 25, 30]. Speaker verification systems are used in hospitals that include computerized emotion categorization and assessment techniques [30]. These systems can also be used in the pathological voice assessment (functional dysphonic voices) [6]. Dysphonia is the medical term for disorders of the voice: an impairment in the ability to produce voice sounds using the vocal organs. Thus, dysphonia is a phonation disorder. The dysphonic voice can be hoarse or excessively breathy, harsh, or rough [3]. Furthermore, speaker verification systems can be used in the diagnosis of Parkinson’s disease [9]. Max Little and his team at Massachusetts Institute of Technology (MIT) did some work on analyzing and evaluating the voice characteristics of patients who had been diagnosed with Parkinson’s disease. They discovered that they could create a tool to detect such a disease in the speech patterns of individuals [9]. In addition, speaker verification systems can be exploited to provide assistance to multidisciplinary evaluation teams as they evaluate each child who is referred for an assessment to determine if he/she is a child with a disability and in need of special education services. The verification of children with disabilities is one of the most important aspects of both federal law and state special education regulation [25].

Speaker recognition has been an interesting research field in the last few decades, which still yields a number of challenging problems. One of the most challenging problems that face speaker recognition systems is the low performance of such systems in emotional talking environments [13, 19, 23, 28]. Emotion-based speaker recognition is one of the vital research fields in the human-computer interaction or affective computing area [12]. The foremost goal of intelligent human-machine interaction is to enable computers with affective computing capability so that computers can verify the identity of the user in intelligent health-care services.

2 Prior Work

There are many studies [10, 18, 29] that focus on speaker verification in neutral talking environments. The authors of Ref. [10] addressed the issues related to language and speaker recognition, focusing on prosodic features extracted from speech signals. Their proposed approach was evaluated using the National Institute of Standards and Technology (NIST) language recognition evaluation 2003 and the extended data task of NIST speaker recognition evaluation 2003 for language and speaker recognition, respectively. The authors of Ref. [18] described the major elements of MIT Lincoln Laboratory’s Gaussian mixture model (GMM)-based speaker verification system in neutral talking environments. The authors of Ref. [29] focused their work on text-dependent speaker verification systems in such talking environments. In their proposed approach, they used suprasegmental and source features, besides spectral features, to verify speakers. The combination of suprasegmental, source, and spectral features significantly enhances the performance of speaker verification systems [29].

On the other hand, there are a limited number of studies [13, 19, 23, 28] that address the issue of speaker verification in emotional talking environments. The authors of Ref. [13] presented investigations into the effectiveness of the state-of-the-art speaker verification techniques: GMM-universal background model (GMM-UBM) and GMM-support vector machine (GMM-SVM) in mismatched noise conditions. The authors of Ref. [19] examined whether speaker verification algorithms that are trained in emotional environments yield better performance when applied to speech samples obtained under stressful or emotional conditions than those trained in neutral environments only. They concluded that training of speaker verification algorithms on a broader range of speech samples, including stressful and emotional talking conditions, rather than the neutral talking condition, is a promising method to enhance speaker authentication performance [19]. The author of Ref. [23] proposed, implemented, and tested a two-stage approach for speaker verification systems in emotional talking environments based entirely on hidden Markov models (HMMs). He tested the proposed approach using his collected speech database and obtained 84.1% as a speaker verification performance. The authors of Ref. [28] studied the influence of emotion on the performance of a GMM-UBM-based speaker verification system in such talking environments. In their work, they proposed an emotion-dependent score normalization technique for speaker verification on emotional speech. They achieved an average speaker verification performance of 88.5% [28].

The main contribution of this work is focused on employing and evaluating a two-stage approach to verify the claimed speaker in emotional talking environments. This approach consists of two recognizers that are combined and integrated into one recognizer using both HMMs and suprasegmental HMMs (SPHMMs) as classifiers. The two recognizers are an emotion identification recognizer followed by a speaker verification recognizer. Our present work focuses on enhancing the performance of text-independent and emotion-dependent speaker verification systems. This work deals with intersession variability caused by different emotional states of the claimed speaker. On the basis of the current approach, the claimed speaker should be registered in advance in the test set (closed set). Our present work is different from one of our prior works [24] that focused on identifying speakers based on a two-stage approach. In Ref. [24], the first stage is to identify the unknown emotion and the second stage is to identify the unknown speaker given that the emotion of the unknown speaker was identified.

The motivation of this work is that speaker verification systems do not perform well in emotional talking environments as they do in neutral talking environments [13, 23, 28]. The proposed architecture of this work aims at enhancing the degraded speaker verification performance in emotional talking environments based on employing emotion cues. The present work is a continuation to the work of one of our previous studies [23], which was devoted to proposing, implementing, and testing a two-stage approach to verify speakers in emotional talking environments based completely on HMMs as a classifier and using only the collected database. In addition, five extensive experiments have been conducted in the current work to assess the two-stage approach.

The remainder of this paper is organized as follows. The fundamentals of SPHMMs are covered in Section 3. Section 4 describes the two speech databases used in this work and the extraction of features. Section 5 discusses the two-stage approach and the experiments. Decision threshold is presented in Section 6. Section 7 demonstrates the results obtained in the present work, and their discussion. Finally, concluding remarks are presented in Section 8.

3 Fundamentals of SPHMMs

SPHMMs have been developed, implemented, and evaluated by the author of Refs. [20–22] in the fields of speaker recognition [21, 22] and emotion recognition [20]. SPHMMs have proven to be superior models over HMMs for speaker recognition in each of shouted [21] and emotional [22] talking environments.

Suprasegmental is a vocal result that expands over many sound segments in an utterance such as pitch and stress. It is usually used for tone, vowel length, and features such as nasalization. SPHMMs have the ability to summarize several states of HMMs into what is termed a suprasegmental state. Suprasegmental state can look at the observation sequence through a larger window. Such a state allows observations at rates suitable for the situation of modeling. For example, prosodic information cannot be detected at a rate that is used for acoustic modeling. The prosodic features of a unit of speech are named suprasegmental features as they influence all segments of the unit. As a result, prosodic events at the levels of phone, syllable, word, and utterance are represented by means of suprasegmental states, while acoustic events are modeled using conventional hidden Markov states.

Within HMMs, prosodic and acoustic information can be combined as given in the following formula [15]:

(1)logP(λv,Ψv|O)=(1α)logP(λv|O)+αlogP(Ψv|O), (1)

where α is a weighting factor. When

(2){0.5>α>0biasedtowardtheacousticmodel1>α>0.5biasedtowardtheprosodicmodelα=0biasedcompletelytowardtheacousticmodelandnoeffectoftheprosodicmodelα=0.5notbiasedtowardanymodelα=1biasedcompletelytowardtheprosodicmodelandnoimpactoftheacousticmodel, (2)

λv is the acoustic model of the vth emotion; Ψv is the suprasegmental model of the vth emotion; and O is the observation vector or sequence of an utterance.

P(λv|O) and Pv|O) can be calculated using Bayes theorem, as given in Eqs. (3) and (4), respectively [16]:

(3)P(λv|O)=P(O|λv)P0(λv)P(O), (3)
(4)P(Ψv|O)=P(O|Ψv)P0(Ψv)P(O), (4)

where P0(λv) and P0v) are the priori distribution of acoustic model and suprasegmental model, respectively.

4 Speech Databases and Extraction of Features

4.1 Collected Database

The collected speech data corpus is composed of 20 male and 20 female untrained healthy adult native speakers of American English. Untrained speakers were selected to utter sentences naturally and to avoid exaggerated expressions. Each speaker was asked to utter eight sentences where each sentence was portrayed nine times under each of the neutral, angry, sad, happy, disgust, and fear emotions. The eight sentences were unbiased toward any emotion. These sentences are as follows:

  1. He works five days a week.

  2. The sun is shining.

  3. The weather is fair.

  4. The students study hard.

  5. Assistant professors are looking for promotion.

  6. University of Sharjah.

  7. Electrical and Computer Engineering Department.

  8. He has two sons and two daughters.

The first four sentences of this database were used in the training phase, while the last four sentences were used in the evaluation phase (text-independent experiment). The collected speech data corpus was captured in a clean environment by a speech acquisition board using a 16-bit linear coding A/D converter and sampled at a sampling rate of 16 kHz. This database is a wideband 16-bit per sample linear data. The signal samples were preemphasized and then segmented into frames of 16 ms each with 9 ms overlap between consecutive frames.

4.2 Emotional Prosody Speech and Transcripts (EPST) Database

The EPST data corpus was produced by the Linguistic Data Consortium [4]. This data corpus comprised eight professional speakers (three actors and five actresses) uttering a series of semantically neutral utterances composed of dates and numbers spoken in 15 different emotions, including the neutral state. Only six emotions were used in this work. The six emotions are neutral, hot anger, sadness, happiness, disgust, and panic. In this database, four utterances were used in the training phase, and different four utterances were used in the test phase.

4.3 Extraction of Features

In this work, the features that characterize the phonetic content of speech signals in the two databases are called mel-frequency cepstral coefficients. These coefficients have been broadly used in many studies in the fields of speech recognition [14, 30], speaker recognition [5, 28], and emotion recognition [1, 7, 8, 11, 27]. This is because such coefficients outperform other coefficients in the three fields, and because they offer a high-level estimation of human auditory perception.

Most of the works [2, 11, 26] performed in the last few decades in the fields of speech recognition, speaker recognition, and emotion recognition on HMMs have been done using left-to-right HMMs (LTRHMMs) because phonemes follow a strictly left-to-right sequence. In this work, left-to-right suprasegmental HMMs (LTRSPHHMs) have been derived from LTRHMMs. Figure 1 shows an example of a basic structure of LTRSPHMMs that has been obtained from LTRHMMs. In this figure, q1, q2, …, q6 are conventional hidden Markov states. p1 is a suprasegmental state that consists of q1, q2, and q3·p2 is a suprasegmental state that is made up of q4, q5, and q6. p3 is a suprasegmental state that is composed of p1 and p2·aij is the transition probability between the ith conventional hidden Markov state and the jth conventional hidden Markov state. bij is the transition probability between the ith suprasegmental state and the jth suprasegmental state.

Figure 1: Basic Structure of LTRSPHMMs.
Figure 1:

Basic Structure of LTRSPHMMs.

In this work, the number of conventional states of LTRHMMs, N, is six. The number of mixture components, M, is 10 per state, with a continuous mixture observation density is selected for these models. In LTRSPHMMs, the number of suprasegmental states is two. Therefore, each three conventional states of LTRHMMs are summarized into one suprasegmental state.

5 Speaker Verification Based on the Two-Stage Approach and the Experiments

Given a registered speaker talking in m emotions, the overall proposed approach to verify the claimed speaker based on his/her emotion cues is shown in Figure 2. The aim of the two-stage approach is to deal with intersession variability caused by different emotional states of the claimed speaker. Figure 2 shows that the overall two-stage architecture is composed of two cascaded stages. The two stages are as follows:

1. Stage a: Emotion Identification

Figure 2: Block Diagram of the Overall Two-Stage Approach.
Figure 2:

Block Diagram of the Overall Two-Stage Approach.

The first stage of the overall approach is to identify the unknown emotion that belongs to the claimed speaker (emotion identification problem). In this stage, m probabilities are computed on the basis of SPHMMs and the maximum probability is chosen as the identified emotion as given in the following formula:

(5)E=argmaxme1{P(O|λeΨe} (5)

where E* is the index of the identified emotion; O is the observation sequence of the unknown emotion that belongs to the claimed speaker; and P(O|λe, Ψe) is the probability of the observation sequence O of the unknown emotion that belongs to the claimed speaker given the eth SPHMM emotion model (λe, Ψe).

The eth SPHMM emotion model has been derived in the training phase for every emotion using the 40 speakers generating all the first four sentences with a repetition of nine utterances per sentence. Therefore, the total number of utterances used to derive each SPHMM emotion model in this phase is 1440 (40 speakers × 4 sentences × 9 utterances/sentence). The SPHMM training phase is very similar to the conventional HMM training phase. In the SPHMM training phase, suprasegmental models are trained on top of acoustic models of HMMs. A block diagram of this stage is illustrated in Figure 3.

2. Stage b: Speaker Verification

Figure 3: Block Diagram of Stage a of the Overall Two-Stage Approach.
Figure 3:

Block Diagram of Stage a of the Overall Two-Stage Approach.

The next stage of the two-stage approach is to verify the speaker identity based on HMMs given that his/her emotion was identified in the previous stage (emotion-specific speaker verification problem), as given in the following formula:

(6)Λ(0)=log[P(O|E)]log[P(O|E¯)], (6)

where Λ(O) is the log-likelihood ratio in the log domain; P(O|E*) is the probability of the observation sequence O that belongs to the claimed speaker given the true identified emotion; and P(O|E̅*) is the probability of the observation sequence O that belongs to the claimed speaker given the false identified emotion. Equation (6) shows that the likelihood ratio is computed between models trained using data from the claimed speaker and recognized emotion.

The probability of the observation sequence O that belongs to the claimed speaker given the true identified emotion can be computed as [17]

(7)logP(O|E)=1Tt=1TlogP(ot|E*) (7)

where O= o1, o2, …, ot, …, oT.

The probability of the observation sequence O that belongs to the claimed speaker given the false identified emotion can be computed using a set of B imposter emotion models: {E¯1,E¯2, , E¯B} as

(8)logP(O|E¯)={1Bb=1Blog[P(O|E¯b)]}, (8)

where P(O|E¯b*) can be computed using Eq. (7). The value of B in this work is equal to 6 – 1 = 5 emotions. Figure 4 demonstrates a block diagram of this stage.

Figure 4: Block Diagram of Stage b of the Overall Two-Stage Approach.
Figure 4:

Block Diagram of Stage b of the Overall Two-Stage Approach.

In the evaluation phase, each one of the 40 speakers used nine utterances per sentence of the last four sentences (text-independent) under each emotion. The total number of utterances used in this phase is 8640 (40 speakers × 4 sentences × 9 utterances/sentence × 6 emotions). In this work, 34 speakers (17 speakers per gender) are used as claimants, and the rest of speakers are used as imposters.

6 Decision Threshold

Two types of error can take place in a speaker verification problem, namely false rejection (miss probability) and false acceptance (false alarm probability). When a valid identity claim is rejected, it is called a false rejection error; on the other hand, when the identity claim from an imposter is accepted, it is named a false acceptance.

A speaker verification problem based on emotion identification requires making a binary decision based on two hypotheses: hypothesis H0 if the claimed speaker belongs to a true emotion or hypothesis and H1 if the claimed speaker comes from a false emotion.

The log-likelihood ratio in the log domain can be defined as

(9)Λ(O)=log[P(O|λC,ΨC)]log[P(O|λC¯,ΨC¯)], (9)

where O is the observation sequence of the claimed speaker; λC, ΨC is the SPHMM claimant emotion model; P(O|λC, ΨC) is the probability that the claimed speaker belongs to a true identified emotion; λC̅, ΨC̅ is the SPHMM imposter emotion model; and P(O|λC̅̅, ΨC̅̅) is the probability that the claimed speaker comes from a false identified emotion.

The last step in the verification process is to compare the log-likelihood ratio with the threshold (θ) in order to accept or reject the claimed speaker, i.e.,

AccepttheclaimedspeakerifΛ(O)θRejecttheclaimedspeakerifΛ(O)<θ

Open set speaker verification often uses thresholding to make a decision if a speaker is out of the set. Both types of error in the speaker verification problem rely on the threshold used in the decision-making process. A tight value of threshold makes it difficult for false speakers to be falsely accepted but at the expenditure of falsely rejecting true speakers. In contrast, a loose value of threshold facilitates true speakers to be accepted continually at the expense of falsely accepting false speakers. In order to set a proper value of threshold that meets with a desired level of a true speaker rejection and a false speaker acceptance, it is essential to know the distribution of true speaker and false speaker scores. An acceptable process for setting a value of threshold is to assign a loose initial value of threshold and then let it adjust by setting it to the average of up-to-date trial scores. This loose value of threshold gives inadequate protection against false speaker trials.

7 Results and Discussion

In the current work, a two-stage approach based on both HMMs and SPHMMs as classifiers has been employed and tested by separately using the collected and EPST databases when α = 0.5 for speaker verification in emotional talking environments. This specific value of α has been chosen to avoid biasing toward either the acoustic or prosodic model.

Tables 1 and 2 show the confusion matrices of stage a using the collected and EPST databases, respectively. The two matrices represent the percentage of confusion of the unknown emotion with the other emotions based on SPHMMs. Table 1 (for example) demonstrates the following:

  1. The most easily recognizable emotion is neutral (99%). Hence, the performance of verifying speakers talking neutrally is the highest compared to that of verifying speakers talking in other emotions as shown in Table 3 (least percentage equal error rate, EER) using the same database.

  2. The least easily recognizable emotion is angry (86%). Therefore, speaker verification performance when speakers talk in angry emotion is the least compared to that when speakers talk in other emotions, as shown in Table 3 (highest percentage EER) using the same database.

  3. Column 3 (angry emotion), for example, shows that 2% of the utterances that were portrayed in angry emotion were evaluated as produced in neutral state; 3% of the utterances that were uttered in angry emotion were recognized as generated in sad emotion. This column shows that angry emotion has the highest confusion percentage with disgust emotion (6%). Therefore, angry emotion is highly confusable with disgust emotion. The column also shows that angry emotion has the least confusion percentage with happy emotion (1%).

Table 1:

Confusion Matrix Based on SPHMMs of Stage a Using the Collected Database When α= 0.5.

ModelPercentage of Confusion of the Unknown Emotion with Other Emotions
NeutralAngrySadHappyDisgustFear
Neutral99%2%1%1%0%0%
Angry0%86%1%1%3%5%
Sad0%3%96%2%4%1%
Happy0%1%0%92%1%1%
Disgust0%6%0%2%87%3%
Fear1%2%2%2%5%90%
Table 2:

Confusion Matrix Based on SPHMMs of Stage a Using EPST Database When α= 0.5.

ModelPercentage of Confusion of the Unknown Emotion with Other Emotions
NeutralHot AngerSadHappyDisgustPanic
Neutral99%4%1%3%1%1%
Hot anger0%88%1%1%4%4%
Sad0%1%97%1%3%1%
Happy1%1%0%94%0%0%
Disgust0%5%0%1%86%3%
Panic0%1%1%0%6%91%
Table 3:

Percentage Equal Error Rate Based on the Two-Stage Approach Using the Collected and EPST Databases When α= 0.5.

EmotionEER (%)
Collected DatabaseEPST Database
Neutral1.52
Angry/hot anger10.512
Sad87.5
Happy8.59
Disgust9.510.5
Fear/panic8.58

Table 3 shows percentage EER in emotional talking environments based on the two-stage framework using each of the collected and EPST databases when α = 0.5. This table indicates that the average value of EER using the collected database is 7.75%, while the average value of EER using the EPST database is 8.17%. The table shows that the least value of EER happens when the claimed speaker speaks neutrally, while the highest value of EER occurs when the claimed speaker talks in angry emotion. This table shows that the percentage EER under all emotions, except under the neutral state, is high. This high percentage EER may be attributed to the following reasons:

  1. The identified emotion of the claimed speaker has not been perfectly identified. The average emotion identification performances based on SPHMMs are 91.67% and 92.15% using the collected and EPST databases, respectively.

  2. The verification stage (stage b) produces another system degradation performance in addition to the degradation in emotion identification performance. This is because some claimants are rejected as imposters and some imposters are accepted as claimants. Therefore, the given EER in Table 3 is the resultant of the EER of both stage a and stage b. As the performance of emotion identification stage is imperfect, the two-stage framework could have a negative impact on the overall performance especially when the emotion in stage a has been falsely identified.

The authors of Ref. [28] achieved an average EER of 11.48% in emotional talking environments using GMM-UBM based on an emotion-independent method. In the present work, the achieved average EER based on the two-stage approach is less than that obtained based on their method [28]. The author of Ref. [23] obtained 15.9% as an average EER in emotional talking environments based on HMMs only. It is evident that the attained results of average EER based on the two-stage approach are less than those achieved in Ref. [23].

Five extensive experiments have been conducted in this work to assess the achieved results based on the two-stage architecture. The five experiments are as follows:

  1. Percentage EER based on the two-stage approach is compared with that based on an emotion-independent speaker verification approach using separately the collected and EPST databases. The obtained average EER using the emotion-independent approach based on HMMs only and using each of the collected and EPST databases is given in Table 4. From this table, the average value of EER using the collected and EPST databases is 14.75% and 14.58%, respectively.

Table 4:

Percentage Equal Error Rate Based on the Emotion-Independent Approach Using the Collected and EPST Databases.

EmotionEER (%)
Collected DatabaseEPST Database
Neutral66
Angry/hot anger18.518
Sad13.513.5
Happy15.515.5
Disgust16.516.5
Fear/panic18.518

A statistical significance test has been performed to show whether EER differences (EER based on the two-stage framework and that based on the emotion-independent approach) are real or simply due to statistical fluctuations. The statistical significance test has been carried out based on the Student’s t distribution test as given in the following formula:

(10)t1,2=x¯1x¯2SDpooled, (10)

where x̅1 is the mean of the first sample of size n; x̅2 is the mean of the second sample of the same size; and SDpooled is the pooled standard deviation of the two samples given as

(11)SDpooled=SD12+SD222, (11)

where SD1 is the standard deviation of the first sample of size n and SD2 is the standard deviation of the second sample of the same size.

In this work, x̅3,collect = 7.75, SD3,collect = 2.91, x̅3,EPST = 8.17, SD3,EPST = 3.14, x̅4,collect = 14.75, SD4,collect = 4.28, x̅4,EPST = 14.58, SD4,EPST = 4.14. These values have been calculated using Table 3 (collected and EPST databases) and Table 4 (collected and EPST databases), respectively. On the basis of these values, the calculated t value using the collected database of both Tables 3 and 4 is t4,3 (collected) = 1.913 and the calculated t value using the EPST database of both Tables 3 and 4 is t4,3(EPST) = 1.745. Each calculated t value is higher than the tabulated critical value at the 0.05 significant level t0.05 = 1.645. Therefore, the conclusion that can be drawn in this experiment states that the two-stage speaker verification approach outperforms the emotion-independent speaker verification approach. Therefore, inserting an emotion identification stage into a speaker verification system in emotional talking environments significantly enhances speaker verification performance compared to that without such a stage.

  1. In stage a, the m probabilities are computed based on SPHMMs. To compare the impact of using acoustic features on emotion identification (stage a) with that using suprasegmental features, Eq. (5) has become

(12)E=argmaxme1{P(O|λe)}. (12)

Therefore, the m probabilities in this experiment are computed on the basis of HMMs. The obtained percentage EER employing emotion cues based on the two-stage approach and using HMMs only in both stage a and stage b using the collected and EPST databases is given in Table 5. The average value of EER using the collected and EPST databases is 15.58% and 14.50%, respectively.

Table 5:

Percentage Equal Error Rate Based on HMMs Only in Both Stage a and Stage b Using the Collected and EPST Databases.

EmotionEER (%)
Collected DatabaseEPST Database
Neutral87
Angry/hot anger20.518.5
Sad15.514.5
Happy1514
Disgust16.515.5
Fear/panic1817.5

In this experiment, x̅5,collect = 15.58, SD5,collect = 3.85, x̅5,EPST = 14.50, SD5,EPST = 3.71. From these values, the calculated t value of both Tables 3 and 5 using the collected database is t5,3 (collect) = 2.294 and the calculated t value of the two tables using the EPST database is t5,3 (EPST) = 1.842. Each calculated t value is larger than t0.05 = 1.645. Therefore, the percentage EER based on using SPHMMs in stage a is lower than that based on using HMMs in the same stage. It can be concluded from this experiment that SPHMMs are superior to HMMs for speaker verification in emotional talking environments.

Figures 5 and 6 show detection error trade-off (DET) curves using the collected and EPST databases, respectively. Each curve compares speaker verification in emotional talking environments based on the two-stage approach with that based on the emotion-independent approach. These two figures evidently demonstrate that the two-stage approach is superior to the emotion-independent approach for speaker verification in emotional talking environments.

  1. The two-stage approach has been evaluated for different values of α. Figures 7 and 8 show the average percentage EER based on the two-stage framework for different values of α (0.0, 0.1, …, 0.9, 1.0) using the collected and EPST databases, respectively. The two figures indicate that increasing the value of the weighting factor has a significant effect on minimizing the EER and, hence, improving speaker verification performance in emotional talking environments (excluding the neutral state) based on the two-stage architecture. Therefore, it is apparent, based on this architecture, that suprasegmental HMMs have more influence on speaker verification performance in such talking environments than acoustic HMMs. These two figures also show that the least percentage EER takes place when the classifiers are entirely biased toward suprasegmental models and no impact of acoustic models (α = 1).

  2. The two-stage approach has been assessed for the worst-case scenario. This scenario takes place when stage b receives false input (false identified emotion) from stage a. The average percentage EER for the worst-case scenario based on SPHMMs when α = 0.5 is 15.11% and 15.25% using the collected and EPST databases, respectively. These values are very close to those attained using the one-stage approach (14.75% and 14.58% using the collected and EPST databases, respectively). It can be concluded from this experiment that the percentage EER for the worst-case scenario based on the two-stage approach is very close to that based on the one-stage approach.

  3. An informal subjective assessment of the two-stage approach has been performed with 10 nonprofessional listeners (human judges) using the collected speech data corpus. A total of 960 utterances (20 speakers × 2 genders × 6 emotions × the last 4 sentences of the database) have been used in this assessment. During the evaluation, each listener is asked two separate questions for every test utterance. The two questions are as follows: identify the unknown emotion, and verify that the claimed speaker provided the unknown emotion was identified. The average emotion identification performance and the average speaker verification performance is 90.5% and 88.14%, respectively.

Figure 5: DET Curve Based on Each of the Two-Stage and Emotion-Independent Approaches Using the Collected Database.
Figure 5:

DET Curve Based on Each of the Two-Stage and Emotion-Independent Approaches Using the Collected Database.

Figure 6: DET Curve Based on Each of the Two-Stage and Emotion-Independent Approaches Using the EPST Database.
Figure 6:

DET Curve Based on Each of the Two-Stage and Emotion-Independent Approaches Using the EPST Database.

Figure 7: Average EER (%) versus (α) Based on the Two-Stage Approach Using the Collected Database.
Figure 7:

Average EER (%) versus (α) Based on the Two-Stage Approach Using the Collected Database.

Figure 8: Average EER (%) versus (α) Based on the Two-Stage Approach Using the EPST Database.
Figure 8:

Average EER (%) versus (α) Based on the Two-Stage Approach Using the EPST Database.

8 Concluding Remarks

This work employed and evaluated a two-stage approach that combines and integrates an emotion recognizer and a speaker recognizer into one recognizer using both HMMs and SPHMMs as classifiers to enhance speaker verification performance in emotional talking environments. Several experiments have been separately carried out in such environments using two different and separate speech databases. Some conclusions can be drawn from this work. First, the emotional state of the speaker has a negative impact on speaker verification performance. Second, the significant improvement of speaker verification performance in emotional talking environments based on the two-stage approach reveals promising results of such an approach. Third, the emotion-dependent speaker verification architecture is superior to the emotion-independent speaker verification architecture (one-stage approach). Therefore, emotion cues significantly contribute in alleviating the deteriorated speaker verification performance in these talking environments. Fourth, suprasegmental HMMs outperform conventional HMMs for speaker verification systems in such talking environments. Furthermore, the highest speaker verification performance happens when the classifiers are completely biased toward suprasegmental models and no influence of acoustic models. Finally, the two-stage recognizer performs almost the same as the one-stage recognizer when the second stage (stage b) receives a false identified emotion from the first stage (stage a).

There are some limitations in this work. First, the processing computations and the time consumed in the two-stage approach are slightly greater than those in the one-stage approach. Second, the two-stage approach requires all emotions of the claimed speaker to be available to the system in the training phase. Hence, the two-stage architecture is restricted to a closed set case. Finally, speaker verification performance based on the two-stage approach is limited. This is because the performance of the overall approach is a resultant of the performances of both stage a and stage b. As the performance of each stage is imperfect, the overall performance is consequently imperfect.


Corresponding author: Ismail Shahin, Department of Electrical and Computer Engineering, University of Sharjah, P.O. Box 27272, Sharjah, United Arab Emirates, Tel.: +(971) 6 5050967, Fax: +(971) 6 5050877, e-mail:

Bibliography

[1] H. Bao, M. Xu and T. F. Zheng, Emotion attribute projection for speaker recognition on emotional speech, in: INTERSPEECH 2007, Antwerp, Belgium, pp. 758–761, August 2007.10.21437/Interspeech.2007-142Search in Google Scholar

[2] L. T. Bosch, Emotions, speech and the ASR framework, Speech Commun.40 (2003), 213–225.10.1016/S0167-6393(02)00083-3Search in Google Scholar

[3] Dysphonia, http://en.wikipedia.org/wiki/Dysphonia, Accessed 8 January, 2015.Search in Google Scholar

[4] Emotional Prosody Speech and Transcripts database. www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002S28, Accessed 10 May, 2014.Search in Google Scholar

[5] T. H. Falk and W. Y. Chan, Modulation spectral features for robust far-field speaker identification, IEEE T. Audio, Speech Language Process.18 (2010), 90–100.10.1109/TASL.2009.2023679Search in Google Scholar

[6] C. Fredouille, G. Pouchoulin1, J. -F. Bonastre, M. Azzarello, A. Giovanni and A. Ghio, Application of automatic speaker recognition techniques to pathological voice assessment (dysphonia), in: Proceedings of European Conference on Speech Communication and Technology (Eurospeech 2005), pp. 149–152, 2005.10.21437/Interspeech.2005-90Search in Google Scholar

[7] A. B. Kandali, A. Routray and T. K. Basu, Emotion recognition from Assamese speeches using MFCC features and GMM classifier, in: Proc. IEEE Region 10 Conference TENCON 2008, Hyderabad, India, pp. 1–5, November 2008.10.1109/TENCON.2008.4766487Search in Google Scholar

[8] A. B. Kandali, A. Routray and T. K. Basu, Vocal emotion recognition in five native languages of Assam using new wavelet features, Int. J. Speech Technol.12 (2009), 1–13.10.1007/s10772-009-9046-4Search in Google Scholar

[9] B. Kepes, Using technology to diagnose Parkinson’s disease. http://www.forbes.com/sites/benkepes/2014/05/27/using-technology-to-diagnose-parkinsons-disease/, Accessed 8 January, 2015.Search in Google Scholar

[10] L. Mary and B. Yegnanarayana, Extraction and representation of prosodic features for language and speaker recognition, Speech Commun.50 (2008), 782–796.10.1016/j.specom.2008.04.010Search in Google Scholar

[11] T. L. Nwe, S. W. Foo and L. C. De Silva, Speech emotion recognition using hidden Markov models, Speech Commun.41 (2003), 603–623.10.1016/S0167-6393(03)00099-2Search in Google Scholar

[12] R. W. Picard, Affective computing, in: MIT Media Laboratory Perceptual Computing Section Technical Report No. 321, 1995.Search in Google Scholar

[13] S. G. Pillay, A. Ariyaeeinia, M. Pawlewski and P. Sivakumaran, Speaker verification under mismatched data conditions, IET Signal Process.3 (2009), 236–246.10.1049/iet-spr.2008.0175Search in Google Scholar

[14] V. Pitsikalis and P. Maragos, Analysis and classification of speech signals by generalized fractal dimension features, Speech Commun.51 (2009), 1206–1223.10.1016/j.specom.2009.06.005Search in Google Scholar

[15] T. S. Polzin and A. H. Waibel, Detecting emotions in speech, in: Cooperative Multimodal Communication, Second International Conference 1998, CMC 1998.Search in Google Scholar

[16] L. R. Rabiner and B. H. Juang, Fundamentals of speech recognition, Prentice Hall, Eaglewood Cliffs, NJ, 1983.Search in Google Scholar

[17] D. A. Reynolds, Automatic speaker recognition using Gaussian mixture speaker models, The Lincoln Laboratory Journal, 8 (1995), 173–192.Search in Google Scholar

[18] D. A. Reynolds, T. F. Quatieri and R. B. Dunn, Speaker verification using adapted Gaussian mixture models, Digital Signal Process.10 (2000), 19–41.10.1006/dspr.1999.0361Search in Google Scholar

[19] K. R. Scherer, T. Johnstone, G. Klasmeyer and T. Banziger, Can automatic speaker verification be improved by training the algorithms on emotional speech?, in: Proceedings of International Conference on Spoken Language Processing, vol. 2, pp. 807–810, October 2000.10.21437/ICSLP.2000-392Search in Google Scholar

[20] I. Shahin, Speaking style authentication using suprasegmental hidden Markov models, Univ. Sharjah J. Pure Appl. Sci.5 (2008), 41–65.Search in Google Scholar

[21] I. Shahin, Speaker identification in the shouted environment using suprasegmental hidden Markov models, Signal Process. J.88 (2008), 2700–2708.10.1016/j.sigpro.2008.05.012Search in Google Scholar

[22] I. Shahin, Speaker identification in emotional environments, Iran. J. Elec. Comput. Eng.8 (2009), 41–46.Search in Google Scholar

[23] I. Shahin, Verifying speakers in emotional environments, in: The 9th IEEE International Symposium on Signal Processing and Information Technology, Ajman, United Arab Emirates, pp. 328–333, December 2009.10.1109/ISSPIT.2009.5407568Search in Google Scholar

[24] I. Shahin, Identifying speakers using their emotion cues, Int. J. Speech Technol.14 (2011), 89–98.10.1007/s10772-011-9089-1Search in Google Scholar

[25] Verification guidelines for children with disabilities, Technical Assistance Document, Nebraska Department of Education, Special Education Office, September 2008.Search in Google Scholar

[26] D. Ververidis and C. Kotropoulos, Emotional speech recognition: resources, features, and methods, Speech Commun.48 (2006), 1162–1181.10.1016/j.specom.2006.04.003Search in Google Scholar

[27] T. Vogt and E. Andre, Improving automatic emotion recognition from speech via gender differentiation, in: Proceedings of Language Resources and Evaluation Conference, Genoa, Italy, 2006.Search in Google Scholar

[28] W. Wu, T. F. Zheng, M. X. Xu and H. J. Bao, Study on speaker verification on emotional speech, in: INTERSPEECH 2006 – Proceedings of International Conference on Spoken Language Processing, pp. 2102–2105, September 2006.Search in Google Scholar

[29] B. Yegnanarayana, S. R. M. Prasanna, J. M. Zachariah and C. S. Gupta, Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification systems, IEEE T. Speech Audio Process.13 (2005), 575–582.10.1109/TSA.2005.848892Search in Google Scholar

[30] G. Zhou, J. H. L. Hansen and J. F. Kaiser, Nonlinear feature based classification of speech under stress, IEEE T. Speech Audio Process.9 (2001), 201–216.10.1109/89.905995Search in Google Scholar

Received: 2014-8-9
Published Online: 2015-2-24
Published in Print: 2016-1-1

©2016 by De Gruyter

Downloaded on 11.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/jisys-2014-0118/html
Scroll to top button