Home Gender Differences in Acoustic-Perceptual Mapping of Emotional Prosody in Mandarin Speech
Article Open Access

Gender Differences in Acoustic-Perceptual Mapping of Emotional Prosody in Mandarin Speech

  • Xuyi Wang , Ruomei Fang and Hongwei Ding ORCID logo EMAIL logo
Published/Copyright: December 6, 2024
Become an author with De Gruyter Brill

Abstract

The discrepancies in existing literature regarding the gender/sex effect on voice-emotion mapping have left the nature of the cross-gender differences unclear. To enrich the knowledge of gender differences in acoustic-perceptual mapping in emotional communication, the present study employed an acoustic-integrated approach to investigate how Mandarin speech prosody is perceived by male and female listeners. One hundred native Mandarin participants recognized the affective states and rated the emotional intensity for 4,500 audio files conveying five basic emotional prosody (i.e., anger, joy, sadness, fear, neutrality) from a female speaker. The results showed that females generally identified emotions more accurately and rated them relatively lower in intensity than males. Meanwhile, acoustic-perceptual analysis revealed a higher predictive power of acoustic measures on male performance. The research extends previous findings by showing a general female advantage in emotion detection, especially in high-arousal emotions like anger, joy, and sadness. The current study suggested that the female sensitivity to minimal affective cues should be sourced to a high-level enhancement with a subjective empathetic filter instead of a low-level superiority on objective acoustic sensation. The complicated mechanism of gender differences in emotional communication and the importance of explaining recognition ability with an acoustic-integrated perspective are highlighted.

1 Introduction

Early studies about gender/sex[1] differences in emotion prosody focused on either the encoding or the decoding level. However, it is theoretically emphasized that a full investigation of the underlying mechanism of emotional communication must attend to all aspects and their interrelationships throughout the entire process (Scherer 2003). Here, we build on this perspective of research by examining perceptual gender preference and investigating acoustic signals.

Being a highly complex tool for communication, the voice is one of the most indicative channels of an individual’s emotional state. Through emotional prosody, which encompasses the modulation and integration of various acoustic-prosodic vocal cues including fundamental frequency (also named F0 or pitch), duration, intensity, and voice quality (Belyk and Brown 2014; Cutler and Pearson 2018), we express and interpret the emotion messages in utterances along with other semantic, grammatic, and pragmatic information (Ding and Zhang 2023). Similar to the mapping between acoustic variations and linguistic categories (Peng et al. 2012), there exists a mapping between emotions and speech prosody (Van Rijn and Larrouy-Maestri 2023), by which people successfully catch corresponding emotional messages via certain acoustic cues. As the socio-cognitive ability to accurately perceive and identify the emotional states of each other plays a fundamental role in successful interpersonal interactions (Fischer and Manstead 2008; Levenson and Ruef 1992), a considerable part of our pragmatic and emotive communication capacity is dependent on the prosodic mapping of physiological acoustic signals to internal psychological emotional states (Lucarini et al. 2020).

One recommended theoretical framework to understand the mapping process of prosodic vocalization and emotional states is the modified version of Brunswik’s lens model proposed in Scherer (2003) with a focus on the vocal-acoustic-auditory channel. This theory describes how objective physiological signals, called distal indicators, contribute to psychological judgement, called proximal percepts in the process of emotional communication. According to this model, the complete process of vocal emotional mapping comprises three main steps: the encoding (or expression), the transmission, and the decoding (or impression) of emotions. First, encoding is the process where the speaker conveys their emotional states by typical voice and speech characteristics, like certain respiration, phonation, and articulation, to produce emotion-specific patterns of acoustic parameters. Second, transmission is the intermediate process accounting for the distortion between the cues that are encoded (the distal cues) and the cues that are decoded (the proximal representations) due to interferences by the vocal-acoustic-auditory transmission channel (e.g., distance, noise) or the structural characteristics determined by human hearing mechanisms. Finally, decoding refers to the process where the listener infers the speaker’s emotional state based on the internalized representations of emotional speech modifications (the proximal cues), and finally gives the subjective attribution (e.g., judgement corresponds and intensity rating of the emotions). The model highlights the importance of studying the emotional mapping of the entire communication process in a comprehensive and systematic fashion, especially with attendance to all possible intermediate variables and their interrelationships (Scherer 2003). In the current study, we aim to study the effect of the decoder’s gender on voice-emotion perceptual mapping with acoustic-supported analysis, covering acoustic measurement for encoded expressions, gender biological differences for transmission, and perceptual responses for inferentially decoded attribution.

Regarding the variability in emotional mapping, gender was one of the first attributes of decoders to be examined (Hall 1978). As a sequence, gender has been frequently suggested as a key attribute that impacts one’s performance on emotion recognition tasks (Hall 1978; Lausen and Schacht 2018; Lin and Ding 2020; Lin, Ding, and Zhang 2021; Schirmer, Kotz, and Friederici 2005, 2002]; Schmid, Mast, and Mast 2011; Thompson and Voyer 2014). One of the most recognized theories in this regard is the “child-rearing hypothesis,” which claims that females are more emotionally perceptive than males due to their lack of physical strength and their societally expected roles as the primary caretakers (e.g., Babchuk, Hames, and Thompson 1985). Many studies have reported supportive evidence that women are more adept at identifying both verbal and non-verbal emotional cues compared to men (e.g., Hall 1978; Bonebright, Thompson, and Leger 1996; Scherer, Banse, and Wallbott 2001; Toivanen, Väyrynen, and Seppänen 2005; Collignon et al. 2010; Fujisawa and Shinohara 2011; Lambrecht, Kreifelts, and Wildgruber 2014; Paulmann and Uskul 2014; Demenescu, Kato, and Mathiak 2015; Keshtiari and Kuhlmann 2016; Rafiee and Schacht 2023). However, due to the diversified nature of affective expressions and embedded contexts, female superiority in emotion recognition is not conclusively established (Lin, Ding, and Zhang 2021), with contradictory results on the existence and magnitude of gender effect, especially in the auditory domain (Fujisawa and Shinohara 2011; Lausen and Schacht 2018), challenging the female global advantage initially proposed by the “child-rearing hypothesis”. To address these diverging findings, it has been suggested that it is necessary to investigate the effect of specific emotion categories separately (de Gelder 2016; Lausen and Hammerschmidt 2020; Lausen and Schacht 2018; Lin, Ding, and Zhang 2021; Korb et al. 2023).

A theoretical supporting an emotion-specific female advantage is the “fitness threat” prediction. As one of the two different theoretical lines developed following “child-rearing hypothesis,” the “fitness threat” prediction emphasizes the emotion types as an important moderator. In contrast to the other line of “attachment promotion” prediction that assumes a female superiority for all emotions to ensure better attachment of infants, the “fitness threat” hypothesis claims a selective female advantage limited to negative emotions (e.g., fear, sadness, and anger) because of the special alarming functions of threat signaling in these affections (Hampson, van Anders, and Mullin 2006; Menezes et al. 2017; Thompson and Voyer 2014). Several meta-analyses indicate that emotional types significantly affect the variability in gender differences in emotion perception (Hall 1978; McClure 2000; Thompson and Voyer 2014). The effect sizes for female advantages in perceiving emotions, whether overall or specific, range from d = 1.56 to d = −0.87 (Thompson and Voyer 2014). Notably, negative emotions (e.g., anger, sadness, fear, disgust, and contempt) show slightly different effect sizes compared to positive emotions (like happiness and interest), both of which are considerably larger than those for “other emotions” (such as surprise and neutrality). This selective female advantage in emotional sensitivity can also be partly reflected with a gender gap in mental disorders. For example, starting in adolescence, women show greater rates of emotional disorders such as depressive/mood disorders and anxiety disorders, which involve the experience and expression of internalizing negative emotions such as sadness, guilt, and fear (Chaplin and Cole 2005; Keenan and Hipwell 2005; Zahn-Waxler, Shirtcliff, and Marceau 2008). In contrast, there exists a male preponderance in conduct disorders, autistic, antisocial and psychopathic behaviors, which often embody expressions of anger and anxiety (Chaplin 2015; Zahn-Waxler, Shirtcliff, and Marceau 2008).

To provide new research evidence on the gender-based differences in emotion auditory perception, the current work attempts to combine the perceptual experiment with acoustic-supported analysis to help reveal the complicated mechanisms of various emotional and behavioral issues. In previous literature, the examinations of gender effects (Fujisawa and Shinohara 2011; Lausen and Schacht 2018; Lin, Ding, and Zhang 2021) and the investigations of acoustic-perceptual mapping (Ekberg et al. 2023; Lausen and Hammerschmidt 2020; Liu and Pell 2014, 2012]) in emotional prosody are often disconnected. Few studies simultaneously inspect the gender-based performance disparities with acoustic-assisted analysis. However, many inconsistent findings in acoustic research may be better understood by considering human-related aspects, whereas subtle perceptual variations across genders may necessitate an examination at the acoustic level. Aiming to bridge the gap between the two lines, we are devoted to exploring the effect of listeners’ gender on emotion perception with acoustic examinations.

Accordingly, we formulated the following two main research questions. First, in the perceptual aspect, how does the factor of the decoder’s gender influence their performance on different decoded attribution measures? Specifically, do females enjoy a global advantage in affective perception with higher recognition accuracy and intensity ratings regardless of emotional status? If not, how do the emotional categories affect the behavioral responses? Second, with regard to acoustic-perceptual mapping, how much do the acoustic characteristics account for the perceptual behaviors and how much gender-based differences can be explained with certain acoustic measures? Do females and males have different strategies in voice-emotion inferential processing?

2 Methods

2.1 Materials

According to the “child-rearing hypothesis”, females compared to men should be better decoders specifically under conditions of minimal stimulus situations (Babchuk, Hames, and Thompson 1985), so we choose disyllabic words, which are the lowest and the most frequent constituent in prosody processing (Qian and Pan 2002) as well as the smallest unit for efficient emotional prosody processing in native Mandarin speakers (Xiao and Liu 2024), to be carriers of emotional prosody in this research. Moreover, disyllabic words also provide better comparability as they are already commonly used stimuli in previous research on Mandarin emotional recognition (Chen, Wang, and Ding 2024; Lin and Ding 2020; Lin, Ding, and Zhang 2021; Lin et al. 2022). Meanwhile, given the overall female advantage in emotional information expressity (Belin, Fillion-Bilodeau, and Gosselin 2008; Collignon et al. 2010; Koeda et al. 2013; Lausen and Schacht 2018; Lin, Ding, and Zhang 2021; Pell 2002; Scherer, Banse, and Wallbott 2001; Vasconcelos et al. 2017) and the primary focus of this study on the decoder’s gender effect, we only use audios encoded by one female speaker in this research. We employed the validated corpus composed of disyllabic word stimuli produced by a female speaker, called Chinese Emotional Speech Audiometry Corpus (CESAP) (Tang et al. 2024) as our research materials. Established by our research team in previous work (Tang et al. 2024), this audiometry corpus is an audio emotional speech corpus designed to be suitable for both perceptual and acoustic investigations. For perceptual purposes, it controlled the semantic content as emotionally neutral and ensured similar lexical familiarity of all items. In its acoustic profiles, CESAP was well representative of the linguistic characteristics in Mandarin Chinese.

The audio recordings in CESAP corpus were textually based on a set of 30 disyllabic word lists, which were strictly compiled to satisfy the four criteria for audiometry materials, including phonetic balance, minimal repetition, familiarity, and emotional valence. Therefore, the word list in CESAP can be used to serve as a reliable equivalent testing unit for perceptual tasks. A brief description of the compilation, recording and norming study of the materials is presented here. For more detailed explanations of the stimulus validation procedures and selection criteria, please refer to our recent article (Tang et al. 2024). In the CESAP corpus, each of the 30 lists contains 30 Mandarin disyllabic words. All words were emotionally neutral in semantics and were frequently used in everyday contexts; see the following examples and the corresponding English translation in the square brackets (e.g., 手机 [mobile phone], 书包 [schoolbag]). The corresponding audio samples for the two exemplar words in each of the five emotions are available in the Supplemental materials. Within each list, the proportions of phonetic unit occurrence (vowel, consonant, and lexical tones) were matched to their natural statistical distributions (Wang 1998); and across lists, the phonetic variations were minimized to ±1. Sourced from the CESAP corpus, all audio materials were produced by a selected female speaker at a normal speaking rate (mean disyllable duration: 1,471 ms, ranging from 1,144 to 1,752 ms). The female speaker was a Mandarin native speaker of age 22, with a 1B level certificate in the Chinese National Mandarin Proficiency Test. The speaker was instructed to clearly and naturally utter all 900 words in the 30 lists with the assigned prosody of anger, sadness, joy, fear, and neutrality, resulting in a total of 4,500 (30 × 30 × 5) source files. Words were recorded prosody by prosody to maximize the consistency of the recording style within each emotional condition. All the audio clips were recorded at a sampling rate of 44.1 kHz with 16 bit resolution. The audio sounds were peak-normalized at 0.99 with the phonetic software Praat (version 6.2.01).

2.2 Perceptual Experiment

2.2.1 Participants

The current study was carried out with ethical approval from the Institutional Review Boards (IRB) in compliance with the Declaration of Helsinki at our institution. We initially recruited 126 Chinese college students as participants through online advertisement. All of them were native Mandarin speakers born on the mainland of China. They all had normal hearing and reported no history of neurological or psychiatric disorders. To exclude other anxiety or depression issues, the participants completed 1) the Chinese version of the seven-item Generalized Anxiety Disorder Scale (GAD-7) (Williams 2014); and 2) the Patient Health Questionnaire Depression Scale-9 (PHQ-9) (Ehde 2018). Seventeen participants who scored higher than 10 on GAD-7 or PHQ-9 were excluded. Eight participants who took more than 3 h to finish the experiment were rejected as they might have been distracted during the experiment. One participant was ruled out for entering the incorrect link. At last, 100 participants were included (51 females and 49 males, mean age = 23.51 years, SD = 2.19 years). All participants confirmed that their assigned sex at birth aligns with their gender identity. At the beginning of the experiment, written informed consent was obtained from all participants, who were financially compensated for their time and involvement.

2.2.2 Procedure

The experiment was created and hosted on the Gorilla Experiment Builder (www.gorilla.sc) (Anwyl-Irvine et al. 2021), which is a widely used online experiment platform that provides reasonable accuracy and precision (Anwyl-Irvine et al. 2020). Participants took the experiment remotely with a computer in a quiet environment. For each recording, participants should perform two consecutive judgments. They were asked to 1) identify the emotion expressed in the utterances by selecting one answer from six given choices (happiness, sadness, fear, anger, neutrality, or none of the above), and 2) rate the emotional intensity in the audio on a 5-point rating scale (1 refers to very weak and 5 refers to very intense). Before the formal experimental sessions, ten practice trials were carried out. The participants might repeat the practice trials until they were familiar with the experimental procedures. The experiment tested the 4,500 items with a Latin square design, where each of the 30 lists with 30 lexical items was tested in all five emotional conditions. Accordingly, one list under one emotional condition was assigned to 20 out of 100 participants. One participant would listen to 900 items in total, which were divided into six blocks, and it was suggested that they rest between the blocks. The formal experiment session took about 1.5 h per participant.

2.3 Data Analysis

Two female participants and one male participant were removed because their accuracy rates were ±2.5 standard deviations away from the group’s average. Then, a double criterion was applied in the selection of valid items: (1) an accuracy rate of at least 50 % correct (i.e., three times chance performance in the six-choice emotion identification task) for the target emotion and (2) recognition rates of less than 50 % in any other emotion categories (Gong et al. 2023; Liu and Pell 2012). In this study, the 4,500 items all met the two criteria and were, therefore, included as valid in the final results. All 87,300 (97 participants × 900 trials) responses were included for analyses on recognition accuracy. For all analyses on the intensity rating data, trials incorrectly classified by the participants were excluded (3,450 trials in total), with the remaining 83,850 objects. The statistical analysis was conducted with R (version 4.3.2) in the integrated environment R-studio.

2.3.1 Perceptual-Focused Analysis

We ran generalized linear mixed effect models (GLMM) with a binomial logistic regression for the binary response variable on emotion recognition using the R toolkit lme4 (Bates et al. 2015), and cumulative link mixed effect models (CLMM) for ordinal logistic regression for the ordinal 5-scale Likert scores of intensity ratings by the function clmm (Haubo and Christensen 2018) in the R package ordinal. As intensity rating is recommended as an influential factor for the analysis of recognition accuracy (Lausen and Schacht 2018), we controlled intensity rating as a confounder in the analysis of accuracy data. Therefore, the GLMM model of accuracy score analysis was fitted with three fixed-effect factors of emotion, gender and intensity rating. For intensity rating analysis, the CLMM model had two fixed-effect factors of emotion and gender. The two best models were selected through a procedure starting with the full model with random effects structure including by-subject and by-item random slopes for all fixed factors (Barr et al. 2013). The final random effect structures were determined by a backward model selection according to the Akaike information criterion (AIC) (Akaike 1974) and a significance test based on the χ 2-distributed likelihood-ratio test (Matuschek et al. 2017). Post hoc tests were conducted with emmeans function using Bonferroni correction with α = 0.05. The final best models for recognition accuracy and intensity ratings are respectively described in Equations (1) and (2).

(1) l o g i t P Y i = 1 = α + β 1 g e n d e r + β 2 e m o t i o n + β 3 i n t e n s i t y r a t i n g + β 4 g e n d e r × e m o t i o n + β 5 g e n d e r × i n t e n s i t y r a t i n g + β 6 e m o t i o n × i n t e n s i t y r a t i n g + β 7 g e n d e r × e m o t i o n × i n t e n s i t y r a t i n g + u i s u b j e c t + ε i i = 1 , , n

(2) l o g i t P Y i j < = k = α k β 1 g e n d e r β 2 e m o t i o n β 3 g e n d e r β 4 e m o t i o n β 5 g e n d e r × e m o t i o n u i s u b j e c t v j a u d i o + ε i j i = 1 , , n , j = 1 , , m , k = 1 , , K 1

2.3.2 Acoustic-Perceptual Analysis

To investigate the acoustic-perceptual relationship, twelve acoustic parameters of prosodic significance (Ekberg et al. 2023; Liu and Pell 2012) in the GeMAPS parameter set (Eyben et al. 2016) were extracted from all 4,500 recordings using the Python package OpenSMILE (version 2.4.1, Eyben et al. 2013, 2010], in Python 3.9.12). The GeMAPS parameter set is a minimal standard set commonly used for affective speech processing, and it was created as a baseline feature set for researchers in this area to facilitate replication and enhance parameter comparability. All the 12 acoustic parameters are presented in Table 1, including three frequency-related features (F0 mean, F0 range, jitter) providing information on pitch, four amplitude-related features (Loudness mean, Loudness range, Shimmer, HNR) measuring the loudness aspect, two temporal related features (Loudness peak rate, Pseudosyllable rate) presenting the timing profiles, and three spectral-balance related features including Hammarberg index (Hammarberg et al. 1980), Harmonic difference H1-H2 (e.g., Fischer-Jørgensen 1967; Bickley 1982; Esposito 2010), Harmonic difference H1-A3 (Esposito 2010; Stevens and Hanson 1995) as indicators for voice quality and phonation type characteristics.

Table 1:

Acoustic parameters analyzed in the current study based on descriptions given by Eyben et al. (2016), and categorized by feature type.

Acoustic features (parameters) Definition
Frequency related
F0 mean Mean logarithmic fundamental frequency (F0) on a semitone scale starting at 27.5 Hz
F0 range Range of the 20th to 80th percentile of the logarithmic fundamental frequency (F0) on a semitone scale starting at 27.5 Hz
Jitter Mean deviations in individual consecutive F0 period lengths
Amplitude related
Loudness mean Estimate of the mean perceived signal intensity from an auditory spectrum
Loudness range Range of the 20th to 80th percentile of the estimate of the mean perceived signal intensity from an auditory spectrum
Shimmer Mean difference of the peak amplitudes of consecutive F0 periods
Harmonics-to-noise ratio (HNR) Mean ratio of energy in harmonic components to energy in noise-like components
Temporal related
Loudness peak rate Mean number of loudness peaks per second
Pseudo syllable rate Mean number of continuous voiced regions per second
Spectral-balance related
Hammarberg index Mean ratio of the strongest energy peak in the 0–2 kHz region to the strongest energy peak in the 2–5 kHZ region
Harmonic difference H1-H2 Mean ratio of energy of the first F0 harmonic to the energy of the second F0 harmonic
Harmonic difference H1-A3 Mean ratio of the energy of the first F0 harmonic to the highest harmonic in the third formant range

In order to better present the variations in acoustic parameters across emotions, we z-normalized the feature values in emotional audio sounds with the mean score and standard deviation of all 4,500 recordings. Further, to investigate the variation in acoustic parameters between different emotional states, we performed one-way ANOVA for each of the 12 acoustic features with emotional state as the independent variable (anger, joy, fear, sadness, neutrality), and the z-scores for each recording as the dependent variable. The ANOVA analysis was conducted with IBM SPSS v.27. Significant results were followed up by post-hoc t-tests using Bonferroni correction for multiple comparisons, with a significance level set to 5 %. Besides, the interrelationship between the acoustic measures was explored in a Pearson correlation analysis for the z-normalized scores of all parameters using corr_coef function in metan R package (Olivoto and Lucio 2020).

Then, to integrate the acoustic profiles with perceptual responses, we calculated the mean values of recognition accuracy rates and the mean intensity ratings for all 4,500 audio files separately with male and female data. Then, four (2 × 2) linear regression models were built through the lm function in lme4 R package (Bates et al. 2015) to measure the predicting power of the z-scores for acoustic parameters on the mean value of the emotion recognition accuracy and intensity ratings with male and female data. The order of the acoustic parameters in the models was determined by importance in a backward stepwise variable selection.

3 Results

3.1 Perceptual Results

3.1.1 Recognition Accuracy

The confusion matrices for male and female participants are shown in Figure 1, and an overview of the recognition accuracy in different emotions across males and females can be found in Table 2 and is visualized by the left-side bar plot in Figure 2. The overall mean recognition accuracy was considerably high in both male (M = 0.956, SD = 0.20) and female participants (M = 0.964, SD = 0.19). Participants of both genders had more difficulty in recognizing joyful and fearful voices, especially since they tended to misclassify joyful utterances as neutral ones. Among all items of joy, male participants mistook 5.17 % as neutrality, while female participants erred in 5.05 % of them. A common confusion between fear-sadness pairs was observed for both genders, with males failing in 2.21 % of fearful items and 1.68 % of sad items, and females in 1.76 % for fear and 1.33 % in sadness. Noticeably, male participants misrecognized 2.03 % of angry voices as neutral, and females mistook 2.2 % of neutral utterances as angry. Besides, males made a gender-specific error by taking 1.59 % of fearful items as joyful ones.

Figure 1: 
Confusion matrixes of emotion recognition by male and female participants. The values represent the percentage of each recognition pattern calculated against the total amount of trials with the true emotional labels across different categories.
Figure 1:

Confusion matrixes of emotion recognition by male and female participants. The values represent the percentage of each recognition pattern calculated against the total amount of trials with the true emotional labels across different categories.

Table 2:

Mean values and standard deviations of identification accuracy and intensity rating in emotion recognition by female and male participants.

Accuracy score Intensity rating
Male Female Male Female
M (SD) M (SD) M (SD) M (SD)
Joy 0.925 (0.26) 0.940 (0.24) 3.40 (1.06) 3.17 (1.09)
Anger 0.966 (0.18) 0.980 (0.14) 3.92 (0.94) 3.79 (0.89)
Fear 0.945 (0.23) 0.961 (0.19) 3.79 (0.93) 3.60 (0.95)
Sadness 0.971 (0.17) 0.980 (0.14) 3.83 (0.94) 3.79 (0.94)
Neutrality 0.976 (0.15) 0.963 (0.19) 3.04 (1.33) 3.14 (1.30)
All 0.956 (0.20) 0.964 (0.19) 3.60 (1.10) 3.50 (1.08)
Figure 2: 
Bar plots of mean recognition accuracy and intensity ratings by male and female participants. Mean values are displayed in the bar charts, with error bars showing 95 % confidence intervals.
Figure 2:

Bar plots of mean recognition accuracy and intensity ratings by male and female participants. Mean values are displayed in the bar charts, with error bars showing 95 % confidence intervals.

The results of the GLMM showed main effects of listener’s gender [χ 2(1) = 45.49, p < 0.000] and emotion [χ 2(4) = 119.03, p < 0.000]. The results in the post-hoc tests showed that, compared to their male counterparts, though not significantly, female listeners were generally more accurate in emotion recognition (Cohen’s d = 0.28, SE = 0.23, z = 1.25, p = 0.21). The interaction effect between gender and emotion was significant [χ 2(4) = 123.85, p < 0.000]. Females significantly outperformed males in recognizing angry (Cohen’s d = 0.56, SE = 0.25, z = 2.27, p = 0.0235), joyful (Cohen’s d = 0.52, SE = 0.23, z = 2.26, p = 0.0236), and fearful emotions (Cohen’s d = 0.51, SE = 0.24, z = 2.17, p = 0.03). Females were also more accurate than males in identifying sadness (Cohen’s d = 0.24, SE = 0.25, z = 0.96, p = 0.34). However, although not statistically significant, males scored higher than females in neutral prosody recognition (Cohen’s d = 0.41, SE = 0.28, z = 1.46, p = 0.14).

3.1.2 Intensity Ratings

The right-side bar plot in Figure 2 shows the results of intensity ratings for male and female participants with trials of correct responses, and Table 2 includes a description of the statistics. Both genders rated higher for anger (M m = 3.92, M f = 3.79), sadness (M m = 3.83, M f = 3.79), and fear (M m = 3.78, M f = 3.60), while the ratings for joy were relatively lower (M m = 3.40, M f = 3.18). Neutrality, as expected, typically had the lowest rating scores (M m = 3.04, M f = 3.14) than all other four emotional conditions. The CLMM results suggested no significant main effect of emotion (p > 0.05) or listener’s gender (p > 0.05). According to the post hoc-test results, the intensity ratings given by male participants were generally, but not significantly higher than female participants (Cohen’s d = 0.29, SE = 0.34, z = 0.87, p = 0.38). The interaction effect between gender and emotion was significant [χ 2(4) = 331.45, p < 0.000]. The gender effect on intensity ratings varied across different emotions. Compared with their female counterparts, males rated higher for utterances conveying affective prosody (joy: Cohen’s d = 0.53, SE = 0.34, z = 1.58, p = 0.11; fear: Cohen’s d = 0.48, SE = 0.34, z = 1.44, p = 0.15; anger: Cohen’s d = 0.43, SE = 0.34, z = 1.27, p = 0.20; sadness: Cohen’s d = 0.14, SE = 0.34, z = 0.41, p = 0.69). However, females tended to give slightly higher rating scores on neutral voices than males (Cohen’s d = 0.12, SE = 0.34, z = 0.36, p = 0.72).

3.2 Acoustic-Perceptual Results

3.2.1 Comparison of Acoustic Parameters

An overview of the acoustic patterns for different emotions is shown in Figure 3, and a more comprehensive presentation of z-scores and significant differences by the one-way ANOVA analysis is displayed in Table 3. The correlation heatmap is presented in the Appendix (Figure A.1).

Figure 3: 
The bar plot of z-normalized scores for the 10 acoustic parameters across different emotional statuses. Mean values are displayed in the bar charts, with error bars showing 95 % confidence intervals. F0Mean = F0 mean, F0Range = F0 range, LoudMean = Loudness mean, LoudRange = Loudness range, LoudRate = Loudness peak rate, Pseudo = Pseudo syllable rate, Hammarberg = Hammarberg index, H1-H2 = Harmonic difference H1-H2, H1-A3 = Harmonic difference H1-A3.
Figure 3:

The bar plot of z-normalized scores for the 10 acoustic parameters across different emotional statuses. Mean values are displayed in the bar charts, with error bars showing 95 % confidence intervals. F0Mean = F0 mean, F0Range = F0 range, LoudMean = Loudness mean, LoudRange = Loudness range, LoudRate = Loudness peak rate, Pseudo = Pseudo syllable rate, Hammarberg = Hammarberg index, H1-H2 = Harmonic difference H1-H2, H1-A3 = Harmonic difference H1-A3.

Table 3:

ANOVA results for the comparisons of acoustic parameter values between emotional states.

Acoustic features Anger Joy Fear Sadness Neutral Comparison
(parameters) M (SD) M (SD) M (SD) M (SD) M (SD) F P pEta2 Post hoc tests (Bonferroni adjusted)
Frequency related
F0Mean 0.318 (0.379) 0.503 (0.475) 1.243 (0.401) −0.974 (0.480) −1.090 (0.451) 4,704.921 <0.000 0.807 Fear > Joy (p < 0.000)

Joy > Anger (p < 0.000)

Anger > Sadness (p < 0.000)

Sadness > Neutrality (p < 0.000)
F0Range 0.110 (0.807) 0.676 (1.108) −0.347 (0.807) −0.579 (0.736) 0.139 (0.995) 261.489 <0.000 0.189 Joy > Neutrality (p < 0.000)

Neutrality ≈ Anger (p = 1)

Anger > Fear (p < 0.000)

Fear > Sadness (p < 0.000)
Jitter 0.638 (1.099) 0.072 (0.873) −0.397 (0.679) −0.445 (0.804) 0.132 (1.068) 209.155 <0.000 0.157 Anger > Neutrality (p < 0.000)

Neutrality ≈ Joy (p = 1)

Joy > Fear (p < 0.000)

Fear ≈ Sadness (p = 1)
Amplitude related
LoudMean 0.449 (0.942) 0.341 (0.801) −0.985 (0.786) 0.075 (0.912) 0.119 (0.847) 398.131 <0.000 0.262 Anger ≥ Joy (p = 0.075)

Joy > Neutrality (p < 0.000)

Neutrality ≈ Sadness (p = 1)

Sadness > Fear (p < 0.000)
LoudRange 0.613 (0.894) 0.720 (0.757) −0.651 (0.585) −0.642 (0.791) −0.040 (0.965) 595.106 <0.000 0.346 Joy > Anger (p = 0.0052)

Anger > Neutrality (p < 0.000)

Neutrality > Sadness (p < 0.000)

Sadness ≈ Fear (p = 1)
Shimmer 0.400 (0.912) 0.134 (0.845) −0.132 (0.819) −0.642 (0.891) 0.241 (1.152) 172.533 <0.000 0.133 Anger > Neutrality (p = 0.0030)

Neutrality ≈ Joy (p = 0.15)

Joy > Fear (p < 0.000)

Fear > Sadness (p < 0.000)
HNR −0.655 (0.686) −0.112 (0.885) 0.918 (0.820) 0.460 (0.664) −0.611 (0.875) 670.917 <0.000 0.374 Fear > Sadness (p < 0.000)

Sadness > Joy (p < 0.000)

Joy > Neutrality (p < 0.000)

Neutrality ≈ Anger (p = 1)
Temporal related
LoudRate 0.000 (0.809) 0.084 (0.626) 1.105 (0.848) −0.847 (0.863) −0.341 (0.657) 788.770 <0.000 0.412 Fear > Joy (p < 0.000)

Joy ≈ Anger (p = 0.21)

Anger > Neutrality (p = 0.0069)

Neutrality > Sadness (p < 0.000)
Pseudo 0.268 (0.979) 0.321 (0.991) 0.030 (0.967) −0.738 (0.720) 0.120 (0.935) 193.842 <0.000 0.147 Joy ≈ Anger (p = 1)

Anger > Neutrality (p < 0.000)

Neutrality > Fear (p < 0.000)

Fear > Sadness (p < 0.000)
Spectral-balance related
Hammarberg −0.813 (0.700) −0.529 (0.697) 1.011 (0.764) 0.529 (0.619) −0.199 (0.882) 943.620 <0.000 0.456 Fear > Sadness (p < 0.000)

Sadness > Neutrality (p < 0.000)

Neutrality > Joy (p < 0.000)

Joy > Anger (p < 0.000)
H1-H2 −0.668 (0.851) 0.024 (0.879) −0.008 (0.758) 0.474 (1.114) 0.178 (0.994) 183.853 <0.000 0.141 Sadness > Neutrality (p < 0.000)

Neutrality > Joy (p = 0.0041)

Joy ≈ Fear (p = 1)

Fear > Anger (p < 0.000)
H1-A3 −0.671 (0.836) −0.389 (0.855) 0.193 (1.085) 0.651 (0.714) 0.217 (0.881) 320.806 <0.000 0.222 Sadness > Neutrality (p < 0.000)

Neutrality ≈ Fear (p = 1)

Fear > Joy (p < 0.000)

Joy > Anger (p < 0.000)
  1. F0Mean, F0 mean; F0Range, F0 range; LoudMean, Loudness mean; LoudRange, Loudness range; LoudRate, Loudness peak rate; Pseudo, Pseudo syllable rate; Hammarberg, Hammarberg index; H1-H2, Harmonic difference H1-H2; H1-A3, Harmonic difference H1-A3; for the post hoc-tests comparison report, “>” is used when p < 0.05, “≥” is used when p < 0.1, “≈” is used when p > 0.1.

The one-way ANOVA revealed that all the 12 acoustic parameters showed significant differences across the five emotional statuses. Non-significant comparison results were observed most between joy and anger (in three features of loudness mean, loudness peak rate, and pseudosyllable rate), followed by fear and sadness (jitter and loudness range), neutrality and anger (pitch range and HNR), as well as neutrality and joy (jitter, shimmer). Besides, close acoustic feature values were also found between joy and fear (H1-H2), neutrality and sadness (loudness mean), and neutrality and fear (H1-A3). Of all the emotions, anger was characterized with significant extreme values in six acoustic parameters (the highest jitter and shimmer, and the lowest HNR, Hammarberg index, H1-H2, and H1-A3), and so was sadness (the highest H1-H2, H1-A3, and the lowest pitch range, shimmer, loudness peak rate, and pseudo syllable rate). Fear was distinguished with the highest values in four parameters (pitch mean, HNR, loudness peak rate, Hammarberg index), and the lowest values in loudness mean. Joy had a relatively moderate acoustic profile which had only two significantly noticeable cues of the largest pitch and loudness range.

3.2.2 Acoustic Prediction of Perceptual Response

The proportion of variance explained by the acoustic measures for emotion recognition accuracy and intensity ratings is presented in Table 4.

Table 4:

Proportion of variance explained by the acoustic measures for emotion recognition and intensity ratings by male and female participants.

Accuracy model Intensity model
Male Female Male Female
Frequency related
F0Mean 0.0018*** (p = 0.0001) 0.28*** (p < 0.000) 0.19*** (p < 0.000)
F0Range 0.0068*** (p < 0.000) 0.0065*** (p < 0.000) 0.0015*** (p < 0.000) 0.01*** (p < 0.000)
Jitter 0.0034. (p = 0.091) 0.01*** (p < 0.000) 0.03*** (p < 0.000)
Amplitude related
LoudMean 0.0060. (p = 0.091) 0.000095* (p = 0.012) 0.06*** (p < 0.000) 0.0075*** (p < 0.000)
LoudRange 0.0070*** (p < 0.000) 0.0054*** (p < 0.000) 0.0020*** (p < 0.000) 0.0027*** (p < 0.000)
Shimmer 0.00020* (p = 0.0431)
HNR 0.000051** (p = 0.003) 0.0043** (p = 0.002) 0.01*** (p = 0.0003) 0.0057 (p = 0.128)
Temporal related
LoudPeak 0.0026*** (p < 0.000) 0.03*** (p < 0.000) 0.01*** (p < 0.000)
Pseudo 0.0009* (p = 0.012)
Spectral-balance related
Hammarberg 0.0042*** (p < 0.000) 0.05*** (p < 0.000) 0.01*** (p < 0.000)
H1-H2 0.0027*** (p = 0.0005) 0.0052*** (p < 0.000) 0.02*** (p < 0.000)
H1-A3 0.0012* (p = 0.022) 0.01*** (p < 0.000) 0.0074*** (p < 0.000)
Adjusted R-squared 0.028 0.020 0.367 0.255
  1. The values represent the estimate effect sizes (partial η2) for the parameters, and the adjusted R 2 for the full models. All p-values were Bonferroni corrected by the number of acoustic parameters (***p < 0.001, **p < 0.01, *p < 0.05, p < 0.1). No values were calculated for parameters dropped by the backward variable selection. F0Mean, F0 mean; F0Range, F0 range; LoudMean, Loudness mean; LoudRange, Loudness range; LoudRate, Loudness peak rate; Pseudo, Pseudo syllable rate; Hammarberg, Hammarberg index; H1-H2, Harmonic difference H1-H2; H1-A3, Harmonic difference H1-A3.

An inspection of Table 4 shows that the proportions of explained variance for recognition accuracy (R 2 m = 0.028, R 2 f = 0.020) were much lower than that for intensity rating (R 2 m = 0.367, R 2 f = 0.255). In general, perceptual responses from male participants are better predicted by the acoustic parameters. The emotion recognition and intensity ratings from decoders of different genders were significantly driven by many common acoustic features, but at the same time, the specific constellation of predictors varied considerably across the two groups. With regard of emotion recognition, while accuracy from both male and female listeners was commonly predicted by frequency- (pitch range) and amplitude- (loudness range, loudness mean, and HNR) related parameters, the recognition responses of males were significantly predicted by one more frequency- (pitch mean) and two more spectral-balance-(Hammarberg index, and H1-A3) related features, but those of females were further predicted by one temporal-(loudness peak rate) and another spectral-balance- (H1-H2) related features. In the aspect of intensity rating, the behavioral data from male and female participants largely shared a common set of acoustic predictors. However, male and female responses were best predicted by different parameters (for males: loudness mean, loudness peak rate, and H1-A3; for females: pitch range, loudness range, and H1-H2).

4 Discussion

This study examined the relations between decoder gender, emotion type, and acoustic measures on emotion perception of recognition accuracy and intensity rating by a native female speaker of Mandarin Chinese. Particularly, we investigated how the gender of the listeners may exert influence on the perceptual responses for speech of several distinct emotional states (i.e., anger, joy, fear, sadness, neutrality), and how the profile of specific acoustic signals may help explain the underlying mechanism with a well-controlled emotional speech database appropriate both for acoustic analysis and perceptual studies. The current research offers two important contributions to the advancement of acoustic-perceptual mapping of Mandarin emotional prosody across different genders. First, we investigate any gender-specific advantage or characteristics for the decoding of vocal emotions across different emotional categories using a standard audiometry corpus with admirable control over semantic and phonetic features. Second, by simultaneously analyzing the acoustic profiles and the behavioral responses, we link perceptual performances and acoustic signals throughout the comprehensive process of emotion prosody perception, which provides a valuable window for us to examine the nuanced details of emotional acoustic-perceptual mapping regarding multiple aspects.

Overall, our results revealed that male and female decoders differ from each other considerably in the acoustic-perceptual mapping of emotional prosody, with females being generally more accurate and sensitive in discerning affective expressions. The gender modulation in perceptual voice-emotion mapping varies across emotion categories and decoded attribution types (i.e., cognition accuracy and intensity rating), which can be further explained by acoustic measures. Possible gender-related inferential preferences in emotional processing mechanisms are revealed. These major findings will be discussed in detail in the following subsections.

4.1 Gender Differences in Decoded Attribution by Perceptual Responses

For our first research questions regarding the gender advantage in decoded attributions (i.e., recognition accuracy and intensity rating) for vocal-acoustic-auditory channels, our findings suggested that females might be generally more accurate and more sensitive in inferring and attributing affective states from speech prosody.

Regarding recognition accuracy, females showed a general tendency to outperform males at categorizing emotions (i.e., anger, joy, fear, sadness) in vocal stimuli from a woman encoder, which is in line with previous reports (Fujisawa and Shinohara 2011; Hall 1978; Lausen and Schacht 2018; Lin and Ding 2020; Lin, Ding, and Zhang 2021). However, female superiority in affective decoding was not globally significant (Ertürk, Gürses, and Kayıkcı 2024; Lausen and Schacht 2018). Our results replicated significant female merits in decoding voices of angry (Rafiee and Schacht 2023; Thompson and Voyer 2014), joyful (Bonebright, Thompson, and Leger 1996; Demenescu, Kato, and Mathiak 2015; Fujisawa and Shinohara 2011; Lambrecht, Kreifelts, and Wildgruber 2014; Lin, Ding, and Zhang 2021), and fearful (Bonebright, Thompson, and Leger 1996; Demenescu, Kato, and Mathiak 2015; Rafiee and Schacht 2023; Zupan et al. 2016) prosody, and we found no evidence for gender differences for the recognition of sad (Chen, Wang, and Ding 2024; Demenescu, Kato, and Mathiak 2015; Lausen and Schacht 2018; Rafiee and Schacht 2023) and neutral voices (Bonebright, Thompson, and Leger 1996; Demenescu, Kato, and Mathiak 2015; Lausen and Schacht 2018; Rafiee and Schacht 2023). The typical female’s enhanced sensitivity in discerning the three highly aroused emotions (i.e., anger, joy, fear) compared with the other two deactivated ones (i.e., sadness, neutrality) can be further explained by an evolutionary approach. The emotion of joy may carry information about love and affection crucial to mate-seeking, while fear and anger alarm potential threats and danger, compelling people to fight or flee. All the three high arousal emotions induce action, energy, and mobilization, making their recognition highly relevant to reproduction and survival success (e.g., Skuse 2003; Al-Shawaf et al. 2016). Given that the “fitness threat” hypothesis posits an exclusive negative bias for women in affective decoding, the “attachment promotion” hypothesis seems to be a more suitable fit for our data here (Hampson, van Anders, and Mullin 2006; Menezes et al. 2017; Thompson and Voyer 2014), where women are more sensitive to both positive and negative affective signals due to their biological competence and social roles in regard to caretaking, romantic relations, and socialization (Brody and Hall 2010; Collignon et al. 2010; Fischer, Kret, and Broekens 2018; Hall 1978).

Results in intensity judgement revealed that males assigned relatively higher ratings than females to utterances conveying emotional prosody, which suggested that males might have a higher emotional perceptual threshold, while females were capable of extracting more affective information from nuanced emotional vocalizations. Examining the intensity for trials recognized correctly, we found that males had relatively higher rating scores than females for all affective states, while females were assigned slightly higher intensity rating scores for the neutral voices. A further investigation of the confusion pattern revealed that while males were more likely to misrecognize negative emotions as positive or neutral ones, like taking anger as neutrality, or fear as joy, females did exactly the opposite, assigning negative emotional labels (e.g., anger, sadness) to neutral utterances. Integrating this pattern into biological and socialization frameworks, one could suppose that women, due to their designated nurturing, affiliative, and less dominant role (Hess, Adams, and Kleck 2005; Schirmer 2013), may have developed with heightened sensitivity to minimal emotional cues (e.g., the fleeting and subtle affective signals of infants).

Nevertheless, we should be cautious in concluding here with a simple “female advantage” or “female sensitivity” account for vocal emotion processing. As is emphasized by Brunswik’s lens model proposed in Scherer (2003), the decoded attribution only revealed the final performance as a result of the whole process of emotional communication. To better understand the mechanism, we must check the inferential chain from acoustic signals to behavioral responses to see possible acoustic-integrated explanations for gender differences.

4.2 Gender Differences in Cue Utilization by Acoustic-Perceptual Analysis

By scrutinizing the results from the acoustic-perceptual analysis, we observed an admirable explanatory power in prosodic measures for the perceptual results. Meanwhile, our study revealed remarkable gender/sex differences as well as similarities in the cue utilization mechanism for acoustic-perceptual mapping. According to our results, we speculate that in the construction of their judgement for vocal emotions, males might rely more on objective acoustic signals, while females extracted more subjective inferences.

The overall acoustic profile across emotions in our materials is largely compatible with previous reports (Ekberg et al. 2023; Liu and Pell 2014; Wang and Ding 2024). The acoustic measurement using the 12 selected parameters showed that compared to other emotions, joy had a relatively moderate prosodic profile (Ekberg et al. 2023; Wang and Ding 2024), which justified the overall lower recognition accuracy for happy voices across male and female participants (see a review in Zupan and Eskritt 2024). Meanwhile, compatible with other findings (Ekberg et al. 2023; Wang and Ding 2024; Yildirim et al. 2004), anger and happiness were relatively similar in acoustic characteristics, which primarily rationalize the poorer separability for the two emotions. In addition, in the current research, fear and sadness were also close in several acoustic parameters, which may serve as a possible source of their inter mix-ups commonly observed in the identification task.

Further surveying the acoustic-perceptual analytical results, an intriguing finding in the current study was that the males demonstrated a higher correlation than females between their performance and the objective acoustic parameters, which indicated that males may rely more on the objective acoustic information. Reflecting the female tendency to misrecognize anger from neutrality and to give higher intensity ratings for neutral voices, it seems that females may project more self-imposed filters that possibly exaggerate the emotional sensations. In other words, females could possibly be more sensitive not to the objective auditory signals, but they might sensibly amplify any emotion-like cues and react quickly with empathic responses, which has been indicated by prior findings of a higher emotional categorization perception degree (Chen, Wang, and Ding 2024) and an earlier use of affective prosody in lexical processing (Schirmer, Kotz, and Friederici 2002). Additionally, as in the current study, we employed disyllabic word materials conveying linguistic meaning; the gender-related difference in cue utilization may be further attributed to a brain lateralization hypothesis, where language functions could be left-lateralized in males while bilateral in females (Shaywitz et al. 1995; Yu et al. 2014). With the highlighted role of the right hemisphere in emotional communication (Blonder, Bowers, and Hellman 1991; Buchanan et al. 2000), females might be more affectively active because of their shorter neural connections between regions responsible for language processing and regions for emotion processing.

What’s more, another alternative explanation for the higher dependence on objective acoustic signals in male participants is a simulating theory approach (Goldman 2006), claiming that we gain information about other minds by pretending to be in their “mental shoes.” In other words, people use their own mind as a model to “mirror” or “mimic” the minds of others. Thus, as the default encoding-decoding mapping model used by males failed to perfectly match the expressive pattern of the female speaker, men were required to extract more information from auditory cues to support affective inference. Specifically, literature on speech production analysis has suggested that the men’s phonation categories (i.e., breathy vs. creaky) were distinguished by H1-A3 while the women’s were distinguished by H1-H2 (Esposito 2010), and the Hammarberg index could be a better signal for mental stress in men than for women (Gerczuk et al. 2024). Accordingly in our results, although the female speaker encoded breathiness mainly with raised values in H1-H2, males still preferred Hammarberg index and H1-A3, in contrast to their female counterparts who relied more on H1-H2. Earlier neuroimaging research by Sokhi et al. (2005) also showed that female voices elicited higher activation in the right anterior superior temporal gyrus, which is assumed to be a correlate of the extraction of prosodic cues (Mitchell et al. 2003; Zatorre, Evans, and Meyer 1994), while male voices evoked higher activation in the precuneus that is supposed to be related to internal monitoring mechanisms or comparisons drawn between the heard voice and the representation of one’s own (Posner and Petersen 1990; Pugh et al. 1996). Nevertheless, this explanation is entirely speculative and requires further testing and validation.

4.3 Implications, Limitations, and Future Research

Previous studies on gender effect on affective prosody perception lack a thorough investigation into the whole process of emotional communication, from encoding and transmission to decoding aspects. Following the modified version of Brunswik’s lens model proposed by Scherer (2003), we examine the whole vocal-acoustic-auditory chain to inspect and provide valuable theoretical implications concerning gender differences in acoustic-perceptual mapping mechanisms. Supporting the “attachment promotion” prediction in the “child-rearing hypothesis,” we corroborated an overall female advantage in detecting vocal affective prosody encoded by the woman speaker, especially in high arousal emotions including anger, fear, and joy, which were closely related to survival and reproductive success. Moreover, investigating the behavioral performance with acoustic measures, we found that the “female sensitivity” was not a result of bottom-up superiority in detecting the objective physiological signals, but a top-down dominance that exaggerates minimal affective cues with imposed subjective filters.

Nevertheless, there are important limitations to consider when interpreting these findings. First of all, the current materials remain limited by its one-speaker and female-only scale. Therefore, the patterns we observed in the current research could be inevitably affected and limited to only female-encoder conditions. Specifically, future studies could include exploring how the gender factor of decoders moderates the acoustic-perceptual mapping mechanism with prosody encoded by men. Second, the present study only investigated the decoders’ gender effect for a participant group of adults. Recent evidence indicates that gender-based differences in emotional processing can be detected as early as infancy, with biases toward various affective stimuli (i.e., negative or positive) showing age-dependent patterns (see a review in Chaplin and Aldao 2013). This finding suggests promising directions for future research, including investigating age-related variations in emotion bias and examining how emotional competence develops through life experiences and maturation. Moreover, concerning the acoustic measurement, the current analysis was conducted with a restricted focus on 12 acoustic features. There are parameters we fail to cover yet might be of great significance for affective processing, especially in the frequency and spectral-balance domain (e.g., F1, F2, F3-related as well as MFCC-related parameters). Further investigation and elaboration should be included in future work.

5 Conclusions

Through a combination of perceptual experimental results and acoustic analysis, this paper enriches the knowledge of gender effect on the acoustic-perceptual mapping mechanism of emotional prosody with female voices in Mandarin Chinese. We inspected how emotion types, gender of decoders, and acoustic portrayals influence the recognition accuracy and intensity judgement of emotional prosody conveyed by a female native Mandarin speaker in emotion perceptual tasks. It adds to the literature on gender differences for the recognition of vocal emotions by showing a female advantage in decoding accuracy and a female sensitivity in minimal affective signals. Results further revealed a complicated nature of the female superiority in auditory emotional communication, suggesting that males may rely more on objective sensing for acoustic characteristics, and females might impose subjective inference to amplify subtle emotional cues. The investigation of emotional speech prosody encoded by male voices on a wider range of acoustic measures and age groups is required for further testing and validation.


Corresponding author: Hongwei Ding, Speech-Language-Hearing Center, School of Foreign Languages, Shanghai Jiao Tong University, Shanghai, China; and National Research Centre for Language and Well-being, Shanghai, China, E-mail: 

Award Identifier / Grant number: 18ZDA293

Appendix

See Figure A.1.

Figure A.1: 
Pearson correlation heatmap of the 12 parameters, with Pearson’s r presented and significance level marked for each pair of parameters.
Figure A.1:

Pearson correlation heatmap of the 12 parameters, with Pearson’s r presented and significance level marked for each pair of parameters.

References

Akaike, H. 1974. “A New Look at the Statistical Model Identification.” IEEE Transactions on Automatic Control 19 (6): 716–23, https://doi.org/10.1109/tac.1974.1100705.Search in Google Scholar

Al-Shawaf, L., D. Conroy-Beam, K. Asao, and D. M. Buss. 2016. “Human Emotions: An Evolutionary Psychological Perspective.” Emotion Review 8 (2): 173–86. https://doi.org/10.1177/1754073914565518.Search in Google Scholar

Anwyl-Irvine, A. L., J. Massonnié, A. Flitton, N. Kirkham, and J. K. Evershed. 2020. “Gorilla in Our Midst: An Online Behavioral Experiment Builder.” Behavior Research Methods 52 (1): 388–407. https://doi.org/10.3758/s13428-019-01237-x.Search in Google Scholar

Anwyl-Irvine, A., E. S. Dalmaijer, N. Hodges, and J. K. Evershed. 2021. “Realistic Precision and Accuracy of Online Experiment Platforms, Web Browsers, and Devices.” Behavior Research Methods 53 (4): 1407–25. https://doi.org/10.3758/s13428-020-01501-5.Search in Google Scholar

Babchuk, W. A., R. B. Hames, and R. A. Thompson. 1985. “Sex Differences in the Recognition of Infant Facial Expressions of Emotion: The Primary Caretaker Hypothesis.” Ethology and Sociobiology 6 (2): 89–101. https://doi.org/10.1016/0162-3095(85)90002-0.Search in Google Scholar

Barr, D., R. Levy, C. Scheepers, and H. Tily. 2013. “Random Effects Structure for Confirmatory Hypothesis Testing: Keep it Maximal.” Journal of Memory and Language 68 (3): 255–78. https://doi.org/10.1016/j.jml.2012.11.001.Search in Google Scholar

Bates, D., M. Mächler, B. Bolker, and S. Walker. 2015. “Fitting Linear Mixed-Effects Models Using Lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.Search in Google Scholar

Belin, P., S. Fillion-Bilodeau, and F. Gosselin. 2008. “The Montreal Affective Voices: A Validated Set of Nonverbal Affect Bursts for Research on Auditory Affective Processing.” Behavior Research Methods 40 (2): 531–9. https://doi.org/10.3758/BRM.40.2.531.Search in Google Scholar

Belyk, M., and S. Brown. 2014. “Perception of Affective and Linguistic Prosody: An ALE Meta-Analysis of Neuroimaging Studies.” Social Cognitive and Affective Neuroscience 9 (9): 1395–403. https://doi.org/10.1093/scan/nst124.Search in Google Scholar

Bickley, C. 1982. “Acoustic Analysis and Perception of Breathy Vowels.” Speech Communication Group Working Papers, Vol. 1, 73–93. Cambridge, MA: MIT.Search in Google Scholar

Blonder, L. X., D. Bowers, and K. M. Hellman. 1991. “The Role of the Right Hemisphere in Emotional Communication.” Brain 114 (3): 1115–27. https://doi.org/10.1093/brain/115.2.645.Search in Google Scholar

Bonebright, T. L., J. Thompson, and D. W. Leger. 1996. “Gender Stereotypes in the Expression and Perception of Vocal Affect.” Sex Roles 34: 429–45, https://doi.org/10.1007/bf01547811.Search in Google Scholar

Brody, L. R., and J. A. Hall. 2010. “Gender, Emotion, and Socialization.” In Handbook of Gender Research in Psychology, edited by J. Chrisler, and D. McCreary, 429–54. New York: Springer.10.1007/978-1-4419-1465-1_21Search in Google Scholar

Buchanan, T. W., K. Lutz, S. Mirzazade, K. Specht, N. J. Shah, K. Zilles, and L. Jäncke. 2000. “Recognition of Emotional Prosody and Verbal Components of Spoken Language: An fMRI Study.” Cognitive Brain Research 9 (3): 227–38. https://doi.org/10.1016/S0926-6410(99)00060-9.Search in Google Scholar

Chaplin, T. M. 2015. “Gender and Emotion Expression: A Developmental Contextual Perspective.” Emotion Review 7 (1): 14–21. https://doi.org/10.1177/1754073914544408.Search in Google Scholar

Chaplin, T., and A. Aldao. 2013. “Gender Differences in Emotion Expression in Children: A Meta-Analytic Review.” Psychological Bulletin 139 (4): 735–65. https://doi.org/10.1037/a0030737.Search in Google Scholar

Chaplin, T. M., and P. M. Cole. 2005. “The Role of Emotion Regulation in the Development of Psychopathology.” In Development of Psychopathology: A Vulnerability-Stress Perspective, edited by B. L. Hankin, and J. R. Z. Abela, 49–74. Thousand Oaks: Sage Publications, Inc.10.4135/9781452231655.n3Search in Google Scholar

Chen, Yu, Ting Wang, and Hongwei Ding. 2024. “Effect of Age and Gender on Categorical Perception of Vocal Emotion Under Tonal Language Background.” Journal of Speech, Language, and Hearing Research 67 (11): 4567–83. https://doi.org/10.1044/2024_JSLHR-23-00716.Search in Google Scholar

Collignon, O., S. Girard, F. Gosselin, D. Saint-Amour, F. Lepore, and M. Lassonde. 2010. “Women Process Multisensory Emotion Expressions More Efficiently Than Men.” Neuropsychologia 48 (1): 220–5. https://doi.org/10.1016/j.neuropsychologia.2009.09.007.Search in Google Scholar

Cutler, A., and M. Pearson. 2018. “On the Analysis of Prosodic Turn-Taking Cues.” In Intonation in Discourse, edited by C. Johns-Lewis, 139–55. Abingdon, UK: Routledge.10.4324/9780429468650-8Search in Google Scholar

de Gelder, B. 2016. “Gender, Culture and Context Differences in Recognition of Bodily Expressions.” In Emotions and the Body. 1st ed., edited by B. de Gelder, 163–91. New York, NY: Oxford University Press.10.1093/acprof:oso/9780195374346.003.0008Search in Google Scholar

Demenescu, L. R., Y. Kato, and K. Mathiak. 2015. “Neural Processing of Emotional Prosody Across the Adult Lifespan.” BioMed Research International 2015 (1): 590216–9, https://doi.org/10.1155/2015/590216.Search in Google Scholar

Ding, H., and Y. Zhang. 2023. “Speech Prosody in Mental Disorders.” Annual Review of Linguistics 9 (1): 335–55. https://doi.org/10.1146/annurev-linguistics-030421-065139.Search in Google Scholar

Ehde, D. M. 2018. “Patient Health Questionnaire.” In Encyclopedia of Clinical Neuropsychology, edited by J. S. Kreutzer, J. DeLuca, and B. Caplan, 2601–4. New York: Springer International Publishing.10.1007/978-3-319-57111-9_2002Search in Google Scholar

Ekberg, M., G. Stavrinos, J. Andin, S. Stenfelt, and Ö. Dahlström. 2023. “Acoustic Features Distinguishing Emotions in Swedish Speech.” Journal of Voice: Official Journal of the Voice Foundation. S0892-1997(23)00103-0. Advance online publication. https://doi.org/10.1016/j.jvoice.2023.03.010.Search in Google Scholar

Ertürk, Ayşe, Emre Gürses, and Maviş Emel Kulak Kayıkcı. 2024. “Sex Related Differences in the Perception and Production of Emotional Prosody in Adults.” Psychological Research 88 (2): 449–57. https://doi.org/10.1007/s00426-023-01865-1.Search in Google Scholar

Esposito, C. M. 2010. “Variation in Contrastive Phonation in Santa Ana Del Valle Zapotec.” Journal of the International Phonetic Association 40 (2): 181–98, https://doi.org/10.1017/s0025100310000046.Search in Google Scholar

Eyben, F., M. Wöllmer, and B. Schuller. 2010. “Opensmile: The Munich Versatile and Fast Open-Source Audio Feature Extractor.” In Proceedings of the 18th ACM International Conference on Multimedia, 1459–62. New York, NY, USA: Association for Computing Machinery.10.1145/1873951.1874246Search in Google Scholar

Eyben, F., F. Weninger, F. Gross, and B. Schuller. 2013. “Recent Developments in Opensmile, the Munich Open-Source Multimedia Feature Extractor.” In Proceedings of the 21st ACM International Conference on Multimedia, 835–8. New York, NY, USA: Association for Computing Machinery.10.1145/2502081.2502224Search in Google Scholar

Eyben, F., K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, et al.. 2016. “The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing.” IEEE Transactions on Affective Computing 7 (2): 190–202. https://doi.org/10.1109/TAFFC.2015.2457417.Search in Google Scholar

Fischer, A., and A. Manstead. 2008. Handbook of Emotions, 3rd ed. New York: Guilford.Search in Google Scholar

Fischer, A. H., M. E. Kret, and J. Broekens. 2018. “Gender Differences in Emotion Perception and Self-Reported Emotional Intelligence: A Test of the Emotion Sensitivity Hypothesis.” PLoS One 13 (1): e0190712. https://doi.org/10.1371/journal.pone.0190712.Search in Google Scholar

Fischer-Jørgensen, E. 1967. “Phonetic Analysis of Breathy (Murmured) Vowels.” Indian Linguistics 28: 71–139.Search in Google Scholar

Fujisawa, T., and K. Shinohara. 2011. “Sex Differences in the Recognition of Emotional Prosody in Late Childhood and Adolescence.” The Journal of Physiological Sciences 61 (5): 429–35. https://doi.org/10.1007/s12576-011-0156-9.Search in Google Scholar

Gerczuk, M., S. Amiriparian, J. Lutz, W. Strube, I. Papazova, A. Hasan, and B. W. Schuller. 2024. “Exploring Gender-Specific Speech Patterns in Automatic Suicide Risk Assessment.” In Proceedings of Interspeech 2024, 1095–9.10.21437/Interspeech.2024-1097Search in Google Scholar

Goldman, A. 2006. Simulating Minds: The Philosophy, Psychology, and Neuroscience of Mindreading. Oxford: Oxford University Press.10.1093/0195138929.001.0001Search in Google Scholar

Gong, B., N. Li, Q. Li, X. Yan, J. Chen, L. Li, X. Wu, and C. Wu. 2023. “The Mandarin Chinese Auditory Emotions Stimulus Database: A Validated Set of Chinese Pseudosentences.” Behavior Research Methods 55 (3): 1441–59, https://doi.org/10.3758/s13428-022-01868-7.Search in Google Scholar

Hall, J. A. 1978. “Gender Effects in Decoding Nonverbal Cues.” Psychological Bulletin 85: 845–57, https://doi.org/10.1037/0033-2909.85.4.845.Search in Google Scholar

Hammarberg, B., B. Fritzell, J. Gaufin, J. Sundberg, and L. Wedin. 1980. “Perceptual and Acoustic Correlates of Abnormal Voice Qualities.” Acta Oto-Laryngologica 90 (1–6): 441–51. https://doi.org/10.3109/00016488009131746.Search in Google Scholar

Hampson, E., S. M. van Anders, and L. I. Mullin. 2006. “A Female Advantage in the Recognition of Emotional Facial Expressions: Test of an Evolutionary Hypothesis.” Evolution and Human Behavior 27 (6): 401–16. https://doi.org/10.1016/j.evolhumbehav.2006.05.002.Search in Google Scholar

Haubo, R., and B. Christensen. 2018. “Cumulative Link Models for Ordinal Regression with the R Package Ordinal.” Journal of Statistical Software 35. https://cran.r-project.org/web/packages/ordinal/vignettes/clm_article.pdf.Search in Google Scholar

Hess, U., R. Adams, and R. Kleck. 2005. “Who May Frown and Who Should Smile? Dominance, Affiliation, and the Display of Happiness and Anger.” Cognition and Emotion 19 (4): 515–36, https://doi.org/10.1080/02699930441000364.Search in Google Scholar

Keenan, K., and A. E. Hipwell. 2005. “Preadolescent Clues to Understanding Depression in Girls.” Clinical Child and Family Psychology Review 8 (2): 89–105, https://doi.org/10.1007/s10567-005-4750-3.Search in Google Scholar

Keshtiari, N., and M. Kuhlmann. 2016. “The Effects of Culture and Gender on the Recognition of Emotional Speech: Evidence from Persian Speakers Living in a Collectivist Society.” International Journal of Society, Culture and Language 4 (2): 71–86, https://doi.org/10.13140/rg.2.1.1159.0001.Search in Google Scholar

Koeda, M., P. Belin, T. Hama, T. Masuda, M. Matsuura, and Y. Okubo. 2013. “Cross-Cultural Differences in the Processing of Non-verbal Affective Vocalizations by Japanese and Canadian Listeners.” Frontiers in Psychology 4: 105, https://doi.org/10.3389/fpsyg.2013.00105.Search in Google Scholar

Korb, S., N. Mikus, C. Massaccesi, J. Grey, S. X. Duggirala, S. A. Kotz, and M. Mehu. 2023. “EmoSex: Emotion Prevails Over Sex in Implicit Judgments of Faces and Voices.” Emotion 23 (2): 569–88. https://doi.org/10.1037/emo0001089.Search in Google Scholar

Lambrecht, L., B. Kreifelts, and D. Wildgruber. 2014. “Gender Differences in Emotion Recognition: Impact of Sensory Modality and Emotional Category.” Cognition and Emotion 28 (3): 452–69, https://doi.org/10.1080/02699931.2013.837378.Search in Google Scholar

Lausen, A., and K. Hammerschmidt. 2020. “Emotion Recognition and Confidence Ratings Predicted by Vocal Stimulus Type and Prosodic Parameters.” Humanities and Social Sciences Communications 7: 2, https://doi.org/10.1057/s41599-020-0499-z.Search in Google Scholar

Lausen, A., and A. Schacht. 2018. “Gender Differences in the Recognition of Vocal Emotions.” Frontiers in Psychology 9: 882, https://doi.org/10.3389/fpsyg.2018.00882.Search in Google Scholar

Levenson, R. W., and A. M. Ruef. 1992. “Empathy: A Physiological Substrate.” Journal of Personality and Social Psychology 63 (2): 234–46. https://doi.org/10.1037/0022-3514.63.2.234.Search in Google Scholar

Lin, Y., and H. Ding. 2020. “Effects of Communication Channels and Actor’s Gender on Emotion Identification by Native Mandarin Speakers.” In Proceedings of Interspeech 2020, 3151–5.10.21437/Interspeech.2020-1498Search in Google Scholar

Lin, Y., H. Ding, and Y. Zhang. 2021. “Gender Differences in Identifying Facial, Prosodic, and Semantic Emotions Show Category- and Channel-specific Effects Mediated by Encoder’s Gender.” Journal of Speech, Language, and Hearing Research 64 (8): 2941–55, https://doi.org/10.1044/2021_jslhr-20-00553.Search in Google Scholar

Lin, Yi, Xinran Fan, Yueqi Chen, Hao Zhang, Fei Chen, Hui Zhang, Hongwei Ding, and Yang Zhang. 2022. “Neurocognitive Dynamics of Prosodic Salience Over Semantics During Explicit and Implicit Processing of Basic Emotions in Spoken Words.” Brain Sciences 12 (12): 1706. https://doi.org/10.3390/brainsci12121706.Search in Google Scholar

Liu, P., and M. Pell. 2012. “Recognizing Vocal Emotions in Mandarin Chinese: A Validated Database of Chinese Vocal Emotional Stimuli.” Behavior Research Methods 44: 1042–51. https://doi.org/10.3758/s13428-012-0203-3.Search in Google Scholar

Liu, P., and M. Pell. 2014. “Processing Emotional Prosody in Mandarin Chinese: A Cross-Language Comparison.” In Proceedings of Speech Prosody 2014, 95–9.10.21437/SpeechProsody.2014-7Search in Google Scholar

Lucarini, V., M. Grice, F. Cangemi, J. Zimmermann, C. Marchesi, K. Vogeley, and M. Tonna. 2020. “Speech Prosody as a Bridge between Psychopathology and Linguistics: The Case of the Schizophrenia Spectrum.” Frontiers in Psychiatry 11: 531863. https://doi.org/10.3389/fpsyt.2020.531863.Search in Google Scholar

Matuschek, H., R. Kliegl, S. Vasishth, H. Baayen, and D. Bates. 2017. “Balancing Type I Error and Power in Linear Mixed Models.” Journal of Memory and Language 94: 305–15. https://doi.org/10.1016/j.jml.2017.01.001.Search in Google Scholar

McClure, E. B. 2000. “A Meta-Analytic Review of Sex Differences in Facial Expression Processing and Their Development in Infants, Children, and Adolescents.” Psychological Bulletin 126 (3): 424–53. https://doi.org/10.1037/0033-2909.126.3.424.Search in Google Scholar

Menezes, C. B., J. C. Hertzberg, F. E. das Neves, P. F. Prates, J. F. Silveira, and S. J. L. Vasconcellos. 2017. “Gender and the Capacity to Identify Facial Emotional Expressions.” Estudos de Psicologia 22 (1): 1–9. https://doi.org/10.22491/1678-4669.20170001.Search in Google Scholar

Mitchell, R. L. C., R. Elliott, M. Barry, A. Cruttenden, and P. W. R. Woodruff. 2003. “The Neural Response to Emotional Prosody, as Revealed by Functional Magnetic Resonance Imaging.” Neuropsychologia 41 (10): 1410–21, https://doi.org/10.1016/s0028-3932(03)00017-4.Search in Google Scholar

Olivoto, T., and A. D. Lucio. 2020. “Metan: An R Package for Multi-Environment Trial Analysis.” Methods in Ecology and Evolution 11 (6): 783–9. https://doi.org/10.1111/2041-210X.13384.Search in Google Scholar

Paulmann, S., and A. K. Uskul. 2014. “Cross-Cultural Emotional Prosody Recognition: Evidence from Chinese and British Listeners.” Cognition and Emotion 28 (2): 230–44, https://doi.org/10.1080/02699931.2013.812033.Search in Google Scholar

Pell, M. D. 2002. “Evaluation of Nonverbal Emotion in Face and Voice: Some Preliminary Findings on a New Battery of Tests.” Brain and Cognition 48 (2): 499–504. https://doi.org/10.1006/brcg.2001.1406.Search in Google Scholar

Peng, G., C. Zhang, H.-Y. Zheng, J. W. Minett, and W. S.-Y. Wang. 2012. “The Effect of Intertalker Variations on Acoustic–Perceptual Mapping in Cantonese and Mandarin Tone Systems.” Journal of Speech, Language, and Hearing Research 55 (2): 579–95. https://doi.org/10.1044/1092-4388(2011/11-0025).Search in Google Scholar

Posner, M. I., and S. E. Petersen. 1990. “The Attention System of the Human Brain.” Annual Review of Neuroscience 13 (1): 25–42, https://doi.org/10.1146/annurev.ne.13.030190.000325.Search in Google Scholar

Pugh, K. R., B. A. Offywitz, S. E. Shaywitz, R. K. Fulbright, D. Byrd, P. Skudlarski, D. P. Shankweiler, et al.. 1996. “Auditory Selective Attention: An fMRI Investigation.” NeuroImage 4: 159–73.10.1006/nimg.1996.0067Search in Google Scholar

Qian, Y., and W. Pan. 2002. “Prosodic Word: The Lowest Constituent in the Mandarin Prosody Processing.” In Proceedings of Speech Prosody 2002, 591–4.10.21437/SpeechProsody.2002-133Search in Google Scholar

Rafiee, Y., and A. Schacht. 2023. “Sex Differences in Emotion Recognition: Investigating the Moderating Effects of Stimulus Features.” Cognition and Emotion 37 (5): 863–73, https://doi.org/10.1080/02699931.2023.2222579.Search in Google Scholar

Rochat, M. J. 2023. “Sex and Gender Differences in the Development of Empathy.” Journal of Neuroscience Research 101 (5): 718–29. https://doi.org/10.1002/jnr.25009.Search in Google Scholar

Scherer, K. R. 2003. “Vocal Communication of Emotion: A Review of Research Paradigms.” Speech Communication 40 (1–2): 227–56. https://doi.org/10.1016/S0167-6393(02)00084-5.Search in Google Scholar

Scherer, K. R., R. Banse, and H. G. Wallbott. 2001. “Emotion Inferences from Vocal Expression Correlate Across Languages and Cultures.” Journal of Cross-Cultural Psychology 32 (1): 76–92. https://doi.org/10.1177/0022022101032001009.Search in Google Scholar

Schirmer, A. 2013. “Sex Differences in Emotion.” In The Cambridge Handbook of Human Affective Neuroscience. 1st ed., edited by J. Armony, and P. Vuilleumier, 591–611. New York, NY: Cambridge University Press.10.1017/CBO9780511843716.033Search in Google Scholar

Schirmer, A., S. A. Kotz, and A. D. Friederici. 2002. “Sex Differentiates the Role of Emotional Prosody During Word Processing.” Cognitive Brain Research 14 (2): 228–33. https://doi.org/10.1016/s0926-6410(02)00108-8.Search in Google Scholar

Schirmer, A., S. A. Kotz, and A. D. Friederici. 2005. “On the Role of Attention for the Processing of Emotions in Speech: Sex Differences Revisited.” Cognitive Brain Research 24 (3): 442–52. https://doi.org/10.1016/j.cogbrainres.2005.02.022.Search in Google Scholar

Schmid, P., M. Mast, and F. Mast. 2011. “Gender Effects in Information Processing on a Nonverbal Decoding Task.” Sex Roles 65 (1): 102–7, https://doi.org/10.1007/s11199-011-9979-3.Search in Google Scholar

Shaywitz, B. A., S. E. Shaywltz, K. R. Pugh, R. T. Constable, P. Skudlarski, R. K. Fulbright, R. A. Bronen, et al.. 1995. “Sex Differences in the Functional Organization of the Brain for Language.” Nature 373 (6515): 607–9. https://doi.org/10.1038/373607a0.Search in Google Scholar

Skuse, D. 2003. “Fear Recognition and the Neural Basis of Social Cognition.” Child and Adolescent Mental Health 8 (2): 50–60. https://doi.org/10.1111/1475-3588.00047.Search in Google Scholar

Sokhi, D. S., M. D. Hunter, I. D. Wilkinson, and P. W. Woodruff. 2005. “Male and Female Voices Activate Distinct Regions in the Male Brain.” NeuroImage 27 (3): 572–8. https://doi.org/10.1016/j.neuroimage.2005.04.023.Search in Google Scholar

Stevens, K., and H. Hanson. 1995. “Classification of Glottal Vibration from Acoustic Measurements.” In Vocal Fold Physiology and Voice Quality Control, edited by Fujimura, and Hirano, 147–70. San Diego, CA: Singular Publishing Group Inc.Search in Google Scholar

Tang, E., J. Gong, J. Zhang, J. Zhang, R. Fang, J. Guan, and H. Ding. 2024. “Chinese Emotional Speech Audiometry Project (CESAP): Establishment and Validation of a New Material Set with Emotionally Neutral Disyllabic Words.” Journal of Speech, Language, and Hearing Research 67 (6): 1945–63. https://doi.org/10.1044/2024_JSLHR-2300625.10.1044/2024_JSLHR-23-00625Search in Google Scholar

Thompson, A. E., and D. Voyer. 2014. “Sex Differences in the Ability to Recognise Non-Verbal Displays of Emotion: A Meta-Analysis.” Cognition and Emotion 28 (7): 1164–95, https://doi.org/10.1080/02699931.2013.875889.Search in Google Scholar

Toivanen, J., E. Väyrynen, and T. Seppänen. 2005. “Gender Differences in the Ability to Discriminate Emotional Content from Speech.” In Proceedings FONETHIK, edited by A. Eriksson, and J. Lindh, 119–23. Göteborg: Reprocentralen, Humanisten, Göteborg University.Search in Google Scholar

Van Rijn, Pol, and Pauline Larrouy-Maestri. 2023. “Modelling Individual and Cross-Cultural Variation in the Mapping of Emotions to Speech Prosody.” Nature Human Behaviour 7 (3): 386–96. https://doi.org/10.1038/s41562-022-01505-5.Search in Google Scholar

Vasconcelos, M., M. Dias, A. P. Soares, and A. P. Pinheiro. 2017. “What Is the Melody of that Voice? Probing Unbiased Recognition Accuracy with the Montreal Affective Voices.” Journal of Nonverbal Behavior 41 (3): 239–67. https://doi.org/10.1007/s10919017-0253-4.Search in Google Scholar

Wang, H. 1998. “Statistical Analysis of Mandarin Acoustic Units and Automatic Extraction of Phonetically Rich Sentences Based Upon a Very Large Chinese Text Corpus.” International Journal of Computational Linguistics and Chinese Language Processing 3: 2.Search in Google Scholar

Wang, X., and H. Ding. 2024. “Acoustic-Prosodic Analysis for Mandarin Disyllabic Words Conveying Vocal Emotions.” In Proceedings of Speech Prosody 2024, 956–60. ISCA.10.21437/SpeechProsody.2024-193Search in Google Scholar

Williams, N. 2014. “The GAD-7 Questionnaire.” Occupational Medicine 64 (3): 224. https://doi.org/10.1093/occmed/kqt161.Search in Google Scholar

Xiao, C., and J. Liu. 2024. “The Perception of Emotional Prosody in Mandarin Chinese Words and Sentences.” Second Language Research 1–28. https://doi.org/10.1177/02676583241286748.Search in Google Scholar

Yildirim, S., M. Bulut, C. M. Lee, A. Kazemzadeh, Z. Deng, S. Lee, S. S. Narayanan, and C. Busso. 2004. “An Acoustic Study of Emotions Expressed in Speech.” In Proceedings of Interspeech 2024. 2193–6. https://doi.org/10.21437/interspeech.2004-242.Search in Google Scholar

Yu, V. Y., M. J. MacDonald, A. Oh, G. N. Hua, L. F. De Nil, and E. W. Pang. 2014. “Age-Related Sex Differences in Language Lateralization: A Magnetoencephalography Study in Children.” Developmental Psychology 50 (9): 2276–84. https://doi.org/10.1037/a0037470.Search in Google Scholar

Zahn-Waxler, C., E. A. Shirtcliff, and K. Marceau. 2008. “Disorders of Childhood and Adolescence: Gender and Psychopathology.” Annual Review of Clinical Psychology 4 (1): 275–303, https://doi.org/10.1146/annurev.clinpsy.3.022806.091358.Search in Google Scholar

Zatorre, R. J., A. C. Evans, and E. Meyer. 1994. “Neural Mechanisms Underlying Melodic Perception and Memory for Pitch.” Journal of Neuroscience 14 (4): 1908–9, https://doi.org/10.1523/jneurosci.14-04-01908.1994.Search in Google Scholar

Zupan, Barbra, and Michelle Eskritt. 2024. “Facial and Vocal Emotion Recognition in Adolescence: A Systematic Review.” Adolescent Research Review 9 (2): 253–77. https://doi.org/10.1007/s40894-023-00219-7.Search in Google Scholar

Zupan, B., D. Babbage, D. Neumann, and B. Willer. 2016. “Sex Differences in Emotion Recognition and Emotional Inferencing Following Severe Traumatic Brain Injury.” Brain Impairment 18 (1): 36–48. https://doi.org/10.1017/BrImp.2016.22.Search in Google Scholar


Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/csh-2024-0025).


Received: 2024-09-19
Accepted: 2024-11-19
Published Online: 2024-12-06
Published in Print: 2024-11-25

© 2024 the author(s), published by De Gruyter on behalf of Shanghai International Studies University

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 18.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/csh-2024-0025/html
Scroll to top button