Investigating the acoustic fidelity of vowels across remote recording methods

Marina Oganyan; Gina-Anne Levow; Robert Squizzero; Emily P. Ahn; Sara Ng; Ella Deaton; Richard A. Wright

doi:10.1515/lingvan-2022-0169

Artikel Open Access

Investigating the acoustic fidelity of vowels across remote recording methods

Marina Oganyan , Gina-Anne Levow , Robert Squizzero , Emily P. Ahn , Sara Ng , Ella Deaton und Richard A. Wright

Veröffentlicht/Copyright: 10. Dezember 2024

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen Erkunden Sie dieses Fachgebiet

Aus der Zeitschrift Linguistics Vanguard Band 10 Heft 1

Abstract

Our study tests the acoustic fidelity of remote recordings, using a large variety of stimuli and recording environments. Remote recordings support crucial uses like reaching isolated populations and more speakers; however, it is important to be aware of any limitations therein. A 188-word list was constructed from each English consonant followed by each vowel. Words recorded by one male and one female speaker in a sound-attenuated booth were input for test recordings. Stimuli were played over an electronic speaker and recorded on six devices across five operating systems, four teleconferencing platforms, and two browsers, using internal and external microphones, a more extensive set than has been analyzed in prior research. Acoustic analysis investigates the impact of these recording configurations on fundamental frequency, loudness, and vowel formant measures. Across recording conditions, browser choice had less effect on the different vowel measures, while choice of hardware resulted in a greater number of significant differences and greater variability within and across conditions. Recording configuration effects were observed to exert different impacts across vowels and speakers, resulting in inconsistent perturbations. These results highlight the need for careful selection and detailed documentation of all recording configurations.

Keywords: COVID-19; remote recording; acoustics; phonetics

1 Introduction

As a result of the COVID-19 pandemic, there has been a significant increase in research being conducted remotely, including studies relying on phonetic measures such as formants and fundamental frequency (f0). The benefits of conducting research remotely are not limited to the pandemic; it allows for more participants to be included in studies and provides an opportunity to reach remote or diverse populations. However, there is still much more to learn about how measures may be affected by the devices or software being used. The goal of this study is to expand that knowledge and provide some guidance as well as directions of further inquiry for conducting phonetic research online.

A few studies prior to 2019 evaluated certain aspects of remote recordings, focusing on device hardware, microphones, and networks. De Decker and Nycz (2011) examined Apple devices and found that the MacBook Pro and iPhone produced reliable F1 and F2 measures but unreliable F3 measures. A preliminary study comparing remote versus local telephone recordings for medical diagnostic purposes showed that speech rate, pitch variability/range, and voice onset time were preserved in remote recordings (Cannizzaro et al. 2005). Other studies from the voice disorder domain revealed that microphone type, placement, and sensitivity levels, as well as internet bandwidth over videoconferencing, can all interfere with voice quality assessment (Manfredi et al. 2017; Parsa et al. 2001; Titze and Winholtz 1993; Xue and Lower 2010). Parsa et al. (2001) showed that the type, placement, and sensitivity level of the microphone can affect the discrimination between normal and pathological voices. Xue and Lower (2010) and Manfredi et al. (2017) extended findings in the voice disorder domain to show that internet bandwidth over videoconferencing can interfere with voice quality assessment, while local recordings on smartphone devices can be reliable. These studies used electronic speakers or software to play out pre-recorded stimuli, a technique which is also used in the methodology of our present study.

Since the pandemic, research has also explored software such as Zoom, used for videoconferencing. Freeman and De Decker (2021b) found that Zoom (lo-fi), Skype, and Microsoft Teams were sufficient for identifying relative vowel space and nasalization patterns, but could not produce reliable absolute measures. They had the same finding across these measurements on personal devices including MacBook Air, Acer, iPad, iPhone, and a Google Pixel phone (Freeman and De Decker 2021a). They kept the stimuli constant by simultaneously recording one talker on multiple devices. Further studies discovered that Zoom (lo-fi) recordings yielded lower raw F1 values and higher F2 values for vowels, though Lobanov normalization mitigated the effects (Calder et al. 2022). Likewise for sibilants, Zoom stretched the spectral space (Calder and Wheeler 2022). With mobile phones, Zhang et al. (2021) found that pitch can be reliably tracked using both lossless phone applications and Zoom (lo-fi), but native phone applications performed better for formant analysis and did not experience drops in intensity as Zoom did. Mobile phones were also found to capture a wide frequency range and reliable differences in F1, F2, and center of gravity, although signal-to-noise ratios and F3 were unreliable (Guan and Li 2021). Ge et al. (2021) also used Zoom across several devices and concluded that f0, jitter, and shimmer measurements were reliable while segment duration, F2, spectral moments, and H1*–H2* were not. In a comparison of hardware and software by Sanker et al. (2021), duration, signal-to-noise ratio, and frequency measures were affected while phonemic distinctions were retained. A methodological comparison across these studies is given in the Supplementary Material.

To generalize across some effects in these studies, we observe that formant spaces are often shifted. F1 and F2 may be more reliable across device hardware comparisons, while F3 is not (De Decker and Nycz 2011; Guan and Li 2021). Software like Zoom tends to distort F1 and F2 (Calder et al. 2022; Ge et al. 2021), and recommendations for handling this issue are to examine relative patterns (Freeman and De Decker 2021b) or conduct normalization (Calder and Wheeler 2022). Zoom and other applications could be reliable for pitch, but not for intensity (Ge et al. 2021; Zhang et al. 2021).

Our study is unique in its systematic exploration of well-known devices and software. Whereas prior work used separate stimuli for device and software comparisons (e.g., Freeman and De Decker 2021a, 2021b; Sanker et al. 2021), our stimuli are held constant across all device, software, browser, and microphone settings. Our work is also the first to compare across common internet browser types. The research questions we seek to answer are the following:

Does controlling for recording setups facilitate cross-participant acoustic comparisons?
Are there advantages to requiring a particular piece of software or device depending on the measure(s) of interest?

2 Methods

2.1 Stimuli

The stimuli in our experiment are real words, containing one of the English monophthongs, /æ, ɪ, i, ɛ, u, ɑ/ and the diphthong /ɑɪ/, preceded by each of the consonants /m, b, p, n, d, t, g, k, h, l, ɹ, v, f, ð, θ, z, s, ʃ, ʒ, tʃ, dʒ, j, w, ʔ/ and the cluster /tɹ/. Two nonsense words, /ðut/ and /ðɑt/, were used where real words could not be selected. The complete 188-token word list appears in the Supplementary Material.

2.2 Original recordings

Baseline recordings were made in a sound-attenuated booth using a Zoom H4n recorder and an AT-4041 tabletop microphone. Two of the authors, both trained Michigan American English-speaking phoneticians, one female and one male, each read the stimuli a total of three times from randomized word lists. One instance of each word from each speaker was chosen for the main experiments, and was scaled to peak intensity.

2.3 Testing configurations

Devices tested in this study were four laptop computers and two smartphones, as follows: two MacBook Airs (2013 and 2019, both running macOS Big Sur, denoted MBA13 and MBA19), two Lenovo laptop PCs (Lenovo T490 running Windows10 denoted LenT490, Lenovo P50 running Ubuntu 16.04 denoted LenP50), a Galaxy S8 Android 9 (denoted S8), and an iPhone 12 mini iOSX (denoted iPhone). Five teleconferencing software platforms were used: Zoom (hi-fi and lo-fi), Cleanfeed, Google Meet, and Skype. On the browser-based Cleanfeed app, recordings were made using Chrome and Firefox.^[1] Finally, recordings were made with the internal microphones of the devices as well as with the same external microphone (JV603 desktop) where possible. This microphone was selected due to its high consumer and reviewer ratings, type (condenser cardioid) and low price point (under US$20), to represent microphones which researchers might send to participants. Not all combinations of software with device and microphone were successfully recorded due to issues such as device incompatibility and experimenter error. The full set of configurations is shown in Tables 1 and 2.

Table 1:

Device and software configurations in current study. Microphone options are indicated with int(ernal), ext(ernal), and X (no recording).

Device	Google meet	Skype	Zoom hi-fi	Zoom lo-fi	Native
LenT490	int/ext	int/ext	int/ext	int/ext	Voice recorder int
LenP50	int	int	int	int	X
MBA19	int/ext	int/ext	int/ext	int/ext	X
MBA13	X	X	int/ext	int/ext	Praat int/ext
S8	int	int	X	int	int
iPhone	int	int	X	int	int

Table 2:

Device and browser configurations in current study, using Cleanfeed. Microphone options are indicated with int(ernal), ext(ernal), and X (no recording).

Device	Chrome	Firefox	Edge
LenT490	int/ext	int/ext	int/ext
LenP50	int	X	X
MBA19	int/ext	int/ext	X
MBA13	int/ext	X	X
S8	int	int	int
iPhone	X	X	X

2.4 Test recording procedure

The test recordings were made in researchers’ homes using the following protocol: a digital speaker (Anker Soundcore Hi-Res 30W) was positioned uniformly to mimic the human recording setup at 12 inches height and 12 inches distance from the recording device. The digital speaker was connected to a play-out device with the original recordings. Both devices had the volume set to the middle. The recording setup is illustrated in Figure 1.

Figure 1:

Test recording setup, with digital speaker on stand 12 inches above and 12 inches in front of the computer microphone.

2.5 Analysis

2.5.1 TextGrid creation for original recordings

For each reference token, a TextGrid was produced with hand-alignments at the phoneme level. Vowels were marked from the onset of regular voicing; when preceded by a liquid or approximant, the midpoint of the two phones was treated as the vowel onset.

2.5.2 Alignment of test recordings

Input for all analyses were in WAV format, either as generated by recording or, if necessary, converted to WAV using FFmpeg. To extract the individual words from the test recordings and match them with the phoneme alignment, a cross-correlation was computed using Parselmouth (Jadoul et al. 2018), a Python library interfacing with Praat (Boersma and Weenink 2021), for each pair of original and test files, using an amplitude scaling with peak of 0.99 and zero signal outside of the time domain. The time at maximum of this cross-correlation was computed using Whittaker–Shannon interpolation with 70 points, and that value was used to shift the test audio into alignment with the original.

Certain configurations, particularly ones recorded over internet connections, experienced variable lag, resulting in misalignment. To correct for these lags, after computing the preliminary phase shift, the cross-correlation method was repeatedly applied to individual words. Each re-aligned token file was verified by one researcher, and 109 tokens were removed across 34 recording conditions due to persisting poor time alignment (55 % from male recordings, 45 % from female), corresponding to 0.5 % of recorded tokens. Eleven of the tokens removed due to lag showed evidence of temporal stretching within a word. Recordings on the S8 device were removed most often, comprising 61 of the removed tokens. Thirty-two tokens were removed from the set recorded on Cleanfeed using the Edge browser on the S8. This vastly dwarfs all other conditions; in no non-S8 recording condition were more than four tokens removed. The validated tokens were matched to the corresponding token TextGrid objects, yielding phoneme alignments for all tokens.

2.5.3 Measure extraction

For each vowel, we collected the measurements described below at 10 % increments between 20 % and 80 % of the vowel duration (7 time-points). Vowels shorter than 80 ms were excluded. From the original and test recordings every 10 ms, we computed:

f0 (in Hz): Using Praat via the Parselmouth package. Pitch maximum was set to 300 Hz for the male speaker, and 400 Hz for the female speaker. Pitch minimum was set to the default.
Formants (in Hz): Also using Parselmouth, values for F1–F3 were extracted. The maximum frequency was 5,000 Hz for the male speaker, and 5,500 Hz for the female speaker, with the number of formants set to 5 for both speakers.
Loudness (in sones): Using OpenSMILE (Eyben et al. 2010), computed based on auditory weighting of the spectrum.
MFCCs (Mel-frequency cepstral coefficients): Popularized by research in automatic speech recognition, MFCCs provide an alternative view of the distribution of energy at different frequencies and can avoid the tracking errors which affect formant extraction. MFCC1 can be interpreted as capturing the difference in energy at high and low frequencies, though higher order MFCCs lack such natural interpretations. We used OpenSMILE to extract MFCCs over 25 ms windows with 10 ms steps. MFCCs are unitless.

2.5.4 Statistical analysis

Differences in acoustic measurements as compared to the original recording^[2] vis-à-vis devices, software, web browsers, and microphones were analyzed using linear mixed-effects models fit with the lmerTest package (Kuznetsova et al. 2017) for R (R Core Team 2022). R formulas are specified under each table or figure for which models were fit.

Loudness and fundamental frequency (f0) modeling were conducted using measures for all vowels, and models for the first two formants were fit separately for each vowel. Separate models were fit for each of the two speakers. All measures were averaged across each vowel instance.^[3] All models except for those comparing hardware also included a random effect for device. When reporting significance, * refers to p < 0.05, ** to p < 0.01, and *** to p < 0.001.

3 Results

3.1 Presentation of data

3.1.1 F1 and F2 heatmaps

We present comparisons for F1 and F2 obtained through the linear mixed-effects models across phonemes and contrastive recording conditions as heatmaps, using the Seaborn Python package (Waskom 2021). The color saturation of each cell corresponds to the estimated effect size relative to the original recording. The divergent series colors reflect positive or negative estimated effect sizes. A short vertical dash in the cell indicates that the difference depicted did not reach statistical significance (p < 0.05).

3.1.2 MFCC heatmaps

Results for MFCC1 and MFCC2 are displayed in heatmaps (as described for F1 and F2) and discussed for the hardware comparison. MFCC plots for the remaining comparisons are available in the Supplementary Material.

3.1.3 Vowel plots

F1 and F2 measures were computed as the average value for the respective measure between the 20 % and 80 % points. Entries with missing data were removed. The data was trimmed at the upper and lower quartiles (for the purposes of plotting and Pillai scores below only). Plots were created using the dplyr (Wickham et al. 2022) and phonR (McCloy 2012) packages for R. Two types of plots were created for each speaker, for each comparison: (1) distribution plots of F1 × F2 of each instance of each vowel by condition; and (2) vowel space plots of the average F1 × F2 for each vowel by condition. For readability, /æ/ was excluded from the distribution plots. For all plots, the IPA symbol of the vowel is plotted at the mean F1 × F2 for that vowel.

3.1.4 Pillai statistic

The Pillai score is a test statistic produced by a MANOVA. Pillai scores and statistical significance based on p values are reported for F1 × F2 of each condition compared to the F1 × F2 of the original recording, with one table for each comparison from the same trimmed data used to generate F1 × F2 plots. The p value associated with the Pillai–Bartlett trace from the MANOVA indicates whether or not each vowel distribution is statistically significantly differentiated from the original recording by F1 and/or F2 values (Hay et al. 2006). The Pillai–Bartlett trace also produces a statistic ranging from 0, indicating complete separation of the vowel in each condition, to 1, indicating complete overlap.

3.2 Hardware

All hardware devices were compared using recordings made with Zoom lo-fi, using the internal microphone.

3.2.1 Loudness

All hardware configurations resulted in lower loudness from the original, and all differences were statistically significant (Table 3). Furthermore, similar patterns of reduced loudness are found for all software, browser, and microphone comparisons, and will not appear in subsequent sections but are available in the Supplementary Material.

Table 3:

Loudness (in sones) by hardware and speaker. Call: lmer (loudness∼device + (1|word), data = hardware.speaker, REML = FALSE).

Device	Estimate	Standard error	t value	Pr (>\|t\|)
Male speaker

Original (intercept)	3.66	0.03	106.52	<2e−16^***
iPhone	−2.07	0.03	−61.2	<2e−16^***
LenovoP50	−2.25	0.03	−66.6	<2e−16^***
LenT490	−2.28	0.03	−67.62	<2e−16^***
MBA13	−3.38	0.03	−99.59	<2e−16^***
MBA19	−1.94	0.03	−57.11	<2e−16^***
S8	−1.96	0.03	−57.47	<2e−16^***

Female speaker

Original (intercept)	3.55	0.04	82.4	<2e−16^***
iPhone	−1.67	0.04	−44	<2e−16^***
LenovoP50	−1.91	0.04	−50.63	<2e−16^***
LenT490	−2.08	0.04	−54.97	<2e−16^***
MBA13	−3.28	0.04	−86.48	<2e−16^***
MBA19	−1.7	0.04	−45.02	<2e−16^***
S8	−1.94	0.04	−51.11	<2e−16^***

3.2.2 f0

For the female speaker, across all devices, there were no statistically significant differences from the original (Table 4). For the male speaker, there were statistically significant decreases in the mean f0 from the original for two devices, the S8 by 2.36 Hz and the LenP50 by 1.48 Hz. For all other devices, differences did not reach significance.

Table 4:

f0 (in Hz) by hardware and speaker. Call: lmer (f0∼device + (1|word), data = hardware.speaker, REML = FALSE).

Device	Estimate	Standard error	t value	Pr (>\|t\|)
Male speaker

Original (intercept)	121	0.9	134.28	<2e−16^***
iPhone	−0.01	0.64	−0.01	0.99
LenP50	−1.48	0.64	−2.31	0.021^*
LenT490	−0.08	0.64	−0.12	0.903
MBA13	0.37	0.64	0.58	0.561
MBA19	−0.89	0.64	−1.38	0.166
S8	−2.36	0.64	−3.68	2.51e−04^***

Female speaker

Original (intercept)	229.28	0.61	375.67	<2e−16^***
iPhone	−0.07	0.51	−0.14	0.886
LenP50	−0.55	0.51	−1.1	0.273
LenT490	0.13	0.51	0.26	0.799
MBA13	−0.77	0.51	−1.53	0.127
MBA19	−0.64	0.51	−1.27	0.205
S8	−0.07	0.51	−0.15	0.883

3.2.3 Differences in F1 and F2 and MFCCs from original

As depicted in the estimated effect sizes presented in the heatmaps in Figure 2, we can observe significant effects of hardware configuration for both speakers across all vowels for F1 and all vowels except /ɪ/ for the male speaker for F2. Both increases and decreases relative to the original recordings are seen. Effect sizes are larger for F2 than for F1 and for the female speaker than for the male speaker, with extreme negative values rather than positive ones for the female for F1. However, few systematic patterns are readily discernible for specific hardware configurations.

Figure 2:

Heatmap of difference in F1 and F2 from the original (in Hz), across devices. Short vertical dashes indicate that the differences are not significant for that phoneme. Calls: lmer (F1∼device + (1|word), data = device.speaker.vowel, REML = FALSE).

If we consider the heatmaps for the MFCC1 and MFCC2 estimated effect sizes (Figure 3), we again see near universal significant differences across the different hardware configurations for all vowels and both speakers. In contrast to the varied pattern observed for F1 and F2 differences, we observe relatively consistent positive effects for MFCC1 and consistent negative effects for MFCC2. We also note more reliable influences of specific hardware configurations across vowels and speakers, with smaller, though still significant, magnitudes for the S8 and LenT490. Interestingly, these relatively salient patterns capturing shifts in spectral energy distribution based on hardware configuration are obscured in observed formant values, likely due to interactions between these spectral differences and formant tracking processes.

Figure 3:

Heatmap of difference in MFCC1 and MFCC2 from the original. Short vertical dashes indicate that the differences are not significant for that phoneme. Calls: lmer (mfcc1∼device + (1|word), data = device.speaker.vowel, REML = FALSE).

3.2.4 Distribution of F1 × F2 and vowel space

For the male speaker (Figure 4, top left), distribution of individual instances of each vowel largely overlapped between each of the hardware settings and the original, although visual inspection indicates greater deviations for the LenT490 for /u/, /ɑ/, and /ɛ/, as well as /i/ and /ɛ/ for the iPhone, /ɛ/ for the MBA13, and /ɑ/ for the MBA19. The female speaker’s vowels (Figure 4, top right) had wider and more varied individual distributions across hardware. Distributions varied by vowel, with the largest differences in both F1 and F2 observed for /i/, /ɪ/, and /ɛ/ and in F1 for /u/ and /ɑ/.

Figure 4:

Distribution plots of each vowel instance for male (top left) and female speakers (top right) and vowel space plots for male (bottom left) and female speakers (bottom right).

For the male speaker, the vowel spaces for the different hardware varied (Figure 4, bottom left). Overall the S8 has the closest vowel space to the original, and the iPhone and LenT490 had greatest differences in F1 × F2 from the original (as illustrated by Pillai scores in Table 5).

Table 5:

Pillai scores computed from F1 and F2, between each device condition and the original vowel. Higher values (min: 0, max: 1) indicate greater separation. The range of degrees of freedom associated with model errors was 30–41 for the male speaker and 41–47 for the female speaker.

Phoneme	iPhone	LenT490	MBA13	S8	MBA19	LenP50
Male speaker

ɛ	0.21^**	0.33^***	0.33^***	0.03	0.02	0.11
ɑ	0.56^***	0.31^**	0.01	0.005	0.08	0.37^***
æ	0.09	0.17	0.34^**	0.09	0.04	0.01
i	0.01	0.23^**	0.07	0.08	0.25^**	0.09
ɪ	0.04	0.14	0.05	0.02	0.06	0.12
u	0.19^*	0.31^**	0.09	0.01	0.06	0.24^**

Female speaker

ɛ	0.44^***	0.15^*	0.22^**	0.24^**	0.38^***	0.22^**
ɑ	0.06	0.06	0.14^*	0.11	0.22^**	0.1
æ	0.38^***	0.33^***	0.07	0.18^*	0.6^***	0.11
i	0.49^***	0.3^***	0.11	0.04	0.67^***	0.04
ɪ	0.27^**	0.3^***	0.39^***	0.04	0.44^***	0.33^***
u	0.33^***	0.04	0.29^***	0.12	0.63^***	0.07

The female speaker’s vowel spaces varied between hardware with larger differences in F1 × F2 than the male’s (Figure 4, bottom right). Overall, for the female speaker, the iPhone, MBA13, and MBA19 were most different in F1 × F2 from the original (Table 5).

3.3 Microphone

Microphones (internal vs. external) were compared using recordings made with Cleanfeed for Chrome on the LenT490, MBA13, and MBA19.

3.3.1 f0

For the female speaker, both microphones yield differences in f0 of less than 7 Hz from the original 229 Hz; neither is significant. For the male speaker, neither the internal microphone average f0 nor the external microphone average f0 differs from the original more than 2 Hz. Neither of these differences is statistically significant (Table 6). However, manual inspection of data points indicates pitch tracking failures for a number of recordings for this speaker due to low-frequency noise when using the external microphone with Mac laptops.

Table 6:

f0 (in Hz) by microphone and speaker, relative to original. Call: lmer (f0∼mic + (1|device) + (1|word), data = mic.speaker, REML = FALSE).

Microphone	Estimate	Standard error	t value	Pr (>\|t\|)
Male speaker

Original (intercept)	121	1.36	88.72	<2e−16^***
External	1.73	1.52	1.14	0.255
Internal	−1.01	1.38	−0.73	0.463

Female speaker

Original (intercept)	229.28	3.75	61.1	5.08e−05^***
External	−6.87	4.26	−1.61	0.231
Internal	−0.88	4.25	−0.21	0.853

3.3.2 Differences in F1 and F2 from original

Overall, the magnitude of differences in F1 and F2 from the original were similar for external and internal microphones (Figure 5). For F1, for both speakers, differences between the external microphone and the original are largely not significant. For F2, for both microphone setups, the female speaker generally has larger differences from the original than the male speaker.

Figure 5:

Heatmap of difference in F1 and F2 from the original (in Hz), across microphones. Short vertical dashes indicate that a comparison is not statistically significant. Measures for male /u/ are excluded due to frequent pitch tracking failures impacting formant tracking with the external microphone and Mac laptops for this vowel. Calls: lmer (F1∼mic + (1|device) + (1|word), data = mic.speaker.vowel, REML = FALSE).

3.3.3 Distribution of F1 × F2 and vowel space

Comparisons of vowel distributions and vowel space for the internal and external microphones appear in Figure 6 with Pillai scores in Table 7. For the male speaker, the distribution of individual vowel instances for the external microphone for /u/ differs greatly from both the internal microphone and original setups, particularly at F2, such that it overlaps with the distribution of /i/. The external microphone setup also results in slightly wider distributions for /i/, /ɛ/, and /ɑ/. The distribution with the internal microphone setup more closely matches the overall distribution of the original. For the male speaker, the vowel space with the internal microphone setup more closely matches that of the original overall, with largest differences in F1 × F2 for /ɛ/ and /u/ (Figure 6 and Table 7). While mitigated here by quartile filtering, the measures for /u/ may be affected by the presence of low-frequency noise in the case of external microphones on Mac laptops that led to the pitch tracking issues noted above for the male speaker.

Figure 6:

Distribution plots of each vowel instance for male (top left) and female speakers (top right) and vowel space plots for male (bottom left) and female speakers (bottom right).

Table 7:

Pillai scores computed from F1 and F2, between each microphone condition and the original vowel. Higher values (min: 0, max: 1) indicate greater separation. The range of degrees of freedom associated with model errors was 24–63 for the male speaker and 43–72 for the female speaker.

Male speaker			Female speaker
Phoneme	ExtMic	IntMic	Phoneme	ExtMic	IntMic
ɛ	0.21^*	0.09^*	ɛ	0.01	0.02
ɑ	0.25^**	0.03	ɑ	0.1	0.02
æ	0.35^**	0.09	æ	0.08	0.14^**
i	0.16^*	0.07	i	0.57^***	0.1^*
ɪ	0.08	0.01	ɪ	0.22^**	0.24^***
u	0.24^*	0.13^*	u	0.54^***	0.15^**

For the female speaker, vowel distributions for the external and internal microphones pattern together except for smaller and shifted distributions for the internal microphone for /i/ and /u/ (Figure 6). Vowel spaces for this speaker are similar between the two microphone setups, with both differing from the original (Figure 6). Average F1 × F2 differs most from the original with the internal microphone for /ɪ/ and /æ/ and with the external microphone for /i/ and /u/ (Table 7).

3.4 Software

All software was compared using recordings made with the LenT490, LenP50, and MBA19. Full software results with tables and figures can be found in the Supplementary Material, while a summary is included here. For both speakers, all software was faithful to the original for f0, with differences of less than 1 Hz. None of the differences for either speaker were statistically significant. Differences from the original in F1 and F2 for several of the vowels were not statistically significant for both speakers, across software configurations. More marked differences in F1 and F2 from the original were observed between different vowels than between software configurations (e.g., higher magnitude differences in F1 from the original for /æ/ for the female speaker and for /aɪ/ for the male speaker than with other vowels). Some of the software configurations were less faithful to the original than others for a particular vowel. However, generally, the different software configurations patterned together and software advantages or disadvantages did not hold across speakers or measures.

3.5 Browser

The two web browsers (Firefox and Chrome, for Cleanfeed) were compared using recordings made using the LenT490 and MBA19. Full browser results with tables and figures can be found in the Supplementary Material, while a summary is included here. There were no significant differences in f0 for the two browsers with respect to the original. The distributions of F1 × F2 and vowel spaces of the two browsers tracked each other very closely. However, for the female speaker, the two browsers exhibited a shift relative to the original vowel means, most noticeably for /ɪ/, /ɑ/, and /æ/, which differed significantly by Pillai score. In terms of raw values, most F1 differences were not significant and patterned similarly across browsers.

4 Discussion and conclusions

4.1 Consistency between individual configurations

The results of this study indicate no clear advantage to any particular setup across measures. Even within a particular measure, different combinations proved more faithful for one speaker, but not for the other. While no recommendation can be made for a specific setup, there are some general conclusions that can be drawn for each of the comparisons.

In the hardware comparisons, there was a lot of variability, even within the same measures. Magnitude difference from the original varied for devices by speaker for f0. The direction (lower vs. higher) and magnitude of difference for F1 and F2 varied by both speaker and vowel. F2 and MFCC measures, in contrast to F1, were more predictably affected by hardware configurations (see Section 3.1.3). Thus, the distortions from different hardware configurations are those that formant tracking is particularly sensitive to and that could potentially be avoided with alternative measures.

As with hardware, there was no clear software choice across the board. For f0, F1, and F2, the best performing software depended on the speaker and/or the vowel. One of the comparisons of interest in software was looking at whether high fidelity Zoom performed more faithfully for the purposes of phonetic measurements than low fidelity. For loudness, f0, F1, and F2 there was no clear advantage to the hi-fi setting.

In the browser comparisons, Chrome and Firefox had very similar performances across measures in terms of differences from the original. In places where performance differed, for example with F1 and F2, each browser variably outperformed the other in particular combinations of vowel and speaker.

Finally, in the microphone comparison, contrary to recommendations by Sanker et al. (2021) and common intuition that the use of an external microphone should improve performance or at least lead to more uniform distortions between speakers, this was not found to have been the case. Differences in magnitudes and directions of effects between the two speakers were found with both internal and external microphones. For both speakers, using the external microphone did not result in uniformly more faithful measures of f0, F1, or F2; in fact, the measures were generally less faithful relative to internal microphones.

4.2 Differences for measures of interest (loudness; f0; F1 and F2)

4.2.1 Loudness

A statistically significant decrease in loudness was observed for all hardware, software, microphone, and browser configurations. These differences varied with configuration. Loudness can be adjusted post-recording and would not raise issues for most analyses; however, conducting a study in which loudness was a measure of interest, for example for stress or prominence, would benefit from relative measures of loudness or other forms of normalization to support comparisons across speakers.

4.2.2 f0

Overall, with suitable pitch range constraints and where pitch tracking was successful, both speakers’ f0 measurements remained faithful to the original regardless of setup, with differences well below the just-noticeable difference for pitch discrimination for falling pitch contours in speech (Turner et al. 2019). However, we do observe some systematic pitch tracking failures, associated with the external microphone on Mac laptop hardware. These recordings are marred by strong low-frequency noise, possibly due to hardware compatibility issues, which interferes with pitch tracking for the lower-pitched speaker, but not for the higher-pitched voice, highlighting differential impacts of the same acoustic distortion.

4.2.3 F1 × F2 vowel space and distribution

Vowel space distributions showed clear inconsistencies in formant tracking. This is particularly evident with the external microphone for the male speaker. Plotting F1 × F2 with both the distributions and the vowel spaces allowed for a clearer picture of how these measures were affected.

4.2.4 Differences between speakers

Different outcomes by configuration were observed for the male and the female speaker for all measurement types. Due to the small sample size of speakers, conclusions cannot be drawn as to whether differences were correlated with speaker sex or other aspects of their speech. Based on these findings, however, there is a strong argument for documenting recording conditions in as much detail as possible and employing speaker normalization methods when performing online studies with cross-speaker comparisons. Additionally, future study with multiple male and female speakers would be helpful to better determine the cause of differences and aid in normalization approaches.

4.3 Future directions

This study focused exclusively on vowels due to space limitations; it is also important to consider effects on other sounds. In an upcoming paper, we will discuss effects of different remote configurations on fricatives.

While there was a clear advantage to using recorded speech from the loudspeaker as the input for all test recordings in that all input was the same, tests could be carried out in multiple recording sites. A further comparison could be done with live speech input, as would be the case for a typical study such as in Freeman and De Decker (2021a, 2021b), rather than with recorded speech played back through the digital speaker.

Finally, given the differences between speakers found in this study, a future study with a much larger sample size of speakers could provide further insight into these observed differences. A comparison could be made, for instance, between male and female speakers, given the baseline differences in average and range for measures such as f0.

4.4 Overview of issues and proposed approaches and solutions

The outcomes of this study raise relevant concerns for performing online phonetic studies involving measurements of vowels with f0, F1, F2, and loudness as measures of interest. There are a number of ways to mitigate these concerns.

When it comes to analysis, it is highly advisable to take care in assessing and normalizing any vowel comparisons of f0, F1, and F2 in cross-participant studies. Our results indicate differences in both magnitude and direction of effects between the two speakers with the same setup, in the same recording location, recorded minutes apart, in the same playback. Further research, particularly focusing on male versus female speakers, could greatly improve normalization efforts.

Given the issues of lag introduced by recording online (see Section 2.5.2), particularly for subsequent alignment, there is an advantage to participants recording themselves locally and then uploading their recording(s) to the researcher.

For data analysis, if hand-correction of pitch and formant tracking is an option, the errors due to pitch and formant tracking could be lessened or eliminated (e.g., those that are behind the wide F1 × F2 distribution of /ɪ/ for the female speaker in Figure 4). Standard outlier trimming did not eliminate this issue, as this was done for the data presented in all vowel plots.

Finally, it may be worthwhile to consider the choice of software for automatic measures such as pitch tracking. While OpenSMILE performed well for MFCC and loudness measures, the number of pitch and formant tracking errors was much larger with OpenSMILE than with Praat. Depending on the distortion from remote recordings, a particular algorithm or way of tracking formants may perform better or worse, so it may be worth trying a few options if there are large errors across the board; Zhang et al. (2021) have explored some of these options.

Corresponding author: Marina Oganyan, Department of Linguistics, University of Washington, Seattle, USA, E-mail: oganyan.marina@gmail.com

Acknowledgments

The authors would like to thank Will Bowers for his help collecting data, and our reviewers for their thoughtful feedback.

References

Boersma, Paul & David Weenink. 2021. Praat: Doing phonetics by computer (version 6.1.38) [computer program]. Available at: http://www.praat.org/.Suche in Google Scholar

Calder, Jeremy & Rebecca Wheeler. 2022. Is Zoom viable for sociophonetic research? A comparison of in-person and online recordings for sibilant analysis. Linguistics Vanguard 20210014. https://doi.org/10.1515/lingvan-2021-0014.Suche in Google Scholar

Calder, Jeremy, Rebecca Wheeler, Sarah Adams, Daniel Amarelo, Katherine Arnold-Murray, Justin Bai, Meredith Church, Josh Daniels, Sarah Gomez, Jacob Henry, Yunan Jia, Brienna Johnson-Morris, Kyo Lee, Kit Miller, Derrek Powell, Caitlin Ramsey-Smith, Sydney Rayl, Sara Rosenau & Salvador Nadine. 2022. Is Zoom viable for sociophonetic research? A comparison of in-person and online recordings for vocalic analysis. Linguistics Vanguard. 20200148. https://doi.org/10.1515/lingvan-2020-0148.Suche in Google Scholar

Cannizzaro, Michael S., Nicole Reilly, James C. Mundt & Peter J. Snyder. 2005. Remote capture of human voice acoustical data by telephone: A methods study. Clinical Linguistics and Phonetics 19(8). 649–658. https://doi.org/10.1080/02699200412331271125.Suche in Google Scholar

De Decker, Paul & Jennifer Nycz. 2011. For the record: Which digital media can be used for sociophonetic analysis? University of Pennsylvania Working Papers in Linguistics 17(2). 51–59.Suche in Google Scholar

Eyben, Florian, Martin Wöllmer & Björn Schuller. 2010. OpenSMILE: The Munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on multimedia, 1459–1462. Firenze, Italy: Association for Computing Machinery.10.1145/1873951.1874246Suche in Google Scholar

Freeman, Valerie & Paul De Decker. 2021a. Remote sociophonetic data collection: Vowels and nasalization from self-recordings on personal devices. Language and Linguistics Compass 15(7). https://doi.org/10.1111/lnc3.12435.Suche in Google Scholar

Freeman, Valerie & Paul De Decker. 2021b. Remote sociophonetic data collection: Vowels and nasalization over video conferencing apps. Journal of the Acoustical Society of America 149(2). 1211–1223. https://doi.org/10.1121/10.0003529.Suche in Google Scholar

Ge, Chunyu, Yixuan Xiong & Peggy Mok. 2021. How reliable are phonetic data collected remotely? Comparison of recording devices and environments on acoustic measurements. Proceedings of Interspeech 2021. 3984–3988. https://doi.org/10.21437/Interspeech.2021-1122.Suche in Google Scholar

Guan, Yihan & Bin Li. 2021. Usability and practicality of speech recording by mobile phones for phonetic analysis. In 2021 12th international symposium on Chinese spoken language processing (ISCSLP), 1–5. Hong Kong.10.1109/ISCSLP49672.2021.9362082Suche in Google Scholar

Hay, Jennifer, Paul Warren & Katie Drager. 2006. Factors influencing speech perception in the context of a merger-in-progress. Journal of Phonetics 34(4). 458–484. https://doi.org/10.1016/j.wocn.2005.10.001.Suche in Google Scholar

Jadoul, Yannick, Bill Thompson & Bart de Boer. 2018. Introducing Parselmouth: A Python interface to Praat. Journal of Phonetics 71. 1–15. https://doi.org/10.1016/j.wocn.2018.07.001.Suche in Google Scholar

Kuznetsova, Alexandra, Per B. Brockhoff & Rune H. B. Christensen. 2017. lmerTest package: Tests in linear mixed effects models. Journal of Statistical Software 82(13). 1–26. https://doi.org/10.18637/jss.v082.i13.Suche in Google Scholar

Manfredi, Claudia, Jean Lebacq, Giovanna Cantarella, Jean Schoentgen, Silvia Orlandi, Andrea Bandini & Philippe H. DeJonckere. 2017. Smartphones offer new opportunities in clinical voice research. Journal of Voice 31(1). 111.e1–111.e7. https://doi.org/10.1016/j.jvoice.2015.12.020.Suche in Google Scholar

McCloy, Daniel R. 2012. Vowel normalization and plotting with the phonR package. Technical Reports of the UW Linguistic Phonetics Laboratory 1. 1–8.Suche in Google Scholar

Parsa, Vijay, Donald G. Jamieson & Bradley R. Pretty. 2001. Effects of microphone type on acoustic measures of voice. Journal of Voice 15(3). 331–343. https://doi.org/10.1016/s0892-1997(01)00035-2.Suche in Google Scholar

R Core Team. 2022. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Available at: https://www.R-project.org/.Suche in Google Scholar

Sanker, Chelsea, Sarah Babinski, Roslyn Burns, Marisha Evans, Jeremy Johns, Juhyae Kim, Slater Smith, Natalie Weber & Claire Bowern. 2021. (Don’t) try this at home! The effects of recording devices and software on phonetic analysis. Language 97(4). e360–e382. https://doi.org/10.1353/lan.2021.0079.Suche in Google Scholar

Titze, Ingo R. & William S. Winholtz. 1993. Effect of microphone type and placement on voice perturbation measurements. Journal of Speech, Language, and Hearing Research 36(6). 1177–1190. https://doi.org/10.1044/jshr.3606.1177.Suche in Google Scholar

Turner, Daniel R., Ann R. Bradlow & Jennifer S. Cole. 2019. Perception of pitch contours in speech and nonspeech. In Proc. Interspeech 2019, 2275–2279. Graz, Austria.10.21437/Interspeech.2019-2619Suche in Google Scholar

Waskom, Michael L. 2021. Seaborn: Statistical data visualization. Journal of Open Source Software 6(60). 3021. https://doi.org/10.21105/joss.03021.Suche in Google Scholar

Wickham, Hadley, Romain François, Lionel Henry & Kirill Müller. 2022. dplyr: A grammar of data manipulation. Version 1.1.2 [R package]. Available at: https://dplyr.tidyverse.org.Suche in Google Scholar

Xue, Steve An & Amy Lower. 2010. Acoustic fidelity of internet bandwidths for measures used in speech and voice disorders. Journal of the Acoustical Society of America 128(3). 1366–1376. https://doi.org/10.1121/1.3467764.Suche in Google Scholar

Zhang, Cong, Kathleen Jepson, Georg Lohfink & Amalia Arvaniti. 2021. Comparing acoustic analyses of speech data collected remotely. Journal of the Acoustical Society of America 149(6). 3910–3916. https://doi.org/10.1121/10.0005132.Suche in Google Scholar

Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/lingvan-2022-0169).

Received: 2022-12-31

Accepted: 2024-06-05

Published Online: 2024-12-10

This work is licensed under the Creative Commons Attribution 4.0 International License.

Artikel in diesem Heft

https://doi.org/10.1515/lingvan-2022-0169

Schlagwörter für diesen Artikel

COVID-19; remote recording; acoustics; phonetics

Creative Commons

BY 4.0

Investigating the acoustic fidelity of vowels across remote recording methods

Artikel

Abstract

1 Introduction

2 Methods

2.1 Stimuli

2.2 Original recordings

2.3 Testing configurations

2.4 Test recording procedure

2.5 Analysis

2.5.1 TextGrid creation for original recordings

2.5.2 Alignment of test recordings

2.5.3 Measure extraction

2.5.4 Statistical analysis

3 Results

3.1 Presentation of data

3.1.1 F1 and F2 heatmaps

3.1.2 MFCC heatmaps

3.1.3 Vowel plots

3.1.4 Pillai statistic

3.2 Hardware

3.2.1 Loudness

3.2.2 f0

3.2.3 Differences in F1 and F2 and MFCCs from original

3.2.4 Distribution of F1 × F2 and vowel space

3.3 Microphone

3.3.1 f0

3.3.2 Differences in F1 and F2 from original

3.3.3 Distribution of F1 × F2 and vowel space

3.4 Software

3.5 Browser

4 Discussion and conclusions

4.1 Consistency between individual configurations

4.2 Differences for measures of interest (loudness; f0; F1 and F2)

4.2.1 Loudness

4.2.2 f0

4.2.3 F1 × F2 vowel space and distribution

4.2.4 Differences between speakers

4.3 Future directions

4.4 Overview of issues and proposed approaches and solutions

Acknowledgments

References

Supplementary Material

Zusatzmaterial

Artikel in diesem Heft

Artikel in diesem Heft

Artikel in diesem Heft