Attention-LSTM autoencoder simulation for phonotactic learning from raw audio input

Frank Lihui Tan; Youngah Do

doi:10.1515/lingvan-2024-0210

Artikel Open Access

Attention-LSTM autoencoder simulation for phonotactic learning from raw audio input

Frank Lihui Tan und Youngah Do

Veröffentlicht/Copyright: 8. September 2025

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen Erkunden Sie dieses Fachgebiet

Aus der Zeitschrift Linguistics Vanguard

Abstract

This paper presents a learning simulation of phonotactics using an attention-based long short-term memory autoencoder trained on raw audio input. Unlike previous models that use abstract phonological representations, the current method imitates early phonotactic acquisition stages by processing continuous acoustic signals. Focusing on an English phonotactic pattern, specifically the distribution of aspirated and unaspirated voiceless stops, the model implicitly acquires phonotactic knowledge through reconstruction tasks. The results demonstrate the model’s ability to acquire essential phonotactic relations through attention mechanisms, exhibiting increased attention to phonological context which shows higher phonotactic predictability. The learning trajectory begins with a strong reliance on contextual cues to identify phonotactic patterns. Over time, the system internalizes these constraints, leading to a decreased reliance on specific phonotactic cues. This study suggests the feasibility of early phonotactic learning models based on raw auditory input and provides insights into both computational modeling and infants’ phonotactic acquisition.

Keywords: language learning; neural network modeling; phonotactics; autoencoder; attention

1 Introduction

Previous modeling studies have primarily utilized theory-driven, abstract phonological representations to simulate phonotactic learning: models such as constraint-based models (Hayes and Wilson 2008), n-gram models (Albright 2009; Coleman and Pierrehumbert 1997; Jarosz and Rysling 2017; Kirby 2021), and neural network models like sRNN (Mayer and Nelson 2020), LSTM (Mirea and Bicknell 2019), and Seq2Seq RNN (Smith et al. 2021) were trained with various types of phonological representations as input, including segments (Albright and Hayes 2003; Vitevitch and Luce 2004), phonological features (Albright 2009; Hayes and Wilson 2008), or larger prosodic structures (Coleman and Pierrehumbert 1997). These models demonstrated varying degrees of success in well-formedness judgment tests and exhibited biases similar to those of human participants (cf. Daland et al. 2011).

In natural language learning, however, infants start acquiring knowledge of phonotactics as early as seven months (Thiessen and Saffran 2003) and detecting word boundaries from birth (Fló et al. 2019). During these early stages, infants have not yet mastered abstract phonological representations (Escudero and Kalashnikova 2020; Yoshida et al. 2010), so they must rely heavily on “pure” acoustic cues in learning phonotactics. This differs from learning phonotactics based on transformed, abstract symbols, the existence of which has been challenged by some behavioral and computational evidence (Davis and Redford 2019; Feldman et al. 2021; McMurray et al. 2002; Qi and Zevin 2024; Schatz et al. 2021). Against this background, our goal is to investigate the feasibility of phonotactic learning solely on acoustic input, without access to abstract, discrete phonological representations. A model is trained on continuous, unsegmented raw audio clips. This approach does not assume prior mastery of phonetic or phonological units but considers the learning of these units as distinct trajectories that may not progress linearly over time.

We focus on an English phonotactic pattern in which aspirated voiceless stops occur in word-initial positions, while unaspirated voiceless stops occur when followed by the sibilant fricative /s/ (e.g., [tʰ]ea (*[t]ea) versus s[t]ar (*s[tʰ]ar)). The acquisition of phonotactic knowledge is analyzed by exploring the level of attention a model directs toward particular phonological environments. A key premise is that the model is expected to pay attention to crucial contextual cues that assist in predicting phonotactic patterns (see Section 2 for details on the attention mechanisms assumed in this study).

2 Methods

All scripts relevant to the methodology outlined herein, along with the associated data, are available at https://osf.io/6mbvd/?view_only=43f7a6f1bd7c4c12a5824738a3d4c6d5.

2.1 Dataset preparation

We used the “train-clean-100” subset from the LibriSpeech corpus (Panayotov et al. 2015), a collection of 16 kHz speech recordings from 251 speakers (125 female and 126 male), totaling approximately 100 h of read English. Readily available transcriptions generated with Montreal Forced Aligner (Lugosch et al. 2019) were employed to break down these sentences into sub-word clusters. These transcriptions guided the segmentation process in the testing phase but were not used in a training phase to prevent the model from receiving input transcription.

From the corpus, sequences matching either #PV (initial voiceless stops) or sPV (voiceless stops after [s]) patterns were extracted, where P was /p/, /t/, or /k/ and V was any English vowel. The dataset included 9,859 instances of #PV and 2,363 instances of sPV. To address the asymmetry in token frequencies between the #PV and sPV patterns, not only the unbalanced condition but also the balanced condition was tested. The unbalanced condition retained the original distributions of the two sequence types. The balanced condition equalized the frequency of #PV and sPV patterns by randomly sampling from the #PV dataset to match the size of the sPV dataset. The audio recordings underwent Mel-spectrogram extraction using the transforms.MelSpectrogram function within the torchaudio library (Yang et al. 2022). To align the lengths of #PV tokens with sPV tokens, white noise of random length was inserted at the beginning of #PV tokens.

2.2 Model setting

An autoencoder structure (Bank et al. 2021) was employed, which learns from auditory input by extensively listening, reorganizing, and repeating what it has processed. The autoencoder was guided to compress the auditory input into a latent representation to capture its essential features (Hinton and Salakhutdinov 2006). Figure 1 depicts the structure of the model, which includes a long short-term memory (LSTM) network. The LSTM network, a type of recurrent neural network (RNN; Sherstinsky 2020), served as the core architecture for both the encoder and the decoder. The current model structure takes a slice of recording at each timestep as input and processes it sequentially, considering all previous timesteps when handling the current timestep. This configuration was chosen not only because of its reported success (Chen et al. 2018; Chung et al. 2016; Liu et al. 2020; Wang et al. 2018), but crucially due to its sequential processing ability, which is essential for phonotactic learning. Cross-attention was incorporated as a link between the encoder and decoder to enable selective information transfer (see Section 2.2.3).

Figure 1:

Illustration of the current model.

2.2.1 Encoder

The extracted Mel-spectrogram features with the shape of (B,L,D_input) were fed into the encoder. B, the batch size, was set to 32, allowing parallel processing. L represents sequence length. Within each batch, all tokens were zero-padded to match the length of the longest sequence. D_input, set to 64, corresponds to the number of Mel filter banks, defining the dimensionality of each frame vector in the Mel-spectrogram time window. The input Mel-spectrogram features first went through a fully connected (FC) layer for linear transformation from the dimension of 64 to the target hidden dimension. Then the transformed features were fed into a two-layered bidirectional LSTM^[1] component, which processed the information sequentially while incorporating contextual information from previous timesteps. The outputs from the LSTM at all timesteps went through another FC layer to integrate the bidirectional outputs.

2.2.2 Hidden representation

The encoder’s output was transformed into a hidden representation, which encapsulated knowledge the model acquired during training. This hidden representation is crucial for enhancing the decoder’s reconstruction process. To investigate the effect of hidden dimensions on learning, analogous to the learner’s mental storage ability, we designed models with 4, 8, 16, and 32 dimensions.

2.2.3 Decoder and cross-attention

The hidden representation, similar to the phonotactic knowledge acquired by the model through encoding, was then passed to the decoder model. The decoder consisted of one FC layer, a two-layered unidirectional LSTM module, a cross-attention layer, and another FC layer following the cross-attention layer, as illustrated in Figure 1. The decoding process was autoregressive (Dalal et al. 2019), whereby each timestep relied on the output from the previous timesteps to generate the output. Information transmission from the encoder to the decoder was facilitated by the cross-attention mechanism (Vaswani et al. 2023). To initiate the decoder, a starting token of shape (B,D_output = 64) was created with all values initialized to zero, to provide a consistent starting point for decoding. This token was then used as the input for the first decoding timestep. For each subsequent timestep, the LSTM module considered the historical information from prior timesteps. This output was then passed to the cross-attention layer.

We employed the scaled dot-product attention mechanism (Vaswani et al. 2023), whereby the decoder information obtained from the LSTM was utilized as the query, and both the key and value were derived from the hidden state. Following individual linear transformations, the query and key were multiplied, resulting in an “attention score”. This score indicates the level of similarity between the query and key at each timestep. Subsequently, information was extracted from the hidden representation, weighted by the attention score.^[2] Cross-attention connected the encoder and decoder, by assigning attention scores that indicate the significance of information from each timestep in the encoded sequence for the current decoding step. This is akin to predicting the accurate realization of a phoneme within a specific phonological context, similar to phonotactic constraints.

After cross-attention, the data were passed through a final FC layer which transformed the shape from B , D h i d d e n _ r e p r e s e n t a t i o n to shape (B,D_output = 64). This was the final output from the decoder at this timestep and was taken as the input to the decoder for the next timestep until reaching the padded length of the batch (L). Throughout training, the decoder’s final output was compared to the input, using the masked mean squared error loss function (Paszke et al. 2019).

2.3 Training and evaluation

Eighty percent of the selected items were allocated for the model’s training phase. The remaining 20 percent were reserved for the testing phase. Training was carried out using Adam optimization, employing a learning rate of 0.001 following Kingma and Ba (2014). Each experiment was conducted five times. The model underwent training for 100 epochs in each run to ensure convergence by the point where the loss on the validation set stabilized. Analysis of the training and validation losses confirmed that overfitting did not occur during training. For each run, all model parameters were saved at every epoch during the 100 epochs of training, and the trained model was tested after each epoch. The results from all five runs were averaged to ensure reliability of the results.

3 Results

3.1 Reconstruction

We tracked the model’s reconstruction quality (mean squared error averaged per frame) on the validation set throughout training. In addition, for each run, we randomly selected one item and monitored its reconstruction across epochs to assess the model’s learning progress. These procedures ensured that all models achieved sufficient proficiency in reconstructing input sequences. Due to space constraints, only the results of the balanced condition are reported here. The results of the unbalanced condition were similar and can be found in the OSF repository. The reconstruction loss throughout training (Figure 2) and the reconstructed Mel-spectrogram (Figure 3) demonstrate that all models reached convergence, and effectively captured the overall sequential patterns of segments, with the hidden dimension size positively correlated with reconstruction quality ( ρ M S E ∼ D h i d d e n r e p r e s e n t a t i o n = − 0.8299 . This enabled us to verify the model’s reconstruction capability, leading us to analyze its attention trajectory.

Figure 2:

Plot showing the reconstruction quality over the training epochs on validation set.

Figure 3:

Plots showing the reconstruction outputs of a #PV sequence (#KAA) and a sPV sequence (SKAA). The #KAA examples are not aligned, since the noise before the burst was of random length.

3.2 Attention trajectory

Attention scores determined how much each segment of the hidden representation contributed to the decoder’s reconstruction process. An attention weight matrix with the shape of D d e c o d e r , D h i d d e n _ r e p r e s e n t a t i o n ^[3] was used to combine the information into a single vector for subsequent transformations. Each element A_i,j in this attention weight matrix indicates the model’s focus on timestep j of the hidden representation during the decoding process at timestep i (thus i to j). We used the “segment attention score” (SAS) to assess the model’s focus on a specific segment. The SAS, denoted as A i → P , is the sum of attention weights at a single decoding timestep i directed toward the timesteps of the target segment P within the hidden representation. The set of timesteps corresponding to P is denoted as P . The computation of the SAS is formalized as follows:

A i → P = ∑ j ∈ P A i , j

Low SAS indicates that the model is minimally leveraging information from the target segment at the current timestep, suggesting a weaker interaction with the decoded segment. Conversely, high SAS signifies that the model is strongly reliant on the segment’s information for decoding and reconstructing the current timestep. For analysis, the testing items were segmented according to phonetic transcription. To examine the evolution of SAS over the timesteps within a segment, we calculated SAS for each decoding timestep. The resulting trajectory comprises the collection of SAS values for all decoding timesteps of segment Q, providing insight into the model’s focus on the encoded segment P during decoding. The trajectory is formalized as follows:

T Q → P = A i → P i ∈ Q

To ensure consistent evaluation of trajectories across various testing items, we employed linear interpolation to standardize each trajectory to 100 timesteps. Specifically, both forward and backward SAS trajectories were collected: from /s/ or the random noise placeholder (#) to the plosive (s→P for sPV and #→P for #PV), from the plosive back to /s/ or the placeholder (P→s for sPV and P→# for #PV), from the plosive and the vowel (P→V for both cases), as well as from the vowel back to the plosive (V→P for both cases). These trajectories were expected to reflect the model’s attention shifts within the decoding segment and across segmental boundaries. All trajectories spanned timesteps from 0 to 100, corresponding to the start and end points of the source segment. Here, we report the SAS trajectories for the 8-dimensional model. Modeling results from higher dimensions exhibit consistent trends; see the OSF repository for the details.

In Figure 4, each SAS trajectory line represents the timespan of a single source segment, with segment boundaries occurring at the start and end of each line. All trajectories exhibit a similar trend, with the SAS peaking at the beginning or the end of the trajectory. The observed higher SAS at the segment boundaries (i.e., near 0 and 100 for each line) aligns with the expected intensification of phonetic and phonological interactions in these regions. As the model transitions between segments, its reliance on information from adjacent segments increases significantly, facilitating segment reconstruction.

Figure 4:

Plot of SAS trajectories for sPV and #PV conditions (epochs 20–40).

For the forward s/#→P (blue line) and P→V (red line), SAS is higher toward the end of the curves, with s/#→P (sPV/#PV) scoring 0.14/0.09 and P→V scoring 0.30/0.31. In contrast, for the backward P→s/# (green line) and V→P (yellow line), the SAS peaks at the beginning, with P→s/# having a score of 0.55/0.74 and V→P being 0.36/0.41. At the other end of their trajectories, that is, the beginning for s/#→P (blue line) and P→V (red line), and the end for P→s/# (green line) and V→P (yellow line), the SAS is low with all scores below 0.1. While both ends represent segment boundaries,^[4] SAS was consistently higher at one end over the other, indicating the model’s selective attention on specific segment boundaries.

It was further found that the SAS at the s/#|P boundary for the backward P→s/# trajectory (green line) consistently exceeds that for the forward s/#→P (blue line). This indicates a specific attentional bias whereby the model allocates more backward attention. In contrast, the attention scores for the P|V boundary (red and yellow lines) do not exhibit a consistent pattern: the backward V→P attention score is only slightly higher than the forward P→V score. Considering the lack of a clear directional asymmetry, such as at the P|V boundary, the asymmetry observed between P→s/# and s/#→P attention scores cannot be simply attributed to backward tracking where statistical dependencies in a sequence are tracked by working backward from the end of a sequence to identify patterns (Pelucchi et al. 2009). This leads us to infer that the model intentionally referred back to the preceding segment at the s/#|P boundary to reconstruct the subsequent plosive; that is, the model focused on the phonological context that determines the realization of aspirated and unaspirated plosives. The SAS for P→s/# (green line) was also consistently higher than P→V (red line), suggesting that at the segment boundary, the model’s attention was more on the preceding /s/ than on the following vowel. In other words, the model relied more on contextual information at the s/#|P boundary than at the P|V boundary: this distinction is understandable, given that a systematic phonotactic pattern is observed at the s/#|P boundary (i.e., unaspirated P after s and aspirated P after #), while it is random at the P|V boundary (P may be aspirated or unaspirated depending only on the preceding environment).

If SAS values directly reflect attention to the phonotactic cues, we must recognize that the discrepancy between the forward s/#→P and backward P→s/# attention is, in fact, puzzling. If our reasoning holds, the results suggest that information from /s/ or the word-initial position (#) is more helpful for the model in decoding the plosive segment than the reverse, even though forward and backward predictability are theoretically equal, that is, aspirated and unaspirated plosives can be predicted based on the preceding environment, and the preceding environment can be equally predicted from the plosive’s realization. We attribute this disparity to the autoregressive decoding mechanism of the model. The decoding process in our model operated timestep-by-timestep, resulting in a unidirectional flow of information. This led to the decoding of not only segments, but also Mel-spectrogram frames of each segment in a unidirectional manner. Consequently, the decoding process relied less on information from the final timesteps of s/#, as they were consistent with previous frames. Information from the preceding timesteps of the segment was sufficient, reducing the importance of information from the subsequent segment. In contrast, decoding the plosive segment at the abrupt transition from the fricative or white noise to a new segment necessitated a greater amount of attention. This may explain why a similar disparity pattern was not observed at the P|V boundary.

3.3 Attention trajectories

Next, we investigate the trajectories of the model’s attention patterns during training, mirroring changes in attention observed in learners as they progress in acquisition. We also consider their relation to hidden dimensionality to understand how the model’s attention changes as memory capacity expands. We consider the peak SAS of trajectories, which is observed at either the first or last timestep of each trajectory – first for P→s/# and V→P, and last for s/#→P and P→V – and plot the developmental trajectory of peak SAS over the training epochs. See Figure 5.

Figure 5:

The peak SAS for sPV and #PV conditions under balanced training for hidden dimension = 4, 8, 16, and 32.

In all higher dimensionality conditions, except for the lowest dimensionality (n = 4), a consistent pattern was observed. The peak SAS for the backward P→s/# (green line) exhibited an initial increase and peak during training, underscoring its importance in predicting aspirational properties of plosives. Conversely, the peak SAS trends for V→P and P→V initially lacked clear patterns, suggesting weak phonological conditioning. The peak SAS of the forward s/#→P showed a contrasting trend to P→s/#, that is, decreasing initially and reaching its minimum point, consistent with our observation of the SAS scores. This behavior can also be attributed to the unidirectional flow of information in the autoregressive decoding mechanism (see Section 3.2), whose impact overrides the importance of the phonotactic cues. As training advanced, all four peak SAS values, regardless of their predictability associations, tended to converge and decrease further (cf. Section 3.4).

In addition to training epochs, hidden dimensionality also significantly impacted the development of the peak SAS (ρ = −0.819): larger hidden dimensions led to earlier merging epochs. For instance, the earliest merge was noted for the 32-dimensional case in our simulation. Additionally, for an 8-dimensional case, the merge occurred near the end of training, whereas for the lowest dimensionality (n = 4), the merge was not observed within the 100 training epochs. However, hidden dimensionality does not seem to influence the maximum of this trajectory, which remains at 0.6 for sPV and 0.8 for #PV.

3.4 Well-formedness test

To confirm the model’s sensitivity to phonotactic constraints, we conducted a well-formedness test by assessing its reconstruction quality on sequences which either adhered to or violated the specified phonotactic constraints related to aspiration. We created a test set by systematically altering the aspiration of plosives. For example, aspirated plosives in the initial position were substituted with unaspirated plosives of the same place of articulation and vowel context, and vice versa.^[5] We then compared the reconstruction quality of this swapped test set to the original test set to assess the model’s capacity to distinguish between sequences that conformed to or violated the trained phonotactic rules.

In line with the peak SAS pattern of P → s/#, the differences in reconstruction quality initially increased, peaked, and then decreased as training progressed. These quality differences strongly correlated^[6] with the peak SAS pattern across all dimensions (ρ₄ = 0.939; ρ₈ = 0.893; ρ₁₆ = 0.950; ρ₃₂ = 0.951), confirming a correlation between phonotactic patterning and attention scores. As shown in Figure 6, the preference toward conforming sequences became apparent early in the learning trajectory and was also reflected in the attention patterns.

Figure 6:

Reconstruction quality difference between conforming and violating sequences under balanced training for hidden dimension = 4, 8, 16, and 32.

4 Discussion and conclusion

The current work simulated phonotactic learning using raw audio input. The training task focused on reconstruction, offering minimal explicit guidance for the model to learn phonotactic constraints. The results indicate that the model acquired phonotactic knowledge, thereby enhancing its ability to reconstruct expected segment sequences from raw audio input. Our approach deviates from prior studies that assume abstract phonological representations (e.g., Albright 2009; Coleman and Pierrehumbert 1997; Hayes and Wilson 2008; Jarosz and Rysling 2017; Kirby 2021; Mayer and Nelson 2020; Mirea and Bicknell 2019; Smith et al. 2021). Still, the current model showed the ability to learn and represent phonotactic relationships between segments exclusively from audio input, without assuming specific categorical phonemic representations (Davis and Redford 2019; Feldman et al. 2021; McMurray et al. 2002; Qi and Zevin 2024; Schatz et al. 2021).

Our results revealed a correlation between the model’s attention scores and the corresponding phonotactic cues. Specifically, the model showed increased attention to the contexts where the distribution of aspiration properties was phonotactically conditioned, while it displayed decreased attention where the distribution was less predictable. Higher attention scores consistently indicated a preference for well-formed sequences over constraint-violating ones. (Similar trends have been observed in syntax and semantics; see Jang et al. 2024; Ravishankar et al. 2021.) The results also demonstrate that the predictability-based attention mechanism interacts with the model’s autoregressive decoding mechanism when processing timesteps.

Notably, the model’s reliance on phonotactic structure was not static. During the stages when reconstruction loss was improving rapidly, phonotactic knowledge clearly emerged and actively contributed to the reconstruction task. As the rate of improvement in reconstruction performance slowed down, approached an elbow, and gradually converged, attention scores similarly decreased and stabilized, accompanied by a reduced preference for well-formed sequences (as seen in Sections 3.3 and 3.4). This developmental trajectory reflects an evolution in the model’s decoding strategy across the learning process (Jang et al. 2024), from an initial stage in which relying specifically on phonotactic constraints was the optimal strategy for reconstruction to a stage where the model relied instead on acoustic cues for the reconstruction tasks. Importantly, this shift was pronounced especially in models with higher dimensionalities, which demonstrate a more rapid decline in phonotactic sensitivity. This suggests that increased representational capacity is linked to enhanced utilization of intricate acoustic features, rather than depending on abstract constraints like phonotactics during the reconstruction process (Abitbul and Dar 2024; Radhakrishnan et al. 2019).

This transition from relying on abstract phonotactic constraints to acoustic details during the reconstruction process raises concerns regarding potential overparameterization and overpowering in models with increased dimensionalities, especially given the demands of the reconstruction task. The reconstruction task, which entails replicating Mel-spectrograms with high precision, inherently requires accurate preservation of acoustic details. In such scenarios, reliance on acoustic details emerged as a preferred strategy in the later stages of the current experiment. Previous research on both neural networks (Abitbul and Dar 2024; Arpit et al. 2017; Radhakrishnan et al. 2019) and human learners (Platzer and Bröder 2013; Rouder and Ratcliff 2006) suggests that tasks with high complexity or noise tend to promote memory-based strategies over abstract rule or constraint generalization. Consistent with these findings, the current models with higher dimensionalities and greater memory capacity were more adept at focusing on and memorizing intricate acoustic patterns, reducing their reliance on abstract phonotactic constraints. This is akin to the finding that longer training durations increase the model’s expressive power, increasing the likelihood of memorizing details (Arpit et al. 2017). It is crucial to clarify that the model’s focus on acoustic details in later stages of learning does not mean conventional overfitting: the model does not memorize specific training tokens per se, but rather develops representations finely tuned to reconstruct highly detailed acoustic features, effectively approaching an identity function (Radhakrishnan et al. 2019).

Given these observations, lower-dimensional models (e.g., dimensionalities of 4 or 8) appear more successful at preserving phonotactic generalizations for longer periods, displaying learning behaviors which more closely resemble human-like reliance on abstract phonotactic constraints. However, this resemblance should be interpreted with caution, as it is likely dependent on the specific characteristics of the reconstruction task. In real-world language acquisition, human learners perform multiple tasks simultaneously, rarely requiring the precise replication of every acoustic detail. Consequently, the reconstruction task employed here might artificially inflate the value of acoustic details, particularly for high-capacity models. Thus, while lower-dimensional models might appear more “human-like” in the context of this task, this observation may not be generalized to suggest these dimensionalities inherently reflect human cognitive mechanisms. Moreover, human learners are known to switch between constraint-based strategies and item memory-based approaches based on factors such as task presentation, stimulus complexity, and available cognitive resources (Platzer and Bröder 2013; Rouder and Ratcliff 2006). Therefore, a more systematic investigation of these factors across a broader range of tasks is necessary to establish which representational dimensionality most accurately reflects human phonological generalization. Incorporating more diverse tasks may also introduce implicit regularization during model training, which is shown to decrease reliance on memorization without sacrificing overall performance (Arpit et al. 2017).

In conclusion, this study demonstrates the feasibility of phonotactic learning from raw audio input. The current approach aligns with naturalistic learning conditions, making few assumptions about phonological knowledge and highlighting the potential for more empirically driven phonotactic learning models in future research. The model’s ability to implicitly learn phonotactic patterns through continuous exposure to natural speech supports parallel phonetic and phonotactic development without presuming concrete knowledge of one over the other, which may contribute to understanding the mechanisms underlying human language acquisition. This parallel development is in line with recent proposals arguing against the necessity of strict categorical phonemic knowledge in phonological acquisition (Davis and Redford 2019; Feldman et al. 2021; McMurray et al. 2002; Qi and Zevin 2024; Schatz et al. 2021). Our findings suggest that the model develops phonotactic awareness very early in training. While not the primary focus of the current study, this result implies that phonotactic knowledge may potentially develop in parallel with phonemic and allophonic knowledge, which are typically acquired in the early stages of phonological learning. This aligns with the demonstrated capabilities of neural networks to learn phonetic and phonological knowledge from raw audio input (Martin et al. 2023; Matusevych et al. 2023).

Corresponding author: Youngah Do, Department of Linguistics, University of Hong Kong, Hong Kong, Hong Kong, E-mail: youngah@hku.hk

Acknowledgments

This paper has been greatly improved by incorporating feedback from Dr. Vsevolod Kapatsinski, Dr. Canaan Breiss, the anonymous reviewers, and the participants of the 19th Conference on Laboratory Phonology. A special thank you to the Art Tech Lab at the University of Hong Kong for providing the server used for the modeling work presented in this paper.

References

Abitbul, Koren & Yehuda Dar. 2024. How much training data is memorized in overparameterized autoencoders? An inverse problem perspective on memorization evaluation. arXiv. https://doi.org/10.48550/arXiv.2310.02897.Suche in Google Scholar

Albright, Adam. 2009. Feature-based generalisation as a source of gradient acceptability. Phonology 26(1). 9–41. https://doi.org/10.1017/S0952675709001705.Suche in Google Scholar

Albright, Adam & Bruce Hayes. 2003. Rules vs. analogy in English past tenses: A computational/experimental study. Cognition 90(2). 119–161. https://doi.org/10.1016/S0010-0277(03)00146-X.Suche in Google Scholar

Arpit, Devansh, Stanisław Jastrzębski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio & Simon Lacoste-Julien. 2017. A closer look at memorization in deep networks. arXiv. https://doi.org/10.48550/arXiv.1706.05394.Suche in Google Scholar

Bank, Dor, Noam Koenigstein & Raja Giryes. 2021. Autoencoders. arXiv. https://doi.org/10.48550/arXiv.2003.05991.Suche in Google Scholar

Chen, Yi-Chen, Chia-Hao Shen, Sung-Feng Huang & Hung-yi Lee. 2018. Towards unsupervised automatic speech recognition trained by unaligned speech and text only. arXiv. https://doi.org/10.48550/arXiv.1803.10952.Suche in Google Scholar

Chung, Yu-An, Chao-Chung Wu, Chia-Hao Shen, Hung-Yi Lee & Lin-Shan Lee. 2016. Audio Word2Vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder. arXiv. https://doi.org/10.48550/arXiv.1603.00982.Suche in Google Scholar

Coleman, John & Janet Pierrehumbert. 1997. Stochastic phonological grammars and acceptability. In Computational Phonology: Third Meeting of the ACL Special Interest Group in Computational Phonology, 49–56. https://aclanthology.org/W97-1107 (accessed 18 August 2025).Suche in Google Scholar

Dalal, Murtaza, Alexander C. Li & Rohan Taori. 2019. Autoregressive models: What are they good for? arXiv. https://doi.org/10.48550/arXiv.1910.07737.Suche in Google Scholar

Daland, Robert, Bruce Hayes, James White, Marc Garellek, Andrea Davis & Ingrid Norrmann. 2011. Explaining sonority projection effects. Phonology 28(2). 197–234. https://doi.org/10.1017/S0952675711000145.Suche in Google Scholar

Davis, Maya & Melissa A. Redford. 2019. The emergence of discrete perceptual-motor units in a production model that assumes holistic phonological representations. Frontiers in Psychology 10. 2121. https://doi.org/10.3389/fpsyg.2019.02121.Suche in Google Scholar

Escudero, Paola & Marina Kalashnikova. 2020. Infants use phonetic detail in speech perception and word learning when detail is easy to perceive. Journal of Experimental Child Psychology 190. 104714. https://doi.org/10.1016/j.jecp.2019.104714.Suche in Google Scholar

Feldman, Naomi H., Sharon Goldwater, Emmanuel Dupoux & Schatz Thomas. 2021. Do infants really learn phonetic categories? Open Mind 5. 113–131. https://doi.org/10.1162/opmi_a_00046.Suche in Google Scholar

Fló, Ana, Perrine Brusini, Francesco Macagno, Marina Nespor, Jacques Mehler & Alissa L. Ferry. 2019. Newborns are sensitive to multiple cues for word segmentation in continuous speech. Developmental Science 22(4). e12802. https://doi.org/10.1111/desc.12802.Suche in Google Scholar

Hayes, Bruce & Colin Wilson. 2008. A maximum entropy model of phonotactics and phonotactic learning. Linguistic Inquiry 39(3). 379–440. https://doi.org/10.1162/ling.2008.39.3.379.Suche in Google Scholar

Hinton, Geoffrey E. & Ruslan R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313(5786). 504–507. https://doi.org/10.1126/science.1127647.Suche in Google Scholar

Jang, Dongjun, Sungjoo Byun & Hyopil Shin. 2024. A study on how attention scores in the BERT model are aware of lexical categories in syntactic and semantic tasks on the GLUE benchmark. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti & Nianwen Xue (eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 1684–1689. Torino: ELRA and ICCL.Suche in Google Scholar

Jarosz, Gaja & Amanda Rysling. 2017. Sonority sequencing in Polish: The combined roles of prior bias and experience. Proceedings of the Annual Meetings on Phonology 4. 1–12. https://doi.org/10.3765/amp.v4i0.3975.Suche in Google Scholar

Kingma, Diederik P. & Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv. https://doi.org/10.48550/arXiv.1412.6980.Suche in Google Scholar

Kirby, James. 2021. Incorporating tone in the calculation of phonotactic probability. Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, 32–38. Association for Computational Linguistics.10.18653/v1/2021.sigmorphon-1.4Suche in Google Scholar

Liu, Alexander H., Tao Tu, Hung-yi Lee & Lin-Shan Lee. 2020. Towards unsupervised speech recognition and synthesis with quantized speech representation learning. In ICASSP 2020 – 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7259–7263. Barcelona: IEEE.10.1109/ICASSP40776.2020.9053571Suche in Google Scholar

Lugosch, Loren, Mirco Ravanelli, Patrick Ignoto, Vikrant Singh Tomar & Yoshua Bengio. 2019. Speech model pre-training for end-to-end spoken language understanding. arXiv. https://doi.org/10.48550/arXiv.1904.03670.Suche in Google Scholar

Martin, Kinan, Jon Gauthier, Canaan Breiss & Roger Levy. 2023. Probing self-supervised speech models for phonetic and phonemic information: A case study in aspiration. arXiv. https://doi.org/10.48550/ARXIV.2306.06232.Suche in Google Scholar

Matusevych, Yevgen, Thomas Schatz, Herman Kamper, Naomi H. Feldman & Sharon Goldwater. 2023. Infant phonetic learning as perceptual space learning: A crosslinguistic evaluation of computational models. Cognitive Science 47(7). e13314. https://doi.org/10.1111/cogs.13314.Suche in Google Scholar

Mayer, Connor & Max Nelson. 2020. Phonotactic learning with neural language models. Society for Computation in Linguistics 3(1). 149–159. https://doi.org/10.7275/G3Y2-FX06.Suche in Google Scholar

McMurray, Bob, Michael K. Tanenhaus & Richard N. Aslin. 2002. Gradient effects of within-category phonetic variation on lexical access. Cognition 86(2). B33–B42. https://doi.org/10.1016/S0010-0277(02)00157-9.Suche in Google Scholar

Mirea, Nicole & Klinton Bicknell. 2019. Using LSTMs to assess the obligatoriness of phonological distinctive features for phonotactic learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1595–1605. Florence: Association for Computational Linguistics.10.18653/v1/P19-1155Suche in Google Scholar

Panayotov, Vassil, Guoguo Chen, Daniel Povey & Sanjeev Khudanpur. 2015. LibriSpeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5206–5210. South Brisbane, Qld: IEEE.10.1109/ICASSP.2015.7178964Suche in Google Scholar

Paszke, Adam, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai & Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. arXiv. https://doi.org/10.48550/ARXIV.1912.01703.Suche in Google Scholar

Pelucchi, Bruna, Jessica F. Hay & Jenny R. Saffran. 2009. Statistical learning in a natural language by 8-month-old infants. Child Development 80(3). 674–685. https://doi.org/10.1111/j.1467-8624.2009.01290.x.Suche in Google Scholar

Platzer, Christine & Arndt Bröder. 2013. When the rule is ruled out: Exemplars and rules in decisions from memory. Journal of Behavioral Decision Making 26(5). 429–441. https://doi.org/10.1002/bdm.1776.Suche in Google Scholar

Qi, Wendy & Jason D. Zevin. 2024. Statistical learning of syllable sequences as trajectories through a perceptual similarity space. Cognition 244. 105689. https://doi.org/10.1016/j.cognition.2023.105689.Suche in Google Scholar

Radhakrishnan, Adityanarayanan, Karren Yang, Mikhail Belkin & Caroline Uhler. 2019. Memorization in overparameterized autoencoders. arXiv. https://doi.org/10.48550/ARXIV.1810.10333.Suche in Google Scholar

Ravishankar, Vinit, Artur Kulmizev, Mostafa Abdou, Anders Søgaard & Joakim Nivre. 2021. Attention can reflect syntactic structure (if you let it). arXiv. https://doi.org/10.48550/ARXIV.2101.10927.Suche in Google Scholar

Rouder, Jeffrey N. & Roger Ratcliff. 2006. Comparing exemplar- and rule-based theories of categorization. Current Directions in Psychological Science 15(1). 9–13. https://doi.org/10.1111/j.0963-7214.2006.00397.x.Suche in Google Scholar

Schatz, Thomas, Naomi H. Feldman, Sharon Goldwater, Xuan-Nga Cao & Emmanuel Dupoux. 2021. Early phonetic learning without phonetic categories: Insights from large-scale simulations on realistic input. Proceedings of the National Academy of Sciences 118(7). e2001844118. https://doi.org/10.1073/pnas.2001844118.Suche in Google Scholar

Sherstinsky, Alex. 2020. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlinear Phenomena 404. 132306. https://doi.org/10.1016/j.physd.2019.132306.Suche in Google Scholar

Smith, Caitlin, Charlie O’Hara, Eric Rosen & Paul Smolensky. 2021. Emergent gestural scores in a recurrent neural network model of vowel harmony. Society for Computation in Linguistics 4(1). 61–70. https://doi.org/10.7275/QYEY-4J04.Suche in Google Scholar

Thiessen, Erik D. & Jenny R. Saffran. 2003. When cues collide: Use of stress and statistical cues to word boundaries by 7- to 9-month-old infants. Developmental Psychology 39(4). 706–716. https://doi.org/10.1037/0012-1649.39.4.706.Suche in Google Scholar

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser & Illia Polosukhin. 2023. Attention is all you need. arXiv. https://doi.org/10.48550/arXiv.1706.03762.Suche in Google Scholar

Vitevitch, Michael S. & Paul A. Luce. 2004. A web-based interface to calculate phonotactic probability for words and nonwords in English. Behavior Research Methods, Instruments, & Computers 36(3). 481–487. https://doi.org/10.3758/BF03195594.Suche in Google Scholar

Wang, Yu-Hsuan, Hung-Yi Lee & Lin-Shan Lee. 2018. Segmental audio Word2Vec: Representing utterances as sequences of vectors with applications in spoken term detection. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6269–6273. Calgary, AB: IEEE.10.1109/ICASSP.2018.8462002Suche in Google Scholar

Yang, Yao-Yuan, Moto Hira, Zhaoheng Ni, Anjali Chourdia, Artyom Astafurov, Caroline Chen, Ching-Feng Yeh, Christian Puhrsch, David Pollack, Dmitriy Genzel, Donny Greenberg, Edward Z. Yang, Jason Lian, Jay Mahadeokar, Jeff Hwang, Ji Chen, Peter Goldsborough, Prabhat Roy, Sean Narenthiran, Shinji Watanabe, Soumith Chintala, Vincent Quenneville-Bélair & Yangyang Shi. 2022. TorchAudio: Building blocks for audio and speech processing. arXiv. https://doi.org/10.48550/arXiv.2110.15018.Suche in Google Scholar

Yoshida, Katherine A., Ferran Pons, Jessica Maye & Janet F. Werker. 2010. Distributional phonetic learning at 10 months of age. Infancy 15(4). 420–433. https://doi.org/10.1111/j.1532-7078.2009.00024.x.Suche in Google Scholar

Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/lingvan-2024-0210).

Received: 2024-10-29

Accepted: 2025-06-17

Published Online: 2025-09-08

This work is licensed under the Creative Commons Attribution 4.0 International License.

Supplementary Material

https://doi.org/10.1515/lingvan-2024-0210

Schlagwörter für diesen Artikel

language learning; neural network modeling; phonotactics; autoencoder; attention

Creative Commons

BY 4.0

Attention-LSTM autoencoder simulation for phonotactic learning from raw audio input

Artikel

Abstract

1 Introduction

2 Methods

2.1 Dataset preparation

2.2 Model setting

2.2.1 Encoder

2.2.2 Hidden representation

2.2.3 Decoder and cross-attention

2.3 Training and evaluation

3 Results

3.1 Reconstruction

3.2 Attention trajectory

3.3 Attention trajectories

3.4 Well-formedness test

4 Discussion and conclusion

Acknowledgments

References

Supplementary Material

Zusatzmaterial