The implementation of a proposed deep-learning algorithm to classify music genres

Lili Liu

doi:10.1515/comp-2023-0106

Article Open Access

The implementation of a proposed deep-learning algorithm to classify music genres

Lili Liu

Published/Copyright: July 9, 2024

Published by

Become an author with De Gruyter Brill

Author Information Explore this Subject

From the journal Open Computer Science Volume 14 Issue 1

Abstract

To improve the classification effect of music genres in the digital music era, the article employs deep-learning algorithms to improve the performance of the classification of music genres. An auxiliary (estimated) model is constructed to estimate the amount of unmeasured data in the dual-rate system to enhance the recognition effect of music features. Moreover, a dual-rate output error model to identify such impacts is proposed to eliminate the impact of corrupt data caused by the estimation, which eventually leads to the further improvement of the proposed model called dual-rate multi-innovation forgetting gradient algorithm based on the auxiliary model. In addition, the article employs linear time-varying forgetting factors to improve the stability of the system, advances the recognition effect of music features through enhancement processing, and combines a deep-learning algorithm to construct a classification system of music genres. The result shows that the classification of the music genre system based on a deep-learning algorithm has a good music genre classification effect.

Keywords: deep learning; music genre; classification; model

1 Introduction

Music genres are constituted of various music types, forms, and styles composed by several different composers or instrumentalists in distinct periods and have come from several cultural backgrounds. Music specialists usually divide a piece of music into genres based on attributes of musical instruments, forms of expression, regional culture, and content of expression. However, when conventional classification methods are substituted, results could be artificial and consist of large subjective factors. Thus, there is no clear absolute standard that could be used for dividing music genres without error [1].

The extraction of music features is an indispensable step in a classification task, and the quality of the featured music is a crucial factor that affects the accuracy of the classification system. Hence, the conventional method of extracting features requires rich prior knowledge and complex mathematical tools, which makes it difficult to break through the bottleneck.

With the popularity of deep-learning algorithms, it is possible to train data to obtain valuable information and accurately characterize the musical features of music genres. Hence, there is no need to design separate processes that deal with extracting features and running classifications [2]. According to the literature, convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are the most widely implemented deep-learning models in the research of music genre classification. Due to the different structures of CNNs and RNNs, the focus of the mastered features could not be the same [3].

Designing a system that could automatically classify music genres and improve the accuracy of the classification process as much as possible is a significant research subject. The deep-learning models have proved their capability in many disciplines and research fields. Generally, the spectrogram of the music signals can be regarded as a picture in essence, and spectrograms that are used as inputs into CNNs help reach a better classification effect. As a result, a growing number of studies have introduced several deep-learning models to classify music genres due to the high achievements of those methods. Although CNNs have achieved very good results in some fields, especially in image processing, the classification tasks of music genres using CNNs have rarely been implemented. The same is true for RNNs.

In the article, the input of the deep-learning model is a piece of music signal, which is essentially represented by an input sequence called a time series. Even though conventional networks pay the same attention to classifying music genres, RNNs are more suitable than others to master relationships when input has time series characteristics, which are called sequenced inputs.

According to the analysis of the music signal, the prelude, the end, and other relatively less important moments could be given less attention. However, the section could highlight the rhythm style of the whole music in the climax section of the song. Then, a higher degree of attention should be given so that the network can employ limited resources and pay more attention to the most salient and closely related information in an input sequence. Thus, an attention mechanism is also added to RNNs to improve its performance, so that different time series features could be assigned to different weights when the model is trained. Hence, the attention probability distribution corresponding to the feature representation is calculated through the attention mechanism. Therefore, obtaining a feature representation could more accurately characterize musical characteristics and improve classification accuracy.

Even though the number of related research covering deep-learning methods grows exponentially, the number of studies has still been limited. Nevertheless, deep-learning-based methods have gained momentum to derive music features and classify music genres recently. Prabhakar and Lee [4] proposed five interesting and novel approaches for music genre classifications, which are called weighted visibility graph-based elastic net sparse classifier, stacked denoising autoencoder classifier, Riemannian alliance tangent space mapping transfer learning, transfer support vector machine algorithm, and lastly bidirectional long short-term memory attention model with graphical convolution network. Hongdan et al. [5] developed a deep-learning method taking into account the disparities in spectrums that can predict and classify song genres better. Foleis and Tavares [6] presented a novel texture selector based on K-Means aimed to identify diverse sound textures within each track. The results show that capturing texture diversity within tracks is important for improving classification performance. Salazar [7] proposed an music genre classification system by using two levels of hierarchical mining, gray-level co-occurrence matrix networks generated by the Mel-spectrogram, and a multi-hybrid feature strategy. Yu et al. [8] proposed a new model incorporating with attention mechanism based on a bidirectional recurrent neural network. Furthermore, two attention-based models (serial attention and parallelized attention) are implemented to have better classification outcomes. Folorunso et al. [9] implemented the global mean (tree SHAP) method to determine feature importance and impact on the classification model. Further analysis of the individual genres found some nearness in the timbral properties between some of the genres. Chapaneri et al. [10] studied features that are extracted from the music signal for an effective representation to aid in genre classification. The feature set comprises dynamic, rhythm, tonal, and spectral features comprising a total of several useful features. Kumaraswamy and Poonacha [11] proposed a new music genre classification model that includes two major processes: feature extraction and classification. In the feature extraction phase, features like “non-negative matrix factorization features, short-time Fourier transform features, and pitch features” are extracted. Farajzadeh et al. [12] suggested a tailored deep neural network-based method, termed PMG-Net, that automatically classifies Persian music genres. Singh and Biswas [13] assessed and compared the robustness of some commonly used musical and non-musical features on deep-learning models for the MGC task by evaluating the performance of selected models on multiple employed features extracted from various datasets accounting for billions of segmented data samples.

Deep-learning algorithms help concurrently design processes that conduct feature extractions and run classifications. Thus, designing a system that could automatically classify music genres and improve the accuracy of the classification process as much as possible is a broadly examined research subject. An attention mechanism is implemented in the network so limited resources that are characterized as the most salient and closely related information in the input sequence are captured. An RNN model with the attention mechanism assigns different time series features to distinct weights when the model is trained. Hence, the attention probability distribution corresponding to the feature representation is calculated through the attention mechanism. Besides, linear time-varying forgetting factors are employed to improve the stability of the system, and the effect of feature recognition of music genres through enhancement processing is progressed. The results suggest that obtaining a feature representation more accurately characterizes music genres and improves classification accuracy in the article.

2 Related work

The literature has research mainly focusing on music genre classifications. A convolutional deep belief network (CDBN) was employed to pre-train the entire dataset in an unsupervised manner on the Million-Song dataset and then used the mastered parameters to initialize a convolutional multilayer perceptron with the same architecture. A decent accuracy in music genre classification and artist identification tasks was achieved [14]. Partesotti et al. [15] proposed a music genre classification method based on the segment features of a long short-term memory (LSTM) that was used to master the representation of frame-level features to obtain segment features and then combined the LSTM segment features with initial frame features to attain fused segment features. Evaluations on the ISMIR database showed that the LSTM segment features outperformed frame features. Babich [16] applied a CNN to music genre classification and compared the results with those obtained with hand-crafted features and Support vector machine classifiers. Gonçalves and Schiavoni [17] fused CNN-learned features and hand-crafted features to evaluate the complementarity between these representations in the task of music genre classification. Gorbunova [18] took a small set of eight musical features that embodied dynamics, timbre, and pitch as input to the CNN. The CNN was then trained in such a way that the filter dimensions were interpretable in time and frequency domains, and the results on the GTZAN dataset showed that eight musical features based on dynamics, timbre, and pitch perform better than Mel-spectrograms were found. Dickens et al. [19] and Vereshchahina-Biliavska et al. [20] proposed the utilization of masked CNNs for music genre recognition. Thus, conditional neural networks preserved inter-frame relationships. Then, the multivariate conditional neural network extended conditional neural network by performing masking operations on network links. The masking process induced the network to learn and automatically explore a range of feature combinations within the frequency band and helped neurons in hidden layers become feature vector local region experts. Unlike typical frame-level feature representations, Tabuena [21] proposed a CNN architecture that used sample-level filters to learn feature representations and mainly conducted three aspects of work: reducing the sampling frequency of audio signals to shorten the training time, combining transfer learning to expand multi-level and multi-scale aggregation features, and visualizing the filters learned by the sample layer CNN, and explaining the learned features. Turchet and Barthet [22] proposed a deep RNN automatic labeling algorithm based on scattering transformation features. The five-layer RNN with a gated recurrent unit (GRU) could fully utilize the scattering transform spectrogram, and the effect was better than that based on features such as the MFCC and Mel spectrogram. Khulusi et al. [23] applied an LSTM to music genre classification, extracted three features of MFCC, spectral centroid, and spectral contrast from the GTZAN dataset, and then trained the LSTM based on this feature. Cano et al. [24] compared Bi-directional recurrent neural network (BRNN), GRU, GRU parallel, and serial attention models based on the BRNN through experiments and verified the effectiveness of the attention mechanism.

Magnusson [25] designed dense inception, which is a novel CNN architecture for music genre classification that could improve the transfer of information between the input and output and the multi-scale fusion feature to choose the kernel size independently. Gorbunova and Petrnova [26] proposed a music classification method based on RNNs and attention mechanisms. The music was segmented, and the feature sequence was extracted from the main melody of the segment. Then, an RNN was used to learn the semantic features of the audio from the feature sequence of the musical segment, and an attention mechanism was added to assign different attention weights to the mastered features.

Amarillas [27] applied CDBNs to unlabeled auditory data such as speech and music and evaluated unsupervised mastered feature representations in multiple audio classification tasks. Scavone and Smith [28] used a complex network to model music, where each vertex represented a song, and the edges between vertices were represented by conditional probability vectors computed by first-order Markov chains. Finally, the rhythm features were extracted from the community detection method in the complex network for hierarchical clustering to resolve the problem of automatic classification of music genres. Turchet et al. [29] extracted local patches from time-frequency transformed music signals, which were then preprocessed and used for the K-means clustering for unsupervised learning of a local feature dictionary. The local feature dictionary was further convoluted to extract feature responses for classification. Way [30] decomposed the data matrix of unlabeled samples into basis and activation matrices through sparse coding. Each sample was represented as a linear combination of the columns in the basis matrix, followed by a basis that remained fixed to obtain activations for labeled data. Finally, these activations were utilized for music genre classification.

3 Error classification model of music genre

A summary of the proposed method is articulated as follows: Noise added to the dual-rate discrete state space model leads to the output error model of the dual-rate system by introducing an intermediate variable. Since the output data contained unmeasured data, the general method of identifying single-rate systems could not be used. Thus, the dual-rate system is implemented and transformed into an equivalent form that uses a polynomial transformation method, and the output signal in the information vector of the equivalent model will become the dual-rate output signal. However, the parameter estimates of the transformed model are biased. Therefore, the article proposes a method based on deviation compensation to successfully resolve the problem. Also, the polynomial transformation method increases the number of parameters that need to be identified, so the accuracy would be low, and its convergence would be difficult to prove. Hence, the iterative algorithm is implemented and could use a large amount of data, thereby improving the accuracy of parameter estimation. Finally, the estimated model is called an auxiliary model. An estimate of the data could be easily attained that could be measured in the dual-rate system. To reach optimal parameter scores, the recursive augmented stochastic gradient search algorithm is employed. Eventually, the dual-rate multi-innovation forgetting gradient algorithm based on the auxiliary model (DR-AMMSG) is proposed. Finally, the steps of the proposed method are presented in an algorithm to be followed easily.

By adding noise to the dual-rate discrete state space model, the output error model of the dual-rate system is obtained as follows:

(1) y ( k q h ) = b 1 z − h + b 2 z − 2 h + … + b n z − n h 1 + a 1 z − h + a 2 z − 2 h + … + a n z − n h u ( k q h ) + v ( k q h ) = B ( z ) A ( z ) u ( k q h ) + v ( k q h ) .

By introducing an intermediate variable x ( t ) , equation (1) can be transformed into equation (2) as follows:

(2) x ( k q h ) = B ( z ) A ( z ) u ( k q h ) , y ( k q h ) = x ( k q h ) + v ( k q h ) .

Here, y ( t ) is the output of the disturbed system. Without loss of generality, when t ≤ 0 , u ( t ) = 0 , y ⇔ 0 , ∀ , the input and output data are represented by { u ( k h ) , y ( k q h ) , q > 1 , q ∈ z } , that is, the dual-rate output error model was investigated. Since the output data contained an unmeasured item y ( k q − h ) i , h ( i 1 , 2 , − ) , the general method of identifying single-rate systems could not be used.

The study of the dual-rate model transformed into an equivalent form uses a polynomial transformation method, and the output signal in the information vector of the equivalent model is the dual-rate output signal. However, the parameter estimates of the transformed model are biased. Therefore, the article proposes a method based on deviation compensation to successfully resolve the problem. However, the polynomial transformation method increases the number of parameters that need to be identified, so the accuracy is low, and its convergence is difficult to prove.

The recursive algorithm only uses limited output and input datasets { u ( i h ) , y ( i q h ) , i = 0 , 1 , 2 , … , k } but does not use data { u ( i h ) , y ( i q h ) , i = k + 1 , k + 2 , … , L } . However, the iterative algorithm uses a large amount of data, thereby improving the accuracy of parameter estimation. The schematic diagram of the equivalent model constructed by the intermediate variables of the output error class is depicted in Figure 1.

Figure 1

An output error system with an auxiliary model.

The idea of the auxiliary model is to establish an equivalent model B a ( z ) A a ( z ) with the structure same as the structure of the original system B ( z ) A ( z ) for the unavailable intermediate variable x ( t ) in the system, and the equivalent intermediate variable is represented by x a ( t ) = B a ( z ) A a ( z ) u ( t ) . The estimated model B ˆ ( z ) A ˆ ( z ) of B ( z ) A ( z ) is usually used as the equivalent model. With the aid of the auxiliary-based equivalent model, an estimate of the data could be easily attained that could be measured in the dual-rate system. However, this processing will result in a large number of estimates in the parameter vector.

The intermediate variable model is examined by

(3) x ( k q h ) = B ( z ) A ( z ) u ( k q h ) = b 1 z − h + b 2 z − 2 h + … + b n z − n h 1 + a 1 z − h + a 2 z − 2 h + … + a n z − n h u ( k q h ) .

The algorithm converts the intermediate variable x ( k q h ) into the following form:

(4) x ( k q h ) = φ T ( k q h ) θ , φ T ( k q h ) = [ − x ( k q h − h ) , − x ( k q h − 2 h ) , … , − x ( k q h − n h ) u ( k q h − h ) , u ( k q h − 2 h ) , … , u ( k q h − n h ) ] θ = [ a 1 , a 2 , … , a n , b 1 , b 2 , … , b n ] T .

The expression φ ( k q h ) is the information vector of the x ( k q h ) model and θ is the parameter vector of the intermediate variable x ( k q h ) model. Then, equation (2) can be rewritten in the following form:

(5) y ( k q h ) = φ T ( k q h ) θ + v ( k q h ) .

Here, θ ^ ( k q h ) is the estimate of the parameter vector θ at a time k q h and ∥ X ∥ 2 = tr [ X X T ] represents the norm of X . A criterion function is defined by

(6) J ( θ ) = ∥ y ( k q h ) − φ T ( k q h ) θ ∥ 2 .

The optimal parameter estimation can be obtained by minimizing equation (6). However, an issue exists. The information vector φ ( q t ) contains unmeasurable x ( k q h − i h ) , i = 1 , 2 , … , n , and the incomplete data volume makes it impossible to optimize equation (6). The estimate x ˆ ( k q h − i h ) of x ( k q h − i h ) is equivalent to it, and the model of the intermediate variable x ˆ ( k q h − i h ) after the equivalence is represented as follows:

(7) x ˆ ( k q h − i h ) = φ ^ T ( k q h − i h ) θ ^ ( k q h − i h ) , φ ^ T ( k q h − i h ) = [ − x ˆ ( k q h − ( i + 1 ) h ) , − x ˆ ( k q h − ( i + 2 ) h ) , … , − x ˆ ( k q h − ( i + n ) h ) u ( k q h − ( i + 1 ) h ) , u ( k q h − ( i + 2 ) h ) , … , u ( k q h − ( i + n ) h ) ] .

The unknown parameter vector θ ^ ( k q h − i h ) in equation (7) is calculated by using the following equation:

(8) θ ^ ( k q h − i h ) = θ ^ ( k q h − q h ) , i = 1 , 2 , … , q − 1 , θ ^ ( k q h ) , i = 0 .

Therefore, the auxiliary (estimated) model of x ˆ ( k q h − i h ) is presented as follows:

(9) x ˆ ( k q h − i h ) = φ ^ T ( k q h − i h ) θ ^ ( k q h − q h ) , i = 1 , 2 … , q − 1 , φ ^ T ( k q h ) θ ^ ( k q h ) , i = 0 , φ ^ T ( k q h − i h ) = [ − x ˆ ( k q h − ( i + 1 ) h ) , − x ˆ ( k q h − ( i + 2 ) h ) , … , − x ˆ ( k q h − ( i + n ) h ) , u ( k q h − ( i + 1 ) h ) , u ( k q h − ( i + 2 ) h ) , … , u ( k q h − ( i + n ) h ) ] ∈ R 2 n .

After establishing such an auxiliary (estimated) model, it is impossible to obtain a measurable approximation.

The method of stochastic gradient search is chosen to optimize equation (6). The parameter estimation algorithm is obtained based on gradient search as follows:

(10) θ ^ ( k q h ) = θ ^ ( k q h − q h ) + φ ^ ( k q h ) r ( k q h ) [ y ( k q h ) − φ ^ T ( k q h ) θ ^ ( k q h − q h ) ] , r ( k q h ) = r ( k q h − q h ) + ∥ φ ^ ( k q h ) ∥ 2 , r ( 0 ) = 1 .

The expression e ( k q h ) = y ( k q h ) − φ ^ T ( k q h ) θ ^ ( k q h − q h ) is the innovation at the time k q h .

For the least squares identification algorithm and the recursive augmented stochastic gradient algorithm using the auxiliary (estimated) model, the parameter vector recursive equations of these two algorithms contain innovation scalars. The recurrence expression of the parameter vector of the dual-rate multiple innovation identification method is proposed that contains the innovation vector.

The following vector matrix is assumed:

(11) Φ ( p , k q h ) = [ φ ^ ( k q h ) , φ ^ ( k q h − q h ) , … , φ ^ ( k q h − ( p − 1 ) q h ) ] , Y ( p , k q h ) = [ y ( k q h ) , y ( k q h − q h ) , … , y ( k q h − ( p − 1 ) q h ) ] T , V ( p , k q h ) = [ v ( k q h ) , v ( k q h − q h ) , … , v ( k q h − ( p − 1 ) q h ) ] T .

The expression Φ ( p , k q h ) , Y ( p , k q h ) , V ( p , k q h ) is a vector matrix containing p vectors. In this way, the output vector Y ( p , k q h ) model is obtained as follows:

(12) Y ( p , k q h ) = Φ T ( p , k q h ) θ + V ( p , k q h ) .

When the following criterion function is considered

(13) J ( θ ) = ∥ Y ( p , k q h ) − Φ T ( p , k q h ) θ ∥ 2 .

The gradient search method is used to minimize the function:

(14) θ ^ ( k q h ) = θ ^ ( k q h − q h ) − μ k q h 2 grad [ J ( θ ^ ( k q h − q h ) ) ] = θ ^ ( k q h − q h ) + μ k q h Φ ^ ( p , k q h ) [ Y ( p , k q h ) − Φ ^ T ( p , k q h ) θ ^ ( k q h − q h ) ] .

The expression μ k q h is the step length of the iterative search of the gradient of k q h sampling points. Considering the step size of the recursive gradient search, the forgetting factor λ is added, and the choices are presented as follows:

(15) μ k q h = 1 r ( k q h ) , r ( k q h ) = λ r ( k q h − q h ) + ∥ Φ ^ ( p , k q h ) ∥ 2 , r ( 0 ) = 1 .

In summary, the steps of the DR-AMMSG based on the auxiliary (estimated) model are presented as follows:

θ ^ ( k q h ) = θ ^ ( k q h − q h ) + Φ ^ ( p , k q h ) r ( k q h ) E ( p , k q h )
E ( p , k q h ) = Y ( p , k q h ) − Φ ^ T ( p , k q h ) θ ^ ( k q h − q h ) = y ( k q h ) − φ ^ T ( k q h ) θ ^ ( k q h − q h ) y ( k q h − q h ) − φ ^ T ( k q h − q h ) θ ^ ( k q h − q h ) ⋯ ⋯ y ( k q h − ( p − 1 ) q h ) − φ ^ T ( k q h − ( p − 1 ) q h ) θ ^ ( k q h − q h )
r ( k q h ) = λ r ( k q h − q h ) + ∥ Φ ^ ( p , k q h ) ∥ 2 , r ( 0 ) = 1
Φ ^ ( p , k q h ) = [ φ ^ ( k q h ) , φ ^ ( k q h − q h ) , … , φ ^ ( k q h − ( p − 1 ) q h ) ]
φ ˆ T ( k q h ) = [ − x ˆ ( k q h − h ) , − x ˆ ( k q h − 2 h ) , … , − x ˆ ( k q h − n h ) , u ( k q h − h ) , u ( k q h − 2 h ) , … , u ( k q h − n h ) ] ∈ R 2 n
x ˆ ( k q h − i h ) = φ ^ T ( k q h − i h ) θ ^ ( k q h − q h ) , i = 1 , 2 … , q − 1 φ ^ T ( k q h ) θ ^ ( k q h ) , i = 0 ,
(16) φ ˆ T ( k q h − i h ) = [ − x ˆ ( k q h − ( i + 1 ) h ) , − x ˆ ( k q h − ( i + 2 ) h ) , … , − x ˆ ( k q h − ( i + n ) h ) u ( k q h − ( i + 1 ) h ) , u ( k q h − ( i + 2 ) h ) , … , u ( k q h − ( i + n ) h ) ] ∈ R 2 n .

The expression E(p,kqh) is the innovation vector of length p.

The flow of the proposed algorithm is presented as follows:

The algorithm is initialized and k = 1 , θ ^ ( 0 ) = I n / p 0 , r ( 0 ) = 1 , p 0 = 10 6 is set.
The algorithm collects input and output datasets { u ( k h ) , y ( k q h ) } and finds φ ˆ ( k q h ) them according to equation (5), where the unmeasurable x ( k q h − i h ) is calculated by equations (6) and (7), and Φ ^ ( p , k q h ) is calculated according to equation (4).
The algorithm obtains the innovation vector E ( p , k q h ) and r ( k q h ) equations (2) and (3), respectively.
The algorithm finds θ ^ ( k q h ) from equation (1).
The algorithm increases the time k by 1 and jumps to step 2.

The dual-rate sampling data system model is presented as follows:

(17) y ( 2 t ) = B ( z − 1 ) A ( z − 1 ) u ( 2 t ) + v ( 2 t ) , A ( z − 1 ) = 1 + a 1 z − 1 + a 2 z − 2 = 1 − 1.1 z − 1 + 0.5 z − 2 , B ( z − 1 ) = b 1 z − 1 + b 2 z − 2 = 0.16 z − 1 − 0.8 z − 2 , θ = [ a 1 , a 2 , b 1 , b 2 ] .

The parameter scores q = 2 , h = 1 are used in the dual-rate system, and the sampled dual-rate signal is denoted by { u ( t ) , y ( 2 t ) } . The two-rate least squares identification algorithm (AMLS) using the auxiliary (estimated) model and the two-rate multiple innovation stochastic gradient identification algorithm with a forgetting factor (auxilary model multi-innovation forgetting gradient) based on the auxiliary model are compared to identify the system. The error of the parameter estimations is calculated by δ = ∥ θ ^ − θ ∥ / ∥ θ ∥ , where θ ^ represents the estimation of the parameter vector, and θ represents the true value of the system parameter vector.

$Figure 2 The variation curve of the parameter estimation error of the DR-AMMSG algorithm when λ = 1 \lambda =1 varies with time t.$

Figure 2

The variation curve of the parameter estimation error of the DR-AMMSG algorithm when λ = 1 varies with time t.

3.1 Implementation

When the forgetting factor is represented by λ = 1 , the system parameter estimation error is shown in Figure 2.

Figure 2 depicts that as the innovation length p increases, the convergence speed of the parameter estimation system and the identification accuracy gradually become faster and higher.

When the forgetting factor takes λ = 0.95 and 0.9, respectively, the schematic diagrams of the system parameter estimation error are shown in Figures 3 and 4.

$Figure 3 The variation curve of the parameter estimation error of the DR-AMMSG algorithm when λ = 0.95 \lambda =0.95 varies with time t.$

Figure 3

The variation curve of the parameter estimation error of the DR-AMMSG algorithm when λ = 0.95 varies with time t.

$Figure 4 The variation curve of the parameter estimation error of the DR-AMMSG algorithm when λ = 0.9 \lambda =0.9 varies with time t.$

Figure 4

The variation curve of the parameter estimation error of the DR-AMMSG algorithm when λ = 0.9 varies with time t.

From Figures 3–5, adjusting the forgetting factor could speed up the convergence speed of the parameter estimation, but the addition of the forgetting factor will increase the fluctuation of the system parameter estimation. Therefore, the article considers a linear time-varying forgetting factor λ ( t ) = 0.9 + 0.1 N t , where N is the number of recursive steps, which is set to 3,000 in the example. Thus, the stability of the system is improved, and the effect of music feature recognition through enhancement processing is advanced. The simulation results are shown in Figure 6.

Figure 5

The variation curve of parameter estimation error δ with time t when the DR-AM-MSG algorithm has a time-varying λ.

Figure 6

A schematic diagram of the basic structure of an RNN.

4 Music genre classification based on a deep-learning algorithm

In this section, we briefly explain what are the components of an RNN structure and how it will be implemented in the classification of the music genres based on the proposed algorithm in the previous section. The RNN is a neural network that specializes in processing time series sequences. Its basic structure is shown in Figure 6. The left side of Figure 6 shows the structure of the RNN. Module A represents the hidden nodes in the network, and xt is the value of the input sequence x at the t-th time. Ot is the output of the hidden node at the t-th time, ht is the hidden state of the hidden node at the t-th time, and U, V, and W are the parameter matrices in the network.

Due to the limitations of the shallow structures, the classifier makes it difficult to express the music sequence and semantic information at a deeper level, which affects the performance of the classification. The proposed method, according to the feature sequence of the input music, employs both the RNN and the attention mechanism concurrently.

Hence, the Bi-GRU and attention mechanisms are implemented to design the network to classify music. The Bi-GRU is good at processing sequenced data and automatically masters music context semantics and high-level features from the sequenced features.

By expanding the loop edges of the hidden node along the time axis, the chain structure shown on the right side of Figure 6 can be obtained. RNNs share parameters at different moments and positions and have two advantages. On the one hand, as the parameter space can be reduced, the scale of the neural network can be reduced. Thus, the generalization ability can be guaranteed. On the other hand, it gives the RNN both memory and learning abilities and stores useful information in the parameter matrix U, V, and W.

The input and output modes of the cyclic neural network are very flexible, and there can be multiple modes, including one-to-many, many-to-many, many-to-one, and can be adapted to a variety of tasks, as shown in Figure 7.

Figure 7

A schematic diagram of the input and output of a cyclic neural network.

Figure 7(a) shows the one-to-many mode, which is often used for decoder modeling. It is suitable for inputting a code vector, decoding it, and outputting the corresponding decoding sequence. Figure 7(b) shows the many-to-one model, which is often used for sequence classifier modeling. It is suitable for deep-learning tasks of the input sequence and single output. The output node of the recurrent unit passes directly through the classifier. Figure 7(c) shows the synchronous many-to-many mode. Each time step of the sequence corresponds to an output, which can be used for tasks such as text generation and music synthesis. Figure 7(d) shows the asynchronous many-to-many mode, which is often used for encoder–decoder modeling, which can be obtained by coupling and connecting two RNNs based on context. Both the input data sequence and the output target sequence are variable in length, and the lengths are allowed to be unequal, which is suitable for machine translation problems.

On the other hand, preprocessing is an indispensable stage in processing music signals, and its main purpose is to facilitate feature extraction in the next stage. The extracted characteristics are another form of expression of the music signals. Because the music signal originally contains a lot of redundancy, if the time domain audio signal is directly input into the classification system, the amount of calculation would be catastrophic. Finally, the extracted feature parameters are input into the classifier, and then the feature is modeled by adjusting the parameters of the classifier. The best model obtained by training is implemented to realize the discrimination of the genre utilizing test music samples.

Figure 8 shows a music genre classification system based on the RNN.

Figure 8

Music genre classification system based on the RNN.

Since the data are limited, the method employed for the data set enhancement is required and is shown in Figure 9. As mentioned above, in the two data sets used for classification, the duration of each song excerpt segment C is 30 s. In the article, each segment is cut, and the duration of each sub-segment ci, after being cut, is 3 s. Moreover, there is a 50% overlap between two adjacent sub-segments. The excerpts of each song are cut into 18 sub-segments with a duration of 3 s (because the sample duration is about 30 s, the last slice may be less than 3 s, so it is discarded). In addition, each sub-segment carries the same genre tag as the source segment.

Figure 9

A schematic diagram of data enhancement.

The structure of the specific features is shown in Figure 10.

Figure 10

Feature classification of a music signal.

4.1 Result of music genre classification

The dual-rate system and auxiliary (estimated) model are used to generate errors and also to optimize the parameters of the RNN model. A simulation experiment is carried out using test parameters, namely, audio recognition error, feature classification effect of music signal, and classification effect of music genre. The results are shown in Tables 1–3. The model is trained based on the two data sets using 0.80 for training and 0.20 for the test. Simulation data are used to generate the predictions for Tables 1–3. When the average audio error rate (the mean of Table 1) is 0.33, the feature classification of the musical signal reaches 0.8945, and the genre classification reaches 82.55.

Table 1

Audio recognition error

Num	Audio error (%)	Num	Audio error (%)	Num	Audio error (%)
1	0.0280	27	0.0783	53	0.4244
2	0.4582	28	0.4826	54	0.2415
3	0.5972	29	0.1616	55	0.3695
4	0.0205	30	0.5913	56	0.0267
5	0.5946	31	0.5011	57	0.5165
6	0.2911	32	0.5818	58	0.4447
7	0.0151	33	0.2085	59	0.5772
8	0.4425	34	0.3836	60	0.1382
9	0.5340	35	0.1526	61	0.0896
10	0.0458	36	0.2599	62	0.3846
11	0.2517	37	0.0804	63	0.0417
12	0.3598	38	0.1525	64	0.3797
13	0.3856	39	0.4676	65	0.0691
14	0.3569	40	0.4867	66	0.3118
15	0.3577	41	0.4868	67	0.3589
16	0.1420	42	0.3443	68	0.5953
17	0.4973	43	0.1902	69	0.4151
18	0.0627	44	0.2261	70	0.1155
19	0.3401	45	0.4719	71	0.5692
20	0.5157	46	0.3057	72	0.3158
21	0.3039	47	0.4056	73	0.5894
22	0.2468	48	0.2800	74	0.2500
23	0.0631	49	0.4487	75	0.1793
24	0.0880	50	0.5337	76	0.5303
25	0.3448	51	0.0426	77	0.5372
26	0.5567	52	0.5400	78	0.5618

Table 2

Feature classification effect of music signal

Num	Feature recognition (%)	Num	Feature recognition (%)	Num	Feature recognition (%)
1	88.47	27	80.51	53	86.81
2	91.84	28	83.59	54	87.99
3	94.93	29	91.00	55	94.38
4	82.63	30	80.42	56	80.02
5	82.87	31	82.67	57	88.24
6	87.81	32	91.68	58	82.03
7	92.88	33	92.35	59	85.70
8	81.18	34	81.69	60	92.88
9	82.44	35	84.39	61	85.30
10	88.98	36	92.30	62	88.55
11	87.66	37	79.15	63	85.16
12	94.12	38	90.05	64	79.51
13	90.27	39	81.64	65	90.08
14	90.11	40	80.33	66	94.18
15	85.08	41	93.25	67	83.60
16	89.27	42	86.59	68	87.32
17	86.58	43	87.90	69	80.45
18	79.80	44	91.03	70	93.19
19	79.26	45	84.37	71	90.66
20	82.48	46	79.09	72	92.75
21	87.55	47	80.57	73	84.21
22	90.32	48	90.63	74	85.62
23	86.91	49	80.73	75	81.72
24	85.49	50	79.15	76	87.95
25	82.86	51	85.89	77	93.57
26	94.39	52	83.22	78	81.10

Table 3

Classification effect of music genre

Num	Genre classification (%)	Num	Genre classification (%)	Num	Genre classification (%)
1	79.16	27	90.02	53	83.95
2	77.89	28	79.89	54	76.58
3	78.28	29	91.61	55	87.26
4	85.66	30	76.30	56	75.51
5	86.15	31	78.38	57	75.27
6	75.88	32	76.16	58	82.38
7	75.45	33	75.65	59	87.05
8	83.74	34	90.70	60	76.49
9	83.82	35	91.08	61	78.79
10	77.80	36	79.11	62	75.68
11	87.94	37	83.77	63	81.13
12	78.96	38	87.05	64	89.59
13	78.04	39	88.90	65	87.73
14	76.13	40	81.78	66	88.81
15	79.98	41	81.39	67	83.09
16	80.16	42	87.48	68	89.29
17	83.06	43	89.92	69	82.32
18	81.78	44	87.67	70	81.69
19	85.79	45	90.31	71	82.90
20	84.62	46	75.55	72	81.76
21	85.37	47	90.83	73	90.78
22	77.27	48	78.04	74	87.98
23	83.05	49	78.17	75	79.07
24	77.89	50	81.98	76	87.43
25	78.37	51	87.22	77	82.45
26	79.40	52	75.03	78	91.80

The results imply that even though the audio error with a large mean occurs in the system, the proposed system could handle it in feature and genre classifications.

5 Conclusion

With the rapid development of network technology and multimedia platforms, the amount of digital music has increased rapidly, and it is difficult for listeners to manage these huge amounts of stored music. Moreover, listeners need to quickly and accurately retrieve the music they are interested in from a huge music database. Music genres are different musical styles formed by different melodies, instruments, rhythms, and other characteristics under different periods and different cultural backgrounds. Therefore, the classification of music genres has become a very important research direction in the field of music information retrieval.

Designing a system that automatically classified music genres and improved the accuracy of the classification process as much as possible has been a highly desired consequence. The article employed a deep-learning algorithm (RNN) to study music genre classification and proposed a music genre classification system with an intelligent music feature recognition function. Hence, the RNN helped separately design processes that conducted feature extractions and ran classifications concurrently. Thus, an attention mechanism was implemented in the network so limited resources that were characterized as the most salient and closely related information in the input sequence was captured. The RNN model with the attention mechanism assigned different time series features to distinct weights when the model was trained. Hence, the attention probability distribution corresponding to the feature representation was calculated through the attention mechanism. Moreover, obtaining a feature representation more accurately found musical characteristics and improved classification accuracy in the article.

The RNN shared parameters at different moments and positions and had two advantages. On the one hand, as the parameter space was reduced, the scale of the neural network was deceased. Thus, the generalization ability was guaranteed. On the other hand, it gave the RNN both memory and learning abilities and stored useful information in the parameter matrices. Moreover, adjusting the forgetting factor could speed up the convergence speed of the parameter estimation, but the addition of the forgetting factor would increase the fluctuation of the system parameter estimation. Therefore, the article also considered a linear time-varying forgetting factor. Even though improvement has been based on the forgetting factor λ, fluctuated estimations of the system parameters required to use of another forgetting factor called a linear time-varying forgetting factor, which is the same function for a time. Hence, the stabilization of the classification system is better reached. So, instead of using a fixed value, a linear time-varying form of the forgetting factor is implemented.

When the forgetting factor is denoted by λ = 1 , and the innovation length p increases, the convergence speed of the parameter estimation system and the identification accuracy gradually become faster and higher. When the forgetting factor takes λ = 0.95 and λ = 0.9 , respectively, adjusting the forgetting factor could speed up the convergence speed of the parameter estimation, but the addition of the forgetting factor will increase the fluctuation of the system parameter estimation. Finally, the proposed model has provided a good music genre classification effect.

Future direction will be based on the performance comparison of the proposed method with other implementable methods to classify music genres.

Funding information: The research did not receive any funding.
Author contributions: The article is written by a single author.
Conflict of interest: Author declares no conflict of interest.
Ethical approval: No ethical approval is needed.
Informed consent: No consent is needed.

References

[1] F. Calegario, M. Wanderley, S. Huot, G. Cabral, and G. Ramalho, “A method and toolkit for digital musical instruments: generating ideas and prototypes,” IEEE Multimed, vol. 24, no. 1, pp. 63–71, 2017.10.1109/MMUL.2017.18Search in Google Scholar

[2] D. Tomašević, S. Wells, I. Y. Ren, A. Volk, and M. Pesek, “Exploring annotations for musical pattern discovery gathered with digital annotation tools,” J. Math. Music., vol. 15, no. 2, pp. 194–207, 2021.10.1080/17459737.2021.1943026Search in Google Scholar

[3] X. Serra, “The computational study of musical culture through its digital traces,” Acta Musicologica, vol. 89, no. 1, pp. 24–44, 2017.Search in Google Scholar

[4] S. K. Prabhakar and S. W. Lee, “Holistic approaches to music genre classification using efficient transfer and deep learning techniques,” Expert. Syst. Appl., vol. 211, p. 118636, 2023.10.1016/j.eswa.2022.118636Search in Google Scholar

[5] W. Hongdan, S. SalmiJamali, C. Zhengping, S. Qiaojuan, and R. Le, “An intelligent music genre analysis using feature extraction and classification using deep learning techniques,” Comput. Electr. Eng., vol. 100, p. 107978, 2022.10.1016/j.compeleceng.2022.107978Search in Google Scholar

[6] J. H. Foleis and T. F. Tavares, “Texture selection for automatic music genre classification,” Appl. Soft Comput., vol. 89, p. 10612, 2022.10.1016/j.asoc.2020.106127Search in Google Scholar

[7] A. E. C. Salazar, “Hierarchical mining with complex networks for music genre classification,” Digital Signal. Process., vol. 127, p. 103559, 2022.10.1016/j.dsp.2022.103559Search in Google Scholar

[8] Y. Yu, S. Luo, S. Liu, H. Qiao, Y. Liu, and L. Feng, “Deep attention based music genre classification,” Neurocomputing, vol. 372, pp. 84–91, 2020.10.1016/j.neucom.2019.09.054Search in Google Scholar

[9] S. O. Folorunso, S. A. Afolabi, and A. B. Owodeyi, “Dissecting the genre of Nigerian music with machine learning models,” J. King Saud. Univ. – Comput. Inf. Sci., vol. 34, no. 8, Part B, pp. 6266–6279, 2022.10.1016/j.jksuci.2021.07.009Search in Google Scholar

[10] S. Chapaneri, R. Lopes, and D. Jayaswal, “Evaluation of music features for PUK kernel based genre classification,” Procedia Comput. Sci., vol. 45, pp. 186–196, 2021.10.1016/j.procs.2015.03.119Search in Google Scholar

[11] B. Kumaraswamy and P. G. Poonacha, “Deep convolutional neural network for musical genre classification via new self adaptive sea lion optimization,” Appl. Soft Comput., vol. 108, p. 107446, 2021.10.1016/j.asoc.2021.107446Search in Google Scholar

[12] N. Farajzadeh, N. Sadeghzadeh, and M. Hashemzadeh, “PMG-Net: Persian music genre classification using deep neural networks,” Entertain. Comput., vol. 44, p. 100518, 2023.10.1016/j.entcom.2022.100518Search in Google Scholar

[13] Y. Singh and A. Biswas, “Robustness of musical features on deep learning models for music genre classification,” Expert. Syst. Appl., vol. 199, p. 116879, 2022.10.1016/j.eswa.2022.116879Search in Google Scholar

[14] I. B. Gorbunova and N. N. Petrova, “Digital sets of instruments in the system of contemporary artistic education in music: socio-cultural aspect,” J. Crit. Rev., vol. 7, no. 19, pp. 982–989, 2022.Search in Google Scholar

[15] E. Partesotti, A. Peñalba, and J. Manzolli, “Digital instruments and their uses in music therapy,” Nordic J. Music. Ther., vol. 27, no. 5, pp. 399–418, 2018.10.1080/08098131.2018.1490919Search in Google Scholar

[16] B. Babich, “Musical “Covers” and the culture industry: From antiquity to the age of digital reproducibility,” Res. Phenomenol., vol. 48, no. 3, pp. 385–407, 2018.10.1163/15691640-12341403Search in Google Scholar

[17] L. L. Gonçalves and F. L. Schiavoni, “Creating digital musical instruments with lib mosaic-sound and mosaicode,” Rev. de. Inform. Teórica e Apl., vol. 27, no. 4, pp. 95–107, 2020.10.22456/2175-2745.104342Search in Google Scholar

[18] I. B. Gorbunova, “Music computer technologies in the perspective of digital humanities, arts, and research,” Opcion, vol. 35, no. SpecialEdition24, pp. 360–375, 2018.Search in Google Scholar

[19] A. Dickens, C. Greenhalgh, and B. Koleva, “Facilitating accessibility in performance: participatory design for digital musical instruments,” J. Audio Eng. Soc., vol. 66, no. 4, pp. 211–219, 2018.10.17743/jaes.2018.0010Search in Google Scholar

[20] O. Y. Vereshchahina-Biliavska, O. V. Cherkashyna, Y. O. Moskvichova, O. M. Yakymchuk, and O. V. Lys, “Anthropological view on the history of musical art,” Linguist. Cult. Rev., vol. 5, no. S2, pp. 108–120, 2021.10.21744/lingcure.v5nS2.1334Search in Google Scholar

[21] A. C. Tabuena, “Chord-interval, direct-familiarization, musical instrument digital interface, circle of fifths, and functions as basic piano accompaniment transposition techniques,” Int. J. Res. Publ., vol. 66, no. 1, pp. 1–11, 2021.10.47119/IJRP1006611220201595Search in Google Scholar

[22] L. Turchet and M. Barthet, “A ubiquitous smart guitar system for collaborative musical practice,” J. N. Music. Res., vol. 48, no. 4, pp. 352–365, 2019.10.1080/09298215.2019.1637439Search in Google Scholar

[23] R. Khulusi, J. Kusnick, C. Meinecke, C. Gillmann, J. Focht, and S. Jänicke, “A survey on visualizations for musical data,” Comput. Graph. Forum, vol. 39, no. 6, pp. 82–110, 2020.10.1111/cgf.13905Search in Google Scholar

[24] E. Cano, D. FitzGerald, A. Liutkus, M. D. Plumbley, and F. R. Stöter, “Musical source separation: An introduction,” IEEE Signal. Process. Mag., vol. 36, no. 1, pp. 31–40, 2020.10.1109/MSP.2018.2874719Search in Google Scholar

[25] T. Magnusson, “The migration of musical instruments: On the socio-technological conditions of musical evolution,” J. N. Music. Res., vol. 50, no. 2, pp. 175–183, 2020.10.1080/09298215.2021.1907420Search in Google Scholar

[26] I. B. Gorbunova and N. N. Petrova, “Music computer technologies, supply chain strategy, and transformation processes in the socio-cultural paradigm of performing art: Using digital button accordion,” Int. J. Supply Chain Manag., vol. 8, no. 6, pp. 436–445, 2020.Search in Google Scholar

[27] J. A. A. Amarillas, “Marketing musical: música, industria y promoción en la era digital,” INTERdisciplina, vol. 9, no. 25, pp. 333–335, 2021.Search in Google Scholar

[28] G. Scavone and J. O. Smith, “A landmark article on nonlinear time-domain modeling in musical acoustics,” J. Acoust. Soc. Am., vol. 150, no. 2, pp. R3–R4, 2021.10.1121/10.0005725Search in Google Scholar PubMed

[29] L. Turchet, T. West, and M. M. Wanderley, “Touching the audience: musical haptic wearables for augmented and participatory live music performances,” Personal. Ubiquitous Comput., vol. 25, no. 4, pp. 749–769, 2021.10.1007/s00779-020-01395-2Search in Google Scholar

[30] C. J. Way, “Populism in musical mash-ups: recontextualizing Brexit,” Soc. Semiotics, vol. 31, no. 3, pp. 489–506, 2021.10.1080/10350330.2021.1930857Search in Google Scholar

Received: 2023-05-08

Revised: 2023-10-31

Accepted: 2023-11-14

Published Online: 2024-07-09

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/comp-2023-0106

Keywords for this article

deep learning; music genre; classification; model

Creative Commons

BY 4.0