A deep learning-based mathematical modeling strategy for classifying musical genres in musical industry

Xiaoquan He; Fang Dong

doi:10.1515/nleng-2022-0302

Article Open Access

A deep learning-based mathematical modeling strategy for classifying musical genres in musical industry

Published/Copyright: July 27, 2023

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Nonlinear Engineering Volume 12 Issue 1

Abstract

Since the beginning of the digital music era, the number of available digital music resources has skyrocketed. The genre of music is a significant classification to use when elaborating music; the role of music tags in locating and categorizing electronic music services is essential. To categorize such a large music archive manually would be prohibitively expensive and time-consuming, rendering it obsolete. This study’s main contributions to knowledge are the following: This article will break down the music into many MIDI (music played on a digital musical instrument) movements, playing way close by analysis movement, character extraction from passages, and character sequencing from movement so that you may get a clearer picture of what you are hearing. The procedure includes the following steps: extracting the note character matrix, extracting the subject and segmentation grouping based on the note character matrix, researching and extracting beneficial characteristics based on the theme of the segments, and composing the feature sequence. It is challenging for the sorter to acquire spatial and contextual knowledge about music using traditional classification techniques due to its shallow structure. This study uses the unique pattern of input MIDI segments, which are used to probe the relationship between recurrent neural networks and attention. The approach for music classification is verified when paired with the testing precision of the same-length segment categorization; thus, gathering MIDI tracks 1920 along with genre tags from the network to construct statistics sets and perform music classification analysis.

Keywords: deep learning; musical genres; classifying

1 Introduction

Physical album formats are being phased out in favor of digital music streaming [1]. In light of the fact that everyone has their unique taste in music, it seems to reason that accurate and efficient music genre identification would be a vital component of any audience management, collection, or recommendation system. Sturm argued that music might be concluded, released with respect to distinct aspects of music, and then separated into different schools [2], taking into account that diverse tones have varied way, accompanying instruments tool, and vocal features.

Algorithms built on deep learning have outperformed more conventional machine learning methods for identifying musical genres. This is due to their ability to instantly acquire subtle qualities from unedited sound samples. However, due to their greater computing cost and bigger data needs, deep learning models are less successful than conventional machine learning strategies.

The neural network approach used by Bogdanov et al. [3] to generate fresh concepts for classifying musical genres through machine learning and training includes drawbacks including a slow rate of convergence and slipping into local optimum, though.

Research done by previous academics is extensive. While the validity of characterization was much evolved by the genetic method described by Dannenberg et al. to improve the neural network back propagation (BP) [4], the proposed improvement to the particle swarm optimization approach fell short area of music character extraction. In this study, we extract the knowledge of music statistics from the most fundamental aspects of tones since we know that doing so improves the validity of characterization.

According to Chillara et al. [5], the relevance of tone tags is highlighted against the new context of the more competitive digital music industry, the huge digital song service library kept in the cloud, and the rising number of digital music listeners. A music label is a short phrase that accurately describes the music it identifies and helps classify various types of digital music files. Labels for recorded music group songs share qualities like musical style, origin, or period under a more general heading [6]. A few examples of common music label classifications include “genre,” “writer,” “emotion,” and “applicable scene.” Musical resources are categorized into genres, using the genre classification of music as an example. Various popular, traditional, rock, jazz, dance, country, blues, and folk music subgenres exist. Similar creative styles may be found in music from the same genre.

Human listeners’ subjective judgments of music can result in significant inconsistencies and mistakes when attempting to classify it. This problem can be solved by using deep learning-based algorithms for genre categorization, which allow for the direct construction of objective mathematical models of musical components. These algorithms may find commonalities and trends in the features of numerous songs in order to make reliable genre predictions. As a result, machines trained with deep learning can classify music in a more precise and impartial manner than humans can.

There are many essential uses for automatic music classification in the modern Internet and digital music eras.

1.1 Control of storage space

The vast digital music information library on the World Wide Web may be considerably divided by music labels, which help with administration and management, as well as a shared repository of musical service, quick and accessible services placement, and computerized.

1.2 Internet music search engines

Fans or audience users benefit greatly from accurately categorizing music work categories. The results returned by music-specific search engines are more relevant and up-to-date since they are tailored to the specific genre of music being sought for.

1.3 Advising methodology

Providers of music streaming media can drive the expansion of the digital music market by catering to their users’ tastes by mining data about their interests and needs and then proactively referring them to the genres and artists they are most likely to enjoy.

1.4 Music creation

Human intelligence technology allows for autonomous composition to be achieved based on user input such as musical genre, emotional state, and key struck, with further compositional and arranging recommendations made available to composers as needed.

Music search relies heavily on the ability to distinguish between different types of songs. The purpose of a musical genre’s recognition categorization scheme is to help individuals enjoy songs by categorizing it corresponding to their preferences, whether which is over simple access and administration or so they can dive deeper into a particular style they are unfamiliar with. Most musical works have human vocalists and an assortment of musical devices. Moreover, musical categories have diverse compositional qualities, and a single performer may produce a wide variety of vocal tones while singing songs from various genres. It is challenging to increase the precision in musical genre recognition and grouping because of numerous barriers that make it hard for humans to extract aspects of music data. As a result, identifying and labeling musical genres has become a hot topic and a quickly evolving field in recent years. The present study in this area [7] focuses on how to enhance the precision and efficacy of music genre detection and categorization. Since the audio of the accompanying instrument is often included in songs, accurate detection and classification of instruments may be a powerful aid to music knowledge retrieval. Because of the significance of musical data retrieval, instrument identification and classification are also crucial components. The identification and categorization of musical instruments is a relatively simple task for humans to complete, especially if the person has a high level of musical accomplishment; however, the average music listener lacks this skill, so it is essential to train a computer to robotically recognize and categorize musical tools. The sound of a musical instrument is the primary criterion for identifying individual instruments and precisely identifying their individual qualities from an auditory perspective. The vibration of the movable section of an instrument of music is crucial to the creation of its unique timbre. The pitch of musical instruments is determined by the ratio of harmonics in overtones, which in turn is determined by the vibrational state of the instrument. When the same instrument is played with varying approaches, however, there might be noticeable differences in its timbre that could be mistaken for those of a different instrument. It will also make it hard to extract features from musical signals, which will reduce the reliability of instrument detection and classification. In the music business, deep learning-based techniques to feature extraction and modeling have the potential to be applied to a wide variety of tasks, including singer identification and song recognition. The model might be trained to identify distinctive vocal and artistic characteristics of a singer, leading to more precise artist identification. Models may be trained to identify musical elements in audio signals (such as melody or beat) and then used to search a database of known songs for a match. Applying deep learning methods to music discovery, copyright protection, and recommendation systems might be useful for the music business. Music information retrieval will increasingly concentrate on improving the precision of musical instrument identification and classification, despite the fact that this area of study has received very little attention. Technology’s impact on music’s cultural significance is substantial. The spiritual importance of instruments, the impact of the process of the cultural ramifications of recorded dissemination, and the innovative and creative possibilities afforded by a wide variety of media are just a few examples. The classification’s findings may be used to bolster investigations into the social and psychological underpinnings of our perceptions of melodic similarity and our groups of musical works, as well as their relationship to the “objective” reality given by machine classifications. The extraction of important characteristics and construction of the utilized classifier are two foci of music genre and musical instrument identification and classification studies. It is not standardized how much of a certain trait to use for recognition and classification purposes. Some studies blend different current one-character amounts for recall and characterization, while others begin with the notion of one creation to identify unique and useful characters.

While the aforementioned techniques can help improve recognition accuracy, they are not unified, and when it comes to more broad classification tasks, such as determining a point’s class, it is much more difficult to use manufactured feature extraction methods because of the lack of knowledge about what characteristics are necessary. When compared to other artificial intelligence models, deep learning is superior because it can model the structure of the human brain, process huge quantities of data, and correlate internal statistical information (i.e., release more significant characteristics of statistics) to boost detection and grouping accuracy.

Robotic learning is now popular in the area of visual processing although underused in song, particularly music information retrieval. As a result, this study investigates and refines a deep confidence network-based method for music genre detection and classification. When compared to the traditional approach, which simply extracted the acoustic data or musical characteristics of the musical and trained the classifier to produce the recognition as well as classification results, the algorithm significantly enhanced the validity of genre detection and grouping. Similarly, no study is available that uses robotic learning techniques for music device recognition and grouping. This is especially true for the category of conventional music gadget detection and grouping, which is specific to China. Because of this, this article also suggests a robotic conviction network in robotic learning-based methods for Chinese traditional music instrument detection and classification. The RoboOptiNet is a useful neural network for traditional Chinese musical instrument recognition tasks because it improves the effectiveness of conventional feature extraction, recognition, and classification processes while yielding more accurate identification and categorization results than classical algorithms.

2 Literature review

Musical instrument digital interface (MIDI) music classification has received less attention than other audio file types since most academics prefer to use waveform-storing audio formats like MPEG Sound Layer-3 and Waveform Audio File Standard. Instead of storing waveforms, MIDI files use a structured data format to record information about events, such as the time of the event, the event type, the event content, and so on.

2.1 Five characterizations of music

Presently, religious themes are often communicated via aspects of tunes, such as the jazz genre’s lyrics, the extremely complicated disco era rhythm, the powerful metal song melody, the lively dancing rhythm, as well as the active melodious planning, in contrast to our every single-day tones, which tend to be firmly and disorganized.

The timbre, rhythm, harmony, melody, and tempo of the music are only a few of the many properties and features that deep learning algorithms use to categorize music. Spectrogram analysis, waveform processing, and signal processing are only a few of the methods used to recreate these features from the raw audio data. By delving into this data, we may better train deep learning algorithms to spot patterns indicative of various genres. Some algorithms could go a step further and use lyrics or other metadata to further refine the categorization process.

2.1.1 Tempo

The tempo and intensity of a song’s rhythm are two key characteristics that define its genre. Pop and rock, for instance, include more powerful and quicker rhythms and more regularity than jazz and blues. The number of beats per minute is one way to measure the tempo of a song. Performing a cross-correlation computation within the calculated irregular number of beats within a minute associated with the detected acoustic signal is based on the rhythmic sequence having the highest trance-inducing value. A musical transmission’s pulse pattern may be interpreted as a signal consisting of a particular amount of rhythms. If a music’s peak frequency is very elevated, it likely has a forceful rhythm, making it perfect for bringing down the house at the start of a new semester. The rhythm pattern with the greatest hypnosis-producing quality is used to determine the uneven number of beating for 1 minute connected with the recorded audio source. The rhythmic pulses that make up an audio broadcast might be read as a coded message. High-peak harmonics in music indicate a strong beat, which makes it ideal for rousing the class at the beginning of the year.

2.1.2 Melody

From the standpoint of the sound wave, the level of pitch may be noticed by the speed of the audio recording signal, particularly the time of the vocal tract vibration in Hertz, and so this verbal grace uses recurrence-domain expectations to express tones, converts data by applying the Fourier transform into a time-domain communication, to cope with the noise of the knowledge, and gets the tune in the rate field on an ordinary foundation if the resonance of the average is large.

2.1.3 Harmony

When many musical instruments play in unison, the result is a harmonious sound. The human voice and chords from the two instruments blend together in this piece. The color theory explains why harmony is so important in music; it controls the balance between heavy and light passages. In light of these sonic facts, this work uses the time-domain variance of musical information as a stand-in for harmony. Sound energy information varies dramatically when the time-domain variation is high, suggesting that an instrument group or human vocal overlap is occurring. It is assumed that only one instrument or performer is responsible for the sound when time-domain variation is low since this suggests that the sound’s energy information changes very little over time. Once the time-domain mean has been analyzed, the time-domain variance may be used to statistically quantify the presence or absence of harmony.

2.1.4 Strength of sound

A decibel is a measurement of how loud something is. In this study, we leverage the music data’s short-time energy characteristic to describe the volume of the music. The magnitude of the sound intensity is reflected by the music message frame’s short-time energy feature calculation. If the short-time power feature is significant, then the amount of energy packed into this period of time is also large, and the resulting sound level will be high; if the short-time power feature is little, then the sound level will be low, argued by Panagakis et al. [8].

2.1.5 Intensity

Timbre aspects allow for the differentiation of musical traditions since the vocal and instrumental tones used in each are distinctive. Expos in Malta, the mel frequency cepstral coefficients (MFCC) [9], were a characteristic often used in voice recognition.

2.2 Status of research

Music information retrieval relies heavily on accurate genre identification and categorization. Since the 1990s, it has been the subject of intense research. The Fourier transform (fast) is performed on the sound statistics; then, the logarithmic transform was used to extract data as features of the data, which were then fed into a neural network training model with two hidden layers to determine whether a piece of music was classical or popular.

Recently, thanks to advancements in deep learning technology for image identification, significant progress has been achieved in the area of voice recognition. The possibilities of machine learning in the area of acoustic retrieval of data have lately attracted the attention of academics. In 2009, Xu et al. [10] were the first to apply a deep trust network to classify musical genres, and additional research has confirmed the network’s superior performance compared to traditional artificial intelligence methods. While Karatana [11] uses music genres to limit the Boltzmann chance, constructing a five-layer limit Boltzmann machine in the process, this approach clearly fails in one respect: its accuracy drops off precipitously as more music genre types are classified.

Although China’s research into identifying and classifying musical genres began later compared to other nations, significant progress has been made in this area. In a 2009 investigation on genres of music recognition and grouping, fundamental characteristics were screened thoroughly by Zhen Chao et al., who looked at things like root cause mean oblong power, zero-crossing percentage, strongest lb, majority strong mix size, spectrum emphasizes value, range flow, range variance, Cestrum aspect, and linear predicting coefficient. To accurately identify and categorize eight various musical genres, a novel multimodal approach was presented, which makes use of music labels. With an outstanding accuracy rate of over 87%, these styles include blues, country, conventional crackling, jazz, technology improvements reggae, and rock [12]. Sturm [13] used a two-dimensional representation of aspects of pitch and rhythm to create feature vectors that indicated the melody portion of songs, and the identification rate increased to 81% [14,15].

There are several successful methods for identifying and classifying musical genres in addition to the aforesaid classical approaches from the USA and elsewhere. By manually obtaining as many minimal or the highest-level musical properties as possible and using a wide range of classification methods, a number of methods aim to maximize both detection and classification efficiency. Processor design focuses on increasing detection and grouping precision, whereas extraction by hand focuses on obtaining the fundamental properties of music signals. Different musical structures and cultural contexts may make it difficult for the deep learning-based method of music genre categorization to operate outside the West music styles and genres. However, deep learning-based models can be taught to successfully identify outside the West music genres as well, provided that feature extractor and modeling methodologies are adapted to the particular qualities of the music. Large volumes of labeled data for these categories may be difficult to come by, although transferable learning and other methods may be useful in making up the difference. Two key downsides associated with human feature mining are the challenge of introducing new characteristics and the need for separate procedures for different acoustic parts. An investigation of an in-depth neural internet-based system for detecting and labeling musical genres is presented in this article [16]. In order to better extract abstract features expressing the core characteristics of each music genre, the neural network will on its own begin developing and assessing the input characteristic data, including multilinear fundamental component, rapid Fourier transform, and MFCC. The level of information required for the extraction of features has been much reduced [17], while the cognitive ability and precision of category of music recognition and classification have both increased.

Research into identifying and classifying musical tools seems to have followed the same route as that of categories of music, with scholars emphasizing feature extraction techniques and classifier development.

Deep learning is a prominent new feature retrieval approach, but it has not been widely used in the field of music equipment identity and classification, particularly in the research of Chinese traditional music equipment detection and category. To that end, this study suggests a method for identifying and categorizing Chinese ancestral musical instruments by the use of a profound assurance network in machine learning [18]. Given the central role that identifying and classifying musical instruments plays in music education, it stands to reason that the name given to a device may be used to infer the associated emotions and musical context.

Music information retrieval. Robotic melody classification might be enhanced by considering the particulars of the tool used to perform a certain piece of music. Since this work may aid folks in studying the identification and grouping of additional acoustic data regaining sectors, it is attracting a rising amount of attention from academics.

3 Methodology

Deep learning-based algorithms for music genre categorization have a number of drawbacks, including a high probability of bias and overfitting, a requirement for vast volumes of labeled data, and difficulty in understanding and comprehending the taught representations. You may decrease the quantity of labeled data required using transfer learning, get insight into the model’s behavior via visualization and analysis of the learned representations, and include fairness and transparency into the process from the get-go by doing so. It is possible to increase performance and lessen the impact of these restrictions by combining many models or using hybrid approaches that combine deep learning with other techniques.

3.1 Classification of MIDI tracks

In this study, we apply Chen Delong’s line of thought to the problem of segmenting MIDI music, using insights from Foote’s approach of music grouping on the basis of local self-phase. The mutation time is determined, and the MIDI song is chopped up into smaller pieces that all have a similar style and mood.

The MIDI segment algorithm’s fundamental flowchart is shown in Figure 1. After collecting, building, and developing MIDI files, creating a music cover structure for melodious simulation, and calculating the similarity among both pictures via the distance calculated by Euclid, we are able to produce the novelty curve, a time-varying curve that explains the variance in music replaying. Finally, the distinctive curve was broken up into pieces based on where their peaks were located. Musical displays at notably forward-thinking time periods are characterized by high levels of self-replication across past or present or the foreseeable future, and low levels of co-exist similarity involving the course of the present and what is to come. If certain categories have much fewer samples than others, it might be difficult to identify music by genre. Data augmentation, class weighting, and transfer learning are all deep learning-based methodologies that might be used to approach this problem. When underrepresented groups are small, data augmentation can be used to make them larger by adding to or altering existing data. Minority classes can be given more weight in the loss function during model training with the help of class weighting. Using transfer learning on a smaller dataset, a pre-trained model may be fine-tuned to perform better in the minority classes. There has been a dramatic shift in the expressiveness and aesthetics of musical performance. Consequently, MIDI may be segmented. Music sent via MIDI may be broken up into smaller pieces using the technique discussed in this section (Figure 2).

Figure 1

Example spectrogram from the GTZAN dataset, including a rock tune.

Figure 2

Simple diagram showing how MIDI tracks are broken up.

3.1.1 Piano mesh curtain

A Piano Roll is an excellent way to explain the playing of music that uses MIDI on a personal computer. Each note performed may be displayed as a separate horizontal bar, with time on the left and note frequency on the right. Consequently, the notes played and their volumes are recorded in each column of the piano curtain matrix in accordance with piano mesh curtain design, MIDI tune simulation, collecting the piano, and structuring and programming a MIDI tune. In this work, we detail the exact procedures used to produce a curtain matrix for a steel violin:

The MIDI file’s note characteristic matrix should be entered, and the corresponding note vectors should be sorted ascendingly by start time.
Each frame of MIDI music correlates with a sampling moment, and the music is sampled at regular intervals of dt, creating a total of M frames. Prepare the 128 by M piano mesh curtains G with all zeros for its elements and the current sampling serial number of I equal to 1. Count the vectors in the note characters of mesh, N.
Beginning and ending sequence numbers (n_start and n_end) are calculated by dividing the initial start duration (t-start) and the ending time represented as (t-end) of the ith note vector (dt).

(1) n start = t start │ d t ,

(2) n end = t end │ d t .

G[pitch] [n] = volume

To use n for the piano curtain event G, look in the i-th note matrix at the cells labeled n_start, n_start + 1, …., n_end.

Let i → i + 1. If i is less than N, the production of the piano curtain grid continues with step 3. If i is more than or equal to n, the process ends.

3.1.2 The self-analogous matrix

The MIDI piano mesh curtains, which we acquired in the prior part, is a 128 by M matrix in two dimensions. There are M columns altogether, and every row represents the encoding array for one frame received by periodic sampling. This array may be thought of as a vector in a 128-dimension area (Figure 3).

Figure 3

Architecture of a network of convolutional neural networks as an illustration.

The replicating matrix S is an M by M grid that is formed by calculating the Euclidean distance that exists between both frames and comparing the results. Since the degree of similarity is calculated using the Euclidean distance, the self-resemblance vector S is a symmetric matrix with resemblance equal to zero along its diagonally. We just need to calculate the similarity of the top rectangular element in the top self-same matrix S since each analog of the entire replication by itself matrix may be determined using mirror properties. Check their degree of resemblance:

(3) D ( i , j ) = ∑ k = 1 128 ( x ik − x jk ) 2 .

3.1.3 Function of novelty and maximum value

The uniqueness function is a curve in a time series that quantifies the degree to which the music being played has changed from one instance to the next.

The novelty values before and after a certain period show how drastically the local musical performing style has changed. In this article, we explain the basics of calculating the unique function and identifying the breakpoints between segments. To determine the uniqueness of each sample instance, we first needed to generate the self-similarity matrix. We begin by building the simplest possible convolution kernel, a 22 square with only two components:

(4) 1 1 1 − 1 = 0 1 1 0 − 1 0 0 1 .

The first half of formula (4) is used to determine the degree of resemblance between the two areas before and after the time point, while the second half is used to determine the degree of resemblance between themselves before and after a certain epoch. At the halfway point, the level of melodic uniqueness is established. The focus point of convolution changes to the present, leading to an increased understanding of differences, if there is a lack of substantial similarity between two regions before and after a given date and time, but a high degree of mental state sameness within each area. The next step is to determine the uniqueness of each sample time:

(5) N ( ⅈ ) = ∑ m = − k k ∑ n = − k k C ( m , n ) s ( ⅈ + m , i + n ) .

Formula (5) shows that the novel functional N(i) cannot be computed if the convolution computation goes beyond the border of the similarities of themselves mesh. Completing the self-similarity mesh is possible, but it fails to fulfill this criterion since it captures the similarity of music between any two sampling times. Therefore, this study uses the approach of filling the piano curtain grid and adding k column 0 values to the beginning and the end of the piano curtain grid as an enhancement. During the starting and ending filling periods, it is clear that the music does not play. Following the aforementioned stages, the remarkably comparable matrix is computed, then a new degree function is formed.

3.1.4 Arrangements of segmentation features development

The first two chapters detail our process for isolating the song’s primary melody from a MIDI file, with each chapter including a different set of note vectors that make up the song’s main melody. However, a computer is unable of immediately perceiving the genre topic information conveyed by each musical composition; rather, this information is limited to expressing by extracting the relevant characteristic parameters. In this research, we use MIDI split segments to extract music characteristic characteristics, generating a component feature sequence that is fed into a deep neural network trained to classify music. We linked the way autistic people take in information with the style of musical expression by breaking down a MIDI tune into its component pieces. This study helps to categorize musical genres by identifying a collection of characteristics that demonstrate significant diversity and powerful expressiveness. The passage’s feature vectors are built from a mixture of these characteristics due to how well they characterize the passage as a whole. The characteristic arrangement in the passage is then constructed using the feature vectors in the correct order. We will discuss how to extract such characteristics and show you some examples.

3.1.4.1 A regular pitch

The rate of vibration of the sound source determines the volume or pitch. Melody relies on the pitch to be performed well, and the average pitch of a segment represents the section’s overall pitch level.

(6) ρ avg = Σ i ρ i n .

3.1.4.2 Pitch consistency

The music is ever-changing, as are the topics and feelings it conveys. The constancy of pitch may, to some degree, be a reflection of the ways in which musical compositions express themselves. The average and individual note pitches in a passage are used to determine the passage’s degree of pitch stability.

(7) ρ std = 1 n ∑ i ( p i − p avg ) 2 .

3.1.4.3 Vocal capacity

A section’s register specifies its pitch range or the interval between its most and least pitched notes.

(8) P r = max ( p ) − min ( p ) .

3.1.4.4 A faster tempo of play

The tempo at which a piece of music is performed is a major factor in determining how it is categorized. This work uses a calculated mean of the main tune’s beats per minute to characterize the performance pace of the section, while also embracing the two aspects of the division in the MIDI header block. As you can see, MIDI’s time precision is established and a worldwide speed standard for MIDI music is defined by division. The average tempo of the passage’s primary melodic notes may be determined by weighting their occurrences.

(9) t ⅇ mpo ̅ = Σ i BP M i × D i Σ i D i .

3.1.4.5 Intensity

This article outlines the tempo details of tunes by describing the typical interactions and equilibrium of the primary melody records performed in the passage, which are the basic components of songs and are closely linked to the thoughts and feelings conveyed through music.

(10) I avg = Σ i I i n ,

(11) I std = 1 n ∑ i ( I i − I avg ) 2 .

4 Results and discussion

Complex models and large amounts of training data may drive up the computing cost of a deep learning-based approach to music genre classification. Using cutting-edge hardware like graphics processing units or tensor processing units may speed up training and cut down on training time. However, the price of these tools might be out of reach for certain groups or people. If frequent model updates and retraining are needed, the total cost of computing may rise significantly.

4.1 A study on music genre analysis

A type of music or subcategory may be determined by its musical methods, cultural context, and the substance or mood of the issues explored. A music genre’s location of origin is not always used to characterize it, even if a single regional category would often include a wide range of subgenres. We discuss what it means to conduct an experiment, how the experiment should be conducted, and how the results should be analyzed. Classifying new musical subgenres may be possible by updating preexisting deep learning-based models with the new data. Before the model can understand the specifics of the new genre, it must be trained on a sample of data that is representative of that genre. Further refinement of the model may be required to adequately represent certain novel-specific features. If enough labeled training data are made available, this approach has the potential to enhance the accuracy with which deep learning models categorize new types of music (Figure 4).

Figure 4

Some illustrative spectrograms gathered from many types of music.

4.1.1 Evidence from experiments

This part conducts a study on music flow categorization using MIDI files from many musical genres, including traditional, national, performing, traditional music, and metal. Classification techniques and models are often evaluated based on their accuracy, accuracy, recall rate, and F1 value. A confusion matrix is a useful tool for representing the connection between the anticipated categories and the real categories in the confirmation set or test collection and for visualizing the outcomes of a classification problem. The confusion matrix has the form (2TP/(2TP + FP + FN), and it is a centered matrix. Every component M(I, j) reflects the number of samples of the true class I that were anticipated to be class J. The portion of the abscissa indicates the actual class, while the ordinate indicates the projected class.

4.1.2 Research conditions

In order to classify MIDI music into genres, the authors of this work use the programming syntax of Python and the Keras library to make calls to the TensorFlow backend.

Eighty percent of the MIDI files in each genre are used as training sets, while the remaining 20% are used as validation sets, in the experiment. In Table 1, you can see how many MIDI files there are in total across all categories in both the instruction set and the validation set.

Table 1

MIDI music distribution

Music genre	Classical	Rural	Dance music	Dance music	Dance music
Training set	320	308	268	320	320
Validation set	80	78	67	80	79

Concerns about bias and privacy are raised when deep learning is used to categorize musical genres. Privacy issues may occur during large-scale model training if significant or identifiable user data are used. Inaccurate or unjust categorization may result from biases within the model’s training information or the model itself, which is problematic for users and artists alike. These moral worries may be reduced if the models are created and used in a transparent, accountable, and fair manner.

4.1.3 Evaluation of the testing tests

For instance, Table 2 displays the outcomes of the experiments.

Table 2

Comparison of experimental results of music classification

Number	Acc	P	R	F1
1	0.7526	0.7229	0.7369	0.7298
2	0.8646	0.856	0.8556	0.8558
3	0.8828	0.8797	0.8804	0.8801
4	0.901	0.8995	0.8991	0.8993
5	0.8724	0.8643	0.8652	0.8648
6	0.888	0.8852	0.8855	0.8854

Experiment 1, which is used as input into the neural network built by BP for the categorizing test, uses the feature set retrieved from the research [6] for audio emotion categorization.

The confusion matrix depicting the generated prediction outcomes is displayed in Table 3 after the network model has been trained and used to assess its classification performance on the validation set.

Table 3

Confusion matrix (%)

		Dance music	Metal	Pre rural	The classical	Folk
True	Dance music	85.07	5.97	8.96	0	0
	Metal	3.8	94.94	0	0	1.27
	Rural	15.38	0	82.05	0	2.56
	The classical	2.5	1.25	2.5	92.5	1.25
	Folk	1.25	0	0	3.75	95

Table 3 shows that analyses of findings show that classifications of metal, conventional, and folk music all achieve high rates of success: 94.94, 92.50, and 95.00%, respectively. Some mislabeling of dance tunes and country music exists. Some country music may serve as a background to traditional dances in a way comparable to theatrical music, leading to the genre’s frequent misclassification as dance music for dance and metal bands do not quite mesh, maybe due to the fact that both place a heavy focus on rhythm.

5 Conclusion

Problems arise while trying to keep track of all of Panda’s digital audio files, and the conventional method of categorizing them based on handwritten annotation requires too much time and money to keep up with modern demands. The scholarly relevance of automatic music categorization has grown steadily in recent years. In an effort to overcome the shortcomings of existing approaches to classifying MIDI files, this article proposes improved obtaining features and classifier architecture. The principal goals of this study are the following:

To further explain the music, the feature-gathering procedure divides a MIDI file into smaller pieces that have comparable local playing styles, from which features are retrieved to produce a section characteristic sequence. Sequence. Note extracting features, topic and segments division extraction using the extracted note component matrix, and component series formation using segment theme research along with successful extracting features are all steps in this procedure.
A deep learning-based approach to MIDI music categorization is presented in the classification technique.
The appropriate experiment of MIDI music genre categorization will be carried out through programming to test the practicality and effectiveness of the aforesaid material.

An Initial Trial of Sectioning and Extracting Common Themes. Both the conventional music classification research using a BP neural network and the deep learning-based music classification experiment presented in this study are conducted. Experimental findings are compared, and it is determined that the set of characteristics retrieved in this work is well suited for the job of classifying MIDI music by genre. In this work, we use a segment-based approach to extract characteristic sequences. Incorporating bidirectional version-gated recurrent unit (BI-GRU) improves the accuracy of classification over a BP neural network by better describing music’s sequential properties and learning the music’s context meanings and high-level features.

The suggested classification approach was built on top of unidirectional gates for recurrent units (BI-GRU), with the addition of a focus mechanism to acquire more salient music elements (90.1% accuracy was attained on the set used for validation, demonstrating the system’s efficacy). Practical precision in classifying segments of identical length supports the reliability of this approach.

Future studies of deep learning-based tactics in the music business should look at the methods’ potential for use in inventive and original ways, such as the birth of novel musical styles or the improvement of creating music software. In addition, further study is required to address issues of bias, justice, and security in the creation and use of these models. When deep learning-based techniques are integrated into other cutting-edge technologies like augmented and virtual reality, new possibilities emerge for music creation, performance, and consumption.

Funding information: This research does not receive any kind of funding from any source.
Author contributions: Xiaoquan He – conceptualization, methodology, supervision, data curation, visualization, validation, writing original draft; Fang Dong – conceptualization, software, methodology, validation, visualization, writing original draft, reviewing and revising manuscript.
Conflict of interest: The authors report no conflict of interest.
Data availability statement: Data will be available at the request of the corresponding author.

References

[1] Oramas S, Nieto O, Barbieri F, Xavier S. Multi-label music genre classification from audio, text, and images using deep features. In: Cunningham SJ, Duan Z, Hu X, Turnbull D, editors. Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR 2017) 2017 Oct 23–27; Suzhou, China. ISMIR, 2017. p. 23–30.Search in Google Scholar

[2] Sturm BL. A survey of evaluation in music genre recognition. Adapt Multimed Retr. 2012;32:29–66.10.1007/978-3-319-12093-5_2Search in Google Scholar

[3] Bogdanov D, Porter A, Herrera P, Xavier S. Cross-collection evaluation for music classification tasks. In: Mandel MI, Devaney J, Turnbull D, Tzanetakis G, editors. Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR 2016); 2016 Aug 7–11; New York (NY), USA. ISMIR, 2016. p. 379–85.Search in Google Scholar

[4] Dannenberg RB, Thom B, Watson D. A machine learning approach to musical style recognition. Proceedings of the International Computer Music Conference; 1997 Sep 25-30; Thessaloniki, Greece. Michigan Publishing, 1997.Search in Google Scholar

[5] Chillara S, Kavitha AS, Neginhal SA, Haldia S, Vidyullatha KS. Music genre classification using machine learning algorithms: a comparison. Int Res J Eng Technol. 2019;6(5):851–8.Search in Google Scholar

[6] Tzanetakis G, Cook P. Musical genre classification of audio signals. IEEE Trans Speech Audio Process. 2002;10(5):293–302.10.1109/TSA.2002.800560Search in Google Scholar

[7] Ajoodha R, Klein R, Rosman B. Single-labelled music genre classification using content-based features. 2015 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech); 2015 Nov 26–27; Port Elizabeth, South Africa. IEEE, 2015. p. 66–71.10.1109/RoboMech.2015.7359500Search in Google Scholar

[8] Panagakis Y, Kotropoulos C, Arce GR. Non-negative multilinear principal component analysis of auditory temporal modulations for music genre classification. IEEE Trans Audio Speech Lang Process. 2010;18(3):576–88.10.1109/TASL.2009.2036813Search in Google Scholar

[9] Zheng F, Zhang G, Song Z. Comparison of different implementations of MFCC. J Comput Sci Technol. 2001;16(6):582–9.10.1007/BF02943243Search in Google Scholar

[10] Xu C, Maddage MC, Shao X, Fang C. Musical genre classification using support vector machines. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’03); 2003 Apr 6–10; Hong Kong. IEEE, 2003. p. V-429.Search in Google Scholar

[11] Karatana A, Yildiz O. Music genre classification with machine learning techniques. 2017 25th Signal Processing and communications applications conference (SIU); Antalya, Turkey, IEEE, 2017.10.1109/SIU.2017.7960694Search in Google Scholar

[12] Li SZ. Content-based audio classification and retrieval using the nearest feature line method. IEEE Trans Speech Audio Process. 2000;8(5):619–25.10.1109/89.861383Search in Google Scholar

[13] Sturm BL. Alexander Lerch: An introduction to audio content analysis: Applications in signal processing and music informatics. Comput Music J. 2013;37(4):90–1.10.1162/COMJ_r_00208Search in Google Scholar

[14] Elbir A, Bilal Çam H, Emre Iyican M, Öztürk B, Aydin N. Music genre classification and recommendation by using machine learning techniques. 2018 Innovations in intelligent systems and applications conference (ASYU); Adana, Turkey. IEEE, 2018.10.1109/ASYU.2018.8554016Search in Google Scholar

[15] Nkambule T, Ajoodha R. Classification of music by genre using probabilistic graphical models and deep learning models. In: Yang XS, Sherratt S, Dey N, Joshi A, editors. Proceedings of the Sixth International Congress on Information and Communication Technology (6th ICICT); 2021 Feb 25–26; London, UK. Springer, 2021. p. 185–93.10.1007/978-981-16-2102-4_17Search in Google Scholar

[16] He K, Zhang X, Ren S. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016 Jun 27–30; Las Vegas (NV), USA. IEEE, 2016. p. 770–8.10.1109/CVPR.2016.90Search in Google Scholar

[17] Donahue J, Hendricks LA, Guadarrama S, Venugopalan S, Guadarrama S, Saenko K. Long-term recurrent convolutional networks for visual recognition and description. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2015 Jun 7–12; Boston (MA), USA. IEEE, 2015. p. 2625–34.10.1109/CVPR.2015.7298878Search in Google Scholar

[18] Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science. 2006;313(5786):504–7.10.1126/science.1127647Search in Google Scholar PubMed

Received: 2023-04-29

Revised: 2023-05-29

Accepted: 2023-06-10

Published Online: 2023-07-27

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/nleng-2022-0302

Keywords for this article

deep learning; musical genres; classifying

Creative Commons

BY 4.0