Home Influence of input data representations for time-dependent instrument recognition
Article
Licensed
Unlicensed Requires Authentication

Influence of input data representations for time-dependent instrument recognition

  • Markus Schwabe

    Markus Schwabe has studied electrical engineering and information technology at the Karlsruhe Institute of Technology (KIT) and received his M. Sc. in 2016. He is currently working at the Institute of Industrial Information Technology (IIIT) at the KIT as a research associate. His research interests include signal and audio processing, machine learning, and music signal separation.

    EMAIL logo
    and Michael Heizmann

    Michael Heizmann is Professor of Mechatronic Measurement Systems and Director at the Institute of Industrial Information Technology at the Karlsruhe Institute of Technology. His research areas include machine vision, image processing, image and information fusion, measurement technology, machine learning, artificial intelligence and their applications in industry.

Published/Copyright: February 25, 2021

Abstract

An important preprocessing step for several music signal processing algorithms is the estimation of playing instruments in music recordings. To this aim, time-dependent instrument recognition is realized by a neural network with residual blocks in this approach. Since music signal processing tasks use diverse time-frequency representations as input matrices, the influence of different input representations for instrument recognition is analyzed in this work. Three-dimensional inputs of short-time Fourier transform (STFT) magnitudes and an additional time-frequency representation based on phase information are investigated as well as two-dimensional STFT or constant-Q transform (CQT) magnitudes. As additional phase representations, the product spectrum (PS), based on the modified group delay, and the frequency error (FE) matrix, related to the instantaneous frequency, are used. Training and evaluation processes are executed based on the MusicNet dataset, which enables the estimation of seven instruments. With a higher number of frequency bins in the input representations, an improved instrument recognition of about 2 % in F1-score can be achieved. Compared to the literature, frame-level instrument recognition can be improved for different input representations.

Zusammenfassung

Ein wichtiger Vorverarbeitungsschritt für verschiedene Musiksignalverarbeitungsalgorithmen ist die Schätzung der spielenden Instrumente in Musikaufnahmen. Zu diesem Zweck wird die zeitabhängige Instrumentenerkennung in diesem Ansatz durch ein neuronales Netz mit Residual-Blöcken realisiert. Da Musiksignalverarbeitungsaufgaben unterschiedliche Zeit-Frequenz-Darstellungen als Eingabematrizen verwenden, wird in dieser Arbeit der Einfluss verschiedener Eingangsdarstellungen für die Instrumentenerkennung analysiert. Dabei werden sowohl dreidimensionale Eingänge von Kurzzeit-Fourier-Transformation (STFT) mit einer zusätzlichen auf Phaseninformation basierenden Zeit-Frequenz-Darstellung als auch die Magnituden der zweidimensionalen STFT oder der Constant-Q-Transformation (CQT) untersucht. Als zusätzliche Phasendarstellungen werden das Produktspektrum (PS), das auf der modifizierten Gruppenlaufzeit basiert, und die Frequenzfehlermatrix (FE-Matrix), welche von der Momentanfrequenz abgeleitet ist, verwendet. Die Trainings- und Evaluierungsprozesse werden auf Basis des MusicNet-Datensatzes durchgeführt, der die Schätzung von sieben Instrumenten ermöglicht. Durch eine höhere Anzahl an Frequenzbins in den Eingangsdarstellungen kann eine um etwa 2 % im F1-Score verbesserte Instrumentenerkennung erreicht werden. Im Vergleich zur Literatur kann die Instrumentenerkennung auf Frame-Ebene für verschiedene Eingangsdarstellungen verbessert werden.

About the authors

Markus Schwabe

Markus Schwabe has studied electrical engineering and information technology at the Karlsruhe Institute of Technology (KIT) and received his M. Sc. in 2016. He is currently working at the Institute of Industrial Information Technology (IIIT) at the KIT as a research associate. His research interests include signal and audio processing, machine learning, and music signal separation.

Michael Heizmann

Michael Heizmann is Professor of Mechatronic Measurement Systems and Director at the Institute of Industrial Information Technology at the Karlsruhe Institute of Technology. His research areas include machine vision, image processing, image and information fusion, measurement technology, machine learning, artificial intelligence and their applications in industry.

References

1. H. Banno, J. Lu, S. Nakamura, K. Shikano, and H. Kawahara. Efficient representation of short-time phase based on group delay. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages 841–864, 1998.10.1109/ICASSP.1998.675401Search in Google Scholar

2. J. J. Bosch, J. Janer, F. Fuhrmann, and P. Herrera. A comparison of sound segregation techniques for predominant instrument recognition in musical audio signals. In 13th International Society for Music Information Retrieval (ISMIR) Conf., pages 559–564, 2012.Search in Google Scholar

3. J. C. Brown. Calculation of a constant Q spectral transform. Journal of the Acoustical Society of America, 89(1):425–434, Jan. 1991.10.1121/1.400476Search in Google Scholar

4. S. Dieleman and B. Schrauwen. End-to-end learning for music audio. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 6964–6968, 2014.10.1109/ICASSP.2014.6854950Search in Google Scholar

5. A. Diment, P. Rajan, T. Heittola, and T. Virtanen. Modified group delay feature for musical instrument recognition. In Proceedings of the 10th International Symposium on Computer Music Multidisciplinary Research, pages 431–438, 2013.Search in Google Scholar

6. Y. Han, J. Kim, and K. Lee. Deep convolutional neural networks for predominant instrument recognition in polyphonic music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(1):208–221, 2017.10.1109/TASLP.2016.2632307Search in Google Scholar

7. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.10.1109/CVPR.2016.90Search in Google Scholar

8. Y.-N. Hung and Y.-H. Yang. Frame-level instrument recognition by timbre and pitch. In 19th International Society for Music Information Retrieval (ISMIR) Conference, pages 135–142, 2018.Search in Google Scholar

9. Y.-N. Hung, Y.-A. Chen, and Y.-H. Yang. Multitask learning for frame-level instrument recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 381–385, 2019.10.1109/ICASSP.2019.8683426Search in Google Scholar

10. P. Li, J. Qian, and T. Wang. Automatic instrument recognition in polyphonic music using convolutional neural networks. arXiv preprint arXiv:1511.05520, 2015.Search in Google Scholar

11. M. Müller. Fundamentals of Music Processing – Audio, Analysis, Algorithms, Applications. Springer International Publishing Switzerland, 2015. ISBN 978-3-319-21944-8.10.1007/978-3-319-21945-5Search in Google Scholar

12. J. Pons, O. Slizovskaia, R. Gong, E. Gómez, and X. Serra. Timbre analysis of music audio signals with convolutional neural networks. In 25th European Signal Processing Conf., pages 2744–2748, 2017.10.23919/EUSIPCO.2017.8081710Search in Google Scholar

13. M. Schwabe, M. Weber, and F. Puente León. Notenseparation in polyphonen Musiksignalen durch einen Matching-Pursuit-Algorithmus. tm - Technisches Messen, 85(s1):s103–s109, 2018.10.1515/teme-2018-0039Search in Google Scholar

14. M. Schwabe, O. Elaiashy, and F. Puente León. Incorporation of phase information for improved time-dependent instrument recognition. tm - Technisches Messen, 87(s1):s62–s67, 2020.10.1515/teme-2020-0031Search in Google Scholar

15. J. Sebastian and H. A. Murthy. Group delay based music source separation using deep recurrent neural networks. In International Conference on Signal Processing and Communications, pages 1–5, 2016.10.1109/SPCOM.2016.7746672Search in Google Scholar

16. J. Thickstun, Z. Harchaoui, and S. M. Kakade. Learning features of music from scratch. In International Conference on Learning Representations, 2017.Search in Google Scholar

17. D. Zhu and K. K. Paliwal. Product of power spectrum and group delay function for speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages I125–I128, 2004.10.1109/ICASSP.2004.1325938Search in Google Scholar

Received: 2020-12-18
Accepted: 2021-02-04
Published Online: 2021-02-25
Published in Print: 2021-05-26

© 2021 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 29.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/teme-2020-0100/html?lang=en
Scroll to top button