Polyphonic pitch tracking with deep layered learning.

Journal: The Journal of the Acoustical Society of America
Published Date:

Abstract

This article presents a polyphonic pitch tracking system that is able to extract both framewise and note-based estimates from audio. The system uses several artificial neural networks trained individually in a deep layered learning setup. First, cascading networks are applied to a spectrogram for framewise fundamental frequency (f) estimation. A sparse receptive field is learned by the first network and then used as a filter kernel for parameter sharing throughout the system. The f activations are connected across time to extract pitch contours. These contours define a framework within which subsequent networks perform onset and offset detection, operating across both time and smaller pitch fluctuations at the same time. As input, the networks use, e.g., variations of latent representations from the f estimation network. Finally, erroneous tentative notes are removed one by one in an iterative procedure that allows a network to classify notes within a correct context. The system was evaluated on four public test sets: MAPS, Bach10, TRIOS, and the MIREX Woodwind quintet and achieved state-of-the-art results for all four datasets. It performs well across all subtasks f, pitched onset, and pitched offset tracking.

Authors

  • Anders Elowsson
    KTH Royal Institute of Technology, School of Computer Science and Communication, Speech, Music and Hearing, Stockholm, Sweden.