Web demo with audio examples "A Learned Loss to Leverage Large Multi-stem Datasets with Bleeding for Music Source Separation"
This is a web demo for the paper "A Learned Loss to Leverage Large Multi-stem Datasets with Bleeding for Music Source Separation".
The paper has been submitted to the 2025 International Conference on Acoustics, Speech, and Signal Processing (ICASSP).
Abstract: Separating the individual sources in a music mixture is a challenging problem currently being addressed using deep
learning models. Compiling clean multi-track stems to train these systems is complex and expensive compared to gathering these from
live performances. However, stems recorded in live shows often have source bleeding between tracks, which degrades model quality.
Using Carnatic music as a use case, we leverage large amounts of multi-track data with bleeding to pre-train a separation network.
Then, we propose a CNN-based bleeding estimator trained with artificially generated bleeding on a small set of clean studio-recorded
Carnatic music stems. This approach is used to fine-tune the pre-trained separation model, improving its ability to handle real-world
bleeding in multi-track recordings. We investigate further the optimal amount of clean data required for the bleeding estimator's
training and the usage of an out-of-domain dataset. Code and audio examples are made available.
All audios in this demo are from the CMC dataset. These are not copyrighted, but please do not share this demo or the displayed audios,
as these are only for the purpose of this demo.
Comparing our complete system with baselines
This comparison relates to Table 1 in the paper. The goal is to perceptually compare our complete system, trained with all the
data available (full Saraga for the separator, and full CMC for the bleeding estimator), with the baselines.
Example 1
Mix
Vocals
Baseline 1, Saraga separator pre-training
Baseline 2, MUSDB18HQ separator pre-training
Saraga pre-training, CMC bleeding estimator
ColdDiffSep, trained only using Saraga
Example 2
Mix
Vocals
Baseline 1, Saraga separator pre-training
Baseline 2, MUSDB18HQ separator pre-training
Saraga pre-training, CMC bleeding estimator
ColdDiffSep, trained only using Saraga
Example 3
Mix
Vocals
Baseline 1, Saraga separator pre-training
Baseline 2, MUSDB18HQ separator pre-training
Saraga pre-training, CMC bleeding estimator
ColdDiffSep, trained only using Saraga
Example 4
Mix
Vocals
Baseline 1, Saraga separator pre-training
Baseline 2, MUSDB18HQ separator pre-training
Saraga pre-training, CMC bleeding estimator
ColdDiffSep, trained only using Saraga
Comparing different sizes of the clean multi-stem dataset
This comparison relates to Table 2 in the paper. The goal is to perceptually compare the separation ability of our complete system
trained with different small clean sets, both from the CMC and from MUSDB18HQ as well, aiming at studying how robust our system is to
the domain shift.
Note: all compared systems here have the separator pre-trained with Saraga. Therefore, we are perceptually comparing solely
the effect of the data used to train our bleeding estimator.