Web demo with audio examples "A Learned Loss to Leverage Large Multi-stem Datasets with Bleeding for Music Source Separation"

This is a web demo for the paper "A Learned Loss to Leverage Large Multi-stem Datasets with Bleeding for Music Source Separation". The paper has been submitted to the 2025 International Conference on Acoustics, Speech, and Signal Processing (ICASSP).

Abstract: Separating the individual sources in a music mixture is a challenging problem currently being addressed using deep learning models. Compiling clean multi-track stems to train these systems is complex and expensive compared to gathering these from live performances. However, stems recorded in live shows often have source bleeding between tracks, which degrades model quality. Using Carnatic music as a use case, we leverage large amounts of multi-track data with bleeding to pre-train a separation network. Then, we propose a CNN-based bleeding estimator trained with artificially generated bleeding on a small set of clean studio-recorded Carnatic music stems. This approach is used to fine-tune the pre-trained separation model, improving its ability to handle real-world bleeding in multi-track recordings. We investigate further the optimal amount of clean data required for the bleeding estimator's training and the usage of an out-of-domain dataset. Code and audio examples are made available.

All audios in this demo are from the CMC dataset. These are not copyrighted, but please do not share this demo or the displayed audios, as these are only for the purpose of this demo.

Comparing our complete system with baselines

This comparison relates to Table 1 in the paper. The goal is to perceptually compare our complete system, trained with all the data available (full Saraga for the separator, and full CMC for the bleeding estimator), with the baselines.

Example 1

Mix

Vocals

Baseline 1, Saraga separator pre-training

Baseline 2, MUSDB18HQ separator pre-training

Saraga pre-training, CMC bleeding estimator

ColdDiffSep, trained only using Saraga

Example 2

Mix

Vocals

Baseline 1, Saraga separator pre-training

Baseline 2, MUSDB18HQ separator pre-training

Saraga pre-training, CMC bleeding estimator

ColdDiffSep, trained only using Saraga

Example 3

Mix

Vocals

Baseline 1, Saraga separator pre-training

Baseline 2, MUSDB18HQ separator pre-training

Saraga pre-training, CMC bleeding estimator

ColdDiffSep, trained only using Saraga

Example 4

Mix

Vocals

Baseline 1, Saraga separator pre-training

Baseline 2, MUSDB18HQ separator pre-training

Saraga pre-training, CMC bleeding estimator

ColdDiffSep, trained only using Saraga

Comparing different sizes of the clean multi-stem dataset

This comparison relates to Table 2 in the paper. The goal is to perceptually compare the separation ability of our complete system trained with different small clean sets, both from the CMC and from MUSDB18HQ as well, aiming at studying how robust our system is to the domain shift.

Note: all compared systems here have the separator pre-trained with Saraga. Therefore, we are perceptually comparing solely the effect of the data used to train our bleeding estimator.

Example 1

Mix

Vocals

1-track CMC

15-track CMC

1-track MUSDB18HQ

15-track MUSDB18HQ

Example 2

Mix

Vocals

1-track CMC

15-track CMC

1-track MUSDB18HQ

15-track MUSDB18HQ

Example 3

Mix

Vocals

1-track CMC

15-track CMC

1-track MUSDB18HQ

15-track MUSDB18HQ

Example 4

Mix

Vocals

1-track CMC

15-track CMC

1-track MUSDB18HQ

15-track MUSDB18HQ