Singing voice extraction

Singing voice extraction#

Music source separation for Carnatic Music#

As seen in the music segmentation example, the vocal separation for Carnatic and Hindustani music remains an unsolved (and barely explored!) field. Let’s first introduce the task, the challenges, and current solutions.

Music source separation#

The problem of music source separation (MSS) is aimed at automatically estimating the individual elements in a music mixture. MSS systems, which recently are mostly based on DL architectures, operate on the waveform and on the time-frequency domain, and even on a combination of both. Since this is a core problem in the field of music information research, many efforts to obtain open and high-performance models are done, and several pre-trained systems are made available to be freely used out-of-the-box.

MSS systems are normally trained using the mixture as input, and the target sources as expected output, and the models are optimized to reproduce the same operation. There are few datasets in the literature that may be used for that purpose: musdb18hq, moisesdb, and medleydb. However, these datasets mostly include recordings that can be framed into the pop and rock styles, and therefore, as it normally happens in the field of DL, when training with data belonging in a particular domain, the generalization to out-of-domain use cases is not feasible.

The problem with Carnatic Music#

For the case of Carnatic Music we observe such problem. Not only the available models in the literature do not have any knowledge on this repertoire, but also the task of MSS normally targets the following source setup: vocals, bass, drums, and other, and that does not comply with the actual arrangement and nature of Carnatic Music.

Some well-known models for MSS are Spleeter [HKVM20] by Deezer, Meta’s Demucs [DUBB19], and their related extensions and evolutions.

Some efforts have been done on improving separation for Carnatic Music, given the existence of the Saraga dataset [SGRS20], which includes multi-track recordings for Carnatic renditions, although the stems of the different sources are not completely isolated, since the recordings are collected in live performances. Therefore, in the background on the stems, there is leakage or bleeding from the rest of the sources, convoluted by the room response. Some works have been trying to take advantage of these noisy data to use the inherent knowledge in the data, although there is still room for improvement.

## Installing (if not) and importing compiam to the project
import importlib.util
if importlib.util.find_spec('compiam') is None:
    ## Bear in mind this will only run in a jupyter notebook / Collab session
    %pip install compiam
import compiam

# Import extras and supress warnings to keep the tutorial clean
import os
import numpy as np
from pprint import pprint

import warnings
warnings.filterwarnings('ignore')

Leakage-aware source separation model#

This model is able to separate clean singing voices even though it has been solely trained with data that have bleeding in the multi-track stems. Let’s test how it works in a real example. Since the model is DL-based, we first need to install tensorflow.

%pip install "tensorflow==2.15.0" "keras<3"

# Importing and initializing a melodia instance
import soundfile as sf
from compiam import load_model, load_dataset
separation_model = load_model('separation:cold-diff-sep')

# Loading an normalizing an example track
saraga_carnatic = load_dataset(
    "saraga_carnatic",
    data_home=os.path.join("..", "audio", "mir_datasets")
)
saraga_tracks = saraga_carnatic.load_tracks()
example = saraga_tracks["109_Sri_Raghuvara_Sugunaalaya"]
input_mixture, sr = sf.read(example.audio_path)

input_mixture = input_mixture.T
mean = np.mean(input_mixture, keepdims=True)
std = np.std(input_mixture, keepdims=True)
input_mixture = (input_mixture - mean) / (1e-6 + std)

### Getting 20 seconds and separating
input_mixture = input_mixture[:, :44100*20]
separation = separation_model.separate(
    input_data=input_mixture,
    input_sr=sr,
    clusters=6,
    scheduler=5,
)

import IPython.display as ipd

# And we play it!
ipd.Audio(
    data=separation,
    rate=separation_model.sample_rate,
)

Although perceptible artifacts in the vocals can be heard, the separation is surprisingly clean, hopefully helping musicians and musicologists to extract relevant information for it. Also, less pitched noise is present in the signal so melodic feature extraction systems may work better on these data rather than in a complete mixture or in a singing voice with source bleeding in the background.