Singing voice extraction

Singing voice extraction#

As seen in the music segmentation example, the vocal separation for Carnatic and Hindustani music remains an unsolved (and unexplored!) field. In this section we walkthrough a tool that has been trained using the [SGRS20]. Given the live performance nature of Carnatic Music, it is difficult, in fact current impossible, to find fully-isolated multi-stem recordings to train or fine-tune existing separation approaches. Saraga includes multi-stem recordings, but these have source bleeding in the background, since these have been recorded in live performances. In this section we present an approach that has been designed having the bleeding problem in mind.

## Installing (if not) and importing compiam to the project
import importlib.util
if importlib.util.find_spec('compiam') is None:
    ## Bear in mind this will only run in a jupyter notebook / Collab session
    %pip install compiam
import compiam

# Import extras and supress warnings to keep the tutorial clean
import os
import numpy as np
from pprint import pprint

import warnings
warnings.filterwarnings('ignore')

Leakage-aware source separation model#

This model is able to separate clean singing voices even though it has been solely trained with data that have bleeding in the multi-track stems. Let’s test how it works in a real example. Since the model is DL-based, we first need to install tensorflow.

%pip install tensorflow
%pip install tensorflow_addons
# Importing and initializing a melodia instance
import soundfile as sf
from compiam import load_model, load_dataset
separation_model = load_model('separation:cold-diff-sep')

# Loading an normalizing an example track
saraga_carnatic = load_dataset(
    "saraga_carnatic",
    data_home=os.path.join("..", "audio", "mir_datasets")
)
saraga_tracks = saraga_carnatic.load_tracks()
example = saraga_tracks["109_Sri_Raghuvara_Sugunaalaya"]
input_mixture, sr = sf.read(example.audio_path)

input_mixture = input_mixture.T
mean = np.mean(input_mixture, keepdims=True)
std = np.std(input_mixture, keepdims=True)
input_mixture = (input_mixture - mean) / (1e-6 + std)
### Getting 20 seconds and separating
input_mixture = input_mixture[:, :44100*20]
separation = separation_model.separate(
    input_data=input_mixture,
    input_sr=sr,
    clusters=6,
    scheduler=5,
)
import IPython.display as ipd

# And we play it!
ipd.Audio(
    data=separation,
    rate=separation_model.sample_rate,
)

Although perceptible artifacts in the vocals can be heard, the separation is surprisingly clean, hopefully helping musicians and musicologists to extract relevant information for it. Also, less pitched noise is present in the signal so melodic feature extraction systems may work better on these data rather than in a complete mixture or in a singing voice with source bleeding in the background.