Music segmentation#
We have seen in the musicological introduction that we may come across different formats of Carnatic and Hindustani performances. Let us review some example works that have been proposed for this task. [VVPR15] proposes to use tempo cues combined with additional handcrafted features that aim at capturing tradition-relevant information. On the other hand, [RPS19] quantize the pitch contours, identify repeated note sequences or patterns, and then compute posterior probabilities to identify the different sections. Alternatively, [GGSR16] builds on top of hand-crafted features of pitch time-series. Finally, DL models are also used to identify the meter and tempo surfaces of Dhrupad Bandish performances for segmentation [MAVR20].
## Installing (if not) and importing compiam to the project
import importlib.util
if importlib.util.find_spec('compiam') is None:
## Bear in mind this will only run in a jupyter notebook / Collab session
%pip install git+git://github.com/MTG/compIAM.git
import compiam
# Import extras and supress warnings to keep the tutorial clean
import os
from pprint import pprint
import warnings
warnings.filterwarnings('ignore')
Let’s first list the available tools for music segmentation in compiam
.
compiam.structure.segmentation.list_tools()
['DhrupadBandishSegmentation*']
Dhrupad Bandish segmentation#
In this section we will showcase a tool that attempts to identify, through the use of rhythmic features, different sections in a Dhrupad Bandish performances [MAVR20], one of the main formats in Hindustani music. As seen in the documentation, this segmentation model is based on PyTorch. Therefore, we proceed to install torch
.
%pip install torch==1.8.0
This tool may be accessed from the structure.segmentation
, however, the tool name has an *
appended, therefore we can use the wrapper for models to rapidly initialize it with the pre-trained weights loaded.
Tip
Get the correct identifier for the wrapper by running compiam.list_models()
.
dbs = compiam.load_model("structure:dhrupad-bandish-segmentation")
help(dbs)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In [5], line 1
----> 1 help(dbs)
NameError: name 'dbs' is not defined
In the documentation we observe that this model includes quite a number of attributes, and particularly we observe two of them that are interesting:
mode
fold
These attributes are important because define the training pipeline that has been used and therefore, a different mode of operating with this model. mode
has options: net, voc, or pakh, which indicate the source for surface tempo multiple (s.t.m.) estimation. net mode is for input mixture signal, voc is for clean or source-separated singing voice recordings, and pakh for pakhawaj tracks (pakhawaj is a percussion instrument from Northern India). fold
is basically an integer indicating with validation fold we do consider for training.
These configuration variables are loaded by default as net
and 0
respectively, however these may be easily changed.
dbs.update_mode(mode="voc")
dbs.update_fold(fold=1)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In [6], line 1
----> 1 dbs.update_mode(mode="voc")
2 dbs.update_fold(fold=1)
NameError: name 'dbs' is not defined
At this moment, the mode
and fold
have been updated and consequently, the class has automatically loaded the model weights corresponding to mode=voc
and fold=1
.
Note
Typically in compiam
, importing a model from the corresponding module or initializing it using the wrapper, can make an important difference on how the loaded instance works. Generally speaking, if you use the wrapper you will probably be only interested in running inference. If your goal is to train or deep-dive into a particular model, you should avoid the use of the model wrapper and start from a clean model instance.
Let’s now run prediction on an input file. Our mode now is voc
, therefore the model expects a clean or source separated vocal signal. Isolated singing voice signals are not commonly available for the case of Carnatic and Hindustani music. We will use a state-of-the-art and out-of-the-box model, spleeter
, to try to separate the singing voice from the accompaniment.
%pip install spleeter
%pip install numba --upgrade
We will now directly download the pre-trained models for spleeter
, and use these for inference in this walkthrough. We will use wget
(UNIX-based) to download the available pre-trained weights for spleeter
online.
!wget https://github.com/deezer/spleeter/releases/download/v1.4.0/2stems.tar.gz
We need to use tarfile
to uncompress the downloaded file into a desired location. We will uncompres the downloaded model weights to the default location where spleeter
looks for the pretrained weights.
import tarfile
# Open file
file = tarfile.open("2stems.tar.gz")
# Creating directory where spleeter looks for models by default
os.mkdir("pretrained_models/")
# Extracting files in tar
file.extractall(
os.path.join("pretrained_models", "2stems")
)
# Closing file
file.close()
spleeter
is based on TensorFlow
. We disable the GPU usage and the TensorFlow
related warnings just like we did in the pitch extraction walkthrough.
# Disabling tensorflow warnings and debugging info
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
# Importing tensorflow and disabling GPU usage
import tensorflow as tf
tf.config.set_visible_devices([], "GPU")
We may now load the spleeter
separator, which will automatically load the pre-trained weights for the model. We will use the 2:stems
model, which has been trained to separate vocals and accompaniment.
Note
The other option, which is the 4:stems
, separates vocals, bass, drums, and other.
from spleeter.separator import Separator
# Load default 2-stem spleeter separation
separator = Separator("spleeter:2stems")
The Separator
class in spleeter
has a method to directly separate the singing voice from an audio file, and the prediction is stored in a given output folder. Let’s use this method and get a source separated version of an example Dhrupad file.
pprint(compiam.list_datasets())
['saraga_carnatic',
'saraga_hindustani',
'mridangam_stroke',
'four_way_tabla',
'compmusic_carnatic_rhythm',
'compmusic_hindustani_rhythm',
'compmusic_raga',
'compmusic_indian_tonic',
'compmusic_carnatic_varnam',
'scms']
Oops… We note that no Dhrupad Bandish dataset is available in mirdata
. Therefore, we will need to refer to the Carnatic and Hindustani corpora in Dunya. Let’s get an audio example from Dunya using the Corpora
class. As already mentioned before, you need a personal and non-shareable token to access the data in Dunya. Within the context of this tutorial, we provide here a snippet of code that we have used beforehand to parse the audios from the Dunya database.
If we the folder compiam/models/structure/dhrupad_bandish_segmentation/audio_original
within the installable compiam
. We can see a .pdf file including the details of the files that form the dataset for the Dhrupad Segmentation tool. That .pdf file is also found in the original repository of the tool. We select the following example:
# UNCOMMENT AND RUN THIS CODE WITH YOUR PERSONAL TOKEN
#from compiam import load_corpora
#corpora = load_corpora(
# "hindustani",
# cc=False, # Indicating we import de private collection
# token="<your-access-token>",
#
#corpora.download_mp3(
# "59c88c32-0bde-433b-b194-0f65281e5714",
# "os.path.join("..", "audio")
#)
Let’s make sure the audios are actually downloaded in the audio folder.
%ls ../audio
59c88c32-0bde-433b-b194-0f65281e5714.mp3 pattern_finding/ testing_samples/
mir_datasets/ separation/
Cool! There it is. Let’s therefore now run the spleeter
separation on this track.
# Separating file
separator.separate_to_file(
os.path.join(
"..", "audio", "59c88c32-0bde-433b-b194-0f65281e5714.mp3"
),
os.path.join("..", "audio")
)
Separation done! We can now run inference with the segmentation model on the source separated signal.
dbs.predict_stm(
input_data=os.path.join(
"..", "audio", "59c88c32-0bde-433b-b194-0f65281e5714", "vocals.wav"
)
)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In [16], line 1
----> 1 dbs.predict_stm(
2 input_data=os.path.join(
3 "..", "audio", "59c88c32-0bde-433b-b194-0f65281e5714", "vocals.wav"
4 )
5 )
NameError: name 'dbs' is not defined
We can observe the estimated sections (given rhythmic characteristics) in the output image. The x
axis provides information about the actual time-stamps in seconds for each estimation.
As a final experiment, let’s listen to the source separated file using spleeter
.
import IPython.display as ipd
import librosa
vocals, sr = librosa.load(
os.path.join(
"..", "audio", "59c88c32-0bde-433b-b194-0f65281e5714", "vocals.wav"
),
)
ipd.Audio(
data=vocals[-sr*30:], # Taking only the last 30 seconds
rate=sr,
)
A note on source separation for Indian Art Music#
As you may have noticed in the displayed audio above, even though spleeter
is within the best out-of-the-box source separation models to use, we need to take into account some considerations in regards to the task of music source separation for Indian Art Music signals.
First of all, although spleeter
is trained with a massive amount of recordings, we can safely assume that Carnatic and Hindustani music do not have a considerable representation in the training set (this applies to the other out-of-the-box source separation models out there). In that sense, it is expected that these models struggle to generalize to Indian Art Music specific instruments and arrangements, which may cause abnormally low interference removal performance. The predominance of melodic monophonic accompaniment instruments in both Carnatic (the violin being the most common case) and Hindustani (harmonium in this case), the tambura drone, the pitched percussion… These are high-level examples of elements that may cause the standardized source separation models to not generalize properly to Indian Art Music signals.
What is more, the standardized source separation models target whether vocals and accompaniment, or vocals, bass, drums, and other. While to separate the singing voice from an accompaniment is OK, the 4 stem configuration is far from being representative of the actual Carnatic and Hindustani Music arrangements.
As a final note, another factor that is currently blocking the research on music source separation for Indian Art Music is the shortage of available datasets for this task. We have observed that the Saraga Carnatic collection has multi-track audio, but this has leakage (it has been recorded in live performances). In such case, a leakage-aware approach would be needed to use this data. Alternatively, a music source separation dataset including completely isolated and aligned tracks, which to the best of our knowledge is unavailable as of now, would open the door the music source separation research on Indian Art Music.
Nov 2023 Update: A Carnatic-specific singing voice separation model has been developed and presented at ISMIR 2023 in Milan, Italy. See the separation walkthrough for an example.