Music segmentation#
We have seen in the musicological introduction that we may come across different formats of Carnatic and Hindustani performances. Let us review some example works that have been proposed for this task. [VVPR15] proposes to use tempo cues combined with additional handcrafted features that aim at capturing tradition-relevant information. On the other hand, [RPS19] quantize the pitch contours, identify repeated note sequences or patterns, and then compute posterior probabilities to identify the different sections. Alternatively, [GGSR16] builds on top of hand-crafted features of pitch time-series. Finally, DL models are also used to identify the meter and tempo surfaces of Dhrupad Bandish performances for segmentation [MAVR20].
## Installing (if not) and importing compiam to the project
import importlib.util
if importlib.util.find_spec('compiam') is None:
## Bear in mind this will only run in a jupyter notebook / Collab session
%pip install git+git://github.com/MTG/compIAM.git
import compiam
# Import extras and supress warnings to keep the tutorial clean
import os
from pprint import pprint
import warnings
warnings.filterwarnings('ignore')
Let’s first list the available tools for music segmentation in compiam
.
compiam.structure.segmentation.list_tools()
['DhrupadBandishSegmentation*']
Dhrupad Bandish segmentation#
In this section we will showcase a tool that attempts to identify, through the use of rhythmic features, different sections in a Dhrupad Bandish performances [MAVR20], one of the main formats in Hindustani music. As seen in the documentation, this segmentation model is based on PyTorch. Therefore, we proceed to install torch
.
%pip install torch==1.13.0
This tool may be accessed from the structure.segmentation
, however, the tool name has an *
appended, therefore we can use the wrapper for models to rapidly initialize it with the pre-trained weights loaded.
Tip
Get the correct identifier for the wrapper by running compiam.list_models()
.
dbs = compiam.load_model("structure:dhrupad-bandish-segmentation")
0%| | 0.00/1.59M [00:00<?, ?iB/s]
1%| | 16.4k/1.59M [00:00<00:11, 132kiB/s]
6%|▌ | 98.3k/1.59M [00:00<00:04, 329kiB/s]
31%|███ | 492k/1.59M [00:00<00:00, 1.16MiB/s]
76%|███████▋ | 1.21M/1.59M [00:00<00:00, 2.76MiB/s]
100%|██████████| 1.59M/1.59M [00:00<00:00, 2.56MiB/s]
[2024-12-01 23:45:27,494] INFO [compiam.utils.download.download_zip:95] Download complete: /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/compiam/models/structure/dhrupad_bandish_segmentation/dhrupad_bandish_segmentation.zip
[2024-12-01 23:45:27,498] INFO [compiam.utils.download.extract_zip:103] Extracting /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/compiam/models/structure/dhrupad_bandish_segmentation/dhrupad_bandish_segmentation.zip...
[2024-12-01 23:45:27,515] INFO [compiam.utils.download.extract_zip:106] Extraction complete: Files extracted to /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/compiam/models/structure/dhrupad_bandish_segmentation
[2024-12-01 23:45:27,516] INFO [compiam.utils.download.download_remote_model:71] Files downloaded and extracted successfully.
help(dbs)
Help on DhrupadBandishSegmentation in module compiam.structure.segmentation.dhrupad_bandish_segmentation object:
class DhrupadBandishSegmentation(builtins.object)
| DhrupadBandishSegmentation(mode='net', fold=0, model_path=None, splits_path=None, annotations_path=None, features_path=None, original_audios_path=None, processed_audios_path=None, download_link=None, download_checksum=None, device=None)
|
| Dhrupad Bandish Segmentation
|
| Methods defined here:
|
| __init__(self, mode='net', fold=0, model_path=None, splits_path=None, annotations_path=None, features_path=None, original_audios_path=None, processed_audios_path=None, download_link=None, download_checksum=None, device=None)
| Dhrupad Bandish Segmentation init method.
|
| :param mode: net, voc, or pakh. That indicates the source for s.t.m. estimation. Use the net
| mode if audio is a mixture signal, else use voc or pakh for clean/source-separated vocals or
| pakhawaj tracks.
| :param fold: 0, 1 or 2, it is the validation fold to use during training.
| :param model_path: path to file to the model weights.
| :param splits_path: path to file to audio splits.
| :param annotations_path: path to file to the annotations.
| :param features_path: path to file to the computed features.
| :param original_audios_path: path to file to the original audios from the dataset (see README.md in
| compIAM/models/structure/dhrupad_bandish_segmentation/audio_original)
| :param processed_audios_path: path to file to the processed audio files.
| :param download_link: link to the remote pre-trained model.
| :param download_checksum: checksum of the model file.
| :param device: indicate whether the model will run on the GPU.
|
| download_model(self, model_path=None, force_overwrite=False)
| Download pre-trained model.
|
| load_model(self, model_path)
| Loading weights for model, given self.mode and self.fold
|
| :param model_path: path to model weights
|
| predict_stm(self, input_data, input_sr=44100, save_output=False, output_path=None)
| Predict Dhrupad Bandish Segmentation
|
| :param input_data: path to audio file or numpy array like audio signal.
| :param input_sr: sampling rate of the input array of data (if any). This variable is only
| relevant if the input is an array of data instead of a filepath.
| :param save_output: boolean indicating whether the output figure for the estimation is
| stored.
| :param output_path: if the input is an array, and the user wants to save the estimation,
| the output_path must be provided, path/to/picture.png.
|
| train(self, verbose=True)
| Train the Dhrupad Bandish Segmentation model
|
| :param verbose: showing details of the model
|
| update_fold(self, fold)
| Update data fold for the training and sampling
|
| :param fold: new fold to use
|
| update_mode(self, mode)
| Update mode for the training and sampling. Mode is one of net, voc,
| pakh, indicating the source for s.t.m. estimation. Use the net mode if
| audio is a mixture signal, else use voc or pakh for clean/source-separated
| vocals or pakhawaj tracks.
|
| :param mode: new mode to use
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables
|
| __weakref__
| list of weak references to the object
In the documentation we observe that this model includes quite a number of attributes, and particularly we observe two of them that are interesting:
mode
fold
These attributes are important because define the training pipeline that has been used and therefore, a different mode of operating with this model. mode
has options: net, voc, or pakh, which indicate the source for surface tempo multiple (s.t.m.) estimation. net mode is for input mixture signal, voc is for clean or source-separated singing voice recordings, and pakh for pakhawaj tracks (pakhawaj is a percussion instrument from Northern India). fold
is basically an integer indicating with validation fold we do consider for training.
These configuration variables are loaded by default as net
and 0
respectively, however these may be easily changed.
dbs.update_mode(mode="voc")
dbs.update_fold(fold=1)
At this moment, the mode
and fold
have been updated and consequently, the class has automatically loaded the model weights corresponding to mode=voc
and fold=1
.
Note
Typically in compiam
, importing a model from the corresponding module or initializing it using the wrapper, can make an important difference on how the loaded instance works. Generally speaking, if you use the wrapper you will probably be only interested in running inference. If your goal is to train or deep-dive into a particular model, you should avoid the use of the model wrapper and start from a clean model instance.
Let’s now run prediction on an input file. Our mode now is voc
, therefore the model expects a clean or source separated vocal signal. Isolated singing voice signals are not commonly available for the case of Carnatic and Hindustani music. We will use a state-of-the-art and out-of-the-box model, spleeter
, to try to separate the singing voice from the accompaniment.
%pip install spleeter
%pip install numba --upgrade
We will now directly download the pre-trained models for spleeter
, and use these for inference in this walkthrough. We will use wget
(UNIX-based) to download the available pre-trained weights for spleeter
online.
!wget https://github.com/deezer/spleeter/releases/download/v1.4.0/2stems.tar.gz
We need to use tarfile
to uncompress the downloaded file into a desired location. We will uncompres the downloaded model weights to the default location where spleeter
looks for the pretrained weights.
import tarfile
# Open file
file = tarfile.open("2stems.tar.gz")
# Creating directory where spleeter looks for models by default
os.mkdir("pretrained_models/")
# Extracting files in tar
file.extractall(
os.path.join("pretrained_models", "2stems")
)
# Closing file
file.close()
spleeter
is based on TensorFlow
. We disable the GPU usage and the TensorFlow
related warnings just like we did in the pitch extraction walkthrough.
# Disabling tensorflow warnings and debugging info
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
# Importing tensorflow and disabling GPU usage
import tensorflow as tf
tf.config.set_visible_devices([], "GPU")
We may now load the spleeter
separator, which will automatically load the pre-trained weights for the model. We will use the 2:stems
model, which has been trained to separate vocals and accompaniment.
Note
The other option, which is the 4:stems
, separates vocals, bass, drums, and other.
from spleeter.separator import Separator
# Load default 2-stem spleeter separation
separator = Separator("spleeter:2stems")
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In [11], line 1
----> 1 from spleeter.separator import Separator
3 # Load default 2-stem spleeter separation
4 separator = Separator("spleeter:2stems")
ModuleNotFoundError: No module named 'spleeter'
The Separator
class in spleeter
has a method to directly separate the singing voice from an audio file, and the prediction is stored in a given output folder. Let’s use this method and get a source separated version of an example Dhrupad file.
pprint(compiam.list_datasets())
['saraga_carnatic',
'saraga_hindustani',
'mridangam_stroke',
'four_way_tabla',
'compmusic_carnatic_rhythm',
'compmusic_hindustani_rhythm',
'compmusic_raga',
'compmusic_indian_tonic',
'compmusic_carnatic_varnam',
'scms']
Oops… We note that no Dhrupad Bandish dataset is available in mirdata
. Therefore, we will need to refer to the Carnatic and Hindustani corpora in Dunya. Let’s get an audio example from Dunya using the Corpora
class. As already mentioned before, you need a personal and non-shareable token to access the data in Dunya. Within the context of this tutorial, we provide here a snippet of code that we have used beforehand to parse the audios from the Dunya database.
If we the folder compiam/models/structure/dhrupad_bandish_segmentation/audio_original
within the installable compiam
. We can see a .pdf file including the details of the files that form the dataset for the Dhrupad Segmentation tool. That .pdf file is also found in the original repository of the tool. We select the following example:
# UNCOMMENT AND RUN THIS CODE WITH YOUR PERSONAL TOKEN
#from compiam import load_corpora
#corpora = load_corpora(
# "hindustani",
# cc=False, # Indicating we import de private collection
# token="<your-access-token>",
#
#corpora.download_mp3(
# "59c88c32-0bde-433b-b194-0f65281e5714",
# "os.path.join("..", "audio")
#)
Let’s make sure the audios are actually downloaded in the audio folder.
%ls ../audio
59c88c32-0bde-433b-b194-0f65281e5714.mp3 mir_datasets/ separation/
demos/ pattern_finding/ testing_samples/
Cool! There it is. Let’s therefore now run the spleeter
separation on this track.
# Separating file
separator.separate_to_file(
os.path.join(
"..", "audio", "59c88c32-0bde-433b-b194-0f65281e5714.mp3"
),
os.path.join("..", "audio")
)
Separation done! We can now run inference with the segmentation model on the source separated signal.
dbs.predict_stm(
input_data=os.path.join(
"..", "audio", "59c88c32-0bde-433b-b194-0f65281e5714", "vocals.wav"
)
)
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In [16], line 1
----> 1 dbs.predict_stm(
2 input_data=os.path.join(
3 "..", "audio", "59c88c32-0bde-433b-b194-0f65281e5714", "vocals.wav"
4 )
5 )
File /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/compiam/structure/segmentation/dhrupad_bandish_segmentation/__init__.py:458, in DhrupadBandishSegmentation.predict_stm(self, input_data, input_sr, save_output, output_path)
456 if isinstance(input_data, str):
457 if not os.path.exists(input_data):
--> 458 raise FileNotFoundError("Target audio not found.")
459 audio, sr = librosa.load(input_data, sr=pars.fs)
460 if output_path is None:
FileNotFoundError: Target audio not found.
We can observe the estimated sections (given rhythmic characteristics) in the output image. The x
axis provides information about the actual time-stamps in seconds for each estimation.
As a final experiment, let’s listen to the source separated file using spleeter
.
import IPython.display as ipd
import librosa
vocals, sr = librosa.load(
os.path.join(
"..", "audio", "59c88c32-0bde-433b-b194-0f65281e5714", "vocals.wav"
),
)
---------------------------------------------------------------------------
LibsndfileError Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/librosa/core/audio.py:176, in load(path, sr, mono, offset, duration, dtype, res_type)
175 try:
--> 176 y, sr_native = __soundfile_load(path, offset, duration, dtype)
178 except sf.SoundFileRuntimeError as exc:
179 # If soundfile failed, try audioread instead
File /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/librosa/core/audio.py:209, in __soundfile_load(path, offset, duration, dtype)
207 else:
208 # Otherwise, create the soundfile object
--> 209 context = sf.SoundFile(path)
211 with context as sf_desc:
File /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/soundfile.py:658, in SoundFile.__init__(self, file, mode, samplerate, channels, subtype, endian, format, closefd)
656 self._info = _create_info_struct(file, mode, samplerate, channels,
657 format, subtype, endian)
--> 658 self._file = self._open(file, mode_int, closefd)
659 if set(mode).issuperset('r+') and self.seekable():
660 # Move write position to 0 (like in Python file objects)
File /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/soundfile.py:1216, in SoundFile._open(self, file, mode_int, closefd)
1215 err = _snd.sf_error(file_ptr)
-> 1216 raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
1217 if mode_int == _snd.SFM_WRITE:
1218 # Due to a bug in libsndfile version <= 1.0.25, frames != 0
1219 # when opening a named pipe in SFM_WRITE mode.
1220 # See http://github.com/erikd/libsndfile/issues/77.
LibsndfileError: Error opening '../audio/59c88c32-0bde-433b-b194-0f65281e5714/vocals.wav': System error.
During handling of the above exception, another exception occurred:
FileNotFoundError Traceback (most recent call last)
Cell In [17], line 4
1 import IPython.display as ipd
2 import librosa
----> 4 vocals, sr = librosa.load(
5 os.path.join(
6 "..", "audio", "59c88c32-0bde-433b-b194-0f65281e5714", "vocals.wav"
7 ),
8 )
File /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/librosa/core/audio.py:184, in load(path, sr, mono, offset, duration, dtype, res_type)
180 if isinstance(path, (str, pathlib.PurePath)):
181 warnings.warn(
182 "PySoundFile failed. Trying audioread instead.", stacklevel=2
183 )
--> 184 y, sr_native = __audioread_load(path, offset, duration, dtype)
185 else:
186 raise exc
File /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/decorator.py:232, in decorate.<locals>.fun(*args, **kw)
230 if not kwsyntax:
231 args, kw = fix(args, kw, sig)
--> 232 return caller(func, *(extras + args), **kw)
File /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/librosa/util/decorators.py:59, in deprecated.<locals>.__wrapper(func, *args, **kwargs)
50 """Warn the user, and then proceed."""
51 warnings.warn(
52 "{:s}.{:s}\n\tDeprecated as of librosa version {:s}."
53 "\n\tIt will be removed in librosa version {:s}.".format(
(...)
57 stacklevel=3, # Would be 2, but the decorator adds a level
58 )
---> 59 return func(*args, **kwargs)
File /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/librosa/core/audio.py:240, in __audioread_load(path, offset, duration, dtype)
237 reader = path
238 else:
239 # If the input was not an audioread object, try to open it
--> 240 reader = audioread.audio_open(path)
242 with reader as input_file:
243 sr_native = input_file.samplerate
File /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/audioread/__init__.py:127, in audio_open(path, backends)
125 for BackendClass in backends:
126 try:
--> 127 return BackendClass(path)
128 except DecodeError:
129 pass
File /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/audioread/rawread.py:59, in RawAudioFile.__init__(self, filename)
58 def __init__(self, filename):
---> 59 self._fh = open(filename, 'rb')
61 try:
62 self._file = aifc.open(self._fh)
FileNotFoundError: [Errno 2] No such file or directory: '../audio/59c88c32-0bde-433b-b194-0f65281e5714/vocals.wav'
ipd.Audio(
data=vocals[-sr*30:], # Taking only the last 30 seconds
rate=sr,
)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In [18], line 2
1 ipd.Audio(
----> 2 data=vocals[-sr*30:], # Taking only the last 30 seconds
3 rate=sr,
4 )
NameError: name 'vocals' is not defined
A note on source separation for Indian Art Music#
As you may have noticed in the displayed audio above, even though spleeter
is within the best out-of-the-box source separation models to use, we need to take into account some considerations in regards to the task of music source separation for Indian Art Music signals.
First of all, although spleeter
is trained with a massive amount of recordings, we can safely assume that Carnatic and Hindustani music do not have a considerable representation in the training set (this applies to the other out-of-the-box source separation models out there). In that sense, it is expected that these models struggle to generalize to Indian Art Music specific instruments and arrangements, which may cause abnormally low interference removal performance. The predominance of melodic monophonic accompaniment instruments in both Carnatic (the violin being the most common case) and Hindustani (harmonium in this case), the tambura drone, the pitched percussion… These are high-level examples of elements that may cause the standardized source separation models to not generalize properly to Indian Art Music signals.
What is more, the standardized source separation models target whether vocals and accompaniment, or vocals, bass, drums, and other. While to separate the singing voice from an accompaniment is OK, the 4 stem configuration is far from being representative of the actual Carnatic and Hindustani Music arrangements.
As a final note, another factor that is currently blocking the research on music source separation for Indian Art Music is the shortage of available datasets for this task. We have observed that the Saraga Carnatic collection has multi-track audio, but this has leakage (it has been recorded in live performances). In such case, a leakage-aware approach would be needed to use this data. Alternatively, a music source separation dataset including completely isolated and aligned tracks, which to the best of our knowledge is unavailable as of now, would open the door the music source separation research on Indian Art Music.
Nov 2023 Update: A Carnatic-specific singing voice separation model has been developed and presented at ISMIR 2023 in Milan, Italy. See the separation walkthrough for an example.