Exploring Carnatic Performance#

Thomas Nuttall, Genís Plaja-Roglans, Lara Pearson, Brindha Manickavasakan, Xavier Serra.

This notebook serves to demonstrate the wide range of tools available as part of the compIAM package. We demonstrate their use on a single performance from the Saraga Audiovisual Dataset, the multi-modal portion of the wider Saraga Dataset [SGRS20]. The tools showcased here are not accompanied by exhaustive usage documentation, which can be found in their respective pages in other parts of this webbook, links for which are provided in each section.

1. Import Dependencies and Data#

Due to restrictions in accessing the Saraga API through the Github hosted webbook, we access the data through a custom shared Google drive created specifically for this tutorial. Users wishing to work with audio from Saraga should follow the instructions here.

## Installing (if not) and importing compiam to the project
import importlib.util
%pip install -U compiam==0.4.1  # Install latest version of compiam
if importlib.util.find_spec('essentia') is None:
    %pip install essentia
if importlib.util.find_spec('torch') is None:
    %pip install "torch==1.13"
if importlib.util.find_spec('tensorflow') is None:
    %pip install "tensorflow==2.15.0" "keras<3"

import compiam
import essentia.standard as estd

# Import extras and supress warnings to keep the tutorial clean
import os
import shutil
import gdown
import zipfile

import numpy as np
import IPython.display as ipd
import matplotlib.pyplot as plt

from pprint import pprint

import warnings
warnings.filterwarnings('ignore')

AUDIO_PATH = os.path.join("..", "audio", "demos")

Since Saraga Audiovisual is a fresh new dataset very recently published in ISMIR 2024 (San Francisco, USA), it is still not available through mirdata and compIAM. However, we will manually download and load an example concert recording from this dataset, a concert performed by Dr. Brindha Manickavasakan during the December Season in Chennai in 2023. Dr. Manickavasakan is also a collaborator of the ongoing efforts on the computational melodic analysis of Carnatic Music in a collaboration between a group of researchers from the Music Technology Group and Dr. Lara Pearson from the Max Plank Institute of Empirical Aesthetics.

url = "https://drive.google.com/uc?id=1iR0bfxDLQbH8fEeHU_GFsg2kh7brZ0HZ&export=download"
output =  os.path.join(AUDIO_PATH,  "dr-brindha-manickavasakan.zip")
gdown.download(url, output, quiet=False)
# Unzip file
with zipfile.ZipFile(output, 'r') as zip_ref:
    zip_ref.extractall(AUDIO_PATH)

# Delete zip file after extraction
os.remove(output)

Alright, the data is downloaded and uncompressed. Let’s get the path to it and analyse a rendition from the concert.

2. Loading and visualising the data#

We work with a single performance from a concert by Brindha Manickavasakan at the Arkay Convention Center, recorded in 2023 in Chennai, South India. The composition is Bhavanuta by Tyaagaraaja in raga mohanam.

rendition = "Bhavanuta"
folder_path = os.path.join(AUDIO_PATH, 'dr-brindha-manickavasakan', rendition)

For 100s of performances in the Saraga dataset, the audio stems corresponding to each instrument/perfromer are available. In this performance, this constitutes the lead vocal, the mridangam (left and right microphone), the violin, and the tanpura. The full mix of all instruments is also available.

Let us select the preview versions of the multitrack audio, which are shortened and compressed versions of the rendition for easier handling of the previsualisation.

audio_path_pre = os.path.join(folder_path, "preview", f"{rendition}.mp3")
mrid_left_path_pre = os.path.join(folder_path, "preview", f"{rendition}.mridangam-left.mp3")
mrid_right_path_pre = os.path.join(folder_path, "preview", f"{rendition}.mridangam-right.mp3")
violin_path_pre = os.path.join(folder_path, "preview", f"{rendition}.multitrack-violin.mp3")
vocal_path_pre = os.path.join(folder_path, "preview", f"{rendition}.multitrack-vocal.mp3")
tanpura_path_pre = os.path.join(folder_path, "preview", f"{rendition}.tanpura.mp3")

1.1 Multitrack player#

We can use the compIAM waveform player to visualise and listen to all of the tracks at the same time, panning, or changing the volume of each as required.

from compiam.visualisation.waveform_player import Player
# list of paths to load and listen
all_audio_paths = [
    vocal_path_pre,
    violin_path_pre,
    mrid_left_path_pre,
    mrid_right_path_pre,
    tanpura_path_pre
]
# List of labels for each path
all_names = ["Vocal", "Violin", "Mridangam left", "Mridangam right", "Tanpura"]
Player(all_names, all_audio_paths)
/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/compiam/visualisation/waveform_player/waveform-playlist/
multi-channel.html

1.2 Video and Gesture Tracks#

The Saraga Audiovisual dataset includes videos of the performances and gesture tracks extracted using MMPose for the lead performer [3]. Let’s take a look at a sample for this performance.

import cv2
import IPython.display as ipd
from IPython.core.display import HTML
vid_out_path =  f'{folder_path}/output_segment.mp4'
# Load keypoints and scores
keypoints_file = f"{folder_path}/singer/Brindha_Manickavasakan_Segment1_0-513_kpts.npy"
scores_file = f"{folder_path}/singer/Brindha_Manickavasakan_Segment1_0-513_scores.npy"
video_file = f"{folder_path}/{rendition}.mov"  # Replace with your video file
keypoints = np.load(keypoints_file)
scores = np.load(scores_file)

# Skeleton for 135 keypoints
# Skeleton for 135 keypoints (MMPose)
skeleton = [
    (0, 1), (1, 2),  # Eyes (left to right)
    (0, 3), (0, 4),  # Nose to ears (left and right)
    (5, 6),          # Shoulders (left and right)
    (5, 7), (7, 9),  # Left arm (shoulder -> elbow -> wrist)
    (6, 8), (8, 10),
    (11,12),  # Right arm (shoulder -> elbow -> wrist)
    (5, 11), (6, 12), # Shoulders to hips
    (11, 13), (13, 15), # Left leg (hip -> knee -> ankle)
    (12, 14), (14, 16)  # Right leg (hip -> knee -> ankle)
]
# Open video file
cap = cv2.VideoCapture(video_file)
fps = int(cap.get(cv2.CAP_PROP_FPS))  # Frames per second
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

# Define start and end frames for the 20-second segment
start_time = 10  # Start time in seconds (adjust as needed)
end_time = start_time + 20  # End time in seconds
start_frame = int(start_time * fps)
end_frame = int(end_time * fps)

# Output video writer
out = cv2.VideoWriter(vid_out_path, cv2.VideoWriter_fourcc(*'mp4v'), fps, (frame_width, frame_height))

# Process the selected frames
frame_idx = 0
while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    if start_frame <= frame_idx < end_frame:
        # Get keypoints and scores for the current frame
        if frame_idx < len(keypoints):
            frame_keypoints = keypoints[frame_idx]
            frame_scores = scores[frame_idx]

            # Draw keypoints and skeleton
            for i, (x, y) in enumerate(frame_keypoints):
                # Only draw if confidence score is above threshold
                if frame_scores[i] > 0.5:  # Adjust threshold as needed
                    cv2.circle(frame, (int(x), int(y)), 5, (0, 255, 0), -1)

            # Draw skeleton
            for connection in skeleton:
                start, end = connection
                if frame_scores[start] > 0.5 and frame_scores[end] > 0.5:
                    x1, y1 = frame_keypoints[start]
                    x2, y2 = frame_keypoints[end]
                    cv2.line(frame, (int(x1), int(y1)), (int(x2), int(y2)), (255, 0, 0), 2)

        # Write frame to output video
        out.write(frame)

    frame_idx += 1

    # Stop processing after the end frame
    if frame_idx >= end_frame:
        break

# Release resources
cap.release()
out.release()
cv2.destroyAllWindows()
print("20-second video segment processing complete. Output saved as 'output_segment.mp4'")
20-second video segment processing complete. Output saved as 'output_segment.mp4'
import subprocess
subprocess.run(["ffmpeg", "-vcodec", "libx264",  "-acodec", "aac", f"{vid_out_path.replace('.mp4', '_re-encoded.mp4')}"])
ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libswscale      5.  9.100 /  5.  9.100
  libswresample   3.  9.100 /  3.  9.100
  libpostproc    55.  9.100 / 55.  9.100
Output #0, mp4, to '../audio/demos/dr-brindha-manickavasakan/Bhavanuta/output_segment_re-encoded.mp4':
Output file #0 does not contain any stream
CompletedProcess(args=['ffmpeg', '-vcodec', 'libx264', '-acodec', 'aac', '../audio/demos/dr-brindha-manickavasakan/Bhavanuta/output_segment_re-encoded.mp4'], returncode=1)
#video_html = f"""
#<video width="640" height="480" controls>
#  <source src="{vid_out_path}" type="video/mp4">
#  Your browser does not support the video tag.
#</video>
#"""
#ipd.display(HTML(video_html))

from IPython.core.display import Video

Video(vid_out_path.replace('.mp4', '_re-encoded.mp4'), embed=True)

2. Feature Extraction#

In this section we extract various audio and musical features from the raw performance audio; the singer tonic, raga, predominant pitch track of the lead vocal melody, source separated vocal audio, downbeat, and repeated melodic patterns.

Let’s first get the path of the full and uncompressed mixture and vocal tracks.

audio_path = os.path.join(folder_path, f"{rendition}.wav")
vocal_path = os.path.join(folder_path, f"{rendition}.multitrack-vocal.wav")

2.1 Tonic Identification#

The tonic of the lead singer is useful for normalising pitch and comparing with other performers. Here we can use the TonicIndianMultiPitch tool which is available through Essentia and compIAM.

# Importing the tool
from compiam.melody.tonic_identification import TonicIndianMultiPitch

# We first initialize the tool we have just imported
tonic_multipitch = TonicIndianMultiPitch()
tonic = tonic_multipitch.extract(audio_path)
print(f'Performer tonic: {round(tonic, 2)} Hz')
Performer tonic: 196.42 Hz

We can quickly listed to the estimated tonic on top of the original audio to perceptually evaluate if the tonic sounds reasonable to the chosen rendition.

# Let's get the audio for the track
sr = 44100
audio_mix = estd.MonoLoader(filename=audio_path, sampleRate=sr)()

# Let's synthesize a tambura
synthesized_tambura = 0.75*np.sin(
    2*np.pi*float(tonic)*np.arange(0, len(audio_mix)//sr, 1/sr)
)
# Adding some harmonics
synthesized_tambura += 0.25*np.sin(
    2*np.pi*float(tonic)*2*np.arange(0, len(audio_mix)//sr, 1/sr)
)
synthesized_tambura += 0.5*np.sin(
    2*np.pi*float(tonic)*3*np.arange(0, len(audio_mix)//sr, 1/sr)
)
synthesized_tambura += 0.125*np.sin(
    2*np.pi*float(tonic)*4*np.arange(0, len(audio_mix)//sr, 1/sr)
)

# We take just a minute of music (60 seg * 44100)
audio_tonic = audio_mix[:60*sr] + synthesized_tambura[:60*sr]*0.02
# And we play it!
ipd.Audio(
    data=audio_tonic[None],
    rate=sr,
)

That sounds good! This is the tonic of the recording. This is a really valuable information that allows us to characterise and normalize the melodies, and may give relevant information about the artist and performed concert.

For further reference, please visit the tonic identification page.

2.2 Raga Recognition#

Whilst raga metadata is available as part of Saraga, compIAM contains a raga identifier, DeepSRGM. We can automatically identifier the raga using this tool. Be aware, this model was trained on the ragas; Bhairav, Madhukauns, Mōhanaṁ, Hamsadhvāni, Varāḷi, Dēś, Kamās, Yaman kalyāṇ, and Bilahari, Ahira bhairav, only and hence can only assign those classes.

from compiam import load_model

# This model uses tensorflow in the backend!
deepsrgm = load_model("melody:deepsrgm")
# Computing features
feat = deepsrgm.get_features(vocal_path)
# Predict raga using subset of features from the very beginning of audio for faster prediction
predicted_raga = deepsrgm.predict(feat[:8])
print(f'Raga: {predicted_raga}')
Raga: Bilahari
from compiam import load_model
from compiam.melody.pitch_extraction import Melodia

melodia = Melodia() 
ftanet_carnatic = load_model("melody:ftanet-carnatic")
PITCH_EXTRACTION_SR = 44100

freqs = ftanet_carnatic.predict(audio_mix)[:, 1]
tonic = tonic_multipitch.extract(audio_mix)

k = 9
N = 200
new_feat = []

feature = np.round(1200 * np.log2(freqs / tonic) * (k / 100)).clip(0)

if len(feature) <= 5000:
    raise ValueError("Audio signal is not longer enough for a proper estimation. Please provide a larger audio.")
for i in range(N):
    c = np.random.randint(0, len(feature) - 5000)
    new_feat.append(feature[c : c + 5000])
new_feat = np.array(new_feat)

raga = deepsrgm.predict(new_feat[:8])  # Let's again only take alap frames
print(f'Raga: {predicted_raga}')
Raga: Bilahari

2.3 Music Source Separation#

Music source separation is the task of automatically estimating the individual elements in a musical mixture. Apart from its creative uses, it may be very important as builing block in research pipeline, acting as a very handy pre-processing step [PRMSS23]. To carefully analyse the singing voice and its components, normally having it isolated from the rest of the instruments is beneficial.

There are several models in the literature to address this problem, most of them based on deep learning architectures, some of them provide pre-trained weights such as Demucs [DUBB19] or Spleeter [HKVM20], the latter is broadly used in Carnatic Music computational research works. However, thes systems have two problems: (1) the training data of these models does normally not include Carnatic Music examples, therefore there are instruments and practices which are completely unseed by these models, and (2) these models have a restricted set of target elements, namely (vocals, bass, drums, and other), which does not fit to Carnatic Music arrangements at all.

To address problem (1), there have been few attemps on trying to use the multi-stem data presented above to develop Carnatic-tailored source separation systmes, although the available multi-stem recordings are collected from mixing consoles in live performances, and therefore the individual tracks are noisy (have background leakage from the rest of the sources). We can test one example of these systems here: [PRMSS23].

from compiam import load_model

# This model uses tensorflow in the backend!
separation_model = load_model("separation:cold-diff-sep")
SEPARATION_SR = separation_model.sample_rate
audio_mix = estd.MonoLoader(filename=audio_path, sampleRate=SEPARATION_SR)()
separation_input = audio_mix[:SEPARATION_SR*30]  # Get 30s
separated_vocals = separation_model.separate(
    separation_input,
    input_sr=SEPARATION_SR
)
ipd.Audio(separated_vocals, rate=SEPARATION_SR)

Let’s try to be a bit more restrictive using the configuration of the separation algorithm.

separated_vocals = separation_model.separate(
    separation_input,
    input_sr=SEPARATION_SR,
    clusters=8,
    scheduler=7,
)
ipd.Audio(separated_vocals, rate=SEPARATION_SR)

Although there is still a long way to go on this problem, the ongoing efforts on improving on the separation of singing voice (and also the rest of the instrumentation!) for Carnatic Music set an interesting baseline to bulid on top of.

For further reference, please visit the music source separation page.

2.4 Pitch Extraction#

The f0 pitch track of the predominant vocal melody has proved useful for range of computational analysis tasks in Indian Art Music. We can extract this for our performance using Melodia [SG12], a broadly used knowledge-based method. We will also test a recently published DL model to achieve the same goal: FTA-Net model, which has been trained specifically for the Carnatic Music use case and included in compIAM as well [PRNP+23].

from compiam import load_model
from compiam.melody.pitch_extraction import Melodia

melodia = Melodia() 
ftanet_carnatic = load_model("melody:ftanet-carnatic")
PITCH_EXTRACTION_SR = 44100

To extract an example pitch track, we first load 30s of the mixture recording, and run prediction with both methods. Once the methods are initialized, running pitch extraction is easily done in one line of code.

import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt

audio_mix = estd.MonoLoader(filename=audio_path, sampleRate=PITCH_EXTRACTION_SR)()
prediction_input = audio_mix[PITCH_EXTRACTION_SR*60:PITCH_EXTRACTION_SR*90]

# Predominant extraction models
melodia_pitch_track = melodia.extract(prediction_input, input_sr=PITCH_EXTRACTION_SR)
ftanet_pitch_track = ftanet_carnatic.predict(prediction_input, input_sr=PITCH_EXTRACTION_SR)

Let’s now plot both pitch tracks and compare!

fig, ax = plt.subplots(nrows=1, ncols=1, sharex=True, figsize=(15, 12))
D = librosa.amplitude_to_db(np.abs(librosa.stft(prediction_input)), ref=np.max)
img = librosa.display.specshow(D, y_axis='linear', x_axis='time', sr=PITCH_EXTRACTION_SR, ax=ax);
ax.set_ylim(0, 2000)
ax.set_xlim(0, 8)  # Visualising 8 seconds
plt.plot(
    melodia_pitch_track[:, 0], melodia_pitch_track[:, 1],
    color="white", label="Melodia",
)
plt.plot(
    ftanet_pitch_track[:, 0], ftanet_pitch_track[:, 1],
    color="black",label="FTANet-Carnatic",
)
plt.legend()
plt.show()
../_images/fe5c69abf35f78a6d95e19e1504c425d7278b3efdb452ce8e6b364b116374812.png
ipd.Audio(prediction_input[:8*PITCH_EXTRACTION_SR], rate=PITCH_EXTRACTION_SR)

See that the violin is fooling the Melodia algorithm which is getting all vocal pitch values one octave above, while FTA-Net is able to get the right pitch values. This is a very common issue when analysing melody in the context of Carnatic Music: the presence of violin shadowing the singing voice is an enormous challenge for the vocal models and algorithms.

Let’s now extract the entire pitch track from the available vocal stem.

pitch_track = ftanet_carnatic.predict(vocal_path, input_sr=PITCH_EXTRACTION_SR)
pitch = pitch_track[:, 1]  # Pitch in Hz
time  = pitch_track[:, 0]  # Time in seconds
timestep  = time[2] - time[1]  # Time in seconds between elements of pitch track

We interpolate small gaps owing to glottal stops or consonant sounds.

from compiam.utils.pitch import interpolate_below_length
pitch = interpolate_below_length(
    pitch,  # track to interpolate
    0,  # value to interpolate 
    200*0.001/timestep  # maximum gap in number sequence elements to interpolate for
)

Let’s visualise our pitch plot using the plot_pitch function from compIAM.visualisation.pitch. We can manually change the yticks to correspond to theoretical svara positions by passing a custom dictionary of {ytick labels : y values}. Since we know the raga (from section 2.2) and the singer tonic (from section 2.1). We can use get_svara_pitch_carnatic to query for the svaras relevant to that raga, and pass that dictionary and the tonic to plot_pitch to alter the pitch plot.

from compiam.visualisation.pitch import plot_pitch, flush_matplotlib
from compiam.utils import ipy_audio
from compiam.utils import get_svara_pitch_carnatic
# let's load the audio also
audio = estd.MonoLoader(filename=audio_path, sampleRate=PITCH_EXTRACTION_SR)()
t1 = 304  # in seconds
t2 = 324  # in seconds
t1s = round(t1/timestep)  # in sequence elements
t2s = round(t2/timestep)  # in sequence elements
this_pitch = pitch[t1s:t2s]
this_time = time[t1s:t2s]
silence_mask = this_pitch == 0
svara_pitch = get_svara_pitch_carnatic('mohanam', tonic=tonic)
plot_pitch(
    this_pitch,
    this_time,
    mask=silence_mask,
    yticks_dict=svara_pitch,
    tonic=tonic,
    cents=True,
    title=f'Excerpt of {rendition} by Brindha Manickavasakan'
)
ipy_audio(audio, t1, t2, sr=PITCH_EXTRACTION_SR)
../_images/48e1dd1de08977659873bbdac0423c4f1039cb01813ec0dcb093b06bf412a50c.png

Pitch curves are a really important feature for the computational analysis of Carnatic Music. The important and very much present ornamentation of notes and transitions between notes, it is better represented with continuous pitch value arrays. However, since the lead melodic instruments are normally found mixed with accompanying instruments, we require methods that are able to capture the melodies in the presence of background music.

Pitch curves, or tracks, can be used for a list of tasks that aim at extracting relevant melodic information from the performed melodies. Also, several classification and tagging tasks (e.g. raga classification), build on top of melodic features, normally including pitch information, whether explicitly, or embedded in other representations.

For further reference, please visit the pitch extraction page.

3. Rhythm analysis: Percussion onset detection#

# We import the tool
from compiam.rhythm.meter import AksharaPulseTracker

# Let's initialize an instance of the tool
apt = AksharaPulseTracker()
predicted_aksharas = apt.extract(audio_path)
from compiam.visualisation.audio import plot_waveform
pulses = predicted_aksharas['aksharaPulses']
predicted_beats_dict = {
    time_step: idx for idx, time_step in enumerate(pulses)
}

# And we plot!
plot_waveform(
    input_data=audio_path,
    t1=272,
    t2=276,
    labels=predicted_beats_dict,
);
<Figure size 2000x500 with 0 Axes>
../_images/f9226e098066b4ae59383bd4040b86ceff7ad9c697699dc2911bd61dfd3473cb.png

See some demos of related beat tracking and percussion pattern research done within the context of the CompMusic project.

from IPython.display import YouTubeVideo

YouTubeVideo("wvrGhXFXtv8", width=800, height=450)

4. Melodic analysis: Melodic pattern discovery#

Melodic patterns are important building blocks in Carnatic music.

# Pattern Extraction for a Given Audio
from compiam import load_model

# Feature Extraction: CAE features
cae = load_model("melody:cae-carnatic")
# returns magnitude and phase
ampl, _ = cae.extract_features(vocal_path)
ampl
from compiam.melody.pattern import self_similarity
from compiam.utils.pitch import (
    extract_stability_mask,
    pitch_seq_to_cents,
)
pitch_cents = pitch_seq_to_cents(pitch, tonic=tonic)
stability_mask = extract_stability_mask(
    pitch=pitch_cents,  # pitch track
    min_stab_sec=1.0,  # minimum cummulative length of stable windows to warrant annotation
    hop_sec=0.2,  # hop length in seconds
    var=60,  # minimum variation from the mean in each window to be considered stable
    timestep=timestep  # time in seconds between consecutice elements in <pitch>
)
silence_mask = pitch==0
stability_mask = extract_stability_mask(
    pitch=pitch_cents,  # pitch track
    min_stab_sec=1.0,  # minimum cummulative length of stable windows to warrant annotation
    hop_sec=0.2,  # hop length in seconds
    var=60,  # minimum variation from the mean in each window to be considered stable
    timestep=timestep  # time in seconds between consecutice elements in <pitch>
)
t1 = 304  # in seconds
t2 = t1 + 10  # in seconds

t1s = round(t1/timestep)  # in sequence elements
t2s = round(t2/timestep)  # in sequence elements

this_pitch = pitch[t1s:t2s]
this_time = time[t1s:t2s]
this_silence_mask = silence_mask[t1s:t2s]
this_stable_mask = stability_mask[t1s:t2s]

# Get pitch plot
fig, ax = plot_pitch(
    this_pitch,
    this_time, 
    mask=this_silence_mask,
    tonic=tonic,
    yticks_dict=svara_pitch,
    cents=True,
    title=f'Excerpt of {rendition} by Brindha Manickavasakan'
)

# On alternative axis plot stable mask values
ax2 = ax.twinx()
ax2.plot(this_time, this_stable_mask, 'g', linewidth=1, alpha=1, linestyle='--')
ax2.set_yticks([0,1])
ax2.set_ylabel("Is stable region?")
    
# Accompanying audio
ipy_audio(audio, t1, t2, sr=sr)
../_images/612f582aa51f4a02b65b74949f1eb8fd9e2610d7aa3664ffa9f5694870fee1ff.png
exclusion_mask = np.logical_or(silence_mask==1, stability_mask==1)
# Mask of regions not interested in cite kaustuv, the papers that use 
ss = self_similarity(
    ampl,  # features
    exclusion_mask=exclusion_mask,  # exclusion mask
    timestep=timestep,  # time in seconds between elements of exlcusion mask
    hop_length=cae.hop_length,  # window size in audio frames
    sr=cae.sr  # sample rate of audio
)

# Sparsely computed self similarity matrix 
X = ss[0]
# Mapping of index between theoretical full matrix and sparse one
orig_sparse_lookup = ss[1]
# Mapping of index between sparse matrix and theoretical full matrix one
sparse_orig_lookup = ss[2]
# Indices of boundaries between split regions in full matrix
boundaries_orig = ss[3]
# Indices of boundaries between split regions in sparse matrix
boundaries_sparse = ss[4]
fig, ax = plt.subplots(figsize=(10,10))
plt.title(f'Self similarity matrix for {rendition}', fontsize=9)
ax.imshow(X[2000:5000,2000:5000], interpolation='nearest')
plt.axis('off')
plt.tight_layout()
plt.show()
../_images/6c67e7a15a9b2bb84830e97ce3bacc1ae5e4690866cdffad0b8ee90dea475c60.png
from compiam.melody.pattern import segmentExtractor
# Emphasize Diagonal
se = segmentExtractor(
    X,  # self sim matrix
    window_size=cae.hop_length,  # window size
    cache_dir='.cache/'  # cache directory for faster computation in future
)
for i in np.arange(0.05, 0.15, 0.01):
    X_proc = se.emphasize_diagonals(bin_thresh=i)
    se.display_matrix(X_proc[2000:5000,2000:5000], title=f'bin_thresh={round(i,2)}', figsize=(5,5))
../_images/49f5414ebea335013d911c5a6d864c8486448ed5364edb16fd91980efb5eb686.png ../_images/6aaec4b80994624bf7fa2f25dc6fafed5cd6ff164e039a4bcf0ab80e9559138c.png ../_images/71dad973afc189105f5ce21044111f4d39f1ae17da191ee1bbd621d2f7e984a0.png ../_images/545e3b73ae5883df283ba804c5d3a999d6a0b2dfca2e327c02ae87e04841b674.png ../_images/0a9c1b246280d4e5ed0c8ea46cda3c3939e6671ee90b4ae8ed13f1f50cfdc88b.png ../_images/0a69a0dd602344b285d8c53f47e95e21642fc0c2161f01558d411d7d7c17e595.png ../_images/83e828b527d46c236d9bf656c18b17dc46dd209fba9e637f540be0e36f0c9bf9.png ../_images/cbf0fed992c8f90ec0886fef316adedcf3664ef07c33ebd0758fa79ab1cefaea.png ../_images/8c96462a280548d62f21f047edaeeafeb7d95a413453a7089ddaee4e159b1299.png ../_images/7e52a4f6a9fe4e44518e43e30a05fadd0cbda4be0552c5064f64c55e2e22fbdf.png
X_proc = se.emphasize_diagonals(bin_thresh=0.11)
all_segments = se.extract_segments(
    timestep=timestep,  # timestep between
    boundaries=boundaries_sparse,  # boundaries of sparse regions (for conversion)
    lookup=sparse_orig_lookup,  # To convert between sparse and true indices
    break_mask=exclusion_mask)  # mask corresponding to break points, any segment that traverse these points are broken into two
print("Format: [(x0, y0), (x1, y1)]...")
all_segments[:10]
Format: [(x0, y0), (x1, y1)]...
[[(875, 1025), (933, 1079)],
 [(1025, 875), (1079, 933)],
 [(5956, 6232), (6032, 6305)],
 [(6233, 5956), (6305, 6032)]]
from compiam.utils import add_center_to_mask
exclusion_mask_center = add_center_to_mask(exclusion_mask)  # center of masked regions is annotated as "2"
anchor_mask = np.array([1 if i==2 else 0 for i in exclusion_mask_center])
# Returns patterns in units of pitch sequence elements
starts_seq, lengths_seq = se.group_segments(
    all_segments,  # segments from se.extract_segments()
    anchor_mask,  # Extend patterns to these points
    pitch,  # pitch track
    min_pattern_length_seconds=2,  # minimum pattern length,
    thresh_dtw=None
)
starts_sec = [[x*timestep for x in p] for p in starts_seq]
lengths_sec = [[x*timestep for x in l] for l in lengths_seq]
print(f"Number of groups: {len(starts_sec)}")
Number of groups: 2
from compiam.visualisation.pitch import plot_subsequence
# kwargs for plot_pitch
plot_kwargs = {
    'yticks_dict': svara_pitch,
    'cents':True,
    'tonic':tonic,
    'figsize':(15,4)
}
i = 1 # Choose pattern group

S = starts_seq[i] # get group
L = lengths_seq[i] # get lengths

for j,s in enumerate(S):
    l = L[j] # this pattern length
    ss = starts_sec[i][j] # this pattern start in seconds
    ls = lengths_sec[i][j] # this pattern length in seconds
    ipd.display(ipy_audio(audio, ss, ss+ls, sr=sr)) # display audio
    # display pitch plot
    plot_subsequence(s, l, pitch, time, timestep, path=None, plot_kwargs=plot_kwargs)
    plt.show()
../_images/ee31249ae521829cf517b958a30ed80a9fb224070ea2d79af4150d3d2d109846.png
../_images/d502a40bcfe4ba7a9a556200838a65839a2946a450630e50d5bcc36dbdedb1b5.png
full_concert_path = os.path.join(AUDIO_PATH, 'dr-brindha-manickavasakan') 
shutil.rmtree(full_concert_path)