A collection of TensorFlow models for Essentia

16 Jan 2020

Note: the models were updated on 2020-07-08 due to a new name convention and some upgrades. See the full CHANGELOG for more details.

In our last post, we introduced the TensorFlow wrapper for Essentia. It can be used with virtually any TensorFlow model and here we present a collection of models we supply with Essentia out of the box.

First, we prepared a set of pre-trained auto-tagging models achieving state of-the-art performance. Then, we used those models as feature extractors to generate a set of transfer learning classifiers trained on our in-house datasets.

Along with the models we provide helper algorithms that allow performing predictions in just a couple of lines. Also, we include C++ extractors that could be built statically to use as standalone executables.

Supplied models

Auto-tagging models

Our auto-tagging models were pre-trained as part of Jordi Pons’ Ph.D. Thesis [1]. They were trained on two research datasets, MSD and MTT, to predict top 50 tags therein. Check the original repository for more details about the training process.

The following table shows the available architectures and the datasets used for training. For every model, its complexity is reported in terms of the number of trainable parameters. The models were trained and evaluated using the standardized train/test splits proposed for these datasets in research. The performance of the models is indicated in terms of ROC-AUC and PR-AUC obtained on the test splits. Additionally, the table provides download links for the models and README files containing the output labels, the name of the relevant layers and some details about the training datasets.

Architecture	Dataset	Params.	ROC-AUC	PR-AUC	Model	README
MusiCNN	MSD	790k	0.88	0.29	msd-musicnn	README
MusiCNN	MTT	790k	0.91	0.38	mtt-musicnn	README
VGG	MSD	605k	0.88	0.28	msd-vgg	README
VGG	MTT	605k	0.90	0.38	mtt-vgg	README

Transfer learning classifiers

In transfer learning, a model is first trained on a source task (typically with a greater amount of available data) to leverage the obtained knowledge in a smaller target task. In this case, we considered the aforementioned MusiCNN models and a big VGG-like (VGGish) model as source tasks. As target tasks, we considered our in-house classification datasets listed below. We use the penultimate layer of the source task models as feature extractors for small classifiers consisting of two fully connected layers.

The following tables present all trained classifiers in genre, mood and miscellaneous task categories. For each classifier, its source task name represents a combination of the architecture and the training dataset used. Its complexity is reported in terms of the number of trainable parameters. The performance of each model is measured in terms of normalized accuracy (Acc.) in a 5-fold cross-validation experiment conducted before the final classifier model was trained on all data.

Genre

Source task	Target task	Params.	Acc.	Model	README
MusiCNN_MSD	genre_dortmund	790k	0.51	genre_dortmund-musicnn-msd	README
MusiCNN_MSD	genre_electronic	790k	0.95	genre_electronic-musicnn-msd	README
MusiCNN_MSD	genre_rosamerica	790k	0.92	genre_rosamerica-musicnn-msd	README
MusiCNN_MSD	genre_tzanetakis	790k	0.83	genre_tzanetakis-musicnn-msd	README
MusiCNN_MTT	genre_dortmund	790k	0.44	genre_dortmund-musicnn-mtt	README
MusiCNN_MTT	genre_electronic	790k	0.71	genre_electronic-musicnn-mtt	README
MusiCNN_MTT	genre_rosamerica	790k	0.92	genre_rosamerica-musicnn-mtt	README
MusiCNN_MTT	genre_tzanetakis	790k	0.80	genre_tzanetakis-musicnn-mtt	README
VGGish_AudioSet	genre_dortmund	62M	0.52	genre_dortmund-vggish-audioset	README
VGGish_AudioSet	genre_electronic	62M	0.93	genre_electronic-vggish-audioset	README
VGGish_AudioSet	genre_rosamerica	62M	0.94	genre_rosamerica-vggish-audioset	README
VGGish_AudioSet	genre_tzanetakis	62M	0.86	genre_tzanetakis-vggish-audioset	README

Mood

Source task	Target task	Params.	Acc.	Model	README
MusiCNN_MSD	mood_acoustic	790k	0.90	mood_acoustic-musicnn-msd	README
MusiCNN_MSD	mood_aggressive	790k	0.95	mood_aggressive-musicnn-msd	README
MusiCNN_MSD	mood_electronic	790k	0.95	mood_electronic-musicnn-msd	README
MusiCNN_MSD	mood_happy	790k	0.81	mood_happy-musicnn-msd	README
MusiCNN_MSD	mood_party	790k	0.89	mood_party-musicnn-msd	README
MusiCNN_MSD	mood_relaxed	790k	0.90	mood_relaxed-musicnn-msd	README
MusiCNN_MSD	mood_sad	790k	0.86	mood_sad-musicnn-msd	README
MusiCNN_MTT	mood_acoustic	790k	0.93	mood_acoustic-musicnn-mtt	README
MusiCNN_MTT	mood_aggressive	790k	0.96	mood_aggressive-musicnn-mtt	README
MusiCNN_MTT	mood_electronic	790k	0.91	mood_electronic-musicnn-mtt	README
MusiCNN_MTT	mood_happy	790k	0.79	mood_happy-musicnn-mtt	README
MusiCNN_MTT	mood_party	790k	0.92	mood_party-musicnn-mtt	README
MusiCNN_MTT	mood_relaxed	790k	0.88	mood_relaxed-musicnn-mtt	README
MusiCNN_MTT	mood_sad	790k	0.85	mood_sad-musicnn-mtt	README
VGGish_AudioSet	mood_acoustic	62M	0.94	mood_acoustic-vggish-audioset	README
VGGish_AudioSet	mood_aggressive	62M	0.98	mood_aggressive-vggish-audioset	README
VGGish_AudioSet	mood_electronic	62M	0.93	mood_electronic-vggish-audioset	README
VGGish_AudioSet	mood_happy	62M	0.86	mood_happy-vggish-audioset	README
VGGish_AudioSet	mood_party	62M	0.91	mood_party-vggish-audioset	README
VGGish_AudioSet	mood_relaxed	62M	0.89	mood_relaxed-vggish-audioset	README
VGGish_AudioSet	mood_sad	62M	0.89	mood_sad-vggish-audioset	README

Miscellaneous

Source task	Target task	Params.	Acc.	Model	README
MusiCNN_MSD	danceability	790k	0.93	danceability-musicnn-msd	README
MusiCNN_MSD	gender	790k	0.88	gender-musicnn-msd	README
MusiCNN_MSD	tonal_atonal	790k	0.60	tonal_atonal-musicnn-msd	README
MusiCNN_MSD	voice_instrumental	790k	0.98	voice_instrumental-musicnn-msd	README
MusiCNN_MTT	danceability	790k	0.91	danceability-musicnn-mtt	README
MusiCNN_MTT	gender	790k	0.87	gender-musicnn-mtt	README
MusiCNN_MTT	tonal_atonal	790k	0.91	tonal_atonal-musicnn-mtt	README
MusiCNN_MTT	voice_instrumental	790k	0.98	voice_instrumental-musicnn-mtt	README
VGGish_AudioSet	danceability	62M	0.94	danceability-vggish-audioset	README
VGGish_AudioSet	gender	62M	0.84	gender-vggish-audioset	README
VGGish_AudioSet	tonal_atonal	62M	0.97	tonal_atonal-vggish-audioset	README
VGGish_AudioSet	voice_instrumental	62M	0.98	voice_instrumental-vggish-audioset	README

Architecture details

MusiCNN is a musically-motivated Convolutional Neural Network. It uses vertical and horizontal convolutional filters aiming to capture timbral and temporal patterns, respectively. The model contains 6 layers and 790k parameters.
VGG is an architecture from computer vision based on a deep stack of commonly used 3x3 convolutional filters. It contains 5 layers with 128 filters each. Batch normalization and dropout are applied before each layer. The model has 605k trainable parameters. We are using the implementation by Jordi Pons.
VGGish [2, 3] follows the configuration E from the original implementation for computer vision, with the difference that the number of output units is set to 3087. This model has 62 million parameters.

Datasets details

MSD contains 200k tracks from the train set of the publicly available Million Song Dataset (MSD) annotated by tags from Last.fm. Only top 50 tags are used.
MTT contains 25k tracks from Magnatune with tags by human annotators. Only top 50 tags are used.
AudioSet contains 1.8 million audio clips from Youtube annotated with the AudioSet taxonomy, not specific to music.
MTG in-house datasets are a collection of small, highly curated datasets used for training classifiers. A set of SVM classifiers based on these datasets is also available.

Dataset	Classes	Size
genre_dortmund	alternative, blues, electronic, folkcountry, funksoulrnb, jazz, pop, raphiphop, rock	1820
genre_gtzan	blues, classic, country, disco, hip hop, jazz, metal, pop, reggae, rock	1000
genre_rosamerica	classic, dance, hip hop, jazz, pop, rhythm and blues, rock, speech	400
genre_electronic	ambient, dnb, house, techno, trance	250
mood_acoustic	acoustic, not acoustic	321
mood_electronic	electronic, not electronic	332
mood_aggressive	aggressive, not aggressive	280
mood_relaxed	not relaxed, relaxed	446
mood_happy	happy, not happy	302
mood_sad	not sad, sad	230
mood_party	not party, party	349
danceability	danceable, not dancable	306
voice_instrumental	voice, instrumental	1000
gender	female, male	3311
tonal_atonal	atonal, tonal	345

Helper algorithms

Algorithms for MusiCNN and VGG based models:

TensorflowInputMusiCNN. Computes mel-bands with a particular parametrization specific to MusiCNN based models.
TensorflowPredictMusiCNN. Makes predictions using MusiCNN models.

Algorithms for VGGish based models:

TensorflowInputVGGish. Computes mel-bands with a particular parametrization specific to VGGish based models.
TensorflowPredictVGGish. Makes predictions using VGGish models.

Usage examples

Let’s exemplify some use cases.

Auto-tagging

In this case, we are replicating the example of our previous post. With the new algorithms, the code is reduced to just a couple of lines!

import numpy as np
from essentia.standard import *


msd_labels = ['rock','pop','alternative','indie','electronic','female vocalists','dance','00s','alternative rock','jazz','beautiful','metal','chillout','male vocalists','classic rock','soul','indie rock','Mellow','electronica','80s','folk','90s','chill','instrumental','punk','oldies','blues','hard rock','ambient','acoustic','experimental','female vocalist','guitar','Hip-Hop','70s','party','country','easy listening','sexy','catchy','funk','electro','heavy metal','Progressive rock','60s','rnb','indie pop','sad','House','happy']

# Our models take audio streams at 16kHz
sr = 16000

# Instantiate a MonoLoader and run it in the same line
audio = MonoLoader(filename='/your/amazong/song.wav', sampleRate=sr)()

# Instatiate the tagger and pass it the audio
predictions = TensorflowPredictMusiCNN(graphFilename='msd-musicnn-1.pb')(audio)

# Retrieve the top_n tags
top_n = 3

# The shape of the predictions matrix is [n_patches, n_labels]
# Take advantage of NumPy to average them over the time axis
averaged_predictions = np.mean(predictions, axis=0)

# Sort the predictions and get the top N
for i, l in enumerate(averaged_predictions.argsort()[-top_n:][::-1], 1):
    print('{}: {}'.format(i, msd_labels[l]))

electronic
chillout
ambient

Classification

In this example, we are using our genre_rosamerica classifier based on the VGGish embeddings. Note that this time we are using TensorflowPredictVGGish instead of TensorflowPredictMusiCNN so that the model is fed with the correct input features.

import numpy as np
from essentia.standard import *


labels = ['classic', 'dance', 'hip hop', 'jazz',
          'pop', 'rnb', 'rock', 'speech']

sr = 16000
audio = MonoLoader(filename='/your/amazing/song.wav', sampleRate=sr)()

predictions = TensorflowPredictVGGish(graphFilename='genre_rosamerica-vggish-audioset-1.pb')(audio)

# Average predictions over the time axis
predictions = np.mean(predictions, axis=0)

order = predictions.argsort()[::-1]
for i in order:
    print('{}: {:.3f}'.format(labels[i], predictions[i]))

hip hop: 0.411
dance: 0.397
jazz: 0.056
pop: 0.053
rnb: 0.051
rock: 0.011
classic: 0.004
speech: 0.001

Latent feature extraction

As the last example, let’s use one of the models as a feature extractor by retrieving the output of the penultimate layer. This is done by setting the output parameter in the predictor algorithm. A list of the supported output layers is available in the README files supplied with the models.

from essentia.standard import MonoLoader, TensorflowPredictMusiCNN


sr = 16000
audio = MonoLoader(filename='/your/amazing/song.wav', sampleRate=sr)()

# Retrieve the output of the penultimate layer
penultimate_layer = TensorflowPredictMusiCNN(graphFilename='msd-musicnn-1.pb', output='model/dense/BiasAdd')(audio)

The following plot shows how these features look like:

png

References

[1] Pons, J., & Serra, X. (2019). musicnn: Pre-trained convolutional neural networks for music audio tagging. arXiv preprint arXiv:1909.06654.

[2] Gemmeke, J. et. al., AudioSet: An ontology and human-labelled dataset for audio events, ICASSP 2017.

[3] Hershey, S. et. al., CNN Architectures for Large-Scale Audio Classification, ICASSP 2017

[ news ] [ tensorflow ]

Essentia Labs