A collection of TensorFlow models for Essentia

Note: the models were updated on 2020-07-08 due to a new name convention and some upgrades. See the full CHANGELOG for more details.

In our last post, we introduced the TensorFlow wrapper for Essentia. It can be used with virtually any TensorFlow model and here we present a collection of models we supply with Essentia out of the box.

First, we prepared a set of pre-trained auto-tagging models achieving state of-the-art performance. Then, we used those models as feature extractors to generate a set of transfer learning classifiers trained on our in-house datasets.

Along with the models we provide helper algorithms that allow performing predictions in just a couple of lines. Also, we include C++ extractors that could be built statically to use as standalone executables.

Supplied models

Auto-tagging models

Our auto-tagging models were pre-trained as part of Jordi Pons’ Ph.D. Thesis [1]. They were trained on two research datasets, MSD and MTT, to predict top 50 tags therein. Check the original repository for more details about the training process.

The following table shows the available architectures and the datasets used for training. For every model, its complexity is reported in terms of the number of trainable parameters. The models were trained and evaluated using the standardized train/test splits proposed for these datasets in research. The performance of the models is indicated in terms of ROC-AUC and PR-AUC obtained on the test splits. Additionally, the table provides download links for the models and README files containing the output labels, the name of the relevant layers and some details about the training datasets.

Architecture Dataset Params. ROC-AUC PR-AUC Model README
MusiCNN MSD 790k 0.88 0.29 msd-musicnn README
MusiCNN MTT 790k 0.91 0.38 mtt-musicnn README
VGG MSD 605k 0.88 0.28 msd-vgg README
VGG MTT 605k 0.90 0.38 mtt-vgg README

Transfer learning classifiers

In transfer learning, a model is first trained on a source task (typically with a greater amount of available data) to leverage the obtained knowledge in a smaller target task. In this case, we considered the aforementioned MusiCNN models and a big VGG-like (VGGish) model as source tasks. As target tasks, we considered our in-house classification datasets listed below. We use the penultimate layer of the source task models as feature extractors for small classifiers consisting of two fully connected layers.

The following tables present all trained classifiers in genre, mood and miscellaneous task categories. For each classifier, its source task name represents a combination of the architecture and the training dataset used. Its complexity is reported in terms of the number of trainable parameters. The performance of each model is measured in terms of normalized accuracy (Acc.) in a 5-fold cross-validation experiment conducted before the final classifier model was trained on all data.


Source task Target task Params. Acc. Model README
MusiCNN_MSD genre_dortmund 790k 0.51 genre_dortmund-musicnn-msd README
MusiCNN_MSD genre_electronic 790k 0.95 genre_electronic-musicnn-msd README
MusiCNN_MSD genre_rosamerica 790k 0.92 genre_rosamerica-musicnn-msd README
MusiCNN_MSD genre_tzanetakis 790k 0.83 genre_tzanetakis-musicnn-msd README
MusiCNN_MTT genre_dortmund 790k 0.44 genre_dortmund-musicnn-mtt README
MusiCNN_MTT genre_electronic 790k 0.71 genre_electronic-musicnn-mtt README
MusiCNN_MTT genre_rosamerica 790k 0.92 genre_rosamerica-musicnn-mtt README
MusiCNN_MTT genre_tzanetakis 790k 0.80 genre_tzanetakis-musicnn-mtt README
VGGish_AudioSet genre_dortmund 62M 0.52 genre_dortmund-vggish-audioset README
VGGish_AudioSet genre_electronic 62M 0.93 genre_electronic-vggish-audioset README
VGGish_AudioSet genre_rosamerica 62M 0.94 genre_rosamerica-vggish-audioset README
VGGish_AudioSet genre_tzanetakis 62M 0.86 genre_tzanetakis-vggish-audioset README


Source task Target task Params. Acc. Model README
MusiCNN_MSD mood_acoustic 790k 0.90 mood_acoustic-musicnn-msd README
MusiCNN_MSD mood_aggressive 790k 0.95 mood_aggressive-musicnn-msd README
MusiCNN_MSD mood_electronic 790k 0.95 mood_electronic-musicnn-msd README
MusiCNN_MSD mood_happy 790k 0.81 mood_happy-musicnn-msd README
MusiCNN_MSD mood_party 790k 0.89 mood_party-musicnn-msd README
MusiCNN_MSD mood_relaxed 790k 0.90 mood_relaxed-musicnn-msd README
MusiCNN_MSD mood_sad 790k 0.86 mood_sad-musicnn-msd README
MusiCNN_MTT mood_acoustic 790k 0.93 mood_acoustic-musicnn-mtt README
MusiCNN_MTT mood_aggressive 790k 0.96 mood_aggressive-musicnn-mtt README
MusiCNN_MTT mood_electronic 790k 0.91 mood_electronic-musicnn-mtt README
MusiCNN_MTT mood_happy 790k 0.79 mood_happy-musicnn-mtt README
MusiCNN_MTT mood_party 790k 0.92 mood_party-musicnn-mtt README
MusiCNN_MTT mood_relaxed 790k 0.88 mood_relaxed-musicnn-mtt README
MusiCNN_MTT mood_sad 790k 0.85 mood_sad-musicnn-mtt README
VGGish_AudioSet mood_acoustic 62M 0.94 mood_acoustic-vggish-audioset README
VGGish_AudioSet mood_aggressive 62M 0.98 mood_aggressive-vggish-audioset README
VGGish_AudioSet mood_electronic 62M 0.93 mood_electronic-vggish-audioset README
VGGish_AudioSet mood_happy 62M 0.86 mood_happy-vggish-audioset README
VGGish_AudioSet mood_party 62M 0.91 mood_party-vggish-audioset README
VGGish_AudioSet mood_relaxed 62M 0.89 mood_relaxed-vggish-audioset README
VGGish_AudioSet mood_sad 62M 0.89 mood_sad-vggish-audioset README


Source task Target task Params. Acc. Model README
MusiCNN_MSD danceability 790k 0.93 danceability-musicnn-msd README
MusiCNN_MSD gender 790k 0.88 gender-musicnn-msd README
MusiCNN_MSD tonal_atonal 790k 0.60 tonal_atonal-musicnn-msd README
MusiCNN_MSD voice_instrumental 790k 0.98 voice_instrumental-musicnn-msd README
MusiCNN_MTT danceability 790k 0.91 danceability-musicnn-mtt README
MusiCNN_MTT gender 790k 0.87 gender-musicnn-mtt README
MusiCNN_MTT tonal_atonal 790k 0.91 tonal_atonal-musicnn-mtt README
MusiCNN_MTT voice_instrumental 790k 0.98 voice_instrumental-musicnn-mtt README
VGGish_AudioSet danceability 62M 0.94 danceability-vggish-audioset README
VGGish_AudioSet gender 62M 0.84 gender-vggish-audioset README
VGGish_AudioSet tonal_atonal 62M 0.97 tonal_atonal-vggish-audioset README
VGGish_AudioSet voice_instrumental 62M 0.98 voice_instrumental-vggish-audioset README

Architecture details

Datasets details

Dataset Classes Size
genre_dortmund alternative, blues, electronic, folkcountry, funksoulrnb, jazz, pop, raphiphop, rock 1820
genre_gtzan blues, classic, country, disco, hip hop, jazz, metal, pop, reggae, rock 1000
genre_rosamerica classic, dance, hip hop, jazz, pop, rhythm and blues, rock, speech 400
genre_electronic ambient, dnb, house, techno, trance 250
mood_acoustic acoustic, not acoustic 321
mood_electronic electronic, not electronic 332
mood_aggressive aggressive, not aggressive 280
mood_relaxed not relaxed, relaxed 446
mood_happy happy, not happy 302
mood_sad not sad, sad 230
mood_party not party, party 349
danceability danceable, not dancable 306
voice_instrumental voice, instrumental 1000
gender female, male 3311
tonal_atonal atonal, tonal 345

Helper algorithms

Algorithms for MusiCNN and VGG based models:

Algorithms for VGGish based models:

Usage examples

Let’s exemplify some use cases.


In this case, we are replicating the example of our previous post. With the new algorithms, the code is reduced to just a couple of lines!

import numpy as np
from essentia.standard import *

msd_labels = ['rock','pop','alternative','indie','electronic','female vocalists','dance','00s','alternative rock','jazz','beautiful','metal','chillout','male vocalists','classic rock','soul','indie rock','Mellow','electronica','80s','folk','90s','chill','instrumental','punk','oldies','blues','hard rock','ambient','acoustic','experimental','female vocalist','guitar','Hip-Hop','70s','party','country','easy listening','sexy','catchy','funk','electro','heavy metal','Progressive rock','60s','rnb','indie pop','sad','House','happy']

# Our models take audio streams at 16kHz
sr = 16000

# Instantiate a MonoLoader and run it in the same line
audio = MonoLoader(filename='/your/amazong/song.wav', sampleRate=sr)()

# Instatiate the tagger and pass it the audio
predictions = TensorflowPredictMusiCNN(graphFilename='msd-musicnn-1.pb')(audio)

# Retrieve the top_n tags
top_n = 3

# The shape of the predictions matrix is [n_patches, n_labels]
# Take advantage of NumPy to average them over the time axis
averaged_predictions = np.mean(predictions, axis=0)

# Sort the predictions and get the top N
for i, l in enumerate(averaged_predictions.argsort()[-top_n:][::-1], 1):
    print('{}: {}'.format(i, msd_labels[l]))

1: electronic
2: chillout
3: ambient


In this example, we are using our genre_rosamerica classifier based on the VGGish embeddings. Note that this time we are using TensorflowPredictVGGish instead of TensorflowPredictMusiCNN so that the model is fed with the correct input features.

import numpy as np
from essentia.standard import *

labels = ['classic', 'dance', 'hip hop', 'jazz',
          'pop', 'rnb', 'rock', 'speech']

sr = 16000
audio = MonoLoader(filename='/your/amazing/song.wav', sampleRate=sr)()

predictions = TensorflowPredictVGGish(graphFilename='genre_rosamerica-vggish-audioset-1.pb')(audio)

# Average predictions over the time axis
predictions = np.mean(predictions, axis=0)

order = predictions.argsort()[::-1]
for i in order:
    print('{}: {:.3f}'.format(labels[i], predictions[i]))
hip hop: 0.411
dance: 0.397
jazz: 0.056
pop: 0.053
rnb: 0.051
rock: 0.011
classic: 0.004
speech: 0.001

Latent feature extraction

As the last example, let’s use one of the models as a feature extractor by retrieving the output of the penultimate layer. This is done by setting the output parameter in the predictor algorithm. A list of the supported output layers is available in the README files supplied with the models.

from essentia.standard import MonoLoader, TensorflowPredictMusiCNN

sr = 16000
audio = MonoLoader(filename='/your/amazing/song.wav', sampleRate=sr)()

# Retrieve the output of the penultimate layer
penultimate_layer = TensorflowPredictMusiCNN(graphFilename='msd-musicnn-1.pb', output='model/dense/BiasAdd')(audio)

The following plot shows how these features look like:



[1] Pons, J., & Serra, X. (2019). musicnn: Pre-trained convolutional neural networks for music audio tagging. arXiv preprint arXiv:1909.06654.

[2] Gemmeke, J. et. al., AudioSet: An ontology and human-labelled dataset for audio events, ICASSP 2017.

[3] Hershey, S. et. al., CNN Architectures for Large-Scale Audio Classification, ICASSP 2017

[ news ] [ tensorflow ]