Here we explain in detail how to re-create the Discogs-VI and Discogs-VI-YT datasets using a monthly Discogs dump.
Download the latest Discogs release dump. The Discogs database is growing and we can expect to build an even larger dataset using more recent dumps.
# create a folder for the metadata
mkdir -p data/discogs_metadata/
cd data/discogs_metadata/
wget https://discogs-data-dumps.s3-us-west-2.amazonaws.com/data/2024/discogs_20240701_releases.xml.gz
wget https://discogs-data-dumps.s3-us-west-2.amazonaws.com/data/2024/discogs_20240701_artists.xml.gz
# Unzip the files
gunzip discogs_20240701_releases.xml.gz
gunzip discogs_20240701_artists.xml.gz
If you are interested, each step of creating Discogs-VI from the downloaded dumps are described in the remainder of this section. Instead, you can use prepare_discogs_vi.sh and the whole process will be automated.
./prepare_discogs_vi.sh data/discogs_metadata/discogs_20240701_releases.xml data/discogs_metadata/discogs_20240701_artists.xml 1
Parse the XML files, pre-process, and convert to JSON. You can run these two scripts in parallel, i.e. they do not depend on each other and discogs_20240701_releases.xml takes a long time to process. The prepocessing will prepare and unify the fields and remove metadata that are irrelevant to the task.
$ python discogs_vi/preprocess_releases_xml.py discogs_20240701_releases.xml
> Processed 17,402,065 releases
>                    3 releases skipped due to errors
> Total processing time: 03:16:55
> Done!
$ python discogs_vi/preprocess_artists_xml.py discogs_20240701_artists.xml
>  9,187,518 artists loaded
> Total processing time: 00:06:10
There are problems related to artist IDs and relationships. In order to deal with them we clean the artists file.
python discogs_vi/clean_artists.py discogs_20240701_artists.xml.json
Clean the release metadata by:
After cleaning the metadata of a release, if no track is left with the necessary information, we discard the release.
We also remove tracks with
$ python discogs_vi/clean_releases.py discogs_20240701_releases.xml.json
> 17,402,065 releases are processed in total.
>  2,484,508 releases remain after cleaning.
> 14,407,815 tracks remain after cleaning.
>    174,958 genre-style matching errors found.
> Total processing time: 00:12:59
Done!
Genre-styles matching errors occur when a style does not correspond to any genre associated with the same release according to the genre taxonomy. These errors aren’t very common in the database and they appear to happen due to historical changes in the Discogs taxonomy. We discard such genre-style annotations.
In order to put the tracks in clique relationships we need to parse the releases to tracks. Each track will be a dictionary with the following fields:
{
    "track_title": ...,
    "release_title": ...,
    "track_writer_ids": ...,
    "track_writer_names": ...,
    "track_artist_ids": ...,
    "track_artist_names": ...,
    "track_feat_ids": ...,
    "track_feat_names": ...,
    "release_id": ...,
    "release_artist_ids": ...,
    "release_artist_names": ...,
    "release_writer_ids": ...,
    "release_writer_names": ...,
    "release_feat_ids": ...,
    "release_feat_names": ...,
    "genres": ...,
    "styles": ...,
    "country": ...,
    "labels": ...,
    "formats": ...,
    "master_id": ...,
    "main_release": ...,
    "release_videos": ...,
    "released": ...,
}
$ python discogs_vi/parse_releases_to_tracks.py discogs_20240701_releases.xml.json.clean
> Parsed 2,484,508 releases to 14,389,443 tracks.
> Processing time: 13:40
> Done!
A clique is a collection of music performances that are realizations of the same composition. Using writer information we put the parsed tracks into clique relationships where each clique has a unique UUID. Currently 2 tracks can be in a clique relationship only if they have exactly the same title. The unique elements of a clique are called versions. Since the Discogs dump contains different releases of the same track, we group the tracks that are exactly the same into versions.
A clique will be of the form,
{
    "clique_id": uuid,
    "versions": [
        {"version_id": version_id,
         "tracks": [track0, track1, ...]},
        {}, ...
    ]
}
where the tracks have the form as in the previous section. An example clique is provided in examples/example_clique.json
python discogs_vi/clique_finder.py discogs_20240701_releases.xml.json.clean.tracks
>   348,796 cliques are versioned into 1,911,611 versions with  8,038,309 tracks.
> Finished the search.
> Done!
You have created the Discogs-VI dataset and you are ready to match the versions to YouTube URLs. By default the file will be named Discogs-VI-20240701.jsonl
In this part our goal is to match the versions in Discogs-VI-20240701.jsonl to youtube IDs. We do this in 2 main steps and then post-process it.
NOTE: We download the audio files as mono, with 44,100 Hz sampling rate in AAC format. This took 1.8 TB of disk space. In many Version Identification systems, the audio is downsampled to 16,000 or 22,050 Hz. While training Discogs-VINet, we downsampled these files to 16,000 Hz on-the-fly.
In this step we search for each version in Youtube by creating queries and storing them. Use the following command to create query strings for the unmatched versions.
python discogs_vi_yt/query_yt/prepare_query_string.py Discogs-VI-20240701.jsonl
Once you have the query strings ready, make a youtube query for each string and store top 5 results’ metadata. You can parallelize this step. There are many queries to make and time is of the essence, therefore we split the queries into multiple parts and make the queries for each part in parallel. We used 16 processes but this many processes may get you banned from Youtube.
utilities/shuffle_and_split.sh Discogs-VI-20240701.jsonl.queries 16
Then for each part of the split you should use the below command in a separate terminal session (e.g. with TMUX).
python discogs_vi_yt/query_yt/query_and_download_yt_metadata.py Discogs-VI-20240701.jsonl.queries.split.00 <metadata_dir>
Once all the metadata is downloaded, now we can search for the versions inside this metadata collection.
python discogs_vi_yt/query_yt/search_tracks_in_queried_yt_metadata.py Discogs-VI-20240701.jsonl <metadata_dir>
After the matching you need to download the matched audio.
python discogs_vi_yt/audio_download_yt/download_missing_version_youtube_urls.py Discogs-VI-20240701.jsonl.youtube_query_matched music_dir/
In this step we check if each video is downloaded and remove the versions without a downloaded video. Then, betweeen the different versions of a same clique that are matched to the same Youtube URL are removed also. Finally, any clique with less than two versions are deleted.
python discogs_vi_yt/post_processing.py Discogs-VI-20240701.jsonl.youtube_query_matched music_dir/
This will create Discogs-VI-YT-20240701.jsonl and Discogs-VI-YT-20240701-light.json.
Since the Discogs-VI-YT has a large size and occupies a lot of memory, we create Discogs-VI-YT-20240701-light.json that only contains:
Discogs-VI-YT-20240701-light.json is a JSON file with the default encoding.
Congratulations, you are almost done!
NOTE: You will have less versions in the dataset than the downloaded music because of the post processing. You can move these extra files to a directory above with: notebooks/file.ipynb
Once you have the dataset, you can use the streamlit demo to visualize the cliques. But first we have to do preprocesesing by removing some redundant information to reduce the memory load.
python utils/prepare_demo.py Discogs-VI-YT-20240701.jsonl
We first find the intersections of our dataset and other datasets’ test sets and then partition our data acordingly.
Since Da-TACOS Train set is not public we do not work with it. We only find the intersections between Da-TACOS Benchmark set and Discogs-VI-YT. For SHS100K we used the metadata in https://github.com/NovaFrost/SHS100K2. We find out which cliques in Da-TACOS benchmark and SHS100K-Test sets are contained in Discogs-VI-YT by using a jupyter notebook discogs_vi_yt/partition/datacos_shs100k.ipynb. This will create a line separated text file of the clique IDs called data/Discogs-VI-20240701-DaTACOS_benchmark-SHS100K2_TEST-lost_cliques.txt
We use Discogs-VI first instead of Discogs-VI-YT since it contains more cliques.
We use the following jupyter notebook discogs_vi_yt/partition/datacos_shs100k.ipynb for this step.
Let $S =$ ( Discogs-VI $\cap$ Da-TACOS benchmark ) $\cup$ ( Discogs-VI $\cap$ SHS100K-Test )
Remember that Discogs-VI-YT $\subset$ Discogs-VI $\implies$ $S \subset $ Discogs-VI-YT. For simplicity we use $S’ = S \cup$ Discogs-VI-YT. That is, we will put all the cliques in $S’$ to our test partition.
For this step we use a jupyter notebook discogs_vi_yt/partition/create_splits.ipynb. Follow the instructions there.
data/Discogs-VI-20240701-DaTACOS_benchmark-SHS100K2_TEST-lost_cliques.txt) obtained in step and put the corresponding cliques into Discogs-VI-YT-Light-Test first.You are done!