Demos

English male voice (M1) - Take the A train

In the following examples only timbre is generated by the model. Pitch and phonetic timings are extracted from a recording (in most cases of a different singer).

Mix
Acapella

English female voice (F1) - Locked out of love

Mix
Acapella

Spanish female voice (F2) - El último vals

Here "Soft VQ" and "Powerful VQ" are different voice qualities trained using smaller amounts of training data. Not discussed in paper or journal.

Mix
Acapella
Soft VQ (Mix)
Powerful VQ (Mix)

Japanese female voice (F3) - ふるさと (Furusato)

Acapella

Japanese female voice (F3) - ハナミズキ (Hanamizuki)

In this example pitch and phonetic timings are predicted by the model from a "MIDI + lyrics"-like input score. Discussed in extended journal, but not original paper.

Mix

Acknowledgments

The datasets used for voices F1 and M1 are provided by Zya. The dataset used for voice F2 is provided by Voctro Labs. The dataset used for voice F3, "NIT SONG070 F001" by Nagoya Institute of Technology, is licensed under CC BY 3.0. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.