Longer demos with music

Demo songs generated with the proposed semi-supervised model trained on a full target dataset (2h7min), consisting of just audio (no other annotations were used).

For these demos the model was controlled by a timed phonetic sequence and F0, in this case obtained from a reference recording.

The waveform was generated from the predicted mel-spectrogram using a neural vocoder. The final mix has effects and background music.

Feel

Fake Plastic Trees

Comparison - Full target dataset (2h7min)

Reference
Supervised
Semi-supervised

Comparison - Voice cloning (3min)

Reference
Supervised cloning
Semi-supervised cloning

Acknowledgments

This work was funded by TROMPA H2020 No 770376.