Sequence-to-sequence Singing Synthesis Using the Feed-forward Transformer

Feed-forward Transformer Seq2Seq model, with neural vocoder, effects and background music.

The model predicts timbre and phonetic timings, while F0 and note onsets (vowel onsets) are obtained from a reference recording.

FFT-NPSS (proposed)

FFT-NPSS w/ ground truth dur.

FFT-NPSS w/o self-attention

AR-NPSS (baseline)

The proprietary dataset used in these experiments was provided by Voctro Labs.