Recent character-based end-to-end text-to-speech (TTS) systems
have shown promising performance in natural speech generation,
especially for English. However, for Chinese TTS, the
character-based model is easy to generate speech with wrong
pronunciation due to the label sparsity issue. To address this
issue, we introduce an additional learning task of character-topinyin
mapping to boost the pronunciation learning of characters,
and leverage a pre-trained dictionary network to correct the
pronunciation mistake through joint training. Specifically, our
model predicts pinyin labels as an auxiliary task to assist learning
better hidden representations of Chinese characters, where
pinyin is a standard phonetic representation for Chinese characters.
The dictionary network plays a role as a tutor to further
help hidden representation learning. Experiments demonstrate
that employing the pinyin auxiliary task and an external dictionary
network clearly enhances the naturalness and intelligibility
of the synthetic speech directly from the Chinese character sequences.
修改评论