Speaker encoding is based on training a separate model to directly infer a new speaker embedding from cloning audios that will ultimately be used with a multi-speaker generative model. The speaker encoding model has time-and-frequency-domain processing blocks to retrieve speaker identity information from each audio sample, and attention blocks to combine them in an optimal way.

The advantages of speaker encoding include fast cloning time (only a few seconds) and low number of parameters to represent each speaker, making it favorable for low-resource deployment.

Besides accurately estimating the speaker embeddings, we observe that speaker encoders learn to map different speakers to embedding space in a meaningful way.

For example, different genders or accents from various regions are clustered together. This was created by applying operations in this learned latent space, to convert the gender or region of accent of one speaker. Our results demonstrate that this approach is highly effective when generating speech for new speakers and morphing speaker characteristics.

This article was republished with permission from Baidu Research.