Disentangling Timbre and Singing Style
with Multi-singer Singing Synthesis System

ABSTRACT

In this study, we define the identity of singer with two independent concepts -- timbre and singing style -- and propose a multi-singer singing synthesis system that can model them separately. To this end, we extend our single-singer model into multi-singer model in the following ways: first, we design a singer identity encoder that can effectively reflect the identity of a singer. Second, we use encoded singer identity to condition the two independent decoders that model timbre and singing style, respectively. Through a user study with the listening tests, we experimentally verify that the proposed framework is capable of generating a natural singing voice of high quality while independently controlling the timbre and singing style. Also, by using the method of changing singing styles while fixing the timbre, we suggest that our proposed network can produce a more expressive singing voice.

PROPOSED SYSTEM

We propose a multi-singer SVS system that can model timbre and singing styles independently. We designed the network with [1] as the baseline and extended the existing model to the multi-singer model by adding 1) singer identity encoder and 2) timbre/singing style conditioning method.

Listening Test Result

Qualitative evaluation results of our proposed system. (9-point scale)

Pronunciation accuracy Sound Quality Naturalness
Proposed (w/o cross-generation) 7.30 ± 1.44 5.06 ± 1.44 5.64 ± 2.01
Proposed (w/ cross-generation) 7.36 ± 1.39 5.19 ± 1.76 5.55 ± 2.02
Ground truth 7.43 ± 1.50 6.40 ± 1.96 6.89 ± 1.89

Audio Samples


Generated singing voice with different singer identity. Note that singer 1 and singer 2 are the result of training with clean singing data collected directly in a controlled environment, while singer 3 and singer 4 are the result of training after separation of a professional singer's sound source in the wild.

Song num. Singer 1 Singer 2 Singer 3 Singer 4
1
2
3
4
5

Singing voice produced through cross-generation. 'A-A' means the result of combining the timbre of A with the singing style of A, and 'A-B' means the result of combining the timbre of A with the singing style of B.

Song num. A-A B-B A-B B-A
1
2
3
4
5

Singing voice generated by interpolating between singing identity embeddedings of two singers. The figure below shows the changing spectrogram as interpolation progresses.


Song num. A - - B
1
2
3
4