VoiceGen: Describing and Generating Voices with Text Prompt
Abstract
Speech conveys more information than text, as the same word can be uttered in various voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using text prompts (descriptions) is more user-friendly since speech prompts can be hard to find or may not exist at all. TTS approaches based on the text prompt face two main challenges: 1) the one-to-many problem, where not all details about voice variability can be described in the text prompt, and 2) the limited availability of text prompt datasets, where vendors and large cost of data labeling are required to write text prompts for speech. In this work, we introduce VoiceGen to address these challenges with a variation network to provide variability information of voice not captured by text prompts, and a prompt generation pipeline to utilize the large language models (LLM) to compose high quality text prompts. Specifically, the variation network predicts the representation extracted from the reference speech (which contains full information about voice variability) based on the text prompt representation. For the prompt generation pipeline, it generates text prompts for speech with a speech language understanding model to recognize voice attributes (e.g., gender, speed) from speech and a large language model to formulate text prompts based on the recognition results. Experiments on a large-scale (44K hours) speech dataset demonstrate that compared to the previous works, VoiceGen generates voices more consistent with text prompts and supports the sampling of diverse voice variability, thereby offering users more choices on voice generation. Additionally, the prompt generation pipeline produces high-quality text prompts, eliminating the large labeling cost. The demo page of VoiceGen is available online.
Content
1. Audio Samples1.1 Attribute Control
1.2 Timbre Variation
1.3 Extension on Face2Voice
1. Audio Samples
We demonstrate the advantages of VoiceGen through the following three aspects:
Attribute Control: we aim to show that we can control a specific attribute with the text prompts of different meanings.
Timbre Variation: we aim to show that we can synthesize speech in the different timbre with different sampling results of variation network, while in the same timbre when changing the text prompt with the same intention, text content or sampling results of TTS backbone. In this case, the voice variability is mainly controlled by the variation network.
Extension on Face2Voice: we aim to show that we can synthesize speech that matches the facial image.
1.1 Attribute Control
Note that audio samples in this part are from VoiceGen, PromptTTS, and InstructTTS, which show that VoiceGen can control a specific attribute with the text prompts of different meanings.
1.1.1 Gender
Gender-1: Please ask a man with a normal voice to say: Fire a whole platoon, Major.
Gender-2: Please ask a woman with a normal voice to say: Fire a whole platoon, Major.
Model | Gender-1 | Gender-2 |
---|---|---|
VoiceGen | ||
PromptTTS | ||
InstructTTS |
1.1.2 Speed
Speed-1: Please speak at a slow speed, gentleman: Him sorely and yet, it was but a woman's fancy, a passing fancy. She would become reconciled to the inevitable, as women do, and when her children came, she would grow accustomed to her sorrow. And her trouble would be forgotten in their laughter.
Speed-2: Please speak at a normal speed, gentleman: Him sorely and yet, it was but a woman's fancy, a passing fancy. She would become reconciled to the inevitable, as women do, and when her children came, she would grow accustomed to her sorrow. And her trouble would be forgotten in their laughter.
Speed-3: Please speak at a fast speed, gentleman: Him sorely and yet, it was but a woman's fancy, a passing fancy. She would become reconciled to the inevitable, as women do, and when her children came, she would grow accustomed to her sorrow. And her trouble would be forgotten in their laughter.
Model | Speed-1 | Speed-2 | Speed-3 |
---|---|---|---|
VoiceGen | |||
PromptTTS | |||
InstructTTS |
1.1.3 Pitch
Pitch-1: She said in a low pitch: But it is not with a view to distinction that you should cultivate this talent, if you consult your own happiness.
Pitch-2: She said in a normal pitch: But it is not with a view to distinction that you should cultivate this talent, if you consult your own happiness.
Pitch-3: She said in a high pitch: But it is not with a view to distinction that you should cultivate this talent, if you consult your own happiness.
Model | Pitch-1 | Pitch-2 | Pitch-3 |
---|---|---|---|
VoiceGen | |||
PromptTTS | |||
InstructTTS |
1.1.4 Volume
Volume-1: Generate a boy's voice with a low volume for me: But their health and strength, child; they can never stand the severe application.
Volume-2: Generate a boy's voice with a normal volume for me: But their health and strength, child; they can never stand the severe application.
Volume-3: Generate a boy's voice with a high volume for me: But their health and strength, child; they can never stand the severe application.
Model | Volume-1 | Volume-2 | Volume-3 |
---|---|---|---|
VoiceGen | |||
PromptTTS | |||
InstructTTS |
1.2 Timbre Variation
Note that audio samples in this part are from VoiceGen, which show that VoiceGen can synthesize speech in the different timbre with different sampling results of variation network, while in the same timbre when changing the text prompt with the same intention, text content or sampling results of TTS backbone. In this case, the voice variability is mainly controlled by the variation network.
1.2.1 Variation Network
We can change the timbre by altering the sampling results of the variation network while maintaining the speech consistent with the intention of the text prompt.
Variation Network-[1, 2, 3]: I want a low pitched female voice: Jason went back sadly and told the heroes what he had heard, and they leapt onshore and searched till gone. At dawn, they found the body all rolled in dust and blood among the corpses of those monstrous beasts.
Variation Network-1 | Variation Network-2 | Variation Network-3 |
---|---|---|
1.2.2 Text Prompt
Even if we change the text prompt with the same intention, the timbre will not be not be altered, which is exactly what we aim to achieve.
Text Prompt-1: I want a low pitched female voice: Jason went back sadly and told the heroes what he had heard, and they leapt onshore and searched till gone. At dawn, they found the body all rolled in dust and blood among the corpses of those monstrous beasts.
Text Prompt-2: This madam talks to me with a deep voice: Jason went back sadly and told the heroes what he had heard, and they leapt onshore and searched till gone. At dawn, they found the body all rolled in dust and blood among the corpses of those monstrous beasts.
Text Prompt-3: Decrease the pitch of her voice for me: Jason went back sadly and told the heroes what he had heard, and they leapt onshore and searched till gone. At dawn, they found the body all rolled in dust and blood among the corpses of those monstrous beasts.
Text Prompt-1 | Text Prompt-2 | Text Prompt-3 |
---|---|---|
1.2.3 Text Content
Even if we change the text content, the timbre will not be not be altered, which is exactly what we aim to achieve.
Text-1: I want a low pitched female voice: Delaney had read one or two works on psychic phenomena and understood from them that spirit projection was not only quite feasible but far from uncommon.
Text-2: I want a low pitched female voice: And in this additional chapter to amplify and fortify, here and there, the result must necessarily be disconnected but a glance at the index will point the way to what is new.
Text-3: I want a low pitched female voice: If you wore the pink bonnet, I'll give it to you, and I'll back you up again, Mrs. Danvey. I think you might have done something with our member, as my father calls him, when you had him for so long in the house, but altogether.
Text Content-1 | Text Content-2 | Text Content-3 |
---|---|---|
1.2.4 TTS Backbone
We cannot alter the timbre by changing the sampling results of TTS backbone.
TTS Backbone-[1, 2, 3]: I want a low pitched female voice: Jason went back sadly and told the heroes what he had heard, and they leapt onshore and searched till gone. At dawn, they found the body all rolled in dust and blood among the corpses of those monstrous beasts.
TTS Backbone-1 | TTS Backbone-2 | TTS Backbone-3 |
---|---|---|
1.3 Extension on Face2Voice
Note that audio samples in this part are from VoiceGen, SP-FaceVC, and ground-truth voice (Ground-Truth) which show that VoiceGen can synthesize speech that matches the facial image.
VoiceGen: Face-1: That summer is immigration, however, being mainly from the free states, greatly changed the relative strength of the two parties.
SP-FaceVC: Face-1: That summer is immigration, however, being mainly from the free states, greatly changed the relative strength of the two parties.
Ground-Truth: Face-1: That includes requiring background checks for students and other school employees, rights for student expulsion.
Face-1 | VoiceGen | SP-FaceVC | Ground-Truth |
---|---|---|---|
VoiceGen: Face-2: There was an average cost per lane for meter operation of 22 cents a year and each meter took care of an average of 17 lamps.
SP-FaceVC: Face-2: There was an average cost per lane for meter operation of 22 cents a year and each meter took care of an average of 17 lamps.
Ground-Truth: Face-2: We currently know it. This is important that we listen to the people will be negatively impacted and everyone who cares deeply about the direction this budget will take.
Face-2 | VoiceGen | SP-FaceVC | Ground-Truth |
---|---|---|---|
VoiceGen: Face-3: That summer is immigration, however, being mainly from the free states, greatly changed the relative strength of the two parties.
SP-FaceVC: Face-3: That summer is immigration, however, being mainly from the free states, greatly changed the relative strength of the two parties.
Ground-Truth: Face-3: Non-fiscal provision for me that would have devastating effects for our citizens. This budget also includes significant change.
Face-3 | VoiceGen | SP-FaceVC | Ground-Truth |
---|---|---|---|
VoiceGen: Face-4: There was an average cost per lane for meter operation of 22 cents a year and each meter took care of an average of 17 lamps.
SP-FaceVC: Face-4: There was an average cost per lane for meter operation of 22 cents a year and each meter took care of an average of 17 lamps.
Ground-Truth: Face-4: I am taking those opportunities away from our kids and giving them to private schools that all interests.
Face-4 | VoiceGen | SP-FaceVC | Ground-Truth |
---|---|---|---|