CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

Abstract & Framework

Paper abstract and model overview

This section now introduces the paper itself before the listening cases, using the actual abstract, the final module framework figure, and the data construction and evaluation pipeline figure from the paper.

Abstract

CapTalk paper abstract

Voice design from natural language descriptions is emerging as a new task in text-to-speech multimodal generation, aiming to synthesize speech with target timbre and speaking style without relying on reference audio. However, existing methods mainly focus on single-utterance generation, leaving conversational voice design largely unexplored.

In this work, we extend voice design to dialogue, enabling better target speaker modeling and turn-level expressive control in natural conversational settings. We propose CapTalk, a unified caption-conditioned text-audio autoregressive framework for both single-utterance and dialogue voice design. CapTalk uses utterance-level captions for single-utterance voice design and speaker-level captions for dialogue speaker modeling, and further introduces a CoT control sequence in dialogue to explicitly plan turn-level dynamic attributes.

To resolve the conflict between stable timbre preservation and context-adaptive expression, we propose a hierarchical variational conditioning module with an utterance-level speaker encoder to better balance stable timbre preservation and context-adaptive expression. This enables timbre reuse while keeping expression adaptive to the current utterance and, in dialogue, the surrounding context. We also build a comprehensive evaluation protocol for both single-utterance and dialogue settings.

Experiments show that CapTalk achieves state-of-the-art performance on a single-utterance voice design benchmark and delivers better expression controllability and contextual appropriateness in multi-turn dialogue.

Contribution 1

Unified Single + Dialogue Voice Design

One text-audio autoregressive backbone for both settings: utterance-level captions for single-utterance, and speaker-level captions plus a CoT control sequence (emotion / tone / pitch / energy / speed) for dialogue. Prior voice design systems focus almost exclusively on single utterances.

Contribution 2

Hierarchical Variational Timbre Conditioning

FHVAE-inspired: an utterance-level speaker embedding defines the prior of a pooled segment latent, and KL regularization reinforces stable timbre while attenuating segment-specific affective variation. Avoids the timbre/emotion entanglement of two-stage "design + clone" pipelines like VoiceSculptor.

Contribution 3

Dialogue-Oriented Evaluation Protocol

A new three-part protocol — CoT prediction plausibility, CoT controllability, and overall dialogue preference — that complements InstructTTSEval, which only covers single-utterance speech.

Framework

CapTalk module framework

The figure below is the framework image from the paper. It shows the hierarchical variational timbre conditioning module and the unified caption-conditioned autoregressive generation pipeline for both single-utterance and dialogue settings.

Figure 1. CapTalk framework. Left: hierarchical variational timbre conditioning. Bottom/right: unified caption-conditioned autoregressive generation for single-utterance and dialogue speech.

Pipeline

Data construction, attribute extraction, and evaluation

Figure 2 is worth keeping on the demo page because it clarifies where the single-utterance and dialogue data come from, how CoT attributes are extracted, and how the evaluation protocol is organized.

CapTalk data construction and evaluation pipeline

Figure 2. Data construction from public and internal speech data, CoT-style attribute extraction for dialogue, and the complete single-utterance and dialogue evaluation pipeline.

Section 1

Single-Utterance Voice Design

The single-utterance section serves two purposes: it matches the benchmark narrative in Table 1 and highlights why CapTalk is more balanced across caption styles, especially on RP.

Prompt compatibility note: CapTalk, Qwen3-TTS, Ming-omni-tts, and VoiceSculptor accept the original free-form caption directly. Fish Speech S2 Pro does not natively follow long natural-language speaker descriptions, so we convert the same caption semantics into short bracketed control tags following its official prompt interface. The Fish results should therefore be read as an interface-adapted comparison with matched caption meaning, rather than an identical raw prompt string match.

Table 1. Results on the InstructTTSEval-ZH Benchmark

Model	APS	DSD	RP	AVG
Qwen3TTS-12Hz-1.7B-VD	87.10	76.00	55.20	72.77
Ming-omni-tts-0.5B	84.90	72.20	53.90	70.33
VoiceSculptor	73.77	65.40	47.60	62.26
Fish Speech S2 Pro	29.61	50.80	42.60	41.00
CapTalk-1.5B (Ours)	84.10	75.40	61.70	73.73

Table 2. Human evaluation for single-utterance voice design

All metrics rated on a 1–5 scale. MOS denotes human-rated naturalness.

Model	Overall	Identity	Timbre	Express.	Role	Stabil.	MOS
Ming-omni-tts-0.5B	3.95	4.22	4.03	4.06	3.88	4.27	3.91
Qwen3TTS-12Hz-1.7B-VD	4.20	3.78	4.15	4.12	3.85	4.35	3.82
VoiceSculptor	3.00	3.15	2.89	2.74	2.81	3.25	2.87
Fish Speech S2 Pro	2.07	2.19	2.02	1.82	1.86	2.29	2.11
CapTalk (Ours)	4.24	3.98	3.59	4.10	4.17	4.38	4.20

APS caption comparison

Structured prompt

APS uses structured acoustic attributes. This card shows two examples under the same prompt style.

Example A

Caption: 性别：男性，年龄：中年男性，音高：男性中低音区，音调沉稳有力，语速：语速平稳稍快，节奏清晰，音量：音量洪亮饱满，富有穿透力，清晰度：吐字清晰标准，发音准确，流畅度：表达流畅自然，无明显卡顿，口音：标准普通话，几无地方口音，音色质感：音色浑厚明亮，富有磁性，情感：中性，语气：庄严。性格：自信沉稳，专业严谨。

Text: 在我身后，就是世界上下潜深度最深的作业型载人潜水器焦龙号。

Fish adapted prompt: [middle-aged male][low pitch][steady][clear mandarin][confident][professional]

Model	Audio
CapTalk (Ours)
Qwen3TTS-12Hz-1.7B-VD
Ming-omni-tts-0.5B
VoiceSculptor
Fish Speech S2 Pro

Example B

Caption: 性别：女性，年龄：青壮年，音高：音调偏高，处于女性中高音区，发声位置靠前，略带轻微的尖锐感，语速：语速中心趋势偏快，节奏紧凑，但夹杂一些即兴停顿与语气词带来的快慢切换，音量：音量整体适中，动态范围较窄，偶尔因表达惊讶或调侃而略为提升，清晰度：发音清晰，咬字基本准确，无明显含糊，流畅度：整体表达流畅，但存在一些日常口语中的语气词填充（如“嘞”、“啦”），口音：普通话，带有轻微的南方口音色彩，音色质感：音色明亮，略带金属感与些许气声，显得活泼，情感：疑问，语气：疑问和反问，性格：活泼、外向、好聊天，略带撒娇与抱怨，沟通中带有较强的情绪表达与互动意愿。

Text: 哎我都没听说过了有这样子嘞，他不是系统的问题吗？

Fish adapted prompt: [young female][bright voice][lively][slightly southern accent][chatty][slightly sharp]

Model	Audio
CapTalk (Ours)
Qwen3TTS-12Hz-1.7B-VD
Ming-omni-tts-0.5B
VoiceSculptor
Fish Speech S2 Pro

DSD caption comparison

Speaker description

DSD uses free-form speaker description prompts. This card shows two examples under the same prompt style.

Example A

Caption: 展现中年男性的隐忍压抑型性格,语音沙哑紧绷并随着情绪变化变得有力,吐字清晰,情绪高涨时略混,音量从正常交谈声量迅速增大,语速随情绪加快显示内心冲突。

Text: 我怎么不关心儿子了。这么多年来，我一直受你的气。

Fish adapted prompt: [middle-aged male][husky voice][tense][emotional][getting louder][faster pace]

Model	Audio
CapTalk (Ours)
Qwen3TTS-12Hz-1.7B-VD
Ming-omni-tts-0.5B
VoiceSculptor
Fish Speech S2 Pro

Example B

Caption: 说话的是个三十岁左右的北方男人，声音挺沉稳的，不咋大声也不咋小，听起来很靠谱的样子。他讲起话来有条不紊的，但有时候也会突然问点扎心的问题，像是在认真跟你聊天，但又不会太煽情。

Text: 你比如说工厂的设备自动化那种。

Fish adapted prompt: [northern male][steady][calm][clear][conversational][serious]

Model	Audio
CapTalk (Ours)
Qwen3TTS-12Hz-1.7B-VD
Ming-omni-tts-0.5B
VoiceSculptor
Fish Speech S2 Pro

RP caption comparison

Role prompt

RP is role- and intent-heavy. This card shows two examples under the same prompt style.

Example A

Caption: 在公众场合对不公行为进行严厉批评,声音充满力量和决心。

Text: 你干了没有良心、不顾底线的事情，少拿孩子当借口。

Fish adapted prompt: [firm][powerful][critical][determined]

Model	Audio
CapTalk (Ours)
Qwen3TTS-12Hz-1.7B-VD
Ming-omni-tts-0.5B
VoiceSculptor
Fish Speech S2 Pro

Example B

Caption: 一个在家庭矛盾中试图主导解决方案的中年男性，正在用沉稳有力的声音劝说对方解决问题，语气中带有不容置疑的果断与责任担当。他在家庭讨论中扮演决策者角色，语重心长地推动事态发展。

Text: 我不怀念他呀，我走出来了我为什么还要去怀念呢？

Fish adapted prompt: [middle-aged male][firm][persuasive][responsible][steady]

Model	Audio
CapTalk (Ours)
Qwen3TTS-12Hz-1.7B-VD
Ming-omni-tts-0.5B
VoiceSculptor
Fish Speech S2 Pro

Single-Utterance Scaling Study

Within Section 1

This subsection stays under single-utterance voice design. Each prompt keeps the caption style and target text fixed, and only changes the training scale: 300h, 1500h, 3000h, 5000h, and 5000h+300h. The 300h is public acted-style data; 1500h / 3000h / 5000h are internal single-utterance speech in a natural casual-talk style. As reported in the paper, this casual-style data provides broader coverage while acted speech contributes stronger expressive attributes, and the mixed 5000h+300h setup yields the best overall result (AVG 73.73).

Data Scale	Type	APS	DSD	RP	AVG
300h	Acted	82.78	71.41	49.10	67.76
1500h	Conversat.	70.20	57.80	41.20	56.40
3000h	Conversat.	74.00	62.40	46.80	61.07
5000h	Conversat.	79.10	68.30	55.10	67.50
5000h+300h	Conversat.+Acted	84.10	75.40	61.70	73.73

Table 5. Scaling law results for single-utterance voice design on InstructTTSEval-ZH.

APS scaling ladder

Structured prompt

Two APS examples are shown below to illustrate how stronger data scale improves prompt-following while keeping the voice natural.

Example A

Caption: 性别：女性，年龄：青年，音高：女性高音区，句末带有上扬，语速：初始略缓，后半段转为适中语速，音量：正常交谈音量，略有起伏，清晰度：吐字清晰标准，流畅度：表达流畅自然，口音：标准普通话，音色质感：音色清亮，略带甜美感，情感：开心愉悦，语气：询问，开心。性格：活泼外向，直接爽朗。

Text: 你怎么这样啊？来吧，晚上八点在我们家酒吧见好吗？

Scale	Audio
300h
1500h
3000h
5000h
5000h+300h

Example B

Caption: 性别：男性，年龄：中年，音高：音调中心趋势偏低，基频稳定在中低频范围，声线沉稳，语速：语速中心趋势适中，节奏平稳，但句与句之间存在明显停顿，整体快慢波动较小，音量：音量整体适中，动态范围较窄，无明显高低起伏，清晰度：发音清晰度高，吐字清晰，归音完整，无含糊现象，流畅度：表达流畅自然，语句连贯，无频繁重复或填充词，口音：带有明显北方方言特征，发音方式具有区域性，音色质感：音质浑厚，共鸣感强，属于典型的自然男声，情感：中性，调侃，语气：反问，性格：稳重、随和，表达中带有日常关切与温和询问的语气。

Text: 回老家起码有口饭也不用你自己不用你做呀，是不是？

Scale	Audio
300h
1500h
3000h
5000h
5000h+300h

DSD scaling ladder

Speaker description

These two DSD examples focus on whether larger training scale improves stability of persona, speaking manner, and delivery coherence.

Example A

Caption: 展现中年男性的压抑爆发型性格,语音沙哑紧绷并随着情绪变化变得有力,吐字清晰,情绪高涨时略混,音量从正常交谈声量迅速增大,语速随情绪加快显示内心冲突。

Text: 我怎么不关心儿子了。这么多年来，我一直受你的气。

Scale	Audio
300h
1500h
3000h
5000h
5000h+300h

Example B

Caption: 说话的是位四十来岁的中年女人，声音沉稳带点沙哑，说话不慌不忙的。她讲起事儿来特别实在，不像那些爱炫的人。有时候说到生气的地方，语速就加快一点，听着还挺真实的。

Text: 我没来过我妈妈去过我没去，她结婚的是我妈妈去了。

Scale	Audio
300h
1500h
3000h
5000h
5000h+300h

RP scaling ladder

Role prompt

RP is where data scale and data mix are easiest to hear. These two examples make the role-intent difference across scales much more obvious.

Example A

Caption: 在讲述社会不公时,语气果断有力,语速由快到慢。

Note: When the caption does not specify gender or age, any gender and age are acceptable as long as they fit the given context.

Text: 有多少人这一辈子，连个机会都看不着，这就是现实。

Scale	Audio
300h
1500h
3000h
5000h
5000h+300h

Example B

Caption: 在家庭或社交圈中与他人激烈争执时的强势女性角色，她正在抓住对方言语漏洞反复质问，情绪激动，语气咄咄逼人，每一句话都充满挑战意味。

Text: 你爱回来不回你爱干嘛干嘛谁还敢说你呢？你爱回不回。

Scale	Audio
300h
1500h
3000h
5000h
5000h+300h

Section 2

Dialogue Voice Design

This is the main strength of CapTalk. The dialogue section is anchored on the final dialogue model trained on 32,171 sessions / 6270.68 hours. It starts from short full-dialogue cases, then moves through speaker consistency, objective CoT control, with-versus-without CoT comparison, and cross-system comparison.

Full multi-turn dialogue

Conversation-level demo

We present two complete dialogue sessions, not sliced utterances. In each case, the target speaker is synthesized, while the other speaker remains the original recorded voice from the session. Across these examples, the synthesized target speaker maintains stable timbre for nearly three minutes of continuous dialogue.

Case A.

Caption note: The APS / DSD / RP captions below all describe the target speaker, and in this dialogue the target speaker is the first speaker.

APS caption:
性别: 男性，年龄: 青壮年，音高: 音调中心趋势偏低，基频稳定处于中低频区域，具备自然的胸腔共鸣基础，语速: 语速中心趋势偏快，节奏平稳，但在部分片段中出现快慢交替，尤其是在陈述信息时有明显的停顿与节奏调整，音量: 音量整体适中，动态范围较窄，表现出较为平稳的能量控制，清晰度: 发音清晰度良好，基本无模糊或含糊现象，流畅度: 表达流畅，偶有轻微填充词，整体无明显卡顿，口音: 带有明显南方口音，可能源自长江流域某地区，非标准普通话，音色质感: 音质干净，略带轻微的沙哑与喉音残留，发声质感偏向生活化和亲民，性格: 随和、自然，交谈风格偏向日常熟人间的轻松状态。

DSD caption:
一个带着点南方口音的30来岁男人，声音挺稳的，不是特别亮也不算闷。说话的时候较快，有点自己的节奏，跟朋友聊家常似的，挺实在的那种。

RP caption:
一个老哥，在和街边小店的老板寒暄聊天，话题围绕生意和市场行情，声音里带着烟火气，也有那么一点经验丰富的老道。

Original full dialogue:

Case B.

Caption note: The APS / DSD / RP captions below all describe the target speaker, and in this dialogue the target speaker is the first speaker.

APS caption:
性别: 男性，年龄: 中年，音高: 音调中心趋势适中偏低，整体维持在自然男声的中低频区域，声线稳定，无剧烈波动，语速: 语速中心趋势适中，节奏平稳，但快慢切换频繁，情绪带动时语速明显加快，音量: 音量整体适中，动态范围较宽，部分片段更偏响亮，清晰度: 发音清晰度一般，偶有轻微含糊，流畅度: 整体流畅，偶有重复语句和少量填充词，口音: 普通话，带有明显的南方地域口音，影响部分字音，音色质感: 音质略显粗糙，发声不够圆润，带有轻微沙哑感，性格: 关注家庭与现实问题，情感直接，表达带有焦虑与担忧。

DSD caption:
这是个中年的男人，声音带点南方口音，听上去挺实在的。说话时候挺关心家里的事，尤其是孩子身体和上学的事，老是提心吊胆的，时不时就唠叨两句，有点操心过头的感觉。

RP caption:
一位操心的儿子健康的中年父亲在向他人倾诉自己的家庭忧虑，他的语气里夹杂着对孩子成长的关心和对年迈父母的担忧，语速因情绪波动而时快时慢，声音中透露出一种真切的生活压力。

Original full dialogue:

Dialogue consistency across different contexts

Speaker consistency

Two cases are shown here. Each case keeps one target speaker fixed and compares three different dialogue contexts, making the main dialogue claim directly audible: stable target-speaker timbre across turns, while expression stays adaptive to the current local context.

Example A

Target speaker caption: 性别: 男性. 年龄: 中年. 音高: 音调偏低，基频稳定，处于男性自然声线的中低频区域，共鸣充足. 语速: 语速中心趋势适中，节奏平稳，但存在快慢切换，在问答与停顿间体现交流节奏. 音量: 音量整体均衡，无明显过大或过小现象，动态范围较窄. 清晰度: 发音清晰度良好，吐字准确. 流畅度: 语言流畅，表达自然，偶有口头禅与填空词. 口音: 带有明显的北方口音. 音色质感: 音质较为浑厚，略带低沉的共鸣感. 性格: 稳重、务实，沟通中表现出较强的回应意愿与自然交流态度.

Context	Target text	Audio
Neutral context	哦你上什么？
Warm context	嗯，好的，你，你你说嘛，你说呀。
Tense context	Ok不想睡睡睡不着躺躺一会儿睡，嗯。

Example B

Target speaker caption: 性别: 女性. 年龄: 中年(45-60岁). 音高: 音调处于中低频区域，基频稳定，整体感觉沉稳低柔. 语速: 语速中心趋势偏慢，节奏舒缓，但在思考或转折时出现短暂加速，快慢切换频繁. 音量: 音量整体适中，部分片段因情绪加强音量略提升，动态范围较宽. 清晰度: 发音清晰，吐字基本准确，偶有轻微气息残留. 流畅度: 整体较为流畅，表达自然，偶有口头禅“啊”、“呀”等填充. 口音: 标准普通话，无明显方言痕迹. 音色质感: 音质沉稳柔和，略带疲惫感，有轻微的气声. 性格: 冷静、内省、现实主义，对情感和物质关系有清醒认知.

Context	Target text	Audio
Reassuring reply	没事，我，最难的时候我都度过来了现在已经，风雨过了，我觉得自己想怎么做就怎么做。
Caring reply	嗯儿子也这我儿子也这我说嘛妈妈你想去干嘛干嘛想多去玩玩。
Reflective reply	这样子，但是他事业做的很好他就什么都很好。我在他身上学了很多东西，他是做事业型的男人我，我就在公司上班吗？

CoT Control

Objective controllability

This subsection keeps only the objective controllability part and uses the final dialogue model trained on 32,171 sessions / 6270.68 hours. Same target speaker, same dialogue context, same target text, and same random seed; only one CoT attribute changes at a time. The page focuses on the three most reliable and directly audible dimensions: pitch, energy, and speed.

Evaluation Category	Metric	Emotion	Tone	Pitch	Energy	Speed
CoT Prediction	Accuracy	0.7850	0.7675	0.8375	0.8250	0.9125
CoT Controllability	Success Rate	0.7675	0.7675	0.8400	0.8550	0.8675

Table 3. Automatic evaluation results for dialogue voice design.

Fixed conditions for all controllability cases: same target speaker, same dialogue context, same target text, and same random seed. Only one CoT attribute changes at a time. The pitch, energy, and speed labels are defined from speaker-level statistics over that speaker's dialogue turns, so the current target sentence itself does not have to be normal. Each row therefore shows the measured value and its change relative to the selected target sentence, not relative to the normal row.

Control case A Original baseline audio

Target text: 我去过开封的少林寺。

Target-turn base profile: pitch = extremely high, energy = slightly quieter, speed = slightly faster.

Measured baseline-turn values: 149.59 Hz, 0.085844 RMS, 6.00 cps.

Pitch

Prefer relative labels because your paper models pitch relative to each speaker's baseline.

Value	Audio
Extremely low140.07 Hz / -1.14 st
Noticeably low127.02 Hz / -2.83 st
Slightly low125.41 Hz / -3.05 st
Normal135.71 Hz / -1.69 st
Slightly high139.18 Hz / -1.25 st
Noticeably high164.88 Hz / +1.68 st
Extremely high149.59 Hz / +0.00 st

Energy

Again, use speaker-relative labels to match the extraction design in the paper.

Value	Audio
Extremely quieter0.085331 RMS / -0.05 dB
Noticeably quieter0.115417 RMS / +2.57 dB
Slightly quieter0.085844 RMS / +0.00 dB
Normal0.103108 RMS / +1.59 dB
Slightly louder0.095289 RMS / +0.91 dB
Noticeably louder0.152708 RMS / +5.00 dB
Extremely louder0.150661 RMS / +4.89 dB

Speed

Use speech-rate variation that is audible but not extreme enough to sound unnatural.

Value	Audio
Extremely slower3.60 cps / ×0.600
Noticeably slower4.29 cps / ×0.715
Slightly slower4.50 cps / ×0.750
Normal5.00 cps / ×0.833
Slightly faster6.00 cps / ×1.000
Noticeably faster6.00 cps / ×1.000
Extremely faster3.60 cps / ×0.600

Control case B Original baseline audio

Target text: 像我儿子都比你大。

Target-turn base profile: pitch = normal, energy = slightly louder, speed = noticeably faster.

Measured baseline-turn values: 172.99 Hz, 0.098876 RMS, 6.67 cps.

Pitch

The labels stay speaker-relative so the comparison remains within the same dialogue identity.

Value	Audio
Extremely low140.62 Hz / -3.59 st
Noticeably low157.95 Hz / -1.57 st
Slightly low173.25 Hz / +0.03 st
Normal172.99 Hz / +0.00 st
Slightly high167.47 Hz / -0.56 st
Noticeably high159.67 Hz / -1.39 st
Extremely high171.52 Hz / -0.15 st

Energy

The same sentence and context are kept fixed so only loudness changes from row to row.

Value	Audio
Extremely quieter0.035207 RMS / -8.97 dB
Noticeably quieter0.038624 RMS / -8.16 dB
Slightly quieter0.058547 RMS / -4.55 dB
Normal0.048262 RMS / -6.23 dB
Slightly louder0.098876 RMS / +0.00 dB
Noticeably louder0.085840 RMS / -1.23 dB
Extremely louder0.111139 RMS / +1.02 dB

Speed

Rate changes are shown on the same target turn so the contrast stays easy to hear.

Value	Audio
Extremely slower3.48 cps / ×0.522
Noticeably slower4.21 cps / ×0.631
Slightly slower6.15 cps / ×0.922
Normal5.71 cps / ×0.856
Slightly faster6.15 cps / ×0.922
Noticeably faster6.67 cps / ×1.000
Extremely faster6.15 cps / ×0.922

Dialogue generation with and without CoT

Table 4 in audio form

This is the most important dialogue comparison on the page. Use two representative cases where the gain from CoT is easy to hear in contextual appropriateness, delivery, or role intent.

Model Setting	Gemini	Human Pairwise Preference
Dialogue Eval. (w/ CoT)	72%	65.5%
Dialogue Eval. (w/o CoT)	28%	34.5%

Table 4. Context coherence comparison in dialogue evaluation.

Example A

Target speaker caption: 性别：女性；年龄：中年；音高处于中低频区域且较稳定；语速适中，情绪表达时会短暂加快；音量整体适中，动态范围略宽；发音清晰，偶有喉音；带轻微北方口音；音色略带沙哑与颗粒感；性格直爽，带有明显的抱怨倾向与焦虑紧绷感。

History

S2: 对我们就卖那个托盘的。 S1: 我们厂里面多的很，看你在搞批发吗？

Target

我们主要是网络销售。

emotion: 平静 tone: 陈述表达 pitch: 稍高 energy: 稍大 speed: 稍慢

Setting	Audio
CapTalk (w/ CoT)
CapTalk (w/o CoT)

Example B

Target speaker caption: 性别：女性；年龄：青年；音调处于典型高音区，句尾常有上扬；语速偏快、节奏紧凑；音量整体适中但情绪上来时会明显变大；吐字清晰、表达流畅；标准普通话，基本无地域口音；音色清亮明快，带年轻女性的稚嫩感；性格活泼外向、表达直接。

History

S2: 二十八啊。 S1: 二十八，二十八差十四岁。 S2: 嗯，你。 S1: 不对呀怎么算的呢？

Target

你多大了呀？

emotion: 惊讶 tone: 撒娇害羞 pitch: 稍低 energy: 极大 speed: 稍快

Setting	Audio
CapTalk (w/ CoT)
CapTalk (w/o CoT)

Cross-system dialogue comparison

Table 7 in audio form

Keep this as one curated case. The target speaker in this snippet is the first speaker in the dialogue, and the key question is whether the system preserves that speaker's identity while staying coherent with the surrounding turns.

Prompt compatibility note: CapTalk uses the target-speaker natural-language caption directly. Fish Speech S2 Pro does not natively support long natural-language speaker descriptions in dialogue; it works better with short bracketed control tags. This means the Fish side is an interface-adapted comparison with matched caption meaning, but it is not a perfectly symmetric prompt setting.

Target-speaker caption (CapTalk): 性别: 男性. 年龄: 中年. 音高: 音调中心趋势适中偏低，整体维持在自然男声的中低频区域，声线稳定，无剧烈波动. 语速: 语速中心趋势适中，节奏平稳，但快慢切换频繁，情绪带动时语速明显加快. 音量: 音量整体适中，动态范围较宽，部分片段更偏响亮. 清晰度: 发音清晰度一般，偶有轻微含糊. 流畅度: 整体流畅，偶有重复语句和少量填充词. 口音: 普通话，带有明显的南方地域口音，影响部分字音. 音色质感: 音质略显粗糙，发声不够圆润，带有轻微沙哑感. 性格: 关注家庭与现实问题，情感直接，表达带有焦虑与担忧.

Fish adapted prompt: [中年男性][中低音][语速平稳][南方口音][轻微沙哑][担忧]

Metric	Fish Speech S2 Pro	CapTalk
SIM	0.806	0.808
Context Coherence	3.61	4.18
MOS	3.80	4.12

System	Audio
CapTalk
Fish Speech S2 Pro

Section 3

Factorized Hierarchical Variational Timbre Conditioning

This section follows the paper's Section 3.3 terminology. It highlights two complementary properties of the module: stable timbre reuse once a designed voice is fixed, and one-to-many caption-conditioned sampling when the speaker embedding is re-predicted for each generation.

Timbre Reuse and One-To-Many Sampling

What this section is designed to show

As described in Section 3.3 of the paper, the hierarchical variational timbre conditioning module is inspired by the factorization idea of FHVAE. It uses an utterance-level speaker encoder and a segment-level pooled latent, with KL regularization between the segment posterior and an utterance-conditioned prior, so that stable speaker and voice attributes are preserved while segment-specific affective variation is attenuated. Its purpose is to let a designed timbre be reused across utterances and dialogue turns while keeping expression adaptive to the current context.

At the same time, caption-conditioned voice design is inherently a one-to-many task. A caption is a high-level, underspecified description that naturally corresponds to a distribution of valid voices rather than a single unique one — many concrete timbres can satisfy the same caption equally well. Furthermore, the stochastic sampling process of the model means that each generation draws a different instance from this distribution. In the standard caption-only generation setting, each run independently predicts a new êspk from the caption-conditioned distribution, so repeated samples under the same caption may sound like different but still caption-consistent voices. This sample-level variability is a property of caption-conditioned sampling, not evidence that the module fails to preserve timbre.

To directly test the timbre-reuse claim of the module, the appendix reports a dedicated timbre reuse analysis that contrasts two inference modes: Fixed êspk, where a designed voice is predicted once and then reused across utterances, and Resampled êspk, where the speaker embedding is re-predicted for every utterance. The resulting cross-utterance SIM is 0.92 in the fixed mode and 0.42 in the resampled mode, showing that once a designed voice is fixed, the module preserves it consistently across utterances.

The listening panel below is therefore meant to complement that analysis from the perceptual side: it lets listeners hear the one-to-many nature of caption-only generation directly, while the appendix numbers quantify the separate timbre-reuse path enabled by fixing the designed speaker embedding.

Each of the six cases below fixes one caption and one target text, then shows three independent generations from three separate decoding runs. These examples illustrate caption-conditioned diversity under resampled êspk, whereas the appendix timbre-reuse analysis quantifies consistency under fixed êspk.

Caption A

APS · 3 independent samples

Caption: 性别: 男性. 年龄: 中年. 音高: 音调中心趋势偏低，基频稳定处于中低音区，具备自然的胸腔共鸣基础. 语速: 语速中心趋势适中，节奏平稳，但在表达观点或情绪时语速短暂加快，体现出情感带动下的快慢切换. 音量: 音量整体适中，部分片段动态范围较宽，特别是在情绪起伏时音量有明显增强. 清晰度: 发音清晰度一般，偶有含糊，部分词语存在模糊发音. 流畅度: 表达基本流畅，偶有迟疑和重复用词，未见明显卡顿. 口音: 普通话，带有一点南方色彩. 音色质感: 音质略带沙哑，气息感较强，发声时可听到轻微的喉音. 性格: 犹豫，焦虑紧绷，带有明显的自我怀疑与情感不安倾向。

Text: 今天外面的风有点大，出门的时候记得多穿件外套呀。

Run	Generation
Sample 1
Sample 2
Sample 3

Caption B

RP · 3 independent samples

Caption: 沉稳的中年女性正在家中与亲朋叙旧，话题围绕家庭琐事展开，她用带着现实感的语气讲述着生活不易。

Text: 比起那些重口味的菜，我还是更偏爱清淡的做法。

Run	Generation
Sample 1
Sample 2
Sample 3

Caption C

RP · 3 independent samples

Caption: 中年男性在家庭群里和晚辈轻松聊天，语气随意，顺便开个小玩笑调节气氛，话题围绕日常生活与家庭琐事展开。

Text: 你觉得明天会更好吗。

Run	Generation
Sample 1
Sample 2
Sample 3

Caption D

DSD · 3 independent samples

Caption: 说话的是位三四十岁的女性，声音低沉又沙哑，语速不紧不慢，时不时还停一下，像是在想接下来该说什么。她说话时有种不太想多讲的感觉，但又忍不住把心里的想法说出来，听起来挺有故事的。

Text: 太棒了，天气终于放晴了。

Run	Generation
Sample 1
Sample 2
Sample 3

Caption E

APS · 3 independent samples

Caption: 性别: 女性. 年龄: 青壮年. 音高: 音调处于女性中高音区，基频稳定，音质明亮. 语速: 语速中心趋势偏快，节奏紧凑，但在部分句子中因情绪表达或强调而出现短暂停顿和速度调整. 音量: 音量整体适中，动态范围中等，有轻微的情绪带动起伏. 清晰度: 吐字清晰，发音准确. 流畅度: 整体流畅，语流自然，偶有笑语插入. 口音: 标准普通话，无明显方言痕迹. 音色质感: 音质清亮，自然柔和，带有年轻女性特有的明亮感. 性格: 活泼、外向、直接，带有调侃和好奇的语气特征。

Text: 今天外面的风有点大，出门的时候记得多穿件外套呀。

Run	Generation
Sample 1
Sample 2
Sample 3

Caption F

DSD · 3 independent samples

Caption: 说话的是个中年男人，声音偏厚带点沙哑，讲话挺随意的，语速不快也不慢，时不时就拖个音或者叹口气。听起来挺自信的，还挺爱开玩笑，总想套点别人的话。

Text: 比起那些重口味的菜，我还是更偏爱清淡的做法。

Run	Generation
Sample 1
Sample 2
Sample 3