Dia TTS — Voice Clone

Generate dialogue audio from text using a cloned voice from a short reference audio and its transcript. Supports multi-speaker scripts.

Multi Speaker

Dia TTS converts a dialogue script into natural audio while cloning a target voice from a short reference clip and its matching transcript. The page presents three inputs — a script textarea, a reference audio field that accepts either a file upload or a public URL, and a reference transcript textarea — followed by a Generate action that returns a playable audio result and a download link.

How it works in the UI

Write the conversation in the script box and mark each turn with a speaker tag such as [S1] or [S2]; put the tag at the beginning of the line, then the spoken text that follows. Add short non-verbal cues in parentheses inside the same turn when they help convey tone or timing, for example (laughs) or (sighs). Provide a reference audio by uploading a local file or by pasting a public URL into the reference audio field; both paths behave the same in processing. Enter the reference transcript that matches the words in the reference audio so the system can align voice characteristics reliably. Press Generate to process the inputs; the page shows progress, then displays an audio player and a download option.

Inputs

The script textarea expects a dialogue that uses [S<N>] tags to indicate who speaks. Start with [S1] for the first turn and [S2] for the second, then continue alternating naturally; introduce new speakers incrementally, for example [S3] and [S4], when a scene requires more voices. Keep sentences clear and punctuated to control phrasing and pauses, and place non-verbals in parentheses near the words they affect. Example as typed in the script textarea: [S1] Hello, how are you? [S2] I'm good, thank you. [S1] What's your name? [S2] My name is Dia. [S1] Nice to meet you. [S2] Nice to meet you too.

The reference audio field takes either a pasted public URL or an uploaded file from the device. Short, clean, single-speaker clips work best; aim for about 5-15 seconds with stable volume and minimal background noise.

The reference transcript textarea holds the exact words spoken in the reference audio. Keep it closely matched; small punctuation differences are acceptable, yet the words should line up with what is said. Example as typed in the transcript textarea: [S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Fal.

Understanding [S<N>] tags

These tags identify the speaker for each turn, allowing the system to structure the dialogue and keep voices consistent across the scene. Use square brackets with S followed by a number, such as [S1] and [S2]. Place a single speaker tag at the start of the turn, write the line for that speaker, then start the next turn with the next tag. Non-verbal cues stay inside the same turn in parentheses, and should be brief and purposeful.

Examples: [S1] Welcome to the demo.
[S2] Thanks for inviting me. (laughs)
[S1] Let's begin with the overview.

Another example: [S1] The storm passed overnight. (whispers) The streets felt new by morning.

Field behavior

The Generate action becomes available when the script, the reference audio (via file or URL), and the reference transcript all contain input. The script must include at least one speaker tag and a corresponding line. The transcript should closely match the reference audio. After submission, the page shows progress and then renders a player with the final audio and a download link; format and sample rate appear alongside the result.

Writing better scripts

Vary sentence length to keep speech natural, and rely on punctuation to guide phrasing and pauses. Break long monologues into multiple turns for clarity and pacing. Use non-verbals sparingly and keep them short, avoiding stacked stage directions in one line. When iterating, change only the words that need adjustment to keep timing stable across takes.

Troubleshooting

If the cloned voice does not match the reference, make sure the clip contains a single speaker and that the transcript aligns with the spoken words. When speech sounds rushed or flat, shorten sentences, add clear punctuation, or split long lines into separate turns. If non-verbals feel exaggerated, remove redundant cues and keep only what conveys intent. When a transcript mismatch appears, compare the text with the audio word-for-word, correct differences, and generate again.

Accessibility and UX

Ensure the file input and controls support keyboard navigation and screen readers. Display inline validation messages beneath each field and a visible status while generating. Provide captions or a transcript for the produced audio where possible, and show file size and duration near the player for quick verification.

Dia TTS — Voice Clone