Inworld TTS 2
Text-to-Speech • Inworld • ProxiedInworld's most powerful and expressive text-to-speech model. Builds on TTS 1.5 with rich expressive speech, real-time latency, natural language steering (e.g. [whisper], [say excitedly]), and stronger multilingual support across 15 production languages plus 90+ experimental languages.
| Model Info | |
|---|---|
| Terms and License | link ↗ |
| More information | link ↗ |
| Pricing | View pricing in the Cloudflare dashboard ↗ |
Usage
const response = await env.AI.run( 'inworld/tts-2', { output_format: 'mp3', temperature: 1, text: 'Hello! Welcome to Cloudflare AI Gateway. Let me show you what we can do.', timestamp_type: 'none', voice_id: 'Dennis', },)console.log(response)curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \ --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \ --header "Content-Type: application/json" \ --data '{ "model": "inworld/tts-2", "input": { "output_format": "mp3", "temperature": 1, "text": "Hello! Welcome to Cloudflare AI Gateway. Let me show you what we can do.", "timestamp_type": "none", "voice_id": "Dennis" }}'{ "gatewayMetadata": { "keySource": "Unified" }, "result": { "audio": "https://pub-04a6d208d361438ea01b797e6973bd19.r2.dev/catalog/inworld__tts-2/simple-speech.mp3" }, "state": "Completed"}Examples
Natural Language Steering — Direct the voice with bracketed natural-language cues for emotion, pace, and style.
const response = await env.AI.run( 'inworld/tts-2', { output_format: 'mp3', temperature: 1, text: "[speak with excitement] I'm really excited about Inworld's new model. Have you tried out the steering capabilities? It's pretty cool!", timestamp_type: 'none', voice_id: 'Dennis', },)console.log(response)curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \ --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \ --header "Content-Type: application/json" \ --data '{ "model": "inworld/tts-2", "input": { "output_format": "mp3", "temperature": 1, "text": "[speak with excitement] I'\''m really excited about Inworld'\''s new model. Have you tried out the steering capabilities? It'\''s pretty cool!", "timestamp_type": "none", "voice_id": "Dennis" }}'{ "gatewayMetadata": { "keySource": "Unified" }, "result": { "audio": "https://pub-04a6d208d361438ea01b797e6973bd19.r2.dev/catalog/inworld__tts-2/natural-language-steering.mp3" }, "state": "Completed"}Whisper — Use steering tags to whisper
const response = await env.AI.run( 'inworld/tts-2', { output_format: 'mp3', temperature: 1, text: '[whisper] This is a secret just between us.', timestamp_type: 'none', voice_id: 'Dennis', },)console.log(response)curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \ --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \ --header "Content-Type: application/json" \ --data '{ "model": "inworld/tts-2", "input": { "output_format": "mp3", "temperature": 1, "text": "[whisper] This is a secret just between us.", "timestamp_type": "none", "voice_id": "Dennis" }}'{ "gatewayMetadata": { "keySource": "Unified" }, "result": { "audio": "https://pub-04a6d208d361438ea01b797e6973bd19.r2.dev/catalog/inworld__tts-2/whisper.mp3" }, "state": "Completed"}High Quality Audio — Higher sample rate for studio quality
const response = await env.AI.run( 'inworld/tts-2', { output_format: 'mp3', sample_rate: 48000, temperature: 1, text: 'This recording is generated at studio quality for the best possible listening experience.', timestamp_type: 'none', voice_id: 'Dennis', },)console.log(response)curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \ --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \ --header "Content-Type: application/json" \ --data '{ "model": "inworld/tts-2", "input": { "output_format": "mp3", "sample_rate": 48000, "temperature": 1, "text": "This recording is generated at studio quality for the best possible listening experience.", "timestamp_type": "none", "voice_id": "Dennis" }}'{ "gatewayMetadata": { "keySource": "Unified" }, "result": { "audio": "https://pub-04a6d208d361438ea01b797e6973bd19.r2.dev/catalog/inworld__tts-2/high-quality-audio.mp3" }, "state": "Completed"}With Text Normalization — Expand numbers and abbreviations before synthesis
const response = await env.AI.run( 'inworld/tts-2', { apply_text_normalization: true, output_format: 'mp3', temperature: 1, text: 'The meeting is at 3:30 PM on Jan 15th, 2026. Please confirm by calling 555-0123.', timestamp_type: 'none', voice_id: 'Dennis', },)console.log(response)curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \ --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \ --header "Content-Type: application/json" \ --data '{ "model": "inworld/tts-2", "input": { "apply_text_normalization": true, "output_format": "mp3", "temperature": 1, "text": "The meeting is at 3:30 PM on Jan 15th, 2026. Please confirm by calling 555-0123.", "timestamp_type": "none", "voice_id": "Dennis" }}'{ "gatewayMetadata": { "keySource": "Unified" }, "result": { "audio": "https://pub-04a6d208d361438ea01b797e6973bd19.r2.dev/catalog/inworld__tts-2/with-text-normalization.mp3" }, "state": "Completed"}Parameters
booleanWhen enabled, text normalization expands numbers, dates, times, and abbreviations before converting to speech. Turning this off may reduce latency.integermaximum: 9007199254740991minimum: -9007199254740991Bits per second of the audio. Only for compressed audio formats (mp3, opus). The default is 128,000.stringrequireddefault: mp3enum: mp3, opus, wav, flacThe output format for the audio. Supported formats are mp3, opus, wav, and flac. Defaults to mp3.integermaximum: 9007199254740991minimum: -9007199254740991The synthesis sample rate in hertz. Accepts: 8000, 16000, 22050, 24000, 32000, 44100, 48000. The default is 48,000.numbermaximum: 1.5minimum: 0.5Speaking rate/speed, in the range [0.5, 1.5]. The default is 1.0. We recommend using values above 0.8 to ensure high quality.numberrequireddefault: 1maximum: 2minimum: 0.01Determines the degree of randomness when sampling audio tokens. Defaults to 1.0. Accepts values between 0 (exclusive) and 2 (inclusive). Higher values = more expressive, lower values = more deterministic.stringrequiredmaxLength: 2000The text to be synthesized into speech. Maximum input of 2,000 characters.stringrequireddefault: noneenum: none, word, characterControls timestamp metadata returned with the audio. "word" returns word-level timing, "character" returns character-level timing. Note: adds latency. Defaults to none.stringrequireddefault: Dennisenum: Loretta, Darlene, Marlene, Hank, Evelyn, Celeste, Pippa, Tessa, Liam, Callum, Hamish, Abby, Graham, Rupert, Mortimer, Snik, Anjali, Saanvi, Arjun, Claire, Oliver, Simon, Elliot, James, Serena, Gareth, Vinny, Lauren, Jessica, Ethan, Tyler, Jason, Chloe, Veronica, Victoria, Miranda, Sebastian, Victor, Malcolm, Nate, Brian, Amina, Kelsey, Derek, Evan, Kayla, Jake, Grant, Tristan, Nadia, Selene, Marcus, Riley, Damon, Cedric, Mia, Naomi, Jonah, Levi, Avery, Brandon, Conrad, Bianca, Lucian, Trevor, Alex, Ashley, Craig, Deborah, Dennis, Edward, Elizabeth, Hades, Julia, Pixie, Mark, Olivia, Priya, Ronald, Sarah, Shaun, Theodore, Timothy, Wendy, Dominus, Hana, Clive, Carter, Blake, Luna, Reed, Duncan, Felix, Eleanor, SophieThe ID of the voice to use for synthesizing speech. Defaults to Dennis.stringURL to the generated audio file