Inworld TTS 2

Text-to-Speech • Inworld

Inworld's most powerful and expressive text-to-speech model. Builds on TTS 1.5 with rich expressive speech, real-time latency, natural language steering (e.g. [whisper], [say excitedly]), and stronger multilingual support across 15 production languages plus 90+ experimental languages.

Model Info
Terms and License	link ↗
More information	link ↗
Zero data retention	Yes
Pricing	View pricing in the Cloudflare dashboard ↗

const response = await env.AI.run(
  'inworld/tts-2',
  {
    output_format: 'mp3',
    temperature: 1,
    text: 'Hello! Welcome to Cloudflare AI Gateway. Let me show you what we can do.',
    timestamp_type: 'none',
    voice_id: 'Dennis',
  },
)
console.log(response)

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \
  --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
  --header "Content-Type: application/json" \
  --data '{
  "model": "inworld/tts-2",
  "input": {
    "output_format": "mp3",
    "temperature": 1,
    "text": "Hello! Welcome to Cloudflare AI Gateway. Let me show you what we can do.",
    "timestamp_type": "none",
    "voice_id": "Dennis"
  }
}'

Output
Raw response

{
  "gatewayMetadata": {
    "keySource": "Unified"
  },
  "result": {
    "audio": "https://pub-04a6d208d361438ea01b797e6973bd19.r2.dev/catalog/inworld__tts-2/simple-speech.mp3"
  },
  "state": "Completed"
}

Examples

Natural Language Steering — Direct the voice with bracketed natural-language cues for emotion, pace, and style.

TypeScript
cURL

const response = await env.AI.run(
  'inworld/tts-2',
  {
    output_format: 'mp3',
    temperature: 1,
    text: "[speak with excitement] I'm really excited about Inworld's new model. Have you tried out the steering capabilities? It's pretty cool!",
    timestamp_type: 'none',
    voice_id: 'Dennis',
  },
)
console.log(response)

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \
  --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
  --header "Content-Type: application/json" \
  --data '{
  "model": "inworld/tts-2",
  "input": {
    "output_format": "mp3",
    "temperature": 1,
    "text": "[speak with excitement] I'\''m really excited about Inworld'\''s new model. Have you tried out the steering capabilities? It'\''s pretty cool!",
    "timestamp_type": "none",
    "voice_id": "Dennis"
  }
}'

Output
Raw response

{
  "gatewayMetadata": {
    "keySource": "Unified"
  },
  "result": {
    "audio": "https://pub-04a6d208d361438ea01b797e6973bd19.r2.dev/catalog/inworld__tts-2/natural-language-steering.mp3"
  },
  "state": "Completed"
}

Whisper — Use steering tags to whisper

TypeScript
cURL

const response = await env.AI.run(
  'inworld/tts-2',
  {
    output_format: 'mp3',
    temperature: 1,
    text: '[whisper] This is a secret just between us.',
    timestamp_type: 'none',
    voice_id: 'Dennis',
  },
)
console.log(response)

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \
  --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
  --header "Content-Type: application/json" \
  --data '{
  "model": "inworld/tts-2",
  "input": {
    "output_format": "mp3",
    "temperature": 1,
    "text": "[whisper] This is a secret just between us.",
    "timestamp_type": "none",
    "voice_id": "Dennis"
  }
}'

Output
Raw response

{
  "gatewayMetadata": {
    "keySource": "Unified"
  },
  "result": {
    "audio": "https://pub-04a6d208d361438ea01b797e6973bd19.r2.dev/catalog/inworld__tts-2/whisper.mp3"
  },
  "state": "Completed"
}

High Quality Audio — Higher sample rate for studio quality

TypeScript
cURL

const response = await env.AI.run(
  'inworld/tts-2',
  {
    output_format: 'mp3',
    sample_rate: 48000,
    temperature: 1,
    text: 'This recording is generated at studio quality for the best possible listening experience.',
    timestamp_type: 'none',
    voice_id: 'Dennis',
  },
)
console.log(response)

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \
  --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
  --header "Content-Type: application/json" \
  --data '{
  "model": "inworld/tts-2",
  "input": {
    "output_format": "mp3",
    "sample_rate": 48000,
    "temperature": 1,
    "text": "This recording is generated at studio quality for the best possible listening experience.",
    "timestamp_type": "none",
    "voice_id": "Dennis"
  }
}'

Output
Raw response

{
  "gatewayMetadata": {
    "keySource": "Unified"
  },
  "result": {
    "audio": "https://pub-04a6d208d361438ea01b797e6973bd19.r2.dev/catalog/inworld__tts-2/high-quality-audio.mp3"
  },
  "state": "Completed"
}

With Text Normalization — Expand numbers and abbreviations before synthesis

TypeScript
cURL

const response = await env.AI.run(
  'inworld/tts-2',
  {
    apply_text_normalization: true,
    output_format: 'mp3',
    temperature: 1,
    text: 'The meeting is at 3:30 PM on Jan 15th, 2026. Please confirm by calling 555-0123.',
    timestamp_type: 'none',
    voice_id: 'Dennis',
  },
)
console.log(response)

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \
  --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
  --header "Content-Type: application/json" \
  --data '{
  "model": "inworld/tts-2",
  "input": {
    "apply_text_normalization": true,
    "output_format": "mp3",
    "temperature": 1,
    "text": "The meeting is at 3:30 PM on Jan 15th, 2026. Please confirm by calling 555-0123.",
    "timestamp_type": "none",
    "voice_id": "Dennis"
  }
}'

Output
Raw response

{
  "gatewayMetadata": {
    "keySource": "Unified"
  },
  "result": {
    "audio": "https://pub-04a6d208d361438ea01b797e6973bd19.r2.dev/catalog/inworld__tts-2/with-text-normalization.mp3"
  },
  "state": "Completed"
}

Parameters

Input
Output

text

stringrequiredmaxLength: 2000The text to be synthesized into speech. Maximum input of 2,000 characters.

voice_id

stringrequireddefault: Dennisenum: Loretta, Darlene, Marlene, Hank, Evelyn, Celeste, Pippa, Tessa, Liam, Callum, Hamish, Abby, Graham, Rupert, Mortimer, Snik, Anjali, Saanvi, Arjun, Claire, Oliver, Simon, Elliot, James, Serena, Gareth, Vinny, Lauren, Jessica, Ethan, Tyler, Jason, Chloe, Veronica, Victoria, Miranda, Sebastian, Victor, Malcolm, Nate, Brian, Amina, Kelsey, Derek, Evan, Kayla, Jake, Grant, Tristan, Nadia, Selene, Marcus, Riley, Damon, Cedric, Mia, Naomi, Jonah, Levi, Avery, Brandon, Conrad, Bianca, Lucian, Trevor, Alex, Ashley, Craig, Deborah, Dennis, Edward, Elizabeth, Hades, Julia, Pixie, Mark, Olivia, Priya, Ronald, Sarah, Shaun, Theodore, Timothy, Wendy, Dominus, Hana, Clive, Carter, Blake, Luna, Reed, Duncan, Felix, Eleanor, SophieThe ID of the voice to use for synthesizing speech. Defaults to Dennis.

output_format

stringrequireddefault: mp3enum: mp3, opus, wav, flacThe output format for the audio. Supported formats are mp3, opus, wav, and flac. Defaults to mp3.

bit_rate

integerminimum: -9007199254740991maximum: 9007199254740991Bits per second of the audio. Only for compressed audio formats (mp3, opus). The default is 128,000.

sample_rate

integerminimum: -9007199254740991maximum: 9007199254740991The synthesis sample rate in hertz. Accepts: 8000, 16000, 22050, 24000, 32000, 44100, 48000. The default is 48,000.

speaking_rate

numberminimum: 0.5maximum: 1.5Speaking rate/speed, in the range [0.5, 1.5]. The default is 1.0. We recommend using values above 0.8 to ensure high quality.

temperature

numberrequireddefault: 1minimum: 0.01maximum: 2Determines the degree of randomness when sampling audio tokens. Defaults to 1.0. Accepts values between 0 (exclusive) and 2 (inclusive). Higher values = more expressive, lower values = more deterministic.

timestamp_type

stringrequireddefault: noneenum: none, word, characterControls timestamp metadata returned with the audio. "word" returns word-level timing, "character" returns character-level timing. Note: adds latency. Defaults to none.

apply_text_normalization

booleanWhen enabled, text normalization expands numbers, dates, times, and abbreviations before converting to speech. Turning this off may reduce latency.

audio

stringURL to the generated audio file

API Schemas (Raw)

Input

Output