Skip to content
Cloudflare Docs
OpenAI logo

whisper-large-v3-turbo

Automatic Speech RecognitionOpenAI
@cf/openai/whisper-large-v3-turbo

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation.

Model Info
BatchYes
Unit Pricing$0.00051 per audio minute

Usage

Workers - TypeScript

TypeScript
import { Buffer } from 'node:buffer';
export interface Env {
AI: Ai;
}
const URL = "https://pub-dbcf9f0bd3af47ca9d40971179ee62de.r2.dev/02f6edc0-1f7b-4272-bd17-f05335104725/audio.mp3";
export default {
async fetch(request, env, ctx): Promise<Response> {
const mp3 = await fetch(URL);
if (!mp3.ok) {
return Response.json({ error: `Failed to fetch MP3: ${mp3.status}` });
}
const mp3Buffer = await mp3.arrayBuffer();
const base64 = Buffer.from(mp3Buffer, 'binary').toString("base64");
try {
const res = await env.AI.run("@cf/openai/whisper-large-v3-turbo", {
audio: base64,
// Specify the language using an ISO 639-1 code.
// Examples: "en" (English), "es" (Spanish), "fr" (French)
// If omitted, the model will auto-detect the language.
language: "en",
});
return Response.json(res);
}
catch (e) {
console.error(e);
return Response.json({ error: "An unexpected error occurred" });
}
},
} satisfies ExportedHandler<Env>

Python

Python
import requests
import base64
API_BASE_URL = "https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/"
headers = {"Authorization": "Bearer {API_KEY}"}
def run(model, input):
response = requests.post(f"{API_BASE_URL}{model}", headers=headers, json=input)
return response.json()
with open("audio.mp3", "rb") as audio_file:
audio_base64 = base64.b64encode(audio_file.read()).decode("utf-8")
# Specify the language using an ISO 639-1 code.
# Examples: "en" (English), "es" (Spanish), "fr" (French)
# If omitted, the model will auto-detect the language.
output = run("@cf/openai/whisper-large-v3-turbo", {
"audio": audio_base64,
"language": "en"
})
print(output)

curl

Terminal window
# Encode the audio file as base64
AUDIO_BASE64=$(base64 -i audio.mp3)
# Specify the language using an ISO 639-1 code.
# Examples: "en" (English), "es" (Spanish), "fr" (French)
# If omitted, the model will auto-detect the language.
curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run/@cf/openai/whisper-large-v3-turbo \
-X POST \
-H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
-d "{\"audio\": \"$AUDIO_BASE64\", \"language\": \"en\"}"

Parameters

* indicates a required field

Input

  • audio required

    • 0 string

      Base64 encoded value of the audio data.

    • 1 object

      • body object

      • contentType string

  • task string default transcribe

    Supported tasks are 'translate' or 'transcribe'.

  • language string

    The language of the audio being transcribed or translated.

  • vad_filter boolean

    Preprocess the audio with a voice activity detection model.

  • initial_prompt string

    A text prompt to help provide context to the model on the contents of the audio.

  • prefix string

    The prefix appended to the beginning of the output of the transcription and can guide the transcription result.

  • beam_size integer default 5

    The number of beams to use in beam search decoding. Higher values may improve accuracy at the cost of speed.

  • condition_on_previous_text boolean default true

    Whether to condition on previous text during transcription. Setting to false may help prevent hallucination loops.

  • no_speech_threshold number default 0.6

    Threshold for detecting no-speech segments. Segments with no-speech probability above this value are skipped.

  • compression_ratio_threshold number default 2.4

    Threshold for filtering out segments with high compression ratio, which often indicate repetitive or hallucinated text.

  • log_prob_threshold number default -1

    Threshold for filtering out segments with low average log probability, indicating low confidence.

  • hallucination_silence_threshold number

    Optional threshold (in seconds) to skip silent periods that may cause hallucinations.

Output

  • transcription_info object

    • language string

      The language of the audio being transcribed or translated.

    • language_probability number

      The confidence level or probability of the detected language being accurate, represented as a decimal between 0 and 1.

    • duration number

      The total duration of the original audio file, in seconds.

    • duration_after_vad number

      The duration of the audio after applying Voice Activity Detection (VAD) to remove silent or irrelevant sections, in seconds.

  • text string required

    The complete transcription of the audio.

  • word_count number

    The total number of words in the transcription.

  • segments array

    • items object

      • start number

        The starting time of the segment within the audio, in seconds.

      • end number

        The ending time of the segment within the audio, in seconds.

      • text string

        The transcription of the segment.

      • temperature number

        The temperature used in the decoding process, controlling randomness in predictions. Lower values result in more deterministic outputs.

      • avg_logprob number

        The average log probability of the predictions for the words in this segment, indicating overall confidence.

      • compression_ratio number

        The compression ratio of the input to the output, measuring how much the text was compressed during the transcription process.

      • no_speech_prob number

        The probability that the segment contains no speech, represented as a decimal between 0 and 1.

      • words array

        • items object

          • word string

            The individual word transcribed from the audio.

          • start number

            The starting time of the word within the audio, in seconds.

          • end number

            The ending time of the word within the audio, in seconds.

  • vtt string

    The transcription in WebVTT format, which includes timing and text information for use in subtitles.

API Schemas

The following schemas are based on JSON Schema

{
"type": "object",
"properties": {
"audio": {
"anyOf": [
{
"type": "string",
"description": "Base64 encoded value of the audio data."
},
{
"type": "object",
"properties": {
"body": {
"type": "object"
},
"contentType": {
"type": "string"
}
}
}
]
},
"task": {
"type": "string",
"default": "transcribe",
"description": "Supported tasks are 'translate' or 'transcribe'."
},
"language": {
"type": "string",
"description": "The language of the audio being transcribed or translated."
},
"vad_filter": {
"type": "boolean",
"default": false,
"description": "Preprocess the audio with a voice activity detection model."
},
"initial_prompt": {
"type": "string",
"description": "A text prompt to help provide context to the model on the contents of the audio."
},
"prefix": {
"type": "string",
"description": "The prefix appended to the beginning of the output of the transcription and can guide the transcription result."
},
"beam_size": {
"type": "integer",
"default": 5,
"description": "The number of beams to use in beam search decoding. Higher values may improve accuracy at the cost of speed."
},
"condition_on_previous_text": {
"type": "boolean",
"default": true,
"description": "Whether to condition on previous text during transcription. Setting to false may help prevent hallucination loops."
},
"no_speech_threshold": {
"type": "number",
"default": 0.6,
"description": "Threshold for detecting no-speech segments. Segments with no-speech probability above this value are skipped."
},
"compression_ratio_threshold": {
"type": "number",
"default": 2.4,
"description": "Threshold for filtering out segments with high compression ratio, which often indicate repetitive or hallucinated text."
},
"log_prob_threshold": {
"type": "number",
"default": -1,
"description": "Threshold for filtering out segments with low average log probability, indicating low confidence."
},
"hallucination_silence_threshold": {
"type": "number",
"description": "Optional threshold (in seconds) to skip silent periods that may cause hallucinations."
}
},
"required": [
"audio"
]
}