Skip to content
OpenAI logo

whisper-large-v3-turbo Beta

Automatic Speech RecognitionOpenAI
@cf/openai/whisper-large-v3-turbo

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation.

Features
BetaYes

Usage

Workers - TypeScript

import { Buffer } from 'node:buffer';
export interface Env {
AI: Ai;
}
const URL = "https://pub-dbcf9f0bd3af47ca9d40971179ee62de.r2.dev/02f6edc0-1f7b-4272-bd17-f05335104725/audio.mp3";
export default {
async fetch(request, env, ctx): Promise<Response> {
const mp3 = await fetch(URL);
if (!mp3.ok) {
return Response.json({ error: `Failed to fetch MP3: ${mp3.status}` });
}
const mp3Buffer = await mp3.arrayBuffer();
const base64 = Buffer.from(mp3Buffer, 'binary').toString("base64");
try {
const res = await env.AI.run("@cf/openai/whisper-large-v3-turbo", {
"audio": base64
});
return Response.json(res);
}
catch (e) {
console.error(e);
return Response.json({ error: "An unexpected error occurred" });
}
},
} satisfies ExportedHandler<Env>

Parameters

* indicates a required field

Input

  • audio * string

    Base64 encoded value of the audio data.

  • task string default transcribe

    Supported tasks are 'translate' or 'transcribe'.

  • language string default en

    The language of the audio being transcribed or translated.

  • vad_filter string default false

    Preprocess the audio with a voice activity detection model.

  • initial_prompt string

    A text prompt to help provide context to the model on the contents of the audio.

  • prefix string

    The prefix it appended the the beginning of the output of the transcription and can guide the transcription result.

Output

  • transcription_info object

    • language string

      The language of the audio being transcribed or translated.

    • language_probability number

      The confidence level or probability of the detected language being accurate, represented as a decimal between 0 and 1.

    • duration number

      The total duration of the original audio file, in seconds.

    • duration_after_vad number

      The duration of the audio after applying Voice Activity Detection (VAD) to remove silent or irrelevant sections, in seconds.

  • text * string

    The complete transcription of the audio.

  • word_count number

    The total number of words in the transcription.

  • segments object

    • start number

      The starting time of the segment within the audio, in seconds.

    • end number

      The ending time of the segment within the audio, in seconds.

    • text string

      The transcription of the segment.

    • temperature number

      The temperature used in the decoding process, controlling randomness in predictions. Lower values result in more deterministic outputs.

    • avg_logprob number

      The average log probability of the predictions for the words in this segment, indicating overall confidence.

    • compression_ratio number

      The compression ratio of the input to the output, measuring how much the text was compressed during the transcription process.

    • no_speech_prob number

      The probability that the segment contains no speech, represented as a decimal between 0 and 1.

    • words array

      • items object

        • word string

          The individual word transcribed from the audio.

        • start number

          The starting time of the word within the audio, in seconds.

        • end number

          The ending time of the word within the audio, in seconds.

  • vtt string

    The transcription in WebVTT format, which includes timing and text information for use in subtitles.

API Schemas

The following schemas are based on JSON Schema

{
"type": "object",
"properties": {
"audio": {
"type": "string",
"description": "Base64 encoded value of the audio data."
},
"task": {
"type": "string",
"default": "transcribe",
"description": "Supported tasks are 'translate' or 'transcribe'."
},
"language": {
"type": "string",
"default": "en",
"description": "The language of the audio being transcribed or translated."
},
"vad_filter": {
"type": "string",
"default": "false",
"description": "Preprocess the audio with a voice activity detection model."
},
"initial_prompt": {
"type": "string",
"description": "A text prompt to help provide context to the model on the contents of the audio."
},
"prefix": {
"type": "string",
"description": "The prefix it appended the the beginning of the output of the transcription and can guide the transcription result."
}
},
"required": [
"audio"
]
}