Skip to content
OpenAI logo

whisper-large-v3-turbo Beta

Automatic Speech RecognitionOpenAI
@cf/openai/whisper-large-v3-turbo

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation.

    Usage

    Workers - TypeScript

    import { Buffer } from 'node:buffer';
    export interface Env {
    AI: Ai;
    }
    const URL = "https://pub-dbcf9f0bd3af47ca9d40971179ee62de.r2.dev/02f6edc0-1f7b-4272-bd17-f05335104725/audio.mp3";
    export default {
    async fetch(request, env, ctx): Promise<Response> {
    const mp3 = await fetch(URL);
    if (!mp3.ok) {
    return Response.json({ error: `Failed to fetch MP3: ${mp3.status}` });
    }
    const mp3Buffer = await mp3.arrayBuffer();
    const base64 = Buffer.from(mp3Buffer, 'binary').toString("base64");
    try {
    const res = await env.AI.run("@cf/openai/whisper-large-v3-turbo", {
    "audio": base64
    });
    return Response.json(res);
    }
    catch (e) {
    console.error(e);
    return Response.json({ error: "An unexpected error occurred" });
    }
    },
    } satisfies ExportedHandler<Env>

    Parameters

    * indicates a required field

    Input

    • audio * string

      Base64 encoded value of the audio data.

    • task string default transcribe

      Supported tasks are 'translate' or 'transcribe'.

    • language string default en

      The language of the audio being transcribed or translated.

    • vad_filter string default false

      Preprocess the audio with a voice activity detection model.

    • initial_prompt string

      A text prompt to help provide context to the model on the contents of the audio.

    • prefix string

      The prefix it appended the the beginning of the output of the transcription and can guide the transcription result.

    Output

    • transcription_info object

      • language string

        The language of the audio being transcribed or translated.

      • language_probability number

        The confidence level or probability of the detected language being accurate, represented as a decimal between 0 and 1.

      • duration number

        The total duration of the original audio file, in seconds.

      • duration_after_vad number

        The duration of the audio after applying Voice Activity Detection (VAD) to remove silent or irrelevant sections, in seconds.

    • text * string

      The complete transcription of the audio.

    • word_count number

      The total number of words in the transcription.

    • segments object

      • start number

        The starting time of the segment within the audio, in seconds.

      • end number

        The ending time of the segment within the audio, in seconds.

      • text string

        The transcription of the segment.

      • temperature number

        The temperature used in the decoding process, controlling randomness in predictions. Lower values result in more deterministic outputs.

      • avg_logprob number

        The average log probability of the predictions for the words in this segment, indicating overall confidence.

      • compression_ratio number

        The compression ratio of the input to the output, measuring how much the text was compressed during the transcription process.

      • no_speech_prob number

        The probability that the segment contains no speech, represented as a decimal between 0 and 1.

      • words array

        • items object

          • word string

            The individual word transcribed from the audio.

          • start number

            The starting time of the word within the audio, in seconds.

          • end number

            The ending time of the word within the audio, in seconds.

    • vtt string

      The transcription in WebVTT format, which includes timing and text information for use in subtitles.

    API Schemas

    The following schemas are based on JSON Schema

    {
    "type": "object",
    "properties": {
    "audio": {
    "type": "string",
    "description": "Base64 encoded value of the audio data."
    },
    "task": {
    "type": "string",
    "default": "transcribe",
    "description": "Supported tasks are 'translate' or 'transcribe'."
    },
    "language": {
    "type": "string",
    "default": "en",
    "description": "The language of the audio being transcribed or translated."
    },
    "vad_filter": {
    "type": "string",
    "default": "false",
    "description": "Preprocess the audio with a voice activity detection model."
    },
    "initial_prompt": {
    "type": "string",
    "description": "A text prompt to help provide context to the model on the contents of the audio."
    },
    "prefix": {
    "type": "string",
    "description": "The prefix it appended the the beginning of the output of the transcription and can guide the transcription result."
    }
    },
    "required": [
    "audio"
    ]
    }