Skip to content

whisper-large-v3-turbo Beta

Automatic Speech RecognitionOpenAI
@cf/openai/whisper-large-v3-turbo

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation.

    Usage

    Workers - TypeScript

    export interface Env {
    AI: Ai;
    }
    export default {
    async fetch(request, env): Promise<Response> {
    const res = await fetch(
    "https://github.com/Azure-Samples/cognitive-services-speech-sdk/raw/master/samples/cpp/windows/console/samples/enrollment_audio_katie.wav"
    );
    const blob = await res.arrayBuffer();
    const input = {
    audio: [...new Uint8Array(blob)],
    };
    const response = await env.AI.run(
    "@cf/openai/whisper-large-v3-turbo",
    input
    );
    return Response.json({ input: { audio: [] }, response });
    },
    } satisfies ExportedHandler<Env>;

    curl

    Terminal window
    curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run/@cf/openai/whisper-large-v3-turbo \
    -X POST \
    -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
    --data-binary "@talking-llama.mp3"

    Parameters

    Input

    • audio string

      Base64 encoded value of the audio data.

    • task string default transcribe

      Supported tasks are 'translate' or 'transcribe'.

    • language string default en

      The language of the audio being transcribed or translated.

    • vad_filter string default false

      Preprocess the audio with a voice activity detection model.

    • initial_prompt string

      A text prompt to help provide context to the model on the contents of the audio.

    • prefix string

      The prefix it appended the the beginning of the output of the transcription and can guide the transcription result.

    Output

    • transcription_info object

      • language string

        The language of the audio being transcribed or translated.

      • language_probability number

        The confidence level or probability of the detected language being accurate, represented as a decimal between 0 and 1.

      • duration number

        The total duration of the original audio file, in seconds.

      • duration_after_vad number

        The duration of the audio after applying Voice Activity Detection (VAD) to remove silent or irrelevant sections, in seconds.

    • text string

      The complete transcription of the audio.

    • word_count number

      The total number of words in the transcription.

    • segments object

      • start number

        The starting time of the segment within the audio, in seconds.

      • end number

        The ending time of the segment within the audio, in seconds.

      • text string

        The transcription of the segment.

      • temperature number

        The temperature used in the decoding process, controlling randomness in predictions. Lower values result in more deterministic outputs.

      • avg_logprob number

        The average log probability of the predictions for the words in this segment, indicating overall confidence.

      • compression_ratio number

        The compression ratio of the input to the output, measuring how much the text was compressed during the transcription process.

      • no_speech_prob number

        The probability that the segment contains no speech, represented as a decimal between 0 and 1.

      • words array

        • items object

          • word string

            The individual word transcribed from the audio.

          • start number

            The starting time of the word within the audio, in seconds.

          • end number

            The ending time of the word within the audio, in seconds.

    • vtt string

      The transcription in WebVTT format, which includes timing and text information for use in subtitles.

    API Schemas

    The following schemas are based on JSON Schema

    {
    "type": "object",
    "properties": {
    "audio": {
    "type": "string",
    "description": "Base64 encoded value of the audio data."
    },
    "task": {
    "type": "string",
    "default": "transcribe",
    "description": "Supported tasks are 'translate' or 'transcribe'."
    },
    "language": {
    "type": "string",
    "default": "en",
    "description": "The language of the audio being transcribed or translated."
    },
    "vad_filter": {
    "type": "string",
    "default": "false",
    "description": "Preprocess the audio with a voice activity detection model."
    },
    "initial_prompt": {
    "type": "string",
    "description": "A text prompt to help provide context to the model on the contents of the audio."
    },
    "prefix": {
    "type": "string",
    "description": "The prefix it appended the the beginning of the output of the transcription and can guide the transcription result."
    }
    },
    "required": [
    "audio"
    ]
    }