whisper-large-v3-turbo

Automatic Speech Recognition • OpenAI

@cf/openai/whisper-large-v3-turbo

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation.

Model Info
Unit Pricing	$0.00051 per audio minute

Usage

Workers - TypeScript

import { Buffer } from 'node:buffer';
export interface Env {
    AI: Ai;
}
const URL = "https://pub-dbcf9f0bd3af47ca9d40971179ee62de.r2.dev/02f6edc0-1f7b-4272-bd17-f05335104725/audio.mp3";
export default {
    async fetch(request, env, ctx): Promise<Response> {
        const mp3 = await fetch(URL);
        if (!mp3.ok) {
          return Response.json({ error: `Failed to fetch MP3: ${mp3.status}` });
        }
        const mp3Buffer = await mp3.arrayBuffer();
        const base64 = Buffer.from(mp3Buffer, 'binary').toString("base64");
        try {
            const res = await env.AI.run("@cf/openai/whisper-large-v3-turbo", {
                "audio": base64
            });
            return Response.json(res);
        }
        catch (e) {
            console.error(e);
            return Response.json({ error: "An unexpected error occurred" });
        }
    },
} satisfies ExportedHandler<Env>

Parameters

* indicates a required field

Input

audio string required
Base64 encoded value of the audio data.
task string default transcribe
Supported tasks are 'translate' or 'transcribe'.
language string
The language of the audio being transcribed or translated.
vad_filter boolean
Preprocess the audio with a voice activity detection model.
initial_prompt string
A text prompt to help provide context to the model on the contents of the audio.
prefix string
The prefix it appended the the beginning of the output of the transcription and can guide the transcription result.

Output

transcription_info object
- language string
  The language of the audio being transcribed or translated.
- language_probability number
  The confidence level or probability of the detected language being accurate, represented as a decimal between 0 and 1.
- duration number
  The total duration of the original audio file, in seconds.
- duration_after_vad number
  The duration of the audio after applying Voice Activity Detection (VAD) to remove silent or irrelevant sections, in seconds.
text string required
The complete transcription of the audio.
word_count number
The total number of words in the transcription.
segments array
- items object
  - start number
    The starting time of the segment within the audio, in seconds.
  - end number
    The ending time of the segment within the audio, in seconds.
  - text string
    The transcription of the segment.
  - temperature number
    The temperature used in the decoding process, controlling randomness in predictions. Lower values result in more deterministic outputs.
  - avg_logprob number
    The average log probability of the predictions for the words in this segment, indicating overall confidence.
  - compression_ratio number
    The compression ratio of the input to the output, measuring how much the text was compressed during the transcription process.
  - no_speech_prob number
    The probability that the segment contains no speech, represented as a decimal between 0 and 1.
  - words array
    - items object
      - word string
        The individual word transcribed from the audio.
      - start number
        The starting time of the word within the audio, in seconds.
      - end number
        The ending time of the word within the audio, in seconds.
vtt string
The transcription in WebVTT format, which includes timing and text information for use in subtitles.

API Schemas

The following schemas are based on JSON Schema

Input
Output

{
    "type": "object",
    "properties": {
        "audio": {
            "type": "string",
            "description": "Base64 encoded value of the audio data."
        },
        "task": {
            "type": "string",
            "default": "transcribe",
            "description": "Supported tasks are 'translate' or 'transcribe'."
        },
        "language": {
            "type": "string",
            "description": "The language of the audio being transcribed or translated."
        },
        "vad_filter": {
            "type": "boolean",
            "default": false,
            "description": "Preprocess the audio with a voice activity detection model."
        },
        "initial_prompt": {
            "type": "string",
            "description": "A text prompt to help provide context to the model on the contents of the audio."
        },
        "prefix": {
            "type": "string",
            "description": "The prefix it appended the the beginning of the output of the transcription and can guide the transcription result."
        }
    },
    "required": [
        "audio"
    ]
}

{
    "type": "object",
    "contentType": "application/json",
    "properties": {
        "transcription_info": {
            "type": "object",
            "properties": {
                "language": {
                    "type": "string",
                    "description": "The language of the audio being transcribed or translated."
                },
                "language_probability": {
                    "type": "number",
                    "description": "The confidence level or probability of the detected language being accurate, represented as a decimal between 0 and 1."
                },
                "duration": {
                    "type": "number",
                    "description": "The total duration of the original audio file, in seconds."
                },
                "duration_after_vad": {
                    "type": "number",
                    "description": "The duration of the audio after applying Voice Activity Detection (VAD) to remove silent or irrelevant sections, in seconds."
                }
            }
        },
        "text": {
            "type": "string",
            "description": "The complete transcription of the audio."
        },
        "word_count": {
            "type": "number",
            "description": "The total number of words in the transcription."
        },
        "segments": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "start": {
                        "type": "number",
                        "description": "The starting time of the segment within the audio, in seconds."
                    },
                    "end": {
                        "type": "number",
                        "description": "The ending time of the segment within the audio, in seconds."
                    },
                    "text": {
                        "type": "string",
                        "description": "The transcription of the segment."
                    },
                    "temperature": {
                        "type": "number",
                        "description": "The temperature used in the decoding process, controlling randomness in predictions. Lower values result in more deterministic outputs."
                    },
                    "avg_logprob": {
                        "type": "number",
                        "description": "The average log probability of the predictions for the words in this segment, indicating overall confidence."
                    },
                    "compression_ratio": {
                        "type": "number",
                        "description": "The compression ratio of the input to the output, measuring how much the text was compressed during the transcription process."
                    },
                    "no_speech_prob": {
                        "type": "number",
                        "description": "The probability that the segment contains no speech, represented as a decimal between 0 and 1."
                    },
                    "words": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "word": {
                                    "type": "string",
                                    "description": "The individual word transcribed from the audio."
                                },
                                "start": {
                                    "type": "number",
                                    "description": "The starting time of the word within the audio, in seconds."
                                },
                                "end": {
                                    "type": "number",
                                    "description": "The ending time of the word within the audio, in seconds."
                                }
                            }
                        }
                    }
                }
            }
        },
        "vtt": {
            "type": "string",
            "description": "The transcription in WebVTT format, which includes timing and text information for use in subtitles."
        }
    },
    "required": [
        "text"
    ]
}

Was this helpful?

Community
X
Discord
YouTube
GitHub