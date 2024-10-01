whisper-large-v3-turbo BetaAutomatic Speech Recognition • OpenAI
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation.
Usage
Workers - TypeScript
curl
Parameters
Input
audiostring
Base64 encoded value of the audio data.
taskstring default transcribe
Supported tasks are 'translate' or 'transcribe'.
languagestring default en
The language of the audio being transcribed or translated.
vad_filterstring default false
Preprocess the audio with a voice activity detection model.
initial_promptstring
A text prompt to help provide context to the model on the contents of the audio.
prefixstring
The prefix it appended the the beginning of the output of the transcription and can guide the transcription result.
Output
transcription_infoobject
languagestring
The language of the audio being transcribed or translated.
language_probabilitynumber
The confidence level or probability of the detected language being accurate, represented as a decimal between 0 and 1.
durationnumber
The total duration of the original audio file, in seconds.
duration_after_vadnumber
The duration of the audio after applying Voice Activity Detection (VAD) to remove silent or irrelevant sections, in seconds.
textstring
The complete transcription of the audio.
word_countnumber
The total number of words in the transcription.
segmentsobject
startnumber
The starting time of the segment within the audio, in seconds.
endnumber
The ending time of the segment within the audio, in seconds.
textstring
The transcription of the segment.
temperaturenumber
The temperature used in the decoding process, controlling randomness in predictions. Lower values result in more deterministic outputs.
avg_logprobnumber
The average log probability of the predictions for the words in this segment, indicating overall confidence.
compression_rationumber
The compression ratio of the input to the output, measuring how much the text was compressed during the transcription process.
no_speech_probnumber
The probability that the segment contains no speech, represented as a decimal between 0 and 1.
wordsarray
itemsobject
wordstring
The individual word transcribed from the audio.
startnumber
The starting time of the word within the audio, in seconds.
endnumber
The ending time of the word within the audio, in seconds.
vttstring
The transcription in WebVTT format, which includes timing and text information for use in subtitles.
API Schemas
