whisper-large-v3-turbo Beta
Automatic Speech Recognition • OpenAIWhisper is a pre-trained model for automatic speech recognition (ASR) and speech translation.
Usage
Workers - TypeScript
curl
Parameters
Input
-
audio
stringBase64 encoded value of the audio data.
-
task
string default transcribeSupported tasks are 'translate' or 'transcribe'.
-
language
string default enThe language of the audio being transcribed or translated.
-
vad_filter
string default falsePreprocess the audio with a voice activity detection model.
-
initial_prompt
stringA text prompt to help provide context to the model on the contents of the audio.
-
prefix
stringThe prefix it appended the the beginning of the output of the transcription and can guide the transcription result.
Output
-
transcription_info
object-
language
stringThe language of the audio being transcribed or translated.
-
language_probability
numberThe confidence level or probability of the detected language being accurate, represented as a decimal between 0 and 1.
-
duration
numberThe total duration of the original audio file, in seconds.
-
duration_after_vad
numberThe duration of the audio after applying Voice Activity Detection (VAD) to remove silent or irrelevant sections, in seconds.
-
-
text
stringThe complete transcription of the audio.
-
word_count
numberThe total number of words in the transcription.
-
segments
object-
start
numberThe starting time of the segment within the audio, in seconds.
-
end
numberThe ending time of the segment within the audio, in seconds.
-
text
stringThe transcription of the segment.
-
temperature
numberThe temperature used in the decoding process, controlling randomness in predictions. Lower values result in more deterministic outputs.
-
avg_logprob
numberThe average log probability of the predictions for the words in this segment, indicating overall confidence.
-
compression_ratio
numberThe compression ratio of the input to the output, measuring how much the text was compressed during the transcription process.
-
no_speech_prob
numberThe probability that the segment contains no speech, represented as a decimal between 0 and 1.
-
words
array-
items
object-
word
stringThe individual word transcribed from the audio.
-
start
numberThe starting time of the word within the audio, in seconds.
-
end
numberThe ending time of the word within the audio, in seconds.
-
-
-
-
vtt
stringThe transcription in WebVTT format, which includes timing and text information for use in subtitles.
API Schemas
The following schemas are based on JSON Schema