whisper-large-v3-turbo Beta

Automatic Speech Recognition • OpenAI

@cf/openai/whisper-large-v3-turbo

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation.

Usage

Workers - TypeScript

export interface Env {
  AI: Ai;
}

export default {
  async fetch(request, env): Promise<Response> {
    const res = await fetch(
      "https://github.com/Azure-Samples/cognitive-services-speech-sdk/raw/master/samples/cpp/windows/console/samples/enrollment_audio_katie.wav"
    );
    const blob = await res.arrayBuffer();

    const input = {
      audio: [...new Uint8Array(blob)],
    };

    const response = await env.AI.run(
      "@cf/openai/whisper-large-v3-turbo",
      input
    );

    return Response.json({ input: { audio: [] }, response });
  },
} satisfies ExportedHandler<Env>;

curl

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run/@cf/openai/whisper-large-v3-turbo  \
  -X POST  \
  -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN"  \
  --data-binary "@talking-llama.mp3"

Parameters

* indicates a required field

Input

audio * string
Base64 encoded value of the audio data.
task string default transcribe
Supported tasks are 'translate' or 'transcribe'.
language string default en
The language of the audio being transcribed or translated.
vad_filter string default false
Preprocess the audio with a voice activity detection model.
initial_prompt string
A text prompt to help provide context to the model on the contents of the audio.
prefix string
The prefix it appended the the beginning of the output of the transcription and can guide the transcription result.

Output

transcription_info object
- language string
  The language of the audio being transcribed or translated.
- language_probability number
  The confidence level or probability of the detected language being accurate, represented as a decimal between 0 and 1.
- duration number
  The total duration of the original audio file, in seconds.
- duration_after_vad number
  The duration of the audio after applying Voice Activity Detection (VAD) to remove silent or irrelevant sections, in seconds.
text * string
The complete transcription of the audio.
word_count number
The total number of words in the transcription.
segments object
- start number
  The starting time of the segment within the audio, in seconds.
- end number
  The ending time of the segment within the audio, in seconds.
- text string
  The transcription of the segment.
- temperature number
  The temperature used in the decoding process, controlling randomness in predictions. Lower values result in more deterministic outputs.
- avg_logprob number
  The average log probability of the predictions for the words in this segment, indicating overall confidence.
- compression_ratio number
  The compression ratio of the input to the output, measuring how much the text was compressed during the transcription process.
- no_speech_prob number
  The probability that the segment contains no speech, represented as a decimal between 0 and 1.
- words array
  - items object
    - word string
      The individual word transcribed from the audio.
    - start number
      The starting time of the word within the audio, in seconds.
    - end number
      The ending time of the word within the audio, in seconds.
vtt string
The transcription in WebVTT format, which includes timing and text information for use in subtitles.

API Schemas

The following schemas are based on JSON Schema

Input
Output

{
    "type": "object",
    "properties": {
        "audio": {
            "type": "string",
            "description": "Base64 encoded value of the audio data."
        },
        "task": {
            "type": "string",
            "default": "transcribe",
            "description": "Supported tasks are 'translate' or 'transcribe'."
        },
        "language": {
            "type": "string",
            "default": "en",
            "description": "The language of the audio being transcribed or translated."
        },
        "vad_filter": {
            "type": "string",
            "default": "false",
            "description": "Preprocess the audio with a voice activity detection model."
        },
        "initial_prompt": {
            "type": "string",
            "description": "A text prompt to help provide context to the model on the contents of the audio."
        },
        "prefix": {
            "type": "string",
            "description": "The prefix it appended the the beginning of the output of the transcription and can guide the transcription result."
        }
    },
    "required": [
        "audio"
    ]
}

{
    "type": "object",
    "contentType": "application/json",
    "properties": {
        "transcription_info": {
            "type": "object",
            "properties": {
                "language": {
                    "type": "string",
                    "description": "The language of the audio being transcribed or translated."
                },
                "language_probability": {
                    "type": "number",
                    "description": "The confidence level or probability of the detected language being accurate, represented as a decimal between 0 and 1."
                },
                "duration": {
                    "type": "number",
                    "description": "The total duration of the original audio file, in seconds."
                },
                "duration_after_vad": {
                    "type": "number",
                    "description": "The duration of the audio after applying Voice Activity Detection (VAD) to remove silent or irrelevant sections, in seconds."
                }
            }
        },
        "text": {
            "type": "string",
            "description": "The complete transcription of the audio."
        },
        "word_count": {
            "type": "number",
            "description": "The total number of words in the transcription."
        },
        "segments": {
            "type": "object",
            "properties": {
                "start": {
                    "type": "number",
                    "description": "The starting time of the segment within the audio, in seconds."
                },
                "end": {
                    "type": "number",
                    "description": "The ending time of the segment within the audio, in seconds."
                },
                "text": {
                    "type": "string",
                    "description": "The transcription of the segment."
                },
                "temperature": {
                    "type": "number",
                    "description": "The temperature used in the decoding process, controlling randomness in predictions. Lower values result in more deterministic outputs."
                },
                "avg_logprob": {
                    "type": "number",
                    "description": "The average log probability of the predictions for the words in this segment, indicating overall confidence."
                },
                "compression_ratio": {
                    "type": "number",
                    "description": "The compression ratio of the input to the output, measuring how much the text was compressed during the transcription process."
                },
                "no_speech_prob": {
                    "type": "number",
                    "description": "The probability that the segment contains no speech, represented as a decimal between 0 and 1."
                },
                "words": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "word": {
                                "type": "string",
                                "description": "The individual word transcribed from the audio."
                            },
                            "start": {
                                "type": "number",
                                "description": "The starting time of the word within the audio, in seconds."
                            },
                            "end": {
                                "type": "number",
                                "description": "The ending time of the word within the audio, in seconds."
                            }
                        }
                    }
                }
            }
        },
        "vtt": {
            "type": "string",
            "description": "The transcription in WebVTT format, which includes timing and text information for use in subtitles."
        }
    },
    "required": [
        "text"
    ]
}