llama-4-scout-17b-16e-instruct

Text Generation • Meta • Hosted

Meta's Llama 4 Scout is a 17 billion parameter model with 16 experts that is natively multimodal. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding.

Model Info
Context Window ↗	131,000 tokens
Terms and License	link ↗
Function calling ↗	Yes
Vision	Yes
Batch	Yes
Unit Pricing	$0.27 per M input tokens, $0.85 per M output tokens

Playground

Try out this model with Workers AI LLM Playground. It does not require any setup or authentication and an instant way to preview and test a model directly in the browser.

Launch the LLM Playground

Usage

export interface Env {
  AI: Ai;
}

export default {
  async fetch(request, env): Promise<Response> {

    const messages = [
      { role: "system", content: "You are a friendly assistant" },
      {
        role: "user",
        content: "What is the origin of the phrase Hello, World",
      },
    ];

    const stream = await env.AI.run("@cf/meta/llama-4-scout-17b-16e-instruct", {
      messages,
      stream: true,
    });

    return new Response(stream, {
      headers: { "content-type": "text/event-stream" },
    });
  },
} satisfies ExportedHandler<Env>;

export interface Env {
  AI: Ai;
}

export default {
  async fetch(request, env): Promise<Response> {

    const messages = [
      { role: "system", content: "You are a friendly assistant" },
      {
        role: "user",
        content: "What is the origin of the phrase Hello, World",
      },
    ];
    const response = await env.AI.run("@cf/meta/llama-4-scout-17b-16e-instruct", { messages });

    return Response.json(response);
  },
} satisfies ExportedHandler<Env>;

import os
import requests

ACCOUNT_ID = "your-account-id"
AUTH_TOKEN = os.environ.get("CLOUDFLARE_AUTH_TOKEN")

prompt = "Tell me all about PEP-8"
response = requests.post(
  f"https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/meta/llama-4-scout-17b-16e-instruct",
    headers={"Authorization": f"Bearer {AUTH_TOKEN}"},
    json={
      "messages": [
        {"role": "system", "content": "You are a friendly assistant"},
        {"role": "user", "content": prompt}
      ]
    }
)
result = response.json()
print(result)

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run/@cf/meta/llama-4-scout-17b-16e-instruct \
  -X POST \
  -H "Authorization: Bearer $CLOUDFLARE_AUTH_TOKEN" \
  -d '{ "messages": [{ "role": "system", "content": "You are a friendly assistant" }, { "role": "user", "content": "Why is pizza so good" }]}'

Parameters

Synchronous — Send a request and receive a complete response

Input
Output

prompt

stringrequiredminLength: 1The input text prompt for the model to generate a response.

guided_json{}

objectJSON schema that should be fulfilled for the response.

▶response_format{}

object

raw

booleandefault: falseIf true, a chat template is not applied and you must adhere to the specific model's expected formatting.

stream

booleandefault: falseIf true, the response will be streamed back incrementally using SSE, Server Sent Events.

max_tokens

integerdefault: 256The maximum number of tokens to generate in the response.

temperature

numberdefault: 0.15minimum: 0maximum: 5Controls the randomness of the output; higher values produce more random results.

top_p

numberminimum: 0maximum: 2Adjusts the creativity of the AI's responses by controlling how many possible words it considers. Lower values make outputs more predictable; higher values allow for more varied and creative responses.

top_k

integerminimum: 1maximum: 50Limits the AI to choose from the top 'k' most probable words. Lower values make responses more focused; higher values introduce more variety and potential surprises.

seed

integerminimum: 1maximum: 9999999999Random seed for reproducibility of the generation.

repetition_penalty

numberminimum: 0maximum: 2Penalty for repeated tokens; higher values discourage repetition.

frequency_penalty

numberminimum: 0maximum: 2Decreases the likelihood of the model repeating the same lines verbatim.

presence_penalty

numberminimum: 0maximum: 2Increases the likelihood of the model introducing new topics.

response

stringThe generated text response from the model

▶usage{}

objectUsage statistics for the inference request

▶tool_calls[]

arrayAn array of tool calls requests made during the response generation

Streaming — Send a request with `stream: true` and receive server-sent events

Input
Output

prompt

stringrequiredminLength: 1The input text prompt for the model to generate a response.

guided_json{}

objectJSON schema that should be fulfilled for the response.

▶response_format{}

object

raw

booleandefault: falseIf true, a chat template is not applied and you must adhere to the specific model's expected formatting.

stream

booleandefault: falseIf true, the response will be streamed back incrementally using SSE, Server Sent Events.

max_tokens

integerdefault: 256The maximum number of tokens to generate in the response.

temperature

numberdefault: 0.15minimum: 0maximum: 5Controls the randomness of the output; higher values produce more random results.

top_p

top_k

integerminimum: 1maximum: 50Limits the AI to choose from the top 'k' most probable words. Lower values make responses more focused; higher values introduce more variety and potential surprises.

seed

integerminimum: 1maximum: 9999999999Random seed for reproducibility of the generation.

repetition_penalty

numberminimum: 0maximum: 2Penalty for repeated tokens; higher values discourage repetition.

frequency_penalty

numberminimum: 0maximum: 2Decreases the likelihood of the model repeating the same lines verbatim.

presence_penalty

numberminimum: 0maximum: 2Increases the likelihood of the model introducing new topics.

type

string

contentType

text/event-stream

format

binary

Batch — Send multiple requests in a single API call

Input
Output

▶requests[]

arrayrequired

response

stringThe generated text response from the model

▶usage{}

objectUsage statistics for the inference request

▶tool_calls[]

arrayAn array of tool calls requests made during the response generation

API Schemas (Raw)

Synchronous Input

Synchronous Output

Streaming Input

Streaming Output

Batch Input

Batch Output