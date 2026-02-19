Changelog
New updates and improvements at Cloudflare.
Workers AI and AI Gateway have received a series of dashboard improvements to help you get started faster and manage your AI workloads more easily.
Navigation and discoverability
AI now has its own top-level section in the Cloudflare dashboard sidebar, so you can find AI features without digging through menus.
Onboarding and getting started
Getting started with AI Gateway is now simpler. When you create your first gateway, we now show your gateway's OpenAI-compatible endpoint and step-by-step guidance to help you configure it. The Playground also includes helpful prompts, and usage pages have clear next steps if you have not made any requests yet.
We've also combined the previously separate code example sections into one view with dropdown selectors for API type, provider, SDK, and authentication method so you can now customize the exact code snippet you need from one place.
Dynamic Routing
- The route builder is now more performant and responsive.
- You can now copy route names to your clipboard with a single click.
- Code examples use the Universal Endpoint format, making it easier to integrate routes into your application.
Observability and analytics
- Small monetary values now display correctly in cost analytics charts, so you can accurately track spending at any scale.
Accessibility
- Improvements to keyboard navigation within the AI Gateway, specifically when exploring usage by provider.
- Improvements to sorting and filtering components on the Workers AI models page.
For more information, refer to the AI Gateway documentation.
Introducing GLM-4.7-Flash on Workers AI, @cloudflare/tanstack-ai, and workers-ai-provider v3.1.1
We're excited to announce GLM-4.7-Flash on Workers AI, a fast and efficient text generation model optimized for multilingual dialogue and instruction-following tasks, along with the brand-new @cloudflare/tanstack-ai ↗ package and workers-ai-provider v3.1.1 ↗.
You can now run AI agents entirely on Cloudflare. With GLM-4.7-Flash's multi-turn tool calling support, plus full compatibility with TanStack AI and the Vercel AI SDK, you have everything you need to build agentic applications that run completely at the edge.
@cf/zai-org/glm-4.7-flashis a multilingual model with a 131,072 token context window, making it ideal for long-form content generation, complex reasoning tasks, and multilingual applications.
Key Features and Use Cases:
- Multi-turn Tool Calling for Agents: Build AI agents that can call functions and tools across multiple conversation turns
- Multilingual Support: Built to handle content generation in multiple languages effectively
- Large Context Window: 131,072 tokens for long-form writing, complex reasoning, and processing long documents
- Fast Inference: Optimized for low-latency responses in chatbots and virtual assistants
- Instruction Following: Excellent at following complex instructions for code generation and structured tasks
Use GLM-4.7-Flash through the Workers AI binding (
env.AI.run()), the REST API at
/runor
/v1/chat/completions, AI Gateway, or via workers-ai-provider for the Vercel AI SDK.
Pricing is available on the model page or pricing page.
We've released
@cloudflare/tanstack-ai, a new package that brings Workers AI and AI Gateway support to TanStack AI ↗. This provides a framework-agnostic alternative for developers who prefer TanStack's approach to building AI applications.
Workers AI adapters support four configuration modes — plain binding (
env.AI), plain REST, AI Gateway binding (
env.AI.gateway(id)), and AI Gateway REST — across all capabilities:
- Chat (
createWorkersAiChat) — Streaming chat completions with tool calling, structured output, and reasoning text streaming.
- Image generation (
createWorkersAiImage) — Text-to-image models.
- Transcription (
createWorkersAiTranscription) — Speech-to-text.
- Text-to-speech (
createWorkersAiTts) — Audio generation.
- Summarization (
createWorkersAiSummarize) — Text summarization.
AI Gateway adapters route requests from third-party providers — OpenAI, Anthropic, Gemini, Grok, and OpenRouter — through Cloudflare AI Gateway for caching, rate limiting, and unified billing.
To get started:
The Workers AI provider for the Vercel AI SDK ↗ now supports three new capabilities beyond chat and image generation:
- Transcription (
provider.transcription(model)) — Speech-to-text with automatic handling of model-specific input formats across binding and REST paths.
- Text-to-speech (
provider.speech(model)) — Audio generation with support for voice and speed options.
- Reranking (
provider.reranking(model)) — Document reranking for RAG pipelines and search result ordering.
This release also includes a comprehensive reliability overhaul (v3.0.5):
- Fixed streaming — Responses now stream token-by-token instead of buffering all chunks, using a proper
TransformStreampipeline with backpressure.
- Fixed tool calling — Resolved issues with tool call ID sanitization, conversation history preservation, and a heuristic that silently fell back to non-streaming mode when tools were defined.
- Premature stream termination detection — Streams that end unexpectedly now report
finishReason: "error"instead of silently reporting
"stop".
- AI Search support — Added
createAISearchas the canonical export (renamed from AutoRAG).
createAutoRAGstill works with a deprecation warning.
To upgrade:
We have partnered with Black Forest Labs (BFL) again to bring their optimized FLUX.2 [klein] 9B model to Workers AI. This distilled model offers enhanced quality compared to the 4B variant, while maintaining cost-effective pricing. With a fixed 4-step inference process, Klein 9B is ideal for rapid prototyping and real-time applications where both speed and quality matter.
Read the BFL blog ↗ to learn more about the model itself, or try it out yourself on our multi modal playground ↗.
Pricing documentation is available on the model page or pricing page.
The model hosted on Workers AI is optimized for speed with a fixed 4-step inference process and supports up to 4 image inputs. Since this is a distilled model, the
stepsparameter is fixed at 4 and cannot be adjusted. Like FLUX.2 [dev] and FLUX.2 [klein] 4B, this image model uses multipart form data inputs, even if you just have a prompt.
With the REST API, the multipart form data input looks like this:
With the Workers AI binding, you can use it as such:
The parameters you can send to the model are detailed here:
JSON Schema for ModelRequired Parameters
prompt(string) - Text description of the image to generate
Optional Parameters
input_image_0(string) - Binary image
input_image_1(string) - Binary image
input_image_2(string) - Binary image
input_image_3(string) - Binary image
guidance(float) - Guidance scale for generation. Higher values follow the prompt more closely
width(integer) - Width of the image, default
1024Range: 256-1920
height(integer) - Height of the image, default
768Range: 256-1920
seed(integer) - Seed for reproducibility
Note: Since this is a distilled model, the
stepsparameter is fixed at 4 and cannot be adjusted.
The FLUX.2 klein-9b model supports generating images based on reference images, just like FLUX.2 [dev] and FLUX.2 [klein] 4B. You can use this feature to apply the style of one image to another, add a new character to an image, or iterate on past generated images. You would use it with the same multipart form data structure, with the input images in binary. The model supports up to 4 input images.
For the prompt, you can reference the images based on the index, like
take the subject of image 1 and style it like image 0or even use natural language like
place the dog beside the woman.
You must name the input parameter as
input_image_0,
input_image_1,
input_image_2,
input_image_3for it to work correctly. All input images must be smaller than 512x512.
Through Workers AI Binding:
-
We've partnered with Black Forest Labs (BFL) again to bring their optimized FLUX.2 [klein] 4B model to Workers AI! This distilled model offers faster generation and cost-effective pricing, while maintaining great output quality. With a fixed 4-step inference process, Klein 4B is ideal for rapid prototyping and real-time applications where speed matters.
Read the BFL blog ↗ to learn more about the model itself, or try it out yourself on our multi modal playground ↗.
Pricing documentation is available on the model page or pricing page.
The model hosted on Workers AI is optimized for speed with a fixed 4-step inference process and supports up to 4 image inputs. Since this is a distilled model, the
stepsparameter is fixed at 4 and cannot be adjusted. Like FLUX.2 [dev], this image model uses multipart form data inputs, even if you just have a prompt.
With the REST API, the multipart form data input looks like this:
With the Workers AI binding, you can use it as such:
The parameters you can send to the model are detailed here:
JSON Schema for ModelRequired Parameters
prompt(string) - Text description of the image to generate
Optional Parameters
input_image_0(string) - Binary image
input_image_1(string) - Binary image
input_image_2(string) - Binary image
input_image_3(string) - Binary image
guidance(float) - Guidance scale for generation. Higher values follow the prompt more closely
width(integer) - Width of the image, default
1024Range: 256-1920
height(integer) - Height of the image, default
768Range: 256-1920
seed(integer) - Seed for reproducibility
Note: Since this is a distilled model, the
stepsparameter is fixed at 4 and cannot be adjusted.
Through Workers AI Binding:
-
We've partnered with Black Forest Labs (BFL) to bring their latest FLUX.2 [dev] model to Workers AI! This model excels in generating high-fidelity images with physical world grounding, multi-language support, and digital asset creation. You can also create specific super images with granular controls like JSON prompting.
Read the BFL blog ↗ to learn more about the model itself. Read our Cloudflare blog ↗ to see the model in action, or try it out yourself on our multi modal playground ↗.
Pricing documentation is available on the model page or pricing page. Note, we expect to drop pricing in the next few days after iterating on the model performance.
The model hosted on Workers AI is able to support up to 4 image inputs (512x512 per input image). Note, this image model is one of the most powerful in the catalog and is expected to be slower than the other image models we currently support. One catch to look out for is that this model takes multipart form data inputs, even if you just have a prompt.
With the REST API, the multipart form data input looks like this:
With the Workers AI binding, you can use it as such:
The parameters you can send to the model are detailed here:
JSON Schema for ModelRequired Parameters
prompt(string) - Text description of the image to generate
Optional Parameters
input_image_0(string) - Binary image
input_image_1(string) - Binary image
input_image_2(string) - Binary image
input_image_3(string) - Binary image
steps(integer) - Number of inference steps. Higher values may improve quality but increase generation time
guidance(float) - Guidance scale for generation. Higher values follow the prompt more closely
width(integer) - Width of the image, default
1024Range: 256-1920
height(integer) - Height of the image, default
768Range: 256-1920
seed(integer) - Seed for reproducibility
Through Workers AI Binding:
The model supports prompting in JSON to get more granular control over images. You would pass the JSON as the value of the 'prompt' field in the multipart form data. See the JSON schema below on the base parameters you can pass to the model.
JSON Prompting Schema
- The model also supports the most common latin and non-latin character languages
- You can prompt the model with specific hex codes like
#2ECC71
- Try creating digital assets like landing pages, comic strips, infographics too!
-
Developers can now programmatically retrieve a list of all file formats supported by the Markdown Conversion utility in Workers AI.
You can use the
env.AIbinding:
Or call the REST API:
Both return a list of file formats that users can convert into Markdown:
Learn more about our Markdown Conversion utility.
Deepgram's newest Flux model
@cf/deepgram/fluxis now available on Workers AI, hosted directly on Cloudflare's infrastructure. We're excited to be a launch partner with Deepgram and offer their new Speech Recognition model built specifically for enabling voice agents. Check out Deepgram's blog ↗ for more details on the release.
The Flux model can be used in conjunction with Deepgram's speech-to-text model
@cf/deepgram/nova-3and text-to-speech model
@cf/deepgram/aura-1to build end-to-end voice agents. Having Deepgram on Workers AI takes advantage of our edge GPU infrastructure, for ultra low latency voice AI applications.
For the month of October 2025, Deepgram's Flux model will be free to use on Workers AI. Official pricing will be announced soon and charged after the promotional pricing period ends on October 31, 2025. Check out the model page for pricing details in the future.
The new Flux model is WebSocket only as it requires live bi-directional streaming in order to recognize speech activity.
- Create a worker that establishes a websocket connection with
@cf/deepgram/flux
- Deploy your worker
- Write a client script to connect to your worker and start sending random audio bytes to it
- Create a worker that establishes a websocket connection with
We're excited to be a launch partner alongside Google ↗ to bring their newest embedding model, EmbeddingGemma, to Workers AI that delivers best-in-class performance for its size, enabling RAG and semantic search use cases.
@cf/google/embeddinggemma-300mis a 300M parameter embedding model from Google, built from Gemma 3 and the same research used to create Gemini models. This multilingual model supports 100+ languages, making it ideal for RAG systems, semantic search, content classification, and clustering tasks.
Using EmbeddingGemma in AI Search: Now you can leverage EmbeddingGemma directly through AI Search for your RAG pipelines. EmbeddingGemma's multilingual capabilities make it perfect for global applications that need to understand and retrieve content across different languages with exceptional accuracy.
To use EmbeddingGemma for your AI Search projects:
- Go to Create in the AI Search dashboard ↗
- Follow the setup flow for your new RAG instance
- In the Generate Index step, open up More embedding models and select
@cf/google/embeddinggemma-300mas your embedding model
- Complete the setup to create an AI Search
Try it out and let us know what you think!
New state-of-the-art models have landed on Workers AI! This time, we're introducing new partner models trained by our friends at Deepgram ↗ and Leonardo ↗, hosted on Workers AI infrastructure.
As well, we're introuding a new turn detection model that enables you to detect when someone is done speaking — useful for building voice agents!
Read the blog ↗ for more details and check out some of the new models on our platform:
@cf/deepgram/aura-1is a text-to-speech model that allows you to input text and have it come to life in a customizable voice
@cf/deepgram/nova-3is speech-to-text model that transcribes multilingual audio at a blazingly fast speed
@cf/pipecat-ai/smart-turn-v2helps you detect when someone is done speaking
@cf/leonardo/lucid-originis a text-to-image model that generates images with sharp graphic design, stunning full-HD renders, or highly specific creative direction
@cf/leonardo/phoenix-1.0is a text-to-image model with exceptional prompt adherence and coherent text
You can filter out new partner models with the
Partnercapability on our Models page.
As well, we're introducing WebSocket support for some of our audio models, which you can filter though the
Realtimecapability on our Models page. WebSockets allows you to create a bi-directional connection to our inference server with low latency — perfect for those that are building voice agents.
An example python snippet on how to use WebSockets with our new Aura model:
-
We're thrilled to be a Day 0 partner with OpenAI ↗ to bring their latest open models ↗ to Workers AI, including support for Responses API, Code Interpreter, and Web Search (coming soon).
Get started with the new models at
@cf/openai/gpt-oss-120band
@cf/openai/gpt-oss-20b. Check out the blog ↗ for more details about the new models, and the
gpt-oss-120band
gpt-oss-20bmodel pages for more information about pricing and context windows.
If you call the model through:
- Workers Binding, it will accept/return Responses API –
env.AI.run(“@cf/openai/gpt-oss-120b”)
- REST API on
/runendpoint, it will accept/return Responses API –
https://api.cloudflare.com/client/v4/accounts/<account_id>/ai/run/@cf/openai/gpt-oss-120b
- REST API on new
/responsesendpoint, it will accept/return Responses API –
https://api.cloudflare.com/client/v4/accounts/<account_id>/ai/v1/responses
- REST API for OpenAI Compatible endpoint, it will return Chat Completions (coming soon) –
https://api.cloudflare.com/client/v4/accounts/<account_id>/ai/v1/chat/completions
The model is natively trained to support stateful code execution, and we've implemented support for this feature using our Sandbox SDK ↗ and Containers ↗. Cloudflare's Developer Platform is uniquely positioned to support this feature, so we're very excited to bring our products together to support this new use case.
We are working to implement Web Search for the model, where users can bring their own Exa API Key so the model can browse the Internet.
- Workers Binding, it will accept/return Responses API –
Workers AI for Developer Week - faster inference, new models, async batch API, expanded LoRA support
Happy Developer Week 2025! Workers AI is excited to announce a couple of new features and improvements available today. Check out our blog ↗ for all the announcement details.
We’re rolling out some in-place improvements to our models that can help speed up inference by 2-4x! Users of the models below will enjoy an automatic speed boost starting today:
@cf/meta/llama-3.3-70b-instruct-fp8-fastgets a speed boost of 2-4x, leveraging techniques like speculative decoding, prefix caching, and an updated inference backend.
@cf/baai/bge-small-en-v1.5,
@cf/baai/bge-base-en-v1.5,
@cf/baai/bge-large-en-v1.5get an updated back end, which should improve inference times by 2x.
- With the
bgemodels, we’re also announcing a new parameter called
poolingwhich can take
clsor
meanas options. We highly recommend using
pooling: clswhich will help generate more accurate embeddings. However, embeddings generated with cls pooling are not backwards compatible with mean pooling. For this to not be a breaking change, the default remains as mean pooling. Please specify
pooling: clsto enjoy more accurate embeddings going forward.
- With the
We’re also excited to launch a few new models in our catalog to help round out your experience with Workers AI. We’ll be deprecating some older models in the future, so stay tuned for a deprecation announcement. Today’s new models include:
@cf/mistralai/mistral-small-3.1-24b-instruct: a 24B parameter model achieving state-of-the-art capabilities comparable to larger models, with support for vision and tool calling.
@cf/google/gemma-3-12b-it: well-suited for a variety of text generation and image understanding tasks, including question answering, summarization and reasoning, with a 128K context window, and multilingual support in over 140 languages.
@cf/qwen/qwq-32b: a medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini.
@cf/qwen/qwen2.5-coder-32b-instruct: the current state-of-the-art open-source code LLM, with its coding abilities matching those of GPT-4o.
Introducing a new batch inference feature that allows you to send us an array of requests, which we will fulfill as fast as possible and send them back as an array. This is really helpful for large workloads such as summarization, embeddings, etc. where you don’t have a human-in-the-loop. Using the batch API will guarantee that your requests are fulfilled eventually, rather than erroring out if we don’t have enough capacity at a given time.
Check out the tutorial to get started! Models that support batch inference today include:
@cf/meta/llama-3.3-70b-instruct-fp8-fast
@cf/baai/bge-small-en-v1.5
@cf/baai/bge-base-en-v1.5
@cf/baai/bge-large-en-v1.5
@cf/baai/bge-m3
@cf/meta/m2m100-1.2b
We’ve upgraded our LoRA experience to include 8 newer models, and can support ranks of up to 32 with a 300MB safetensors file limit (previously limited to rank of 8 and 100MB safetensors) Check out our LoRAs page to get started. Models that support LoRAs now include:
@cf/meta/llama-3.2-11b-vision-instruct
@cf/meta/llama-3.3-70b-instruct-fp8-fast
@cf/meta/llama-guard-3-8b
@cf/meta/llama-3.1-8b-instruct-fast(coming soon)
@cf/deepseek-ai/deepseek-r1-distill-qwen-32b(coming soon)
@cf/qwen/qwen2.5-coder-32b-instruct
@cf/qwen/qwq-32b
@cf/mistralai/mistral-small-3.1-24b-instruct
@cf/google/gemma-3-12b-it
-
Document conversion plays an important role when designing and developing AI applications and agents. Workers AI now provides the
toMarkdownutility method that developers can use to for quick, easy, and convenient conversion and summary of documents in multiple formats to Markdown language.
You can call this new tool using a binding by calling
env.AI.toMarkdown()or the using the REST API endpoint.
In this example, we fetch a PDF document and an image from R2 and feed them both to
env.AI.toMarkdown(). The result is a list of converted documents. Workers AI models are used automatically to detect and summarize the image.
This is the result:
See Markdown Conversion for more information on supported formats, REST API and pricing.
Workers AI is excited to add 4 new models to the catalog, including 2 brand new classes of models with a text-to-speech and reranker model. Introducing:
- @cf/baai/bge-m3 - a multi-lingual embeddings model that supports over 100 languages. It can also simultaneously perform dense retrieval, multi-vector retrieval, and sparse retrieval, with the ability to process inputs of different granularities.
- @cf/baai/bge-reranker-base - our first reranker model! Rerankers are a type of text classification model that takes a query and context, and outputs a similarity score between the two. When used in RAG systems, you can use a reranker after the initial vector search to find the most relevant documents to return to a user by reranking the outputs.
- @cf/openai/whisper-large-v3-turbo - a faster, more accurate speech-to-text model. This model was added earlier but is graduating out of beta with pricing included today.
- @cf/myshell-ai/melotts - our first text-to-speech model that allows users to generate an MP3 with voice audio from inputted text.
Pricing is available for each of these models on the Workers AI pricing page.
This docs update includes a few minor bug fixes to the model schema for llama-guard, llama-3.2-1b, which you can review on the product changelog.
Try it out and let us know what you think! Stay tuned for more models in the coming days.
Workers AI now supports structured JSON outputs with JSON mode, which allows you to request a structured output response when interacting with AI models.
This makes it much easier to retrieve structured data from your AI models, and avoids the (error prone!) need to parse large unstructured text responses to extract your data.
JSON mode in Workers AI is compatible with the OpenAI SDK's structured outputs ↗
response_formatAPI, which can be used directly in a Worker:
To learn more about JSON mode and structured outputs, visit the Workers AI documentation.
We've updated the Workers AI text generation models to include context windows and limits definitions and changed our APIs to estimate and validate the number of tokens in the input prompt, not the number of characters.
This update allows developers to use larger context windows when interacting with Workers AI models, which can lead to better and more accurate results.
Our catalog page provides more information about each model's supported context window.
We've updated the Workers AI pricing to include the latest models and how model usage maps to Neurons.
- Each model's core input format(s) (tokens, audio seconds, images, etc) now include mappings to Neurons, making it easier to understand how your included Neuron volume is consumed and how you are charged at scale
- Per-model pricing, instead of the previous bucket approach, allows us to be more flexible on how models are charged based on their size, performance and capabilities. As we optimize each model, we can then pass on savings for that model.
- You will still only pay for what you consume: Workers AI inference is serverless, and not billed by the hour.
Going forward, models will be launched with their associated Neuron costs, and we'll be updating the Workers AI dashboard and API to reflect consumption in both raw units and Neurons. Visit the Workers AI pricing page to learn more about Workers AI pricing.