Set up Evaluations
This guide walks you through the process of setting up an evaluation in AI Gateway. These steps are done in the Cloudflare dashboard ↗.
Datasets are collections of logs stored for analysis that can be used in an evaluation. You can create datasets by applying filters in the Logs tab. Datasets will update automatically based on the set filters.
- Apply filters to narrow down your logs. Filter options include provider, number of tokens, request status, and more.
- Select Create Dataset to store the filtered logs for future analysis.
You can manage datasets by selecting Manage datasets from the Logs tab.
Filter category | Filter options | Filter by description |
---|---|---|
Status | error, status | error type or status. |
Cache | cached, not cached | based on whether they were cached or not. |
Provider | specific providers | the selected AI provider. |
AI Models | specific models | the selected AI model. |
Cost | less than, greater than | cost, specifying a threshold. |
Request type | Universal, Workers AI Binding, WebSockets | the type of request. |
Tokens | Total tokens, Tokens In, Tokens Out | token count (less than or greater than). |
Duration | less than, greater than | request duration. |
Feedback | equals, does not equal (thumbs up, thumbs down, no feedback) | feedback type. |
Metadata Key | equals, does not equal | specific metadata keys. |
Metadata Value | equals, does not equal | specific metadata values. |
Log ID | equals, does not equal | a specific Log ID. |
Event ID | equals, does not equal | a specific Event ID. |
After creating a dataset, choose the evaluation parameters:
- Cost: Calculates the average cost of inference requests within the dataset (only for requests with cost data).
- Speed: Calculates the average duration of inference requests within the dataset.
- Performance:
- Human feedback: measures performance based on human feedback, calculated by the % of thumbs up on the logs, annotated from the Logs tab.
- Create a unique name for your evaluation to reference it in the dashboard.
- Review the selected dataset and evaluators.
- Select Run to start the process.
Evaluation results will appear in the Evaluations tab. The results show the status of the evaluation (for example, in progress, completed, or error). Metrics for the selected evaluators will be displayed, excluding any logs with missing fields. You will also see the number of logs used to calculate each metric.
While datasets automatically update based on filters, evaluations do not. You will have to create a new evaluation if you want to evaluate new logs.
Use these insights to optimize based on your application's priorities. Based on the results, you may choose to:
- Change the model or provider
- Adjust your prompts
- Explore further optimizations, such as setting up Retrieval Augmented Generation (RAG)