Magic Tunnels background information
Cloudflare combines different metrics to determine when to send you Magic Tunnel health alerts. Review key concepts below to better understand this process.
A ratio between the total number of positive events divided by the total number of events. An SLI of 0% is a state where the feature is not working at all, and an SLI of 100% is a state where the feature is fully working as expected.
SLOs are the threshold for the SLI and they set a target level of reliability for Magic Tunnels. For example, if the SLI is based on the percentage of Magic Tunnel states meeting the desired criteria, an SLO could be 99.9% of tunnel states being healthy over the past 30 days. The SLI values for the SLO are calculated based on the tunnel state values, not on the timeout results from tunnel health checks.
The error budget is the amount of unsuccessful events (from the SLI perspective) that can happen over the course of the SLO time window while maintaining the service at the level of availability defined by the SLO.
The SLO is a target percentage, and the error budget is defined as 100% minus the SLO. For example, let us assume that during a course of 30 days there were one million Magic Tunnel health checks in your account, and your SLO is set to 99%. The error budget for this case would be:
This means the SLO would allow for 1000 unsuccessful Magic health checks over the course of 30 days. However, what happens if all errors happen in one hour instead of 30 days? This leads us to the concept of burn rate.
The burn rate measures how fast the error budget is expended over a given time window relative to the SLO window. In the example from above, an SLO of 99% means it is acceptable to observe 100 Magic health check fails over the course of 30 days. However, those same 100 health check fails would not be acceptable during the course of one hour, for example.
To determine when to send Magic Tunnel health alerts, Cloudflare relies on a multi-window, multi-burn rate approach. Every five minutes, Cloudflare analyzes the last hour and the last five minutes of data. We calculate the SLI for the short window (five minutes) and large window (one hour) of data.
Cloudflare only alerts you when both the short and long SLOs are below the configured threshold, so the maximum of the two is the limiting factor. For example, if you defined a threshold of 99%:
- 99.2% (small window) + 99% (large window) = 99%. An alert would not be triggered.
- 98% (small window) + 98% (large window) = 98%. An alert would be triggered.
Cloudflare counts degraded health checks as failed health checks.