Tunnel health checks
Tunnel health checks monitor the status of the Generic Routing Encapsulation (GRE) tunnels that route traffic from Cloudflare to your origin network. Magic Transit relies on health checks to steer traffic to the best available routes.
A tunnel health check probe consists of an ICMP (Internet Control Message Protocol) reply packet that originates from an IP address on the origin side of the GRE tunnel and whose destination address is a public Cloudflare IP.
Cloudflare encapsulates the ICMP reply packet and transmits the probe across the GRE tunnel to the origin. When the probe reaches the origin router, the router decapsulates the ICMP reply and forwards it to the specified destination IP. The probe is successful when Cloudflare receives the reply.
Every Cloudflare edge server configured to process your traffic sends a tunnel health check probe every 60 seconds. When a probe attempt fails, each server detecting the failure quickly probes up to 2 more times to obtain an accurate result.
This Wireshark screenshot shows a collection of example health check packets:
Since each Cloudflare edge server that processes your traffic emits a probe every 60 seconds, expect your network to receive several hundred health check packets per second. This represents a relatively trivial amount of traffic.
Tunnel traffic management
Magic Transit uses tunnel health check packets to prioritize and steer traffic among tunnels.
Health state and prioritization
There are three tunnel health states: Healthy tunnels are preferred to Degraded tunnels, and Degraded tunnels are preferred to those that are Down.
Tunnel routes with lower values have priority over those with higher values.
When 0.1% or more of tunnel health checks (at least 2) fail in the previous 5 minutes, Magic Transit considers the link lossy and sets the tunnel state to Degraded. In response, Magic Transit immediately sets tunnel status to Degraded and applies a priority penalty. Magic Transit requires 2 failures so that a single lost packet does not trigger a penalty.
When all health checks (at least 3 samples) in the last 1 second fail, Magic Transit immediately transitions the tunnel from Healthy to Down and applies a priority penalty to routes through that tunnel.
When Magic Transit identifies a route that is not healthy, it applies the these penalties:
- Degraded: Add 500,000 to priority.
- Down: Add 1,000,000 to priority.
Applying a penalty rather than removing the route altogether preserves redundancy and maintains options for customers with only one tunnel. It also supports the case when multiple tunnels are unhealthy.
Once a tunnel is in the Down state, edge servers continue to emit probes every 60 seconds. When a probe returns Healthy, the edge server that received the healthy packet immediately sends two more probes. If these probes return Healthy, Magic Transit sets tunnel status to Degraded.
Tunnels in a Degraded state transition to Healthy when the failure rate for the previous 30 probes is less than 5%. This transition may take up to 30 minutes.
Magic Transit’s tunnel health check system allows a tunnel to transition quickly from Healthy to Degraded or Down but only slowly from Degraded or Down to Healthy. This dampens changes to tunnel routing caused by flapping and other intermittent network failures.
Consider 2 tunnels and their associated routing priorities. Lower route values have priority:
- Tunnel 1, route priority 100
- Tunnel 2, route priority 200
When both tunnels are in a Healthy state, routing priority directs traffic exclusively to Tunnel 1, since its route priority of 100 beats that of Tunnel 2. Tunnel 2 does not receive any traffic, except for tunnel health check probes. Endpoint health checks only flow over Tunnel 1 to their destination inside the origin network.
If the link between Tunnel 1 and Cloudflare becomes unusable, Cloudflare edge servers discover the failure on their next health check probe and immediately issue two more probes.
When an edge server does not receive the proper ICMP reply packets from these two additional probes, it labels Tunnel 1 Down and downgrades Tunnel 1 priority to 1,000,100, shifting priority to Tunnel 2. Immediately, Magic Transit steers packets arriving at that edge server to Tunnel 2.
Suppose the connectivity issue that had set Tunnel 1 health to Down is now resolved. At the next health check interval, the issuing edge server receives a successful probe and immediately sends two more probes to validate tunnel health.
When all three probes return successfully, Magic Transit transitions the tunnel from Down to Degraded. As part of this transition, Cloudflare reduces the priority penalty for that route so that its priority is 500,100. Since Tunnel 2 has a priority of 200, traffic continues to flow over Tunnel 2.
Edge servers continue probing Tunnel 1. When the health check failure rate drops below 0.1% for a 5-minute period, Magic Transit sets tunnel status to Healthy. Tunnel 1’s routing priority is fully restored to 100, and traffic steering returns the data flow to Tunnel 1.