Skip to content

Key server metrics

The gokeyless key server exposes a Prometheus metrics endpoint that you can use to monitor signing performance, error rates, connection health, and certificate expiry. This endpoint can also be scraped by the OpenTelemetry Collector Prometheus receiver, making the metrics available to any OpenTelemetry-compatible backend.

Metrics endpoint

By default, metrics are served at:

http://<host>:2406/metrics

The port is configurable via the metrics_port key in your configuration file, the --metrics-port flag, or the KEYLESS_METRICS_PORT environment variable.

The endpoint serves only /metrics. There are no additional HTTP endpoints such as /health or /debug.


Histogram buckets

All histogram metrics share the same bucket configuration: 15 exponential buckets starting at 100 microseconds, doubling each step up to approximately 1.64 seconds, plus a final +Inf bucket.

BucketUpper bound
1100 µs
2200 µs
3400 µs
4800 µs
51.6 ms
63.2 ms
76.4 ms
812.8 ms
925.6 ms
1051.2 ms
11102 ms
12205 ms
13410 ms
14819 ms
15~1.64 s
+InfAnything above ~1.64 s

Metrics reference

keyless_requests

Type: Counter
Labels: opcode

Counts every incoming request received over an established connection, regardless of outcome. Incremented once per request before any processing begins.

The opcode label uses the full constant name from the gokeyless protocol.

RSA operations

opcode labelWire valueDescription
OpRSADecrypt0x01RSA raw decryption — used in TLS RSA key exchange (deprecated in TLS 1.3)
OpRSASignMD5SHA10x02RSA PKCS#1 v1.5 signature over MD5+SHA1 combined hash — TLS 1.0/1.1 handshake
OpRSASignSHA10x03RSA PKCS#1 v1.5 signature over SHA1
OpRSASignSHA2240x04RSA PKCS#1 v1.5 signature over SHA224
OpRSASignSHA2560x05RSA PKCS#1 v1.5 signature over SHA256
OpRSASignSHA3840x06RSA PKCS#1 v1.5 signature over SHA384
OpRSASignSHA5120x07RSA PKCS#1 v1.5 signature over SHA512
OpRSAPSSSignSHA2560x35RSASSA-PSS signature over SHA256 — primary RSA operation in TLS 1.3
OpRSAPSSSignSHA3840x36RSASSA-PSS signature over SHA384
OpRSAPSSSignSHA5120x37RSASSA-PSS signature over SHA512

ECDSA operations

opcode labelWire valueDescription
OpECDSASignMD5SHA10x12ECDSA signature over MD5+SHA1 combined hash
OpECDSASignSHA10x13ECDSA signature over SHA1
OpECDSASignSHA2240x14ECDSA signature over SHA224
OpECDSASignSHA2560x15ECDSA signature over SHA256 — most common in TLS 1.2 and TLS 1.3
OpECDSASignSHA3840x16ECDSA signature over SHA384
OpECDSASignSHA5120x17ECDSA signature over SHA512

Other signing

opcode labelWire valueDescription
OpEd25519Sign0x18Ed25519 signature over an arbitrary-length payload (not a pre-hashed digest)

Sealing and infrastructure operations

opcode labelWire valueDescription
OpSeal0x21Encrypt a blob using the server's sealing key — used for TLS session tickets
OpUnseal0x22Decrypt a blob previously encrypted by OpSeal. Returns ErrExpired if the sealing key has rotated
OpRPC0x23Execute a named function registered on the server. Available to all connection types
OpCustom0x24Execute a custom function set in the server configuration. Available to unrestricted connections only
OpPing0xF1Health check — the server echoes the payload back as OpPong with no HSM or key lookup involved

keyless_request_exec_duration_per_opcode

Type: Histogram
Labels: type, error

Measures the time to execute a single operation, from when processing begins to when a response is produced. For operations backed by a PKCS#11 HSM, this includes the full time waiting for a session from the pool plus the HSM cryptographic operation time.

This metric does not include time a request spends waiting for a connection semaphore slot. That is captured by keyless_request_total_duration_per_opcode.

type label

Opcodes are grouped into coarser categories for this label:

type labelOpcodes included
rsaOpRSADecrypt, all OpRSASign*, all OpRSAPSSSign*
ecdsaAll OpECDSASign*
ed25519OpEd25519Sign
rpcOpRPC
customOpCustom
otherOpSeal, OpUnseal, OpPing, OpPong, OpResponse, OpError
unknownAny unrecognised opcode byte

error label

For successful requests the value is no error. All other values indicate a failed operation.

error labelDescriptionCommon cause
no errorOperation completed successfully
cryptography errorHSM or signing operation failedPKCS#11 session pool exhaustion (resource pool timed out), HSM returned an error, key type mismatch
key not found due to no matching SKI/SNI/ServerIPKey lookup returned no resultKey not loaded in keystore, incorrect SKI in request
read failureI/O read error during the operationDisk error reading key file
version mismatchProtocol version not supportedClient and server version skew
bad opcodeUnknown opcode receivedOpCustom sent with no custom handler configured
unexpected opcodeA response opcode was used as a requestClient sent OpPong, OpResponse, or OpError as a request
malformed messageTLV parse failureCorrupt or truncated packet
internal errorNon-cryptographic server-side failureSealer is nil, RPC dispatch error
certificate not foundCertificate lookup failedCertificate not loaded
sealing key expiredOpUnseal blob is too old to decryptTLS session ticket key rotation — blob sealed with a key that has since been retired
remote configuration errorRemote key server is misconfiguredKey points to an unreachable or misconfigured remote key server

keyless_request_total_duration_per_opcode

Type: Histogram
Labels: type, error (same values as keyless_request_exec_duration_per_opcode)

Measures the total time to satisfy a request, from when the request packet is read off the wire to when the response bytes are written back to the client.

total_duration = exec_duration + response_write_time

Both timestamps are captured after the connection semaphore is already held, so semaphore queue wait time is not included in either histogram. Under normal load, total duration and exec duration are approximately equal. A growing gap between them indicates slow writes back to the client — for example, network backpressure between the key server and the Cloudflare edge.


keyless_key_load_duration

Type: Histogram
Labels: None

Measures the time taken by the keystore to locate and return the private key for each request, keyed by SKI, SNI, and server IP.

  • For file-backed keystores, this is a map lookup and is typically sub-millisecond.
  • For PKCS#11 or HSM keystores, this may include a network round-trip to the HSM if key references are not cached in memory.

This metric is recorded for all signing and decryption operations: OpRSADecrypt, all OpRSASign*, all OpRSAPSSSign*, all OpECDSASign*, and OpEd25519Sign.

It is not recorded for OpPing, OpSeal, OpUnseal, OpRPC, or OpCustom, which do not require a private key lookup.


keyless_failed_connection

Type: Counter
Labels: None

Counts connection-level transport failures. This metric reflects problems at the network or TLS layer — it does not count signing errors or key lookup failures, which are reported in the error label of the duration histograms.

ScenarioCounted?
TLS handshake failureNo
Client disconnected before TLS handshake (EOF)No
Failure determining connection trust level after TLSYes
Non-EOF read error on an established connectionYes
Write error when delivering a responseYes
Read timeout — graceful connection drainNo
Signing error, including PKCS#11 pool timeoutNo
Key not foundNo

certificate_expiration_timestamp_seconds

Type: Gauge
Labels: source, serial_no, cn, hostnames, ca, server, client

Reports the expiration time (NotAfter) of each certificate loaded by the key server as a Unix timestamp. One time series is emitted per certificate.

This metric is updated:

  • At startup, for the server authentication certificate (auth_cert) and the Cloudflare CA certificate (cloudflare_ca_cert).
  • On each successful inbound TLS connection, for the peer certificates presented by the connecting client.
LabelDescription
sourceFile path for startup certs; listener: <addr> for peer certs from incoming connections
serial_noCertificate serial number
cnSubject Common Name
hostnamesSorted, comma-separated list of DNS Subject Alternative Names
ca1 if the certificate is a CA certificate, 0 otherwise
server1 if the certificate includes ExtKeyUsageServerAuth, 0 otherwise
client1 if the certificate includes ExtKeyUsageClientAuth, 0 otherwise

Example PromQL queries

Signing throughput by key type

sum by (opcode) (rate(keyless_requests[1m]))

Error rate by error type

sum by (error) (
rate(keyless_request_exec_duration_per_opcode_count{error!="no error"}[5m])
)

99th percentile signing latency for RSA

histogram_quantile(
0.99,
rate(keyless_request_exec_duration_per_opcode_bucket{type="rsa"}[5m])
)

A value approaching 10 seconds indicates PKCS#11 session pool exhaustion. Refer to Scaling and benchmarking and your HSM documentation for guidance on increasing the session pool size.

99th percentile key load latency

histogram_quantile(0.99, rate(keyless_key_load_duration_bucket[5m]))

A spike here without a corresponding spike in exec duration suggests the keystore lookup itself is slow — a possible disk I/O issue or PKCS#11 object enumeration delay.

Connection failure rate

rate(keyless_failed_connection_total[5m])

A sustained non-zero rate indicates network or TLS problems between the Cloudflare network and your key server.

Alert on certificate expiry within 30 days

(certificate_expiration_timestamp_seconds - time()) / 86400 < 30