Skip to content

Monitoring

Edge Gateway follows the open box monitoring principle, exposing detailed Prometheus metrics that provide visibility into request processing, cache behavior, render coordination, and system health. Use these metrics to understand system behavior and identify potential issues during operation.

Configuration

Enable metrics by setting a port separate from the main server port:

yaml
server:
  listen: ":10070"

metrics:
  enabled: true
  listen: ":10079"

The metrics endpoint is available at http://localhost:10079/metrics.

Metrics reference

All metrics use the eg_ prefix and include labels for filtering by host, dimension, or other attributes.

Request metrics

MetricTypeLabelsDescription
eg_requests_totalcounterhost, dimension, statusTotal number of render requests processed.
eg_request_duration_secondshistogramhost, dimension, statusTime taken to process render requests. Buckets: 5ms to 10s.
eg_active_requestsgaugeNumber of currently active render requests.
eg_errors_totalcountererror_type, hostTotal number of errors by type.

Cache metrics

MetricTypeLabelsDescription
eg_cache_hits_totalcounterhost, dimensionTotal number of cache hits.
eg_cache_misses_totalcounterhost, dimensionTotal number of cache misses.
eg_cache_hit_ratiogaugehost, dimensionCache hit ratio (0-1) for each host and dimension.
eg_stale_cache_served_totalcounterhost, dimensionTotal number of stale cache entries served when render fails.
eg_cache_size_bytesgaugeTotal size of cached content in bytes.

Render service metrics

MetricTypeLabelsDescription
eg_render_service_duration_secondshistogramhost, dimension, service_idTime taken by render service to process requests. Buckets: 100ms to 30s.
eg_status_code_responses_totalcounterhost, dimension, status_rangeTotal rendered responses by status code range (2xx, 3xx, 4xx, 5xx).

Bypass metrics

MetricTypeLabelsDescription
eg_bypass_totalcounterhost, reasonTotal number of requests that used bypass mode instead of rendering.

Wait and coordination metrics

These metrics track requests waiting for concurrent renders to complete (thundering herd protection).

MetricTypeLabelsDescription
eg_wait_totalcounterhost, dimension, outcomeTotal requests that waited for concurrent renders. Outcome: success or timeout.
eg_wait_duration_secondshistogramhost, dimension, outcomeTime spent waiting for concurrent renders. Buckets: 10ms to 5s.
eg_wait_timeouts_totalcounterhost, dimensionTotal wait timeouts while waiting for concurrent renders.

Sharding metrics

These metrics are available when cache sharding is enabled across multiple Edge Gateway instances.

MetricTypeLabelsDescription
eg_sharding_requests_totalcounteroperation, status, target_eg_idTotal inter-EG requests for cache operations.
eg_sharding_request_duration_secondshistogramoperationDuration of inter-EG requests. Buckets: 10ms to 5s.
eg_sharding_bytes_transferred_totalcounteroperation, directionTotal bytes transferred in inter-EG communication.
eg_sharding_cluster_sizegaugeNumber of healthy Edge Gateways in the cluster.
eg_sharding_under_replicated_totalcounterhost_idCache entries created with fewer replicas than target.
eg_sharding_errors_totalcountererror_typeInter-EG communication errors.
eg_sharding_push_failures_totalcountertarget_eg_idFailed push operations per target Edge Gateway.
eg_sharding_local_cache_entriesgaugeNumber of cache entries stored locally on this instance.

Filesystem cleanup metrics

MetricTypeLabelsDescription
eg_filesystem_cleanup_runs_totalcounterhost_id, statusTotal cleanup runs per host.
eg_filesystem_cleanup_directories_deleted_totalcounterhost_idTotal directories deleted during cleanup.
eg_filesystem_cleanup_duration_secondshistogramhost_idDuration of cleanup operations. Buckets: 100ms to 60s.
eg_filesystem_cleanup_errors_totalcounterhost_id, error_typeCleanup errors by type.

Example queries

Cache performance

txt
# Cache hit ratio over last 5 minutes
rate(eg_cache_hits_total[5m]) / (rate(eg_cache_hits_total[5m]) + rate(eg_cache_misses_total[5m]))

# Stale cache usage rate
rate(eg_stale_cache_served_total[5m])

Request latency

txt
# 95th percentile request duration
histogram_quantile(0.95, rate(eg_request_duration_seconds_bucket[5m]))

# Render service latency by service
histogram_quantile(0.95, rate(eg_render_service_duration_seconds_bucket[5m]))

Error monitoring

txt
# Error rate by type
rate(eg_errors_total[5m])

# Wait timeout rate
rate(eg_wait_timeouts_total[5m])

Sharding health

txt
# Cluster size over time
eg_sharding_cluster_size

# Under-replication issues
rate(eg_sharding_under_replicated_total[5m])