18. Standard Metrics for Services using Prometheus
Date: 2023-12-29
Status
Accepted
Context
The drive for standardizing metrics in our services is informed by the need for clear, consistent, and actionable data that can guide operational decisions and provide insights into service performance. Prometheus, with its robust monitoring capabilities, offers a suitable platform for this endeavor. This decision also aligns with our commitment to maintaining high service availability and reliability and is influenced by our previous ADR on Prometheus Metrics Naming (ADR #23).
Decision
We have decided to adopt standard metrics for our services using Prometheus.
Enhancements to Prometheus Metric Examples
SLA/SLO/SLI Metrics
service_availability_ratio
: A gauge metric representing the ratio of uptime to the total time, aligning with our SLA uptime guarantees.http_request_duration_seconds
: A histogram metric measuring the request response times, aiding in tracking our SLOs related to response time.error_rate_per_minute
: A counter metric tracking the number of errors per minute, an SLI for system reliability.
Cache Metrics
cache_hit_total
andcache_miss_total
: Counter metrics for the total number of cache hits and misses, respectively.cache_hit_ratio
: A calculated ratio fromcache_hit_total
andcache_miss_total
to determine the effectiveness of the cache.hot_key_access_frequency
: A gauge metric indicating the frequency of access for hot keys.
Basic Service Metrics
requests_per_second
(RPS): Counter metric measuring service request load per second.transactions_per_second
(TPS) &queries_per_second
(QPS): Counter metrics for transactions and database queries per second, respectively.response_time_seconds
: Histogram metric tracking service response time.error_rate_percentage
: Gauge metric for percentage of error requests.
Resource Utilization Metrics
network_traffic_bytes
: Counter metric for inbound and outbound network traffic.cpu_usage_percentage
: Gauge metric for CPU utilization.ram_usage_bytes
: Gauge metric for RAM usage.disk_usage_bytes
: Gauge metric for disk space usage (HDD/SSD).
Additional Metrics
queue_size
: Gauge metric for the size of each critical queue.process_count
&thread_count
: Gauge metrics for monitoring the number of processes and threads.
Reference to ADR #23: Prometheus Metrics Naming
Consistent with ADR #23, all metrics will follow the prescribed naming conventions and utilize labels for additional dimensions. This will enhance clarity, ease of understanding, and consistency in metric categorization.
Consequences
This standardization will enable systematic monitoring and improvement of service performance. However, challenges include ensuring accuracy in distributed systems and avoiding over-reliance on quantitative metrics. These will be mitigated through continuous monitoring strategy refinement and periodic metric reviews.