Observability

Production Observability with Prometheus + Grafana

Real-time metrics, SLO dashboards, and alerting for Spring Boot and Node.js. RED and USE signals wired end-to-end — from HTTP handler to LLM pipeline.

By Rohit Raj · Founding EngineerPublished Jan 31, 2026Updated Apr 24, 2026

Live Prometheus Metrics

Request Rate

1,247req/s

Error Rate

0.12%

P95 Latency

23.4ms

P99 Latency

45.2ms

Active Connections

342

CPU Usage

42.3%

What Metrics Should Every Production Service Emit?

Two families. RED for request health, USE for resource health. Ship both.

RED Metrics

Request-level health for any service

Rate · Requests per second across endpoints and methods.
Errors · Failed request rate — HTTP 5xx plus business errors.
Duration · Request latency at p50, p95, and p99 buckets.

USE Metrics

System-level saturation signals

Utilization · CPU, memory, and thread-pool saturation.
Saturation · Queue depth, backlog, and connection pool waits.
Errors · System-level failures (GC stalls, OOM, disk full).

Real Use Case: Text2SQL Query Engine SLOs

How observability held 99.5% query accuracy for an AI-powered NL-to-SQL system

Challenge: Maintain sub-500ms p95 latency while translating natural language → SQL with high accuracy. Without visibility into the LLM pipeline, debugging query failures would require parsing logs across API, schema resolver, and LLM layers.
Solution: Prometheus metrics across the full pipeline, split into two dashboards.

🤖 LLM Pipeline Metrics

Token usage per request
LLM response latency (p50, p95, p99)
Schema context cache hit rate

📊 Query Accuracy Metrics

SQL syntax validation rate
Query execution success rate
Fallback/retry frequency

Impact: Mean time to detection (MTTD) fell from hours to minutes. When LLM latency spiked on context overflow, alerts fired before users reported timeouts — enabling proactive token optimization.

How Does Production Observability Pay for Itself?

SLO Tracking

Define and track Service Level Objectives. Know when you're burning error budget before SLA violations occur.

Incident Response

Correlate metrics across services during incidents. Identify root causes faster with historical data and trend analysis.

Capacity Planning

Use historical trends to predict resource needs. Scale proactively based on data, not guesswork.

Frequently Asked Questions

Questions engineers ask before wiring their first production metrics stack.

What is observability and how is it different from monitoring?

Monitoring answers predefined questions ("is CPU above 80%?"). Observability lets you ask any question about system state by combining metrics, logs, and traces with high-cardinality labels. In practice: monitoring gives you a fixed dashboard, observability gives you a query language (PromQL, LogQL, or equivalent) to explore behavior you did not predict.

What are RED and USE metrics?

RED — Rate, Errors, Duration — describes request-level health for any service (best fit for REST APIs, Kafka consumers, gRPC). USE — Utilization, Saturation, Errors — describes resource-level health (CPU, memory, disk, connection pools). Ship both. RED catches user-visible degradation; USE catches the root cause underneath.

How do I add Prometheus metrics to a Spring Boot application?

Add micrometer-registry-prometheus to your dependencies. Spring Boot Actuator exposes /actuator/prometheus out of the box. Use @Timed and @Counted annotations for method-level timing, or inject MeterRegistry to emit custom metrics. Scrape the endpoint with Prometheus, visualize in Grafana. Total setup: about 15 minutes for a greenfield service.

What is an SLO and how do I define one?

A Service Level Objective is a target for a Service Level Indicator (SLI) measured over a rolling window. Example SLI: "99.5% of /api/query requests complete in under 500ms." Example SLO: "maintain 99.5% over any 28-day window." The gap between SLO and 100% is the error budget. When you burn through it, you freeze feature work and fix reliability.

How do I monitor LLM pipelines with Prometheus?

Instrument four signals: token usage per request (input + output), LLM response latency (p95 per provider), cache hit rate for prompts and retrieval, and fallback/retry frequency. I documented the Text2SQL case in the use case above — the same pattern works for any RAG or LLM-orchestration stack. Label metrics by model and provider so you can A/B route and compare cost/quality.

What is mean time to detection (MTTD) and how does observability reduce it?

MTTD is the time between when a problem starts and when a human notices. Without observability, MTTD is often hours (someone reports it). With RED alerts on error rate and latency, MTTD drops to minutes. The Text2SQL case above took MTTD from hours to minutes by alerting on LLM latency breaking a p95 threshold before users noticed timeouts.

Production Observability with Prometheus + Grafana

Real-time metrics, SLO dashboards, and alerting for Spring Boot and Node.js. RED and USE signals wired end-to-end — from HTTP handler to LLM pipeline.

By Rohit Raj · Founding EngineerPublished Jan 31, 2026Updated Apr 24, 2026

Live Prometheus Metrics

Request Rate

1,247req/s

Error Rate

0.12%

P95 Latency

23.4ms

P99 Latency

45.2ms

Active Connections

342

CPU Usage

42.3%

What Metrics Should Every Production Service Emit?

Two families. RED for request health, USE for resource health. Ship both.

RED Metrics

Request-level health for any service

Rate · Requests per second across endpoints and methods.
Errors · Failed request rate — HTTP 5xx plus business errors.
Duration · Request latency at p50, p95, and p99 buckets.

USE Metrics

System-level saturation signals

Utilization · CPU, memory, and thread-pool saturation.
Saturation · Queue depth, backlog, and connection pool waits.
Errors · System-level failures (GC stalls, OOM, disk full).

Real Use Case: Text2SQL Query Engine SLOs

How observability held 99.5% query accuracy for an AI-powered NL-to-SQL system

Challenge: Maintain sub-500ms p95 latency while translating natural language → SQL with high accuracy. Without visibility into the LLM pipeline, debugging query failures would require parsing logs across API, schema resolver, and LLM layers.
Solution: Prometheus metrics across the full pipeline, split into two dashboards.

🤖 LLM Pipeline Metrics

Token usage per request
LLM response latency (p50, p95, p99)
Schema context cache hit rate

📊 Query Accuracy Metrics

SQL syntax validation rate
Query execution success rate
Fallback/retry frequency

Impact: Mean time to detection (MTTD) fell from hours to minutes. When LLM latency spiked on context overflow, alerts fired before users reported timeouts — enabling proactive token optimization.

How Does Production Observability Pay for Itself?

SLO Tracking

Define and track Service Level Objectives. Know when you're burning error budget before SLA violations occur.

Incident Response

Correlate metrics across services during incidents. Identify root causes faster with historical data and trend analysis.

Capacity Planning

Use historical trends to predict resource needs. Scale proactively based on data, not guesswork.

Frequently Asked Questions

Questions engineers ask before wiring their first production metrics stack.

What is observability and how is it different from monitoring?

What are RED and USE metrics?

How do I add Prometheus metrics to a Spring Boot application?

What is an SLO and how do I define one?

How do I monitor LLM pipelines with Prometheus?

What is mean time to detection (MTTD) and how does observability reduce it?

Production Observability with Prometheus + Grafana

Live Prometheus Metrics

What Metrics Should Every Production Service Emit?

RED Metrics

USE Metrics

Real Use Case: Text2SQL Query Engine SLOs

🤖 LLM Pipeline Metrics

📊 Query Accuracy Metrics

How Does Production Observability Pay for Itself?

SLO Tracking

Incident Response

Capacity Planning

Frequently Asked Questions

Related Reading

Production Observability with Prometheus + Grafana

Live Prometheus Metrics

What Metrics Should Every Production Service Emit?

RED Metrics

USE Metrics

Real Use Case: Text2SQL Query Engine SLOs

🤖 LLM Pipeline Metrics

📊 Query Accuracy Metrics

How Does Production Observability Pay for Itself?

SLO Tracking

Incident Response

Capacity Planning

Frequently Asked Questions

Related Reading