LLM Observability: Tracing AI Agents in Production

AI agents in production fail in ways that traditional monitoring doesn't catch: a prompt change silently degrades output quality, a token budget creep slowly increases costs, or a downstream LLM API timeout causes cascading failures with no visible error in your service metrics. Proper observability for LLM systems requires specific patterns. Here's what I use.

What makes LLM observability different

Traditional service observability: latency, error rate, throughput.

LLM service observability additionally needs:

Token usage per call (cost tracking)
Prompt content (for debugging quality issues)
Model routing (which model handled which request)
Cache hit rates (for prompt caching efficiency)
Output quality signals (did the LLM produce valid output?)

These require custom instrumentation — standard APM tools don't know about LLM semantics.

OpenTelemetry for LLM tracing

I use OpenTelemetry with a custom LLM span wrapper:

// internal/observability/llm.go
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/trace"
)

const (
    AttrLLMModel        = attribute.Key("llm.model")
    AttrLLMInputTokens  = attribute.Key("llm.input_tokens")
    AttrLLMOutputTokens = attribute.Key("llm.output_tokens")
    AttrLLMCostUSD      = attribute.Key("llm.cost_usd")
    AttrLLMCacheHit     = attribute.Key("llm.cache_hit")
    AttrLLMTaskType     = attribute.Key("llm.task_type")
)

type TracingLLMClient struct {
    inner    LLMClient
    tracer   trace.Tracer
    costLog  *CostLogger
}

func (t *TracingLLMClient) Complete(ctx context.Context, req CompletionRequest) (CompletionResponse, error) {
    ctx, span := t.tracer.Start(ctx, "llm.complete",
        trace.WithAttributes(
            AttrLLMModel.String(req.Model),
            AttrLLMTaskType.String(req.TaskType),
        ),
    )
    defer span.End()

    resp, err := t.inner.Complete(ctx, req)
    if err != nil {
        span.RecordError(err)
        return resp, err
    }

    // Record usage
    span.SetAttributes(
        AttrLLMInputTokens.Int(resp.Usage.InputTokens),
        AttrLLMOutputTokens.Int(resp.Usage.OutputTokens),
        AttrLLMCostUSD.Float64(calculateCost(req.Model, resp.Usage)),
        AttrLLMCacheHit.Bool(resp.Usage.CacheHit),
    )

    // Log cost to DB for per-product tracking
    t.costLog.Record(ctx, CostRecord{
        Product:      extractProduct(ctx),
        Model:        req.Model,
        TaskType:     req.TaskType,
        InputTokens:  resp.Usage.InputTokens,
        OutputTokens: resp.Usage.OutputTokens,
        CostUSD:      calculateCost(req.Model, resp.Usage),
    })

    return resp, nil
}

func calculateCost(model string, usage Usage) float64 {
    rates := map[string][2]float64{
        "claude-haiku-4-5":   {0.00025, 0.00125},  // per 1k tokens
        "claude-sonnet-4-6":  {0.003, 0.015},
        "claude-opus-4-8":    {0.015, 0.075},
    }
    rate := rates[model]
    return (float64(usage.InputTokens)/1000*rate[0] + float64(usage.OutputTokens)/1000*rate[1])
}

Grafana dashboard for LLM metrics

Key queries for Grafana + Tempo (OpenTelemetry backend):

# Average LLM latency by model
histogram_quantile(0.95, 
  sum(rate(llm_complete_duration_seconds_bucket[5m])) by (le, llm_model)
)

# Token cost per hour by product
sum(rate(llm_cost_usd_total[1h])) by (product)

# Cache hit rate by model
sum(rate(llm_cache_hits_total[5m])) by (model)
/ sum(rate(llm_complete_total[5m])) by (model)

# Error rate by task type
sum(rate(llm_errors_total[5m])) by (task_type)
/ sum(rate(llm_complete_total[5m])) by (task_type)

Cost DB for per-product tracking

All LLM costs log to a PostgreSQL table:

CREATE TABLE llm_costs (
    id            BIGSERIAL PRIMARY KEY,
    product       TEXT NOT NULL,
    model         TEXT NOT NULL,
    task_type     TEXT NOT NULL,
    input_tokens  INT NOT NULL,
    output_tokens INT NOT NULL,
    cost_usd      NUMERIC(10, 8) NOT NULL,
    cache_hit     BOOL DEFAULT FALSE,
    created_at    TIMESTAMPTZ DEFAULT now()
);

CREATE INDEX ON llm_costs (product, created_at);
CREATE INDEX ON llm_costs (model, created_at);

Daily summary query:

SELECT 
    product,
    model,
    task_type,
    SUM(cost_usd) AS daily_cost,
    SUM(input_tokens) AS input_tokens,
    SUM(output_tokens) AS output_tokens,
    AVG(cache_hit::int) AS cache_hit_rate
FROM llm_costs
WHERE created_at >= NOW() - INTERVAL '24 hours'
GROUP BY product, model, task_type
ORDER BY daily_cost DESC;

This query tells me exactly where money is going — and where cost optimization is worth effort.

Alerting: what to alert on

# AlertManager rules
groups:
  - name: llm-alerts
    rules:
      - alert: LLMHighCost
        expr: sum(rate(llm_cost_usd_total[1h])) by (product) > 5
        annotations:
          summary: "LLM cost for {{ $labels.product }} exceeds $5/hour"

      - alert: LLMHighLatency
        expr: histogram_quantile(0.95, llm_complete_duration_seconds_bucket) > 5
        annotations:
          summary: "LLM P95 latency exceeds 5s"

      - alert: LLMHighErrorRate
        expr: rate(llm_errors_total[5m]) / rate(llm_complete_total[5m]) > 0.05
        annotations:
          summary: "LLM error rate exceeds 5%"

      - alert: LLMCacheHitRateLow
        expr: |
          sum(rate(llm_cache_hits_total[30m])) / sum(rate(llm_complete_total[30m])) < 0.5
        annotations:
          summary: "Prompt cache hit rate below 50% — check system prompt stability"

Cache hit rate dropping below 50% usually means someone changed the system prompt — invalidating the cache and doubling effective input token costs.

Output quality monitoring

The hardest thing to monitor: output quality. LLMs don't return errors when they produce bad output — they just return wrong text.

Lightweight quality signals:

Schema validation — if output should be JSON, validate it
Length bounds — if output should be 100–200 words, flag outliers
Confidence signals — ask the model to rate its own confidence (1-5); low scores trigger review
Spot sampling — randomly sample 1% of outputs to a quality review queue for human spot-check

func validateOutput(taskType string, output string) QualitySignal {
    switch taskType {
    case "product_classify":
        var result ClassifyResult
        if err := json.Unmarshal([]byte(output), &result); err != nil {
            return QualitySignal{Valid: false, Reason: "invalid JSON"}
        }
        if result.Category == "" {
            return QualitySignal{Valid: false, Reason: "empty category"}
        }
    case "summary":
        words := len(strings.Fields(output))
        if words < 50 || words > 300 {
            return QualitySignal{Valid: false, Reason: fmt.Sprintf("word count %d out of range", words)}
        }
    }
    return QualitySignal{Valid: true}
}

FAQ

What is LLM observability? LLM observability is the practice of monitoring AI systems beyond standard metrics — tracking token usage (for cost), model routing, prompt cache hit rates, and output quality signals that traditional APM tools don't capture.

What tool do you use for LLM tracing? OpenTelemetry for instrumentation (vendor-neutral), Grafana Tempo for trace storage, and Grafana for dashboards. Grafana Cloud free tier includes 14-day trace retention — sufficient for debugging.

How do you track LLM costs in production? A PostgreSQL table logging every LLM call with model, token counts, and computed cost. Daily aggregate queries show per-product, per-model cost breakdown. CloudWatch alarms trigger when hourly costs exceed thresholds.

How do you monitor LLM output quality? Schema validation for structured outputs, length bounds for text outputs, and random sampling (1%) of outputs to a human review queue. For critical features, add a confidence rating step to the LLM call.

Written by Shihab Shahriar Antor — AI Engineer & Founder of Shahriar Labs. See also: LLM Cost Optimization: Cut Your AI API Bills 10x · Multi-Agent AI Systems: Architecture Patterns.