Language:English VersionChinese Version

Your Dashboards Are Not Telling You What You Think

There is a dangerous assumption baked into most engineering organizations: if the dashboard is green, everything is fine. This assumption has caused more outages, more silent data corruption, and more weekend pages than any single technical failure I can name. The problem is not that monitoring is bad — it is that monitoring and observability solve fundamentally different problems, and most teams are using the wrong tool for the job they actually have.

Monitoring tells you when something is broken. Observability tells you why. That distinction sounds academic until you are three hours into an incident, staring at a flat CPU graph and a healthy-looking error rate, while your users are tweeting screenshots of 500 errors you cannot reproduce.

The Three Pillars Are Not Enough

The observability community loves talking about “the three pillars” — logs, metrics, and traces. This framing was useful in 2018 when the primary goal was convincing teams to move beyond Nagios checks and log files. But in 2026, the three pillars framing is actively misleading because it implies that having all three automatically gives you observability. It does not.

You can have petabytes of logs, thousands of metrics, and fully instrumented traces and still be unable to answer basic questions about your system. Observability is not about having data — it is about being able to ask arbitrary questions of your system without deploying new code.

Here is the litmus test: can your current setup answer a question you have never asked before? If the answer is no, you have monitoring, not observability.

Where Monitoring Fails: Real Examples

Let me walk through three scenarios I have seen in production where monitoring dashboards showed green while systems were on fire.

Scenario 1: The P99 Trap

A payment processing service was reporting healthy P50 (12ms) and P99 (180ms) latencies. Everything looked fine. But a small subset of users — roughly 0.3% — were experiencing timeouts of 30+ seconds on every request. The P99 metric completely masked this because the distribution had a long, thin tail that only affected users hitting a specific database partition.

# What the dashboard showed:
# p50: 12ms  ✅
# p99: 180ms ✅
# error_rate: 0.1% ✅

# What was actually happening:
# Latency distribution (simplified):
# 0-50ms:    85% of requests
# 50-200ms:  14.7% of requests
# 200ms-1s:  0.2% of requests
# 1s-30s:    0.1% of requests  ← these users were furious

# The fix: high-cardinality instrumentation per user_id
# that could be queried ad-hoc
import opentelemetry.trace as trace

tracer = trace.get_tracer("payment-service")

@tracer.start_as_current_span("process_payment")
def process_payment(user_id, amount):
    span = trace.get_current_span()
    span.set_attribute("user.id", user_id)
    span.set_attribute("payment.amount", amount)
    span.set_attribute("db.partition", get_partition(user_id))
    # Now you can query: "show me all spans where 
    # db.partition=7 AND duration > 5s"

Scenario 2: The Silent Data Corruption

An ETL pipeline was processing events from Kafka and writing to PostgreSQL. All metrics were healthy: throughput was stable, error rate was zero, lag was minimal. But the pipeline had a bug where it was silently dropping events that contained Unicode characters outside the Basic Multilingual Plane. No error was thrown — the events were simply filtered out during a transformation step.

Monitoring could not catch this because nothing was “wrong” from a systems perspective. The pipeline was doing exactly what the code told it to do. Finding this required the ability to drill into individual events and compare input versus output, which is an observability problem.

Scenario 3: The Cascading Retry Storm

Service A calls Service B, which calls Service C. Service C experienced a brief spike in latency (200ms to 2s). Service B’s retry logic kicked in, tripling the load on Service C. Service A’s retry logic then kicked in on top of that. Within 60 seconds, Service C was receiving 27x its normal traffic.

The monitoring dashboard for each individual service looked reasonable in isolation. Service C showed elevated latency and increased request count. But understanding the causal chain — that the retries were making things worse — required distributed tracing that could follow a single request across all three services.

What Observability Actually Looks Like

True observability has a few distinguishing characteristics that separate it from traditional monitoring:

Characteristic Monitoring Observability
Questions Pre-defined (dashboards) Ad-hoc (exploratory)
Data model Aggregated metrics High-cardinality events
Failure mode Known-unknowns Unknown-unknowns
Cost driver Number of metrics Event volume and cardinality
Primary tool Dashboards, alerts Query interface, trace explorer
Who uses it On-call engineer Anyone debugging

High-Cardinality Data Is the Key

The fundamental difference between monitoring and observability is cardinality. Monitoring systems aggregate data into low-cardinality dimensions: status code, endpoint, region. Observability systems preserve high-cardinality dimensions: user ID, request ID, trace ID, session ID.

# Low-cardinality metric (monitoring):
http_requests_total{method="GET", status="200", endpoint="/api/users"} 145832

# High-cardinality event (observability):
{
  "timestamp": "2026-03-15T14:32:01.445Z",
  "service": "api-gateway",
  "trace_id": "abc123def456",
  "span_id": "span_789",
  "user_id": "user_42981",
  "endpoint": "/api/users/42981/orders",
  "method": "GET",
  "status_code": 200,
  "duration_ms": 847,
  "db_query_count": 3,
  "db_duration_ms": 612,
  "cache_hit": false,
  "response_size_bytes": 14520,
  "feature_flags": ["new-order-index", "cache-v2"]
}

With the high-cardinality event, you can answer questions you never anticipated: “Which users have cache_hit=false AND duration_ms > 500 AND are using the new-order-index feature flag?” Try answering that with a Grafana dashboard.

The Tooling Landscape in 2026

The observability tooling market has matured significantly, but it is also more confusing than ever. Here is an honest assessment of the major players:

OpenTelemetry: The Standard That Won

OpenTelemetry has effectively won the instrumentation standards war. If you are starting a new project, instrument with OTel and do not look back. The collector architecture gives you vendor flexibility:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000
  
  # Tail sampling: only keep traces with errors or high latency
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-requests
        type: latency
        latency: {threshold_ms: 1000}
      - name: baseline
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

exporters:
  otlp/honeycomb:
    endpoint: "api.honeycomb.io:443"
    headers:
      "x-honeycomb-team": "${HONEYCOMB_API_KEY}"
  prometheus:
    endpoint: "0.0.0.0:8889"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, tail_sampling]
      exporters: [otlp/honeycomb]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Honeycomb vs Datadog vs Grafana Cloud

Honeycomb remains the gold standard for true observability — their query engine handles high-cardinality data better than anyone. But their pricing can be steep for high-volume services. Datadog has everything but charges for everything individually, and the bill can be shocking. Grafana Cloud (with Tempo and Loki) offers the best self-hosted fallback path but requires more operational investment.

For small teams, my honest recommendation: start with Grafana Cloud’s free tier for metrics and basic dashboards, add Honeycomb’s free tier for traces and high-cardinality queries, and use structured logging to stdout that you can search when needed.

Practical Steps to Move from Monitoring to Observability

Step 1: Instrument with OpenTelemetry

Add OTel auto-instrumentation to your services. For most languages, this is a few lines of code:

# Python with FastAPI
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter())
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)

Step 2: Add Custom Attributes

Auto-instrumentation gives you the skeleton. Custom attributes give you the meat. Add business context to every span: user ID, tenant ID, feature flags, plan tier. These are the dimensions you will query when debugging.

Step 3: Build Fewer Dashboards

This sounds counterintuitive, but dashboards are where observability goes to die. Every dashboard is a pre-baked question. Instead of building a dashboard for every possible failure mode, invest in teaching your team to write ad-hoc queries against your trace and event data. The tool should be a query interface, not a wall of graphs.

Step 4: Practice with Game Days

The best way to build observability muscle is to inject failures and practice debugging them using only your observability tooling. Can your team find the root cause of a latency spike in under 15 minutes using your current tools? If not, your observability stack needs work.

The Cost Question

Observability tooling is expensive because high-cardinality data is expensive to store and query. But the alternative — flying blind during incidents — is more expensive. The key is sampling intelligently: keep 100% of error traces, 100% of slow traces, and a statistical sample of everything else. This typically reduces data volume by 80-90% while preserving the ability to debug real issues.

Your dashboards are not lying to you on purpose. They are answering the questions you asked. The problem is that production systems fail in ways you did not think to ask about. That gap between the questions you asked and the questions you need to ask is exactly where observability lives.

By Michael Sun

Founder and Editor-in-Chief of NovVista. Software engineer with hands-on experience in cloud infrastructure, full-stack development, and DevOps. Writes about AI tools, developer workflows, server architecture, and the practical side of technology. Based in China.

Leave a Reply

Your email address will not be published. Required fields are marked *