For most of the 2010s, observability was a vendor war. Datadog used one agent, New Relic used another, Jaeger used its own SDK, and Prometheus scraped its own exposition format. Instrumenting an application for observability meant committing to a vendor and paying the switching cost later. OpenTelemetry changed this. By 2026, it has become the de facto standard for telemetry instrumentation — the layer that separates “how you instrument” from “where data goes.” Understanding it properly is no longer optional for backend engineers.
What OpenTelemetry Actually Is
OpenTelemetry (OTel) is a CNCF graduated project that provides a vendor-neutral specification, API, SDK, and collector for three signal types: traces, metrics, and logs. The key architectural insight is the separation of concerns:
- Instrumentation libraries generate telemetry using the OTel API
- SDKs implement the API and handle batching, sampling, and export
- The Collector receives telemetry, processes it, and routes it to any backend
- Backends (Jaeger, Tempo, Prometheus, Datadog, Honeycomb) receive standardized data via OTLP
This means you instrument your application once with OTel and can send data to any combination of backends — Jaeger for distributed traces during development, Tempo in production, Datadog for specific teams — without changing application code.
The Three Signals and What Each Tells You
Traces: Distributed Request Flows
A trace represents the end-to-end journey of a single request through your system. It’s composed of spans — each span represents one operation (an HTTP call, a database query, a cache lookup) with a start time, duration, status, and arbitrary attributes.
The critical property is context propagation: a trace ID flows with the request across service boundaries. When service A calls service B calls service C, all three spans share the same trace ID and can be visualized as a single request tree. This is what makes distributed debugging tractable.
Metrics: Aggregated System State
Metrics are numerical measurements aggregated over time. Request rate, error rate, latency percentiles (p50, p95, p99), queue depth, cache hit ratio. Metrics are cheap to store and query at scale — you can keep years of metrics data that would be impossible to store as raw traces.
Logs: Discrete Events
Logs are timestamped records of discrete events. OTel’s log data model adds trace context (trace ID, span ID) to log records, enabling correlation: when you see an error in a log, you can jump directly to the trace that produced it.
Instrumentation in Practice
Auto-Instrumentation: The Fastest Path to Traces
For many common frameworks, OTel provides auto-instrumentation that requires no code changes. It patches the framework’s internals to generate spans automatically:
# Python: auto-instrument a FastAPI application
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap --action=install # Installs instrumentation for detected libraries
# Run the application with auto-instrumentation
OTEL_SERVICE_NAME=order-api \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \
OTEL_TRACES_EXPORTER=otlp \
OTEL_METRICS_EXPORTER=otlp \
opentelemetry-instrument uvicorn main:app --host 0.0.0.0 --port 8000
Without changing a line of application code, this generates spans for every HTTP request, SQLAlchemy query, Redis operation, and outbound HTTP call. For the first 80% of observability coverage, auto-instrumentation is the right approach.
Manual Instrumentation: Adding Business Context
Auto-instrumentation doesn’t know what matters to your business. Adding custom spans and attributes for domain-specific operations is where you capture the context that makes debugging meaningful:
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
tracer = trace.get_tracer("order-service", "1.0.0")
def process_order(order_id: str, customer_id: str) -> dict:
with tracer.start_as_current_span("process_order") as span:
# Add business-relevant attributes to the span
span.set_attribute("order.id", order_id)
span.set_attribute("customer.id", customer_id)
span.set_attribute("order.region", get_customer_region(customer_id))
try:
# Inventory check — creates a child span automatically if auto-instrumented
inventory_result = check_inventory(order_id)
span.set_attribute("inventory.available", inventory_result.available)
if not inventory_result.available:
span.set_attribute("order.outcome", "rejected_inventory")
span.set_status(Status(StatusCode.ERROR, "Insufficient inventory"))
return {"status": "rejected", "reason": "inventory"}
# Payment processing
with tracer.start_as_current_span("process_payment") as payment_span:
payment_span.set_attribute("payment.method", get_payment_method(customer_id))
result = charge_customer(customer_id, order_id)
payment_span.set_attribute("payment.transaction_id", result.transaction_id)
span.set_attribute("order.outcome", "fulfilled")
return {"status": "success", "transaction_id": result.transaction_id}
except Exception as e:
span.record_exception(e)
span.set_status(Status(StatusCode.ERROR, str(e)))
raise
Now when an order fails, you can filter traces by order.outcome = "rejected_inventory" or find all orders from a specific region. This contextual richness is what separates useful traces from traces that just show you which endpoints were called.
The OpenTelemetry Collector: Architecture Patterns
The Collector is the routing layer between your applications and your backends. It receives telemetry via OTLP (gRPC or HTTP), processes it, and exports to one or more destinations.
Agent Mode: One Collector Per Host
# otel-collector-agent.yaml — runs as a DaemonSet on each Kubernetes node
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
# Also collect host metrics
hostmetrics:
collection_interval: 30s
scrapers:
cpu:
memory:
disk:
network:
processors:
batch:
timeout: 1s
send_batch_size: 1024
# Add Kubernetes metadata to all telemetry
k8sattributes:
auth_type: serviceAccount
passthrough: false
extract:
metadata:
- k8s.pod.name
- k8s.namespace.name
- k8s.deployment.name
- k8s.node.name
# Sample high-volume, low-value traces
probabilistic_sampler:
hash_seed: 22
sampling_percentage: 10 # Keep 10% of traces
exporters:
otlp/tempo:
endpoint: tempo.monitoring:4317
tls:
insecure: true
prometheusremotewrite:
endpoint: http://prometheus.monitoring:9090/api/v1/write
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, k8sattributes, probabilistic_sampler]
exporters: [otlp/tempo]
metrics:
receivers: [otlp, hostmetrics]
processors: [batch, k8sattributes]
exporters: [prometheusremotewrite]
Tail-Based Sampling: Keeping the Traces That Matter
Head-based sampling (deciding at trace start whether to keep it) is simple but discards traces before you know if they’re interesting. Tail-based sampling buffers traces and makes the keep/drop decision after the trace completes — allowing you to always keep error traces and slow traces regardless of sampling rate.
# Tail sampling processor configuration
processors:
tail_sampling:
decision_wait: 10s # Wait up to 10s for all spans to arrive
num_traces: 50000 # Buffer up to 50k traces
expected_new_traces_per_sec: 100
policies:
# Always keep error traces
- name: errors-policy
type: status_code
status_code: {status_codes: [ERROR]}
# Always keep slow traces (p99 threshold)
- name: slow-traces-policy
type: latency
latency: {threshold_ms: 2000}
# Keep 5% of everything else
- name: sample-policy
type: probabilistic
probabilistic: {sampling_percentage: 5}
This configuration keeps 100% of errors and slow traces — exactly what you need for debugging — while sampling down healthy fast requests that would otherwise dominate storage costs.
Correlating Traces, Metrics, and Logs
The real power of OTel emerges when you can navigate between signal types. A p99 latency spike in your metrics dashboard should link directly to example traces showing what was slow. A log error should link to its parent trace. This correlation requires consistent trace context across all signals.
In the Grafana stack (Tempo for traces, Loki for logs, Prometheus/Mimir for metrics), correlation is configured through data source links:
# Loki log line with OTel trace context
{
"timestamp": "2026-03-28T10:23:45.123Z",
"severity": "ERROR",
"message": "Payment processing failed: insufficient funds",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"service.name": "payment-service",
"order.id": "ord_abc123"
}
# Grafana data source configuration links trace_id in logs to Tempo
# datasources.yaml
- name: Loki
type: loki
url: http://loki:3100
jsonData:
derivedFields:
- datasourceUid: tempo
matcherRegex: '"trace_id":"(\w+)"'
name: TraceID
url: '$${__value.raw}' # Links to Tempo trace viewer
OpenTelemetry in 2026: What Has Changed
The OTel specification has reached stability across all three signal types as of 2024-2025. Key developments that matter for production adoption:
- Logs are now stable: The OTel log data model and SDK support have reached stability. Replacing direct Loki/Elasticsearch shipping with OTel-routed logs is now the recommended path.
- Profiles signal type: Continuous profiling is being added as a fourth signal type, with Pyroscope integration already in preview. This will unify the observability signal model further.
- OpAMP (Open Agent Management Protocol): Remote management of OTel Collector instances — updating configurations, sampling policies, and backend routing without redeployment — is now production-stable.
- Semantic conventions maturity: The semantic conventions for HTTP, databases, messaging, and cloud providers have stabilized, meaning attributes are now consistent across all auto-instrumentation libraries.
Getting Started: A Practical Order of Operations
- Deploy the OTel Collector as a DaemonSet (Kubernetes) or sidecar (other environments)
- Add auto-instrumentation to your highest-traffic services first — this generates immediate value with minimal effort
- Configure tail-based sampling from the start — it’s much harder to add later when you have high volumes
- Add manual instrumentation for your most important business operations (checkout, payment, user registration)
- Set up correlation links between your trace, metric, and log backends in Grafana
- Define SLO-based alerting on the metrics OTel generates — p99 latency, error rate, availability
Conclusion
OpenTelemetry has delivered on its promise of vendor-neutral observability instrumentation. The ecosystem is stable, the auto-instrumentation coverage is broad, and the Collector’s routing flexibility means you’re never locked into a backend choice. Teams that instrument with OTel today can switch from Jaeger to Tempo, from self-hosted Prometheus to Grafana Cloud, or add a commercial backend for specific use cases — without touching application code.
The investment in proper instrumentation — spans with business-relevant attributes, tail-based sampling, log-trace correlation — pays compound interest. Every incident that gets resolved faster, every performance regression that gets caught earlier, every capacity planning decision made from real data: all of it depends on the quality of your observability foundation. OpenTelemetry provides that foundation in a form that won’t trap you in a vendor relationship you’ll regret.

[…] OpenTelemetry in 2026: The Standard That Finally Unified Observability […]