Language:English VersionChinese Version

What Good Logging Actually Looks Like in Production

Most applications log. Very few log well. The difference between logs that help you debug a production incident in 10 minutes versus logs that leave you guessing for hours comes down to three properties: structure, context, and consistency. This guide walks through the evolution from basic print statements to production-grade structured logging with OpenTelemetry correlation — with real code you can drop into your application today.

Why Unstructured Logs Fail at Scale

Consider this log line:

2026-03-15 14:23:01 ERROR Payment failed for user 4821 after 3 retries

Questions you can’t answer from this log: Which service emitted it? What was the trace ID? What error did the payment gateway return? What was the request that triggered this? How long did it take?

Now consider the structured equivalent:

{
  "timestamp": "2026-03-15T14:23:01.432Z",
  "level": "error",
  "service": "order-service",
  "version": "1.4.2",
  "environment": "production",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "event": "payment_failed",
  "user_id": 4821,
  "order_id": "ord_9k2hf7",
  "amount_cents": 4999,
  "currency": "USD",
  "gateway": "stripe",
  "retry_count": 3,
  "gateway_error_code": "card_declined",
  "gateway_error_message": "Your card has insufficient funds.",
  "duration_ms": 2847,
  "message": "Payment failed after 3 retries"
}

This log is queryable, correlatable with traces, and gives a complete picture of what happened without needing additional context.

Setting Up Structured Logging in Python

The standard library’s logging module can emit JSON with the right configuration:

import logging
import json
import time
import traceback
from datetime import datetime, timezone
from typing import Any

class StructuredFormatter(logging.Formatter):
    def __init__(self, service_name: str, version: str, environment: str):
        super().__init__()
        self.service_name = service_name
        self.version = version
        self.environment = environment

    def format(self, record: logging.LogRecord) -> str:
        log_entry = {
            "timestamp": datetime.fromtimestamp(
                record.created, tz=timezone.utc
            ).isoformat(),
            "level": record.levelname.lower(),
            "service": self.service_name,
            "version": self.version,
            "environment": self.environment,
            "logger": record.name,
            "message": record.getMessage(),
        }

        # Add trace context if available (populated by OTel instrumentation)
        if hasattr(record, 'trace_id'):
            log_entry["trace_id"] = record.trace_id
        if hasattr(record, 'span_id'):
            log_entry["span_id"] = record.span_id

        # Add any extra fields passed to the logger
        if hasattr(record, 'extra_fields'):
            log_entry.update(record.extra_fields)

        # Exception info
        if record.exc_info:
            log_entry["exception"] = {
                "type": record.exc_info[0].__name__,
                "message": str(record.exc_info[1]),
                "stacktrace": self.formatException(record.exc_info)
            }

        return json.dumps(log_entry, default=str)

def setup_logging(service_name: str, version: str, environment: str,
                  level: str = "INFO"):
    handler = logging.StreamHandler()
    handler.setFormatter(StructuredFormatter(service_name, version, environment))

    root_logger = logging.getLogger()
    root_logger.setLevel(getattr(logging, level.upper()))
    root_logger.handlers = [handler]

# Initialize once at startup
setup_logging(
    service_name="order-service",
    version="1.4.2",
    environment=os.getenv("APP_ENV", "development")
)

Context-Aware Logging with Bound Loggers

The most common mistake: passing context (user_id, request_id, etc.) as individual keyword arguments to every log call. This is tedious and inconsistent. Use bound loggers that carry context automatically:

import structlog

# structlog is the gold standard for Python structured logging
structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.ExceptionRenderer(),
        structlog.processors.JSONRenderer()
    ],
    wrapper_class=structlog.BoundLogger,
    context_class=dict,
    logger_factory=structlog.PrintLoggerFactory()
)

logger = structlog.get_logger()

# In your request handler (FastAPI/Flask middleware)
async def request_middleware(request, call_next):
    request_id = request.headers.get('X-Request-ID', str(uuid.uuid4()))
    user_id = get_user_from_token(request)

    # Bind context for ALL log calls in this request
    structlog.contextvars.bind_contextvars(
        request_id=request_id,
        user_id=user_id,
        path=str(request.url.path),
        method=request.method
    )

    start = time.monotonic()
    try:
        response = await call_next(request)
        logger.info("request_completed",
                    status_code=response.status_code,
                    duration_ms=round((time.monotonic() - start) * 1000, 2))
        return response
    except Exception as e:
        logger.error("request_failed",
                     error=str(e),
                     duration_ms=round((time.monotonic() - start) * 1000, 2))
        raise
    finally:
        structlog.contextvars.clear_contextvars()

# Anywhere in your call stack, user_id and request_id are automatic
def process_payment(order_id: str, amount: int):
    log = logger.bind(order_id=order_id, amount_cents=amount)
    log.info("payment_started")

    try:
        result = payment_gateway.charge(amount)
        log.info("payment_succeeded", transaction_id=result.id)
        return result
    except PaymentError as e:
        log.error("payment_failed",
                  gateway_error=e.code,
                  gateway_message=str(e))
        raise

Node.js: Pino for High-Performance Structured Logging

import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  base: {
    service: 'api-gateway',
    version: process.env.APP_VERSION,
    environment: process.env.NODE_ENV,
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  formatters: {
    level: (label) => ({ level: label }),
  },
  // In development, use pino-pretty for human-readable output
  transport: process.env.NODE_ENV === 'development'
    ? { target: 'pino-pretty', options: { colorize: true } }
    : undefined,
});

// Child loggers carry parent context
const requestLogger = logger.child({
  request_id: req.id,
  user_id: req.user?.id,
  ip: req.ip,
});

requestLogger.info({ path: req.path, method: req.method }, 'request_received');

Correlating Logs with OpenTelemetry Traces

The real power of structured logging emerges when logs are correlated with distributed traces. OpenTelemetry (OTel) provides the standard for this. When you instrument your application with OTel, you get a trace_id and span_id in context that you can inject into every log line.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import structlog

# Initialize OTel
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("order-service")

# Custom structlog processor to inject OTel trace context
def add_otel_context(logger, method, event_dict):
    span = trace.get_current_span()
    if span.is_recording():
        ctx = span.get_span_context()
        event_dict["trace_id"] = format(ctx.trace_id, '032x')
        event_dict["span_id"] = format(ctx.span_id, '016x')
        event_dict["trace_flags"] = format(ctx.trace_flags, '02x')
    return event_dict

# Add to structlog processors
structlog.configure(
    processors=[
        add_otel_context,  # Inject trace context into every log
        structlog.contextvars.merge_contextvars,
        # ... rest of processors
    ]
)

# Now every log line includes trace_id and span_id
# You can click from a log line in Grafana Loki directly to the trace in Tempo
with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order.id", order_id)
    logger.info("processing_order", order_id=order_id)
    # This log line has trace_id that links to the span in Jaeger/Tempo

What to Log: The Events That Matter

Logging too much is nearly as bad as logging too little. High-value events to always log:

  • Service lifecycle: startup, shutdown, config loaded, migrations run
  • Authentication events: login success/failure, token issued/revoked, permission denied
  • State changes: order created/cancelled, payment processed, user role changed
  • External service calls: request sent, response received (status + duration), errors
  • Background job execution: job started, completed, failed (with duration)
  • Errors and exceptions: always with stack trace, always with context

Things that often aren’t worth logging:

  • Every cache hit (log misses and errors instead)
  • Routine health check requests
  • Debug-level diagnostics in production (use sampling)

Log Sampling for High-Volume Services

import random

class SamplingLogger:
    def __init__(self, logger, sample_rate: float = 0.01):
        self._logger = logger
        self.sample_rate = sample_rate

    def debug(self, event: str, **kwargs):
        """Sample debug logs at sample_rate, log all errors"""
        if random.random() < self.sample_rate:
            self._logger.debug(event, **kwargs, _sampled=True,
                             _sample_rate=self.sample_rate)

    def error(self, event: str, **kwargs):
        """Always log errors"""
        self._logger.error(event, **kwargs)

# 1% sampling on verbose cache operations
cache_logger = SamplingLogger(logger, sample_rate=0.01)
cache_logger.debug("cache_hit", key=cache_key)  # Only logs 1% of the time

Testing Your Log Output

Log output should be tested like any other interface. Use snapshot testing to catch accidental changes:

import json
import structlog.testing

def test_payment_failure_log():
    with structlog.testing.capture_logs() as captured:
        try:
            process_payment("ord_123", 4999)
        except PaymentError:
            pass

    assert len(captured) == 2
    assert captured[0]["event"] == "payment_started"
    assert captured[1]["event"] == "payment_failed"
    assert captured[1]["order_id"] == "ord_123"
    assert captured[1]["amount_cents"] == 4999
    assert "gateway_error" in captured[1]
    # Verify no PII leaks into logs
    assert "card_number" not in json.dumps(captured[1])

The Migration Path

If you have an existing application with unstructured logs, migrate incrementally:

  1. Add structured logging to all new code immediately
  2. Wrap existing loggers with a structured adapter
  3. Add request-scoped context (request_id, user_id) via middleware first — highest value
  4. Instrument external service calls next
  5. Add OTel trace correlation last

The payoff is measurable: teams with good structured logging cut their mean-time-to-resolution for incidents by 40-60%. When a customer reports a problem, you can find the exact log line within seconds rather than spending 30 minutes grepping through unstructured text. Invest in your logs — they’re the primary diagnostic tool for everything that happens in production.

By Michael Sun

Founder and Editor-in-Chief of NovVista. Software engineer with hands-on experience in cloud infrastructure, full-stack development, and DevOps. Writes about AI tools, developer workflows, server architecture, and the practical side of technology. Based in China.

Leave a Reply

Your email address will not be published. Required fields are marked *