Language:English VersionChinese Version

Your System Will Fail. The Question Is How Gracefully.

I was on call when our payment service went down because a third-party fraud detection API started timing out. The timeout was 30 seconds. We had 200 concurrent requests, each holding a thread while waiting for a response that would never come. Within two minutes, the thread pool was exhausted, and our entire payment service — not just the fraud check — was unresponsive. Orders, refunds, balance queries — everything dead because one downstream dependency got slow.

This is the canonical failure mode that resilience patterns exist to prevent. Not the dramatic server-on-fire scenario, but the quiet, cascading kind where one slow dependency drags everything down with it. Circuit breakers, bulkheads, and retry patterns are the engineering tools that contain these failures before they become full outages.

Circuit Breakers: Stop Calling a Dead Service

A circuit breaker monitors calls to a downstream service and trips open when failures exceed a threshold. Once open, it immediately fails all requests without actually calling the downstream service, giving it time to recover.

The circuit breaker has three states:

  • Closed: Normal operation. Requests flow through. Failures are counted.
  • Open: Failures exceeded the threshold. All requests are immediately rejected without calling the downstream service.
  • Half-Open: After a cooldown period, a limited number of test requests are allowed through. If they succeed, the breaker closes. If they fail, it opens again.

Implementation in Python

import time
import httpx
from enum import Enum
from dataclasses import dataclass, field
from threading import Lock

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

@dataclass
class CircuitBreaker:
    failure_threshold: int = 5
    recovery_timeout: float = 30.0
    half_open_max_calls: int = 3
    
    _state: CircuitState = field(default=CircuitState.CLOSED, init=False)
    _failure_count: int = field(default=0, init=False)
    _last_failure_time: float = field(default=0.0, init=False)
    _half_open_calls: int = field(default=0, init=False)
    _lock: Lock = field(default_factory=Lock, init=False)
    
    @property
    def state(self) -> CircuitState:
        with self._lock:
            if self._state == CircuitState.OPEN:
                if time.time() - self._last_failure_time > self.recovery_timeout:
                    self._state = CircuitState.HALF_OPEN
                    self._half_open_calls = 0
            return self._state
    
    def record_success(self):
        with self._lock:
            if self._state == CircuitState.HALF_OPEN:
                self._half_open_calls += 1
                if self._half_open_calls >= self.half_open_max_calls:
                    self._state = CircuitState.CLOSED
                    self._failure_count = 0
            else:
                self._failure_count = 0
    
    def record_failure(self):
        with self._lock:
            self._failure_count += 1
            self._last_failure_time = time.time()
            if self._failure_count >= self.failure_threshold:
                self._state = CircuitState.OPEN
            if self._state == CircuitState.HALF_OPEN:
                self._state = CircuitState.OPEN

    def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            raise CircuitOpenError(
                f"Circuit is open. Will retry after {self.recovery_timeout}s. "
                f"Last failure: {time.time() - self._last_failure_time:.1f}s ago."
            )
        try:
            result = func(*args, **kwargs)
            self.record_success()
            return result
        except Exception as e:
            self.record_failure()
            raise

class CircuitOpenError(Exception):
    pass

# Usage
fraud_check_breaker = CircuitBreaker(
    failure_threshold=5,
    recovery_timeout=30.0,
)

async def check_fraud(transaction):
    try:
        return fraud_check_breaker.call(
            httpx.post,
            "https://fraud-api.example.com/check",
            json=transaction.dict(),
            timeout=5.0,
        )
    except CircuitOpenError:
        # Fallback: allow the transaction but flag for manual review
        return FraudResult(approved=True, requires_review=True)

Circuit Breakers in Practice

The fallback behavior when the circuit is open is where the real engineering judgment lives. Options include:

Strategy When to Use Example
Return cached data Data staleness is acceptable Product catalog, user preferences
Return a default A safe default exists Default shipping estimate, feature flags off
Degrade gracefully Feature is optional Skip recommendations, skip analytics
Fail fast with clear error No safe fallback exists Payment processing, auth checks
Queue for later Action can be async Email notifications, webhook delivery

Bulkheads: Isolate the Blast Radius

The bulkhead pattern borrows from shipbuilding: ships have watertight compartments so that a hull breach in one compartment does not sink the entire vessel. In software, bulkheads isolate resources so that a failure in one component cannot exhaust resources needed by other components.

Thread Pool Bulkheads

The most common bulkhead implementation uses separate thread pools (or connection pools, or semaphores) for different downstream dependencies:

import asyncio
from dataclasses import dataclass

@dataclass
class Bulkhead:
    name: str
    max_concurrent: int
    max_wait: float = 5.0
    
    def __post_init__(self):
        self._semaphore = asyncio.Semaphore(self.max_concurrent)
        self._waiting = 0
    
    async def execute(self, coro):
        self._waiting += 1
        try:
            acquired = await asyncio.wait_for(
                self._semaphore.acquire(),
                timeout=self.max_wait,
            )
        except asyncio.TimeoutError:
            self._waiting -= 1
            raise BulkheadFullError(
                f"Bulkhead '{self.name}' is full. "
                f"{self.max_concurrent} calls in progress, "
                f"{self._waiting} waiting."
            )
        
        self._waiting -= 1
        try:
            return await coro
        finally:
            self._semaphore.release()

class BulkheadFullError(Exception):
    pass

# Separate bulkheads for each downstream service
payment_bulkhead = Bulkhead("payment-api", max_concurrent=20, max_wait=5.0)
fraud_bulkhead = Bulkhead("fraud-api", max_concurrent=10, max_wait=3.0)
inventory_bulkhead = Bulkhead("inventory-api", max_concurrent=30, max_wait=5.0)

async def process_order(order):
    # Each call is isolated. If fraud API is slow and all 10 slots
    # are occupied, it cannot steal capacity from payment or inventory.
    payment = await payment_bulkhead.execute(
        check_payment(order.payment_method)
    )
    fraud = await fraud_bulkhead.execute(
        check_fraud(order)
    )
    inventory = await inventory_bulkhead.execute(
        reserve_inventory(order.items)
    )

Without bulkheads, all downstream calls share a single resource pool. When one service gets slow, it monopolizes the shared resources and every other service call suffers. With bulkheads, a slow fraud API can only consume its allocated 10 concurrent slots. The remaining capacity for payments and inventory remains untouched.

Retry Patterns: The Most Dangerous Tool in Your Toolbox

Retries are the most commonly implemented and most commonly misimplemented resilience pattern. A naive retry loop can turn a minor hiccup into a catastrophic retry storm that overwhelms the very service you are trying to reach.

The Wrong Way

# NEVER DO THIS
async def call_service(url, payload):
    for attempt in range(5):
        try:
            return await httpx.post(url, json=payload, timeout=10)
        except Exception:
            pass  # Retry immediately
    raise Exception("Service unavailable")

# Why this is dangerous:
# - No backoff: hammers the failing service as fast as possible
# - No jitter: all clients retry at the exact same time
# - Retries on ALL exceptions, including 400 Bad Request
# - 5 retries * N clients = 5N requests to an already struggling service

The Right Way: Exponential Backoff with Jitter

import random
import asyncio
import httpx

async def call_with_retry(
    url: str,
    payload: dict,
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 30.0,
    retryable_status_codes: set = {429, 502, 503, 504},
):
    last_exception = None
    
    for attempt in range(max_retries + 1):
        try:
            response = await httpx.AsyncClient().post(
                url, json=payload, timeout=5.0
            )
            
            if response.status_code < 400:
                return response
            
            if response.status_code not in retryable_status_codes:
                # Client error (4xx) - do NOT retry
                raise NonRetryableError(
                    f"Request failed with {response.status_code}: {response.text}"
                )
            
            last_exception = HttpError(response.status_code, response.text)
            
        except (httpx.ConnectTimeout, httpx.ReadTimeout) as e:
            last_exception = e
        except httpx.ConnectError as e:
            last_exception = e
        
        if attempt < max_retries:
            # Exponential backoff with full jitter
            delay = min(base_delay * (2 ** attempt), max_delay)
            jittered_delay = random.uniform(0, delay)
            await asyncio.sleep(jittered_delay)
    
    raise RetriesExhaustedError(
        f"Failed after {max_retries + 1} attempts. Last error: {last_exception}"
    )

Retry Budget Pattern

An even better approach is a retry budget that limits retries as a percentage of total traffic:

from collections import deque
from time import time

class RetryBudget:
    """Limits retries to a percentage of total requests over a time window."""
    
    def __init__(self, max_retry_ratio=0.1, window_seconds=60, min_retries_per_second=10):
        self.max_retry_ratio = max_retry_ratio
        self.window_seconds = window_seconds
        self.min_retries_per_second = min_retries_per_second
        self._requests = deque()
        self._retries = deque()
    
    def _cleanup(self):
        cutoff = time() - self.window_seconds
        while self._requests and self._requests[0] < cutoff:
            self._requests.popleft()
        while self._retries and self._retries[0] < cutoff:
            self._retries.popleft()
    
    def record_request(self):
        self._requests.append(time())
    
    def can_retry(self) -> bool:
        self._cleanup()
        total_requests = len(self._requests)
        total_retries = len(self._retries)
        
        # Always allow a minimum retry rate
        if total_retries < self.min_retries_per_second * self.window_seconds:
            return True
        
        # Check if retries exceed the budget
        if total_requests == 0:
            return True
        return (total_retries / total_requests) < self.max_retry_ratio
    
    def record_retry(self):
        self._retries.append(time())

# Usage: retry budget shared across all callers of a service
payment_retry_budget = RetryBudget(max_retry_ratio=0.1)  # Max 10% retries

Combining the Patterns

These patterns work best together. Here is how they compose:

async def resilient_call(service_name, url, payload):
    """
    Call flow:
    1. Check circuit breaker (fail fast if open)
    2. Acquire bulkhead slot (fail if capacity exhausted)
    3. Make the call with retry logic
    4. Record result in circuit breaker
    """
    breaker = circuit_breakers[service_name]
    bulkhead = bulkheads[service_name]
    budget = retry_budgets[service_name]
    
    # Step 1: Circuit breaker check
    if breaker.state == CircuitState.OPEN:
        return get_fallback(service_name, payload)
    
    # Step 2: Bulkhead
    async with bulkhead:
        # Step 3: Call with retries
        try:
            result = await call_with_retry(
                url, payload,
                max_retries=2 if budget.can_retry() else 0,
            )
            breaker.record_success()
            return result
        except Exception as e:
            breaker.record_failure()
            raise

Library Recommendations

You do not have to implement these patterns from scratch. Production-grade libraries exist for most languages:

Language Library Patterns Supported
Java/Kotlin Resilience4j Circuit breaker, bulkhead, retry, rate limiter, time limiter
Go sony/gobreaker Circuit breaker
Go avast/retry-go Retry with backoff
Python tenacity Retry with backoff, jitter
Python pybreaker Circuit breaker
Node.js cockatiel Circuit breaker, bulkhead, retry, timeout
.NET Polly Circuit breaker, bulkhead, retry, timeout, fallback

Resilience patterns are not optional complexity you add when things get serious. They are the difference between a minor dependency hiccup and a two-hour outage that wakes up the entire on-call rotation. Start with circuit breakers on your slowest downstream dependency. Add bulkheads when you have more than three downstream services. Implement retry budgets before your retry loops amplify the next incident. Your future self at 3 AM will be grateful.

By Michael Sun

Founder and Editor-in-Chief of NovVista. Software engineer with hands-on experience in cloud infrastructure, full-stack development, and DevOps. Writes about AI tools, developer workflows, server architecture, and the practical side of technology. Based in China.

Leave a Reply

Your email address will not be published. Required fields are marked *