Language:English VersionChinese Version

When Systems Fail Gracefully Instead of Catastrophically

Distributed systems fail. Networks partition. Dependencies go slow. Third-party APIs return 503. The question isn’t whether your system will encounter failures — it’s whether those failures cascade into full outages or get absorbed gracefully. Circuit breakers, bulkheads, and retry patterns are the three foundational resilience patterns every backend developer needs to internalize. This guide covers how each works, how to implement them, and crucially, how to combine them without creating new problems.

The Circuit Breaker Pattern

A circuit breaker sits between your service and a dependency. When the dependency starts failing repeatedly, the circuit breaker “opens” and immediately rejects requests rather than waiting for timeouts. This prevents slow or failing dependencies from consuming all your threads and degrading your entire application.

A circuit breaker has three states:

  • Closed: Normal operation. Requests flow through. Failures are counted.
  • Open: Dependency is failing. Requests are immediately rejected with a fallback response. No calls made to the failing service.
  • Half-Open: After a recovery timeout, a small number of test requests are allowed through. If they succeed, the circuit closes. If they fail, it opens again.

Implementing a Circuit Breaker in Python

import time
import threading
from enum import Enum
from dataclasses import dataclass, field
from typing import Callable, Optional

class State(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

@dataclass
class CircuitBreaker:
    failure_threshold: int = 5
    success_threshold: int = 2
    timeout: float = 30.0  # seconds before trying half-open

    _state: State = field(default=State.CLOSED, init=False)
    _failure_count: int = field(default=0, init=False)
    _success_count: int = field(default=0, init=False)
    _last_failure_time: Optional[float] = field(default=None, init=False)
    _lock: threading.Lock = field(default_factory=threading.Lock, init=False)

    def call(self, func: Callable, *args, fallback=None, **kwargs):
        with self._lock:
            if self._state == State.OPEN:
                if time.time() - self._last_failure_time >= self.timeout:
                    self._state = State.HALF_OPEN
                    self._success_count = 0
                else:
                    if fallback is not None:
                        return fallback()
                    raise CircuitOpenError("Circuit is open")

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            if fallback is not None:
                return fallback()
            raise

    def _on_success(self):
        with self._lock:
            if self._state == State.HALF_OPEN:
                self._success_count += 1
                if self._success_count >= self.success_threshold:
                    self._state = State.CLOSED
                    self._failure_count = 0
            elif self._state == State.CLOSED:
                self._failure_count = 0

    def _on_failure(self):
        with self._lock:
            self._failure_count += 1
            self._last_failure_time = time.time()
            if (self._state == State.CLOSED and
                    self._failure_count >= self.failure_threshold):
                self._state = State.OPEN
            elif self._state == State.HALF_OPEN:
                self._state = State.OPEN

class CircuitOpenError(Exception):
    pass

# Usage
payment_circuit = CircuitBreaker(failure_threshold=5, timeout=30.0)

def charge_card(amount, card_token):
    return payment_circuit.call(
        payment_gateway.charge,
        amount,
        card_token,
        fallback=lambda: {"status": "queued", "message": "Payment queued for retry"}
    )

Using Resilience4j in Java/Kotlin

In the JVM ecosystem, Resilience4j is the de facto standard:

CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .slidingWindowType(SlidingWindowType.COUNT_BASED)
    .slidingWindowSize(10)
    .failureRateThreshold(50)        // Open when 50% of requests fail
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .permittedNumberOfCallsInHalfOpenState(3)
    .recordExceptions(IOException.class, TimeoutException.class)
    .ignoreExceptions(BusinessException.class)  // Don't count business logic errors
    .build();

CircuitBreaker circuitBreaker = CircuitBreaker.of("payment-service", config);

// Decorate your function
Supplier<PaymentResult> decoratedSupplier =
    CircuitBreaker.decorateSupplier(circuitBreaker, () -> paymentService.charge(amount));

// Execute with fallback
Try.ofSupplier(decoratedSupplier)
    .recover(CallNotPermittedException.class, ex -> PaymentResult.queued())
    .recover(IOException.class, ex -> PaymentResult.failed(ex.getMessage()));

The Bulkhead Pattern

A bulkhead isolates failures by dedicating separate resource pools to different operations. Named after the watertight compartments in a ship’s hull — if one compartment floods, the others remain intact.

Without bulkheads, a slow external API can exhaust your entire thread pool, blocking unrelated operations. With bulkheads, each downstream dependency gets its own bounded pool.

Thread Pool Bulkheads

import concurrent.futures
from contextlib import contextmanager

class BulkheadPool:
    def __init__(self, name: str, max_workers: int, max_queue: int = 10):
        self.name = name
        self.executor = concurrent.futures.ThreadPoolExecutor(
            max_workers=max_workers,
            thread_name_prefix=f"bulkhead-{name}"
        )
        self.semaphore = threading.Semaphore(max_workers + max_queue)

    def submit(self, fn, *args, **kwargs):
        if not self.semaphore.acquire(blocking=False):
            raise BulkheadFullError(
                f"Bulkhead '{self.name}' is full — rejecting request"
            )
        def release_and_call():
            try:
                return fn(*args, **kwargs)
            finally:
                self.semaphore.release()

        return self.executor.submit(release_and_call)

class BulkheadFullError(Exception):
    pass

# Separate pools for different external services
email_pool = BulkheadPool("email-service", max_workers=5, max_queue=20)
payment_pool = BulkheadPool("payment-service", max_workers=10, max_queue=5)
analytics_pool = BulkheadPool("analytics", max_workers=3, max_queue=50)

# Now a slow email service can't block payment processing
def send_confirmation_email(user_id, order_id):
    try:
        future = email_pool.submit(email_service.send, user_id, order_id)
        return future.result(timeout=5.0)
    except BulkheadFullError:
        # Degrade gracefully — queue for later
        queue_email_task(user_id, order_id)
    except concurrent.futures.TimeoutError:
        queue_email_task(user_id, order_id)

Semaphore Bulkheads in Resilience4j

BulkheadConfig bulkheadConfig = BulkheadConfig.custom()
    .maxConcurrentCalls(10)
    .maxWaitDuration(Duration.ofMillis(100))  // Don't wait long if full
    .build();

Bulkhead bulkhead = Bulkhead.of("database-reads", bulkheadConfig);

Supplier<List<User>> decoratedSupplier =
    Bulkhead.decorateSupplier(bulkhead, () -> userRepository.findActive());

Try.ofSupplier(decoratedSupplier)
    .recover(BulkheadFullException.class, ex -> getCachedActiveUsers());

Retry Patterns

Retries handle transient failures — network blips, brief resource exhaustion, momentary database locks. Done naively, retries cause thundering herd problems. Done correctly, they make systems self-healing.

Exponential Backoff with Jitter

The most important rule: never retry immediately and never use a fixed retry interval. Use exponential backoff with jitter:

import random
import time

def exponential_backoff_with_jitter(
    func,
    max_retries: int = 5,
    base_delay: float = 0.1,    # 100ms base
    max_delay: float = 30.0,    # 30 second cap
    jitter_factor: float = 0.5,
    retryable_exceptions: tuple = (IOError, TimeoutError)
):
    last_exception = None

    for attempt in range(max_retries + 1):
        try:
            return func()
        except retryable_exceptions as e:
            last_exception = e
            if attempt == max_retries:
                break

            # Exponential backoff: 0.1, 0.2, 0.4, 0.8, 1.6...
            delay = min(base_delay * (2 ** attempt), max_delay)

            # Full jitter: random value between 0 and delay
            # This spreads retries across time, preventing thundering herd
            jittered_delay = random.uniform(0, delay * jitter_factor) + delay * (1 - jitter_factor)

            time.sleep(jittered_delay)

    raise last_exception

# Usage
result = exponential_backoff_with_jitter(
    lambda: requests.post("https://api.example.com/charge", json=payload, timeout=5),
    max_retries=3,
    retryable_exceptions=(requests.exceptions.Timeout, requests.exceptions.ConnectionError)
)

Retry Budgets: Preventing Retry Amplification

A dangerous failure mode: Service A calls Service B, which calls Service C. Each layer retries 3 times. Under failure, you generate 3 x 3 x 3 = 27 requests for every original request. Use retry budgets to limit total retries system-wide:

class RetryBudget:
    """
    Allows retries only when the retry rate stays below a threshold.
    Inspired by Google's SRE approach to retry budgets.
    """
    def __init__(self, budget_percent: float = 10.0, window_seconds: float = 60.0):
        self.budget_percent = budget_percent
        self.window_seconds = window_seconds
        self._requests = []
        self._retries = []
        self._lock = threading.Lock()

    def can_retry(self) -> bool:
        now = time.time()
        cutoff = now - self.window_seconds

        with self._lock:
            self._requests = [t for t in self._requests if t > cutoff]
            self._retries = [t for t in self._retries if t > cutoff]

            if not self._requests:
                return True

            retry_rate = len(self._retries) / len(self._requests) * 100
            return retry_rate < self.budget_percent

    def record_request(self):
        with self._lock:
            self._requests.append(time.time())

    def record_retry(self):
        with self._lock:
            self._retries.append(time.time())

payment_retry_budget = RetryBudget(budget_percent=10.0)

Combining the Patterns: The Right Order

These patterns work best together, and the order matters. The standard composition from outermost to innermost:

  1. Bulkhead — outermost, limits concurrent calls
  2. Circuit Breaker — wraps the operation, fails fast when open
  3. Retry — innermost, retries transient failures before the circuit breaker sees them
  4. Timeout — on the actual network call
# Python example combining all three
def resilient_payment_charge(amount, card_token):
    def charge():
        return payment_circuit.call(
            lambda: exponential_backoff_with_jitter(
                lambda: payment_api.charge(amount, card_token, timeout=3.0),
                max_retries=2
            ),
            fallback=lambda: {"status": "queued"}
        )

    try:
        future = payment_pool.submit(charge)
        return future.result(timeout=10.0)
    except BulkheadFullError:
        return {"status": "rejected", "reason": "system_busy"}

Timeouts: The Pattern Everyone Forgets

None of the above patterns work without timeouts. An operation that hangs indefinitely defeats circuit breakers (failure count never increments) and bulkheads (threads stay occupied forever).

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# Configure connection-level and read timeouts
session = requests.Session()
session.mount('https://', HTTPAdapter(
    max_retries=Retry(total=0)  # Let our retry logic handle this
))

# Always set both connect timeout AND read timeout
response = session.post(
    "https://api.payment.com/charge",
    json=payload,
    timeout=(3.05, 10)  # (connect_timeout, read_timeout)
)

Observability for Resilience Patterns

These patterns are useless if you can’t observe them. Instrument every state transition:

from prometheus_client import Counter, Histogram, Gauge

circuit_state = Gauge('circuit_breaker_state',
    'Circuit breaker state (0=closed, 1=open, 2=half_open)',
    ['service'])
retry_attempts = Counter('retry_attempts_total',
    'Total retry attempts', ['service', 'outcome'])
bulkhead_rejected = Counter('bulkhead_rejected_total',
    'Requests rejected by bulkhead', ['pool'])

Set up alerts: circuit breaker open for >2 minutes means your dependency is genuinely down. Bulkhead rejection rate above 1% means you’re undersized. Retry rate above 5% means something is persistently wrong.

Real-World Implementation Checklist

  • Every external HTTP call has a connect timeout AND a read timeout
  • No retries without exponential backoff and jitter
  • Circuit breakers on every external service dependency
  • Separate thread pools (bulkheads) for different external services
  • Fallback responses defined for every circuit-broken operation
  • All circuit state changes logged and alerted
  • Load test your resilience patterns — chaos engineering before production

The investment in these patterns pays off the first time a dependency degrades at 2am and your service keeps serving cached or degraded responses instead of throwing 500s. Build resilience in from the start — retrofitting it into a fragile system is significantly harder.

By Michael Sun

Founder and Editor-in-Chief of NovVista. Software engineer with hands-on experience in cloud infrastructure, full-stack development, and DevOps. Writes about AI tools, developer workflows, server architecture, and the practical side of technology. Based in China.

Leave a Reply

Your email address will not be published. Required fields are marked *