API Rate Limiting in Practice: Token Buckets, Sliding Windows & More

Last month, one of our payment processing services went down for 47 minutes. The root cause wasn’t a bug in business logic or a database failure. A single integration partner started hammering our API at 50x their normal volume, exhausting connection pools and starving every other client. We had rate limiting in place—sort of. A naive counter that reset every 60 seconds. It wasn’t enough.

If you’ve built APIs for any length of time, you’ve probably lived through a version of this story. Rate limiting sounds simple until you actually have to implement it for a system handling thousands of requests per second across multiple regions. The devil is in the algorithms, the distributed state, and—perhaps most overlooked—how your clients handle being told “slow down.”

This article breaks down the three most common rate limiting algorithms, shows production-ready implementations, and covers the client-side patterns that make the whole system work.

Why Naive Rate Limiting Fails

The simplest approach to rate limiting is a fixed counter: allow N requests per time window, increment a counter on each request, reject when the counter exceeds N. Most tutorials start and end here. The problem is the boundary condition known as the “burst at the edge” problem.

Imagine a limit of 100 requests per minute. A client sends 0 requests for the first 59 seconds, then sends 100 requests at second 59. The counter resets at second 60, and the client immediately sends another 100 requests. That’s 200 requests in 2 seconds—double the intended rate—and your counter never triggered.

This isn’t theoretical. In production, traffic patterns are bursty by nature. Cron jobs fire on the minute. Mobile apps sync when screens wake. Webhook retries cluster together. A rate limiter that can’t handle burst patterns is barely a rate limiter at all.

Algorithm 1: Fixed Window Counter

Despite its flaws, the fixed window counter has a place. It’s dead simple to implement and reason about, and for many internal services, its burst vulnerability is acceptable.

The implementation uses a key that includes the current time window:

-- Fixed Window in Redis (Lua script)
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])

local current = redis.call('INCR', key)
if current == 1 then
    redis.call('EXPIRE', key, window)
end

if current > limit then
    return 0
end
return 1

The key typically encodes both the client identifier and the window timestamp: ratelimit:client_abc:1711929600 where the timestamp is floored to the window boundary. When a new window starts, the key doesn’t exist yet, so INCR initializes it to 1 and we set the expiry.

The Lua script is atomic—Redis executes Lua scripts without interleaving other commands—so there’s no race condition between the INCR and the EXPIRE. This matters more than people think. I’ve seen implementations that use separate GET and SET commands, and under high concurrency, keys can end up without expiry times, leaking memory until Redis hits its maxmemory limit.

When to use it: Internal service-to-service communication where you trust the clients and just want a safety valve. Background job queues where approximate limiting is fine.

Algorithm 2: Sliding Window Log

The sliding window log fixes the burst-at-the-edge problem by tracking individual request timestamps rather than aggregate counts. For each request, you record the timestamp, remove entries older than the window, and check if the remaining count exceeds the limit.

-- Sliding Window Log in Redis (Lua script)
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local member = ARGV[4]

-- Remove entries outside the window
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)

-- Count current entries
local count = redis.call('ZCARD', key)

if count >= limit then
    return 0
end

-- Add the new request
redis.call('ZADD', key, now, member)
redis.call('EXPIRE', key, window)
return 1

This uses a Redis sorted set where the score is the timestamp. ZREMRANGEBYSCORE prunes old entries, ZCARD counts what remains. The member value needs to be unique per request—a UUID or a combination of timestamp and random suffix works.

The accuracy is perfect: at any moment, you’re counting exactly the requests within the trailing window. But the memory cost is significant. If you allow 10,000 requests per hour, each client’s sorted set can hold up to 10,000 entries. Multiply by thousands of clients, and you’re looking at serious Redis memory consumption.

When to use it: When accuracy matters more than memory, typically for expensive operations like payment processing or SMS sending where each request has real cost.

Algorithm 3: Token Bucket

The token bucket is the algorithm I reach for first in most production scenarios. It allows controlled bursting (which is usually what you actually want) while maintaining a steady-state rate limit. The mental model is straightforward: imagine a bucket that holds tokens. Tokens are added at a fixed rate. Each request consumes a token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, which determines the burst size.

-- Token Bucket in Redis (Lua script)
local key = KEYS[1]
local capacity = tonumber(ARGV[1])      -- max tokens (burst size)
local rate = tonumber(ARGV[2])           -- tokens per second
local now = tonumber(ARGV[3])            -- current timestamp (ms)
local requested = tonumber(ARGV[4])      -- tokens to consume (usually 1)

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1])
local last_refill = tonumber(bucket[2])

if tokens == nil then
    -- First request: initialize full bucket
    tokens = capacity
    last_refill = now
end

-- Calculate tokens to add since last refill
local elapsed = (now - last_refill) / 1000
local new_tokens = elapsed * rate
tokens = math.min(capacity, tokens + new_tokens)

local allowed = 0
local remaining = tokens

if tokens >= requested then
    tokens = tokens - requested
    allowed = 1
    remaining = tokens
end

-- Store updated state
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / rate) * 2)

return {allowed, math.floor(remaining)}

This implementation stores two values per client: the current token count and the last refill timestamp. On each request, it calculates how many tokens have accumulated since the last check, adds them (up to the capacity), and then tries to consume the requested number.

The beauty of the token bucket is its two parameters map directly to business requirements. “We want clients to be able to make 100 requests per minute, with bursts up to 20” translates to rate=1.67 (100/60), capacity=20. The burst handling is built into the algorithm rather than being an edge case you have to work around.

Token Bucket in Application Code

Here’s a Go implementation for scenarios where you don’t want the Redis dependency—say, a CLI tool or a single-instance service:

package ratelimit

import (
    "sync"
    "time"
)

type TokenBucket struct {
    mu         sync.Mutex
    tokens     float64
    capacity   float64
    rate       float64 // tokens per second
    lastRefill time.Time
}

func NewTokenBucket(capacity float64, ratePerSecond float64) *TokenBucket {
    return &TokenBucket{
        tokens:     capacity,
        capacity:   capacity,
        rate:       ratePerSecond,
        lastRefill: time.Now(),
    }
}

func (tb *TokenBucket) Allow() bool {
    return tb.AllowN(1)
}

func (tb *TokenBucket) AllowN(n float64) bool {
    tb.mu.Lock()
    defer tb.mu.Unlock()

    now := time.Now()
    elapsed := now.Sub(tb.lastRefill).Seconds()
    tb.tokens += elapsed * tb.rate
    if tb.tokens > tb.capacity {
        tb.tokens = tb.capacity
    }
    tb.lastRefill = now

    if tb.tokens < n {
        return false
    }
    tb.tokens -= n
    return true
}

Note the mutex. Even in a single-process application, if you're handling concurrent HTTP requests, the bucket state is shared. I've seen production bugs where developers assumed single-threaded execution in a goroutine-based server and ended up with negative token counts.

Server-Side Implementation Patterns

Redis + Lua: The Production Standard

For distributed systems, Redis with Lua scripts is the most battle-tested approach. The Lua scripts shown above are atomic, and Redis's single-threaded execution model means you don't need distributed locks. A few operational notes from running this in production:

Use Redis Cluster carefully. All keys for a single rate limit check must live on the same shard. Use hash tags to ensure this: ratelimit:{client_abc}:tokens where the {client_abc} portion determines the shard.

Set appropriate timeouts. If Redis is unreachable, you need a policy. Most services default to allowing the request (fail open) because a brief period without rate limiting is better than a total outage. But for security-sensitive limits (login attempts, OTP verification), you might want to fail closed.

async function checkRateLimit(clientId, limit, window) {
    try {
        const result = await redis.eval(luaScript, 1,
            `ratelimit:${clientId}`,
            limit, window, Date.now(), uuidv4()
        );
        return result === 1;
    } catch (err) {
        logger.warn('Rate limit check failed, failing open', {
            clientId, error: err.message
        });
        // Fail open for general API limits
        return true;
    }
}

Nginx Rate Limiting

For edge-level protection, nginx's built-in rate limiting module works well as a first line of defense. It uses a leaky bucket algorithm internally:

http {
    # Define a zone: 10MB shared memory, keyed on client IP, 50 req/s
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=50r/s;

    # For authenticated endpoints, key on API key from header
    map $http_x_api_key $api_key_limit {
        default $http_x_api_key;
        ""      $binary_remote_addr;
    }
    limit_req_zone $api_key_limit zone=auth_api_limit:20m rate=100r/s;

    server {
        location /api/ {
            limit_req zone=api_limit burst=20 nodelay;
            limit_req_status 429;

            proxy_pass http://backend;
        }

        location /api/v2/ {
            limit_req zone=auth_api_limit burst=50 nodelay;
            limit_req_status 429;

            proxy_pass http://backend;
        }
    }
}

The burst parameter allows queuing excess requests rather than rejecting them immediately. With nodelay, burst requests are processed immediately but count against the burst budget. Without nodelay, they're delayed to match the configured rate. In practice, you almost always want nodelay for API endpoints—clients would rather get a fast 429 than wait 10 seconds for a delayed 200.

API Gateway Rate Limiting

If you're running Kong, AWS API Gateway, or Envoy, rate limiting is a configuration concern rather than a code concern. Kong's rate-limiting plugin, for instance:

plugins:
  - name: rate-limiting
    config:
      second: 10
      minute: 500
      hour: 10000
      policy: redis
      redis_host: rate-limit-redis.internal
      redis_port: 6379
      redis_database: 0
      fault_tolerant: true
      hide_client_headers: false

The fault_tolerant: true setting means Kong will allow requests through if Redis is unreachable—the same fail-open pattern discussed earlier. The hide_client_headers: false ensures rate limit headers are passed to the client, which brings us to the client side.

Response Headers: The Communication Layer

Rate limiting is a conversation between server and client. The server's side of this conversation happens through HTTP headers. The IETF draft RateLimit Header Fields (draft-ietf-httpapi-ratelimit-headers) is converging on a standard, but in practice you'll see several conventions:

HTTP/1.1 200 OK
RateLimit-Limit: 100
RateLimit-Remaining: 67
RateLimit-Reset: 1711929660
Retry-After: 30

RateLimit-Limit tells the client their quota. RateLimit-Remaining tells them how much is left. RateLimit-Reset is a Unix timestamp for when the window resets. Retry-After appears on 429 responses and tells the client how long to wait in seconds.

Here's middleware that sets these headers in Express.js:

function rateLimitMiddleware(options) {
    const { limit, windowMs } = options;

    return async (req, res, next) => {
        const clientId = req.headers['x-api-key'] || req.ip;
        const result = await checkRateLimit(clientId, limit, windowMs);

        res.set('RateLimit-Limit', String(limit));
        res.set('RateLimit-Remaining', String(result.remaining));
        res.set('RateLimit-Reset', String(result.resetAt));

        if (!result.allowed) {
            const retryAfter = Math.ceil((result.resetAt - Date.now()) / 1000);
            res.set('Retry-After', String(retryAfter));
            return res.status(429).json({
                error: 'rate_limit_exceeded',
                message: `Rate limit of ${limit} requests per ${windowMs/1000}s exceeded`,
                retry_after: retryAfter
            });
        }

        next();
    };
}

Client-Side: Playing Nice with Rate Limits

Exponential Backoff with Jitter

When you receive a 429, the worst thing you can do is immediately retry. The second worst thing is to retry after a fixed delay—because every other client that got rate limited at the same time will also retry after that same delay, creating a thundering herd.

Exponential backoff with jitter is the standard solution. Here's a Python implementation that respects the Retry-After header when present:

import time
import random
import requests
from requests.adapters import HTTPAdapter

class RateLimitedClient:
    def __init__(self, base_url, max_retries=5):
        self.base_url = base_url
        self.max_retries = max_retries
        self.session = requests.Session()
        self.session.mount('https://', HTTPAdapter(max_retries=0))

    def request(self, method, path, **kwargs):
        url = f"{self.base_url}{path}"
        last_exception = None

        for attempt in range(self.max_retries + 1):
            try:
                response = self.session.request(method, url, **kwargs)

                if response.status_code != 429:
                    return response

                # Use Retry-After header if present
                retry_after = response.headers.get('Retry-After')
                if retry_after:
                    delay = float(retry_after)
                else:
                    # Exponential backoff: 1s, 2s, 4s, 8s, 16s
                    base_delay = min(2 ** attempt, 32)
                    # Full jitter: random between 0 and base_delay
                    delay = random.uniform(0, base_delay)

                time.sleep(delay)

            except requests.ConnectionError as e:
                last_exception = e
                base_delay = min(2 ** attempt, 32)
                time.sleep(random.uniform(0, base_delay))

        raise Exception(
            f"Request failed after {self.max_retries} retries: {last_exception}"
        )

The "full jitter" strategy (random.uniform(0, base_delay)) outperforms both "equal jitter" and "decorrelated jitter" in reducing total completion time under contention, according to AWS's analysis published in their architecture blog. The key insight is that spreading retries across the entire delay window maximizes the probability that at least some clients get through on each retry round.

Circuit Breaker Pattern

Exponential backoff handles transient rate limiting. But what if a downstream service is consistently rejecting your requests? Continuing to retry wastes resources on both sides. The circuit breaker pattern addresses this by tracking failure rates and "opening the circuit" when failures exceed a threshold:

class CircuitBreaker:
    CLOSED = 'closed'        # Normal operation
    OPEN = 'open'            # Failing, reject immediately
    HALF_OPEN = 'half_open'  # Testing if service recovered

    def __init__(self, failure_threshold=5, recovery_timeout=30,
                 success_threshold=3):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.success_threshold = success_threshold
        self.state = self.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None

    def can_execute(self):
        if self.state == self.CLOSED:
            return True
        if self.state == self.OPEN:
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = self.HALF_OPEN
                self.success_count = 0
                return True
            return False
        # HALF_OPEN: allow request through to test
        return True

    def record_success(self):
        if self.state == self.HALF_OPEN:
            self.success_count += 1
            if self.success_count >= self.success_threshold:
                self.state = self.CLOSED
                self.failure_count = 0
        else:
            self.failure_count = 0

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = self.OPEN

In practice, you'd wrap this around your rate-limited client so that when a service is persistently returning 429s, your application stops sending requests entirely for the recovery period. Netflix's Hystrix library popularized this pattern, and while Hystrix itself is in maintenance mode, the pattern lives on in resilience4j (Java), Polly (.NET), and gobreaker (Go).

Multi-Tenant Rate Limiting for SaaS

If you're building a SaaS product, rate limiting gets significantly more complex. You're not just protecting your system from overload—you're enforcing business rules, ensuring fairness across tenants, and often tying limits to pricing tiers.

Tiered Limits

Most SaaS APIs use a hierarchical limit structure:

const TIER_LIMITS = {
    free: {
        requests_per_minute: 60,
        requests_per_day: 1000,
        burst: 10,
        concurrent_connections: 2
    },
    pro: {
        requests_per_minute: 600,
        requests_per_day: 50000,
        burst: 50,
        concurrent_connections: 10
    },
    enterprise: {
        requests_per_minute: 6000,
        requests_per_day: 500000,
        burst: 200,
        concurrent_connections: 50
    }
};

async function multiTierRateLimit(req, res, next) {
    const apiKey = req.headers['x-api-key'];
    const tenant = await getTenantByApiKey(apiKey);
    const limits = TIER_LIMITS[tenant.tier];

    // Check multiple limits in parallel
    const [minuteOk, dailyOk, concurrentOk] = await Promise.all([
        checkTokenBucket(
            `minute:${tenant.id}`,
            limits.burst,
            limits.requests_per_minute / 60
        ),
        checkFixedWindow(
            `daily:${tenant.id}`,
            limits.requests_per_day,
            86400
        ),
        checkConcurrent(
            `concurrent:${tenant.id}`,
            limits.concurrent_connections
        )
    ]);

    if (!minuteOk || !dailyOk || !concurrentOk) {
        // Return specific information about which limit was hit
        return res.status(429).json({
            error: 'rate_limit_exceeded',
            limits: {
                per_minute: { limit: limits.requests_per_minute, exceeded: !minuteOk },
                per_day: { limit: limits.requests_per_day, exceeded: !dailyOk },
                concurrent: { limit: limits.concurrent_connections, exceeded: !concurrentOk }
            }
        });
    }

    next();
}

Notice that we check multiple limits simultaneously. A tenant could be within their per-minute limit but have exhausted their daily quota. The response body tells the client exactly which limit they hit, which is crucial for debugging—there's nothing more frustrating than a bare 429 with no context.

Fairness Under Load

Here's a scenario that bit us: during a traffic spike, one enterprise tenant was consuming 80% of our API capacity. They were within their rate limits—they were paying for high limits. But their traffic was degrading the experience for hundreds of smaller tenants. Individual rate limits weren't enough; we needed global fairness.

The solution was a weighted fair queuing approach. Each tenant gets a weight proportional to their tier, and during overload, requests are admitted proportionally:

async function fairnessAwareRateLimit(req, tenant) {
    // First check: individual tenant limits (fast path)
    const withinTenantLimit = await checkTenantLimit(tenant);
    if (!withinTenantLimit) return false;

    // Second check: global system load
    const systemLoad = await getSystemLoad(); // 0.0 to 1.0
    if (systemLoad < 0.8) return true; // No fairness throttling needed

    // Under high load: weighted admission
    const weight = TIER_WEIGHTS[tenant.tier]; // free=1, pro=5, enterprise=20
    const totalWeight = await getTotalActiveWeight();
    const fairShare = weight / totalWeight;

    const tenantRecentRequests = await getRecentRequestCount(tenant.id, 10);
    const totalRecentRequests = await getTotalRecentRequests(10);
    const actualShare = tenantRecentRequests / totalRecentRequests;

    // Allow if tenant is using less than their fair share
    return actualShare <= fairShare * 1.2; // 20% grace margin
}

Monitoring and Observability

Rate limiting without monitoring is like having a smoke detector with no alarm. You need to know when limits are being hit, by whom, and whether your limits are calibrated correctly.

Key metrics to track:

Rate limit hit rate by tier — If 40% of your free-tier users are hitting limits daily, your free tier is too restrictive (or your documentation doesn't set expectations).
429 response percentage — Track this as a percentage of total responses. A sudden spike means either a traffic pattern changed or a client is misbehaving.
Retry storm detection — Monitor for patterns where 429 responses lead to immediate retries. This suggests clients aren't implementing backoff correctly.
Headroom per tier — What percentage of their limit are tenants typically using? If most enterprise customers are routinely at 90%, you either need to raise limits or offer a higher tier.

# Prometheus metrics for rate limiting
from prometheus_client import Counter, Histogram, Gauge

rate_limit_decisions = Counter(
    'rate_limit_decisions_total',
    'Rate limit decisions',
    ['tenant_tier', 'decision', 'limit_type']
)

rate_limit_remaining = Gauge(
    'rate_limit_remaining_ratio',
    'Ratio of remaining rate limit quota',
    ['tenant_id', 'limit_type']
)

rate_limit_check_duration = Histogram(
    'rate_limit_check_duration_seconds',
    'Time to evaluate rate limit',
    buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1]
)

Common Pitfalls

Clock skew in distributed systems. If your rate limit checks happen across multiple application servers, and those servers have slightly different clocks, your fixed window and sliding window implementations will behave inconsistently. Use Redis server time (redis.call('TIME')) in your Lua scripts rather than passing timestamps from application servers.

Rate limiting at the wrong layer. I've seen teams implement rate limiting in their application code but leave their database connection pool uncapped. A client that sends 1,000 requests that each trigger 10 database queries is effectively doing 10,000 operations. Consider rate limiting at multiple layers: edge (nginx), application (per-endpoint), and resource (database connections, external API calls).

Forgetting about WebSockets and streaming. Traditional request-based rate limiting doesn't work for persistent connections. For WebSocket APIs, you need message-based rate limiting within the connection, plus limits on the number of concurrent connections per client.

Not rate limiting internal services. "But it's an internal service, we trust our callers." Until a deployment bug in Service A causes an infinite retry loop against Service B. Every service-to-service call should have rate limits, even if they're generous.

Putting It All Together

A production rate limiting system typically layers multiple strategies:

Edge layer (nginx/CDN): IP-based rate limiting to stop DDoS and obvious abuse. Generous limits, fail-closed.
API Gateway: API-key-based rate limiting tied to subscription tiers. Token bucket for per-second limits, fixed window for daily quotas.
Application layer: Per-endpoint limits for expensive operations (search, export, batch). Sliding window for accuracy.
Resource layer: Connection pools, queue depths, and concurrency limits to protect databases and downstream services.

Each layer serves a different purpose and catches different failure modes. The edge layer stops volumetric attacks. The gateway enforces business rules. The application layer protects expensive operations. The resource layer prevents cascade failures.

Rate limiting is one of those problems that's easy to get 80% right and brutally hard to get 100% right. But getting to 95%—using a token bucket or sliding window, running the checks in Redis with Lua scripts, returning proper headers, and monitoring the results—is straightforward engineering. That remaining 5% is where you'll spend time tuning limits based on real traffic patterns, handling edge cases in distributed deployments, and negotiating with customers who are convinced their use case deserves an exception.

Start with the token bucket. Implement it in Redis with the Lua script from this article. Add the response headers. Build a client that respects them. You'll be ahead of 90% of the APIs out there.

API Rate Limiting in Practice: Token Buckets, Sliding Windows, and Client-Side Handling

ByMichael Sun