Why does plain exponential backoff cause retry storms?

Without jitter, every client instance that observed the same failure will compute identical retry intervals. Across dozens of workers this creates synchronized bursts that land at the same millisecond, overwhelming the recovering service and triggering further failures.

What is the difference between client-side jitter and server-side deduplication?

Jitter decorrelates the timing of retries so they do not arrive simultaneously; it does not prevent a request from being executed more than once. Server-side deduplication — via an idempotency key store — ensures that even if two retries do land at the same time, only one side effect is produced.

How long should a distributed lease TTL be?

Set the TTL to at least 2× the p99 execution time of the operation, with a hard minimum of 30 seconds. This prevents premature expiry during slow executions while still allowing the lock to self-heal after a crashed holder.

Implementing Exponential Backoff Without Overlapping Retries

Part of: Retry Logic & Backoff Fundamentals

Prerequisites. You should already understand idempotency fundamentals and API guarantees — specifically what an idempotency key is, why safe HTTP methods differ from mutating ones, and how a deduplication store prevents duplicate side effects. This page focuses on the narrower task of wiring a jittered backoff schedule to a distributed lease so that retries from concurrent workers never overlap.

The Problem: Why Backoff Alone Is Not Enough

Exponential backoff was designed for a single-client, single-server world. In a microservice mesh, dozens of worker processes may independently detect the same downstream timeout and each start their own retry countdown. Because every instance seeds its delay from the same formula — base × 2^attempt — the countdowns correlate. Workers that failed at t=0 will retry near-simultaneously around t=1 s, t=2 s, t=4 s, creating synchronized bursts that hit the recovering service just as it is trying to stabilize.

This is the thundering herd. The standard fix is full-jitter backoff, which makes each delay a random sample drawn uniformly from [0, min(cap, base × 2^attempt)]. Jitter decorrelates timing, but it does not prevent two workers from dispatching the same logical request in the same execution window. Preventing that requires a second layer: a server-side distributed lease keyed to a deterministic request fingerprint. Together these two mechanisms form the non-overlapping retry contract described in this runbook.

Architecture Overview

The diagram below shows the two-layer architecture. The client layer owns jitter and retry scheduling; the server layer owns lease acquisition and idempotency validation.

Step 1 — Generate a Deterministic Idempotency Key

Before writing any retry loop, establish the key contract. The key must be:

Deterministic: identical input always produces identical key, across all worker instances.
Tenant-scoped: prefix with an account or tenant identifier so identical payloads from different tenants do not collide.
Opaque to the client: use a SHA-256 hash rather than a client-supplied string that could be guessed or enumerated.

The key feeds both the client-side retry schedule (to detect already-in-flight attempts) and the server-side deduplication store. See idempotency key generation strategies for a full treatment of UUID, HMAC, and UUIDv7 variants.

Step 2 — Implement Full-Jitter Backoff

Replace any plain sleep(base × 2^attempt) with a jittered variant. The AWS full-jitter formula is the most effective at decorrelating concurrent clients:

delay = random_uniform(0, min(cap, base × 2^attempt))

Use base = 500 ms, cap = 30 s, and max_attempts = 8 as a production-safe default. The 30 s cap prevents the retry schedule from growing beyond a typical downstream recovery window; 8 attempts gives ~4 min of total retry budget at the cap.

Node.js / TypeScript

function fullJitterDelay(attempt: number, baseMs = 500, capMs = 30_000): number {
  const ceiling = Math.min(capMs, baseMs * Math.pow(2, attempt));
  return Math.random() * ceiling;
}

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  maxAttempts = 8,
): Promise<T> {
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (attempt === maxAttempts - 1) throw err;
      const delay = fullJitterDelay(attempt);
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }
  throw new Error("unreachable");
}

Go

import (
    "context"
    "math"
    "math/rand"
    "time"
)

const (
    baseDelay = 500 * time.Millisecond
    capDelay  = 30 * time.Second
    maxAttempts = 8
)

func fullJitter(attempt int) time.Duration {
    ceiling := math.Min(float64(capDelay), float64(baseDelay)*math.Pow(2, float64(attempt)))
    return time.Duration(rand.Float64() * ceiling)
}

func retryWithBackoff(ctx context.Context, fn func() error) error {
    for attempt := 0; attempt < maxAttempts; attempt++ {
        if err := fn(); err == nil {
            return nil
        } else if attempt == maxAttempts-1 {
            return err
        }
        select {
        case <-ctx.Done():
            return ctx.Err()
        case <-time.After(fullJitter(attempt)):
        }
    }
    return nil
}

Java (Resilience4j)

RetryConfig config = RetryConfig.custom()
    .maxAttempts(8)
    .intervalFunction(IntervalFunction.ofExponentialRandomBackoff(
        Duration.ofMillis(500),   // base
        2.0,                      // multiplier
        0.5,                      // randomisation factor (full jitter ≈ 1.0)
        Duration.ofSeconds(30)    // cap
    ))
    .retryExceptions(IOException.class, TimeoutException.class)
    .build();

Retry retry = Retry.of("paymentService", config);

Python (tenacity)

from tenacity import retry, stop_after_attempt, wait_random_exponential

@retry(
    stop=stop_after_attempt(8),
    wait=wait_random_exponential(multiplier=0.5, max=30),
)
def call_payment_api(payload: dict) -> dict:
    return requests.post("/payments", json=payload, timeout=10).json()

wait_random_exponential implements full-jitter: random(0, min(max, multiplier × 2^n)).

Step 3 — Acquire a Distributed Lease Before Execution

Jitter reduces collision probability; it does not eliminate it. When two workers both draw a small random delay and hit the server within milliseconds of each other, the server must serialize them. The mechanism is an atomic SET key value NX EX ttl in Redis — the first writer wins the lease; the second receives nil and should either wait or return the cached result.

The TTL must be at least 2 × p99_execution_time. For typical OLTP operations set it to 30 s minimum. For long-running payment captures, use 120 s. Do not use wall-clock TTLs when cross-region clock skew exceeds ±2 s; switch to compare-and-swap (CAS) with logical timestamps instead, as described in handling stale locks in distributed systems.

-- Acquire lease (Lua, atomic)
local acquired = redis.call('SET', KEYS[1], ARGV[1], 'NX', 'EX', ARGV[2])
if acquired then
  return 1   -- lease granted, proceed
else
  return 0   -- already in-flight, return cached result or 409
end

-- Release lease after successful execution
DEL idempotency:::lock

For multi-region deployments where a single Redis node is insufficient, use Redlock for high-availability deduplication. Redlock requires a quorum write across 5 independent Redis nodes, reducing the window in which a partitioned node can issue a conflicting grant.

Step 4 — Validate Idempotency and Execute Atomically

After acquiring the lease, check whether the idempotency key already has a stored result from a previous attempt. If it does, return that result immediately without re-executing. If it does not, execute the operation and store the result before releasing the lease.

import hashlib, json, redis

r = redis.Redis()

def execute_idempotent(tenant: str, payload: dict, operation) -> dict:
    fingerprint = hashlib.sha256(
        f"{tenant}:{json.dumps(payload, sort_keys=True)}".encode()
    ).hexdigest()
    key = f"idempotency:{tenant}:{fingerprint}"
    lock_key = f"{key}:lock"

    # Acquire lease (30 s TTL)
    if not r.set(lock_key, "1", nx=True, ex=30):
        # Another worker holds the lease — return cached result or 409
        cached = r.get(key)
        if cached:
            return json.loads(cached)
        raise ConflictError("request in flight, retry after 1 s")

    try:
        # Check for existing result (previous attempt completed)
        cached = r.get(key)
        if cached:
            return json.loads(cached)

        # Execute once
        result = operation(payload)

        # Persist result for future duplicate requests (TTL: 24 h = 86400 s)
        r.set(key, json.dumps(result), ex=86400)
        return result
    finally:
        r.delete(lock_key)

The idempotency result TTL (86400 s / 24 h) is separate from the lease TTL (30 s). The result must outlive the retry window by a significant margin; use 24 h for payment APIs and 1 h for lower-stakes operations.

Step 5 — Handle HTTP Method Semantics

Retry safety varies by HTTP method. Map each to its correct boundary before adding retry logic:

Method	Retry-safe by default	Required safeguard
`GET`, `HEAD`, `OPTIONS`	Yes	None — safe and inherently idempotent
`PUT`, `DELETE`	Yes (with correct implementation)	Payload-level deduplication to prevent stale overwrites
`POST`, `PATCH`	No	Idempotency key header + server-side lease required

For POST requests to payment endpoints, enforce the Idempotency-Key header at the API gateway layer. Reject requests that omit it with 400 Bad Request. This prevents clients that have not adopted the key contract from silently causing duplicate charges.

Verification and Testing

Simulate a Duplicate Request

# Capture a real request fingerprint from logs
KEY="idempotency:tenant-123:abc123def456"

# Store a fake in-flight result to simulate a concurrent worker
redis-cli SET "$KEY" '{"status":"processing"}' EX 30

# Send the same request — server must return the cached result
curl -X POST https://api.example.com/payments \
  -H "Idempotency-Key: abc123def456" \
  -d '{"amount": 100, "currency": "USD"}'
# Expected: {"status":"processing"} — not a new charge

Inspect Redis State

# Verify lease key exists and check TTL
redis-cli TTL "idempotency:tenant-123:abc123def456:lock"

# Verify result key is persisted after successful execution
redis-cli GET "idempotency:tenant-123:abc123def456"

# Scan for orphaned lock keys (TTL = -1 means no expiry set — misconfiguration)
redis-cli --scan --pattern "idempotency:*:lock" | xargs -I{} redis-cli TTL {}

Validate Jitter Distribution

Run 1000 concurrent workers with a shared failure timestamp and plot the retry interval histogram. A correctly jittered distribution should be approximately uniform within each attempt’s ceiling. Any spike at a single millisecond value indicates a seeding bug (e.g. random.seed(timestamp) called once per process rather than per retry).

# k6 script: inject 1000 VUs with forced failure at t=0
k6 run --vus 1000 --duration 120s scripts/retry-storm-test.js
# Then inspect: k6 dashboard → retry_dispatch_latency histogram

Confirm Circuit Breaker Engagement

Use tc netem to inject 100% packet loss for 10 s and verify that:

The circuit breaker opens after 5 consecutive failures (half-open threshold).
No retries are dispatched while the circuit is open.
The circuit transitions to half-open after the configured sleep window.

Failure Scenarios and Debugging

Failure Scenario	Remediation Steps	Observability Hooks
Clock skew invalidates TTL-based lease	Replace TTL locks with CAS operations using logical timestamps or vector clocks; set minimum lease TTL to `max_clock_skew × 3`	Alert: `clock_skew_ms > 500` on any node; span attribute `lease_acquired_at_logical_ts`
Idempotency store race condition (two workers both read `nil`, both execute)	Enforce atomic `SET key value NX EX ttl` in a Lua script; never use read-then-write without the `NX` flag	Metric: `deduplication_conflict_rate` counter; alert threshold `> 0.01%` over 5 min
Backoff curve clustering (GC pause or thread-pool exhaustion delays scheduling)	Implement adaptive jitter: measure actual dispatch timestamp and widen jitter bounds when `gc_pause_ms > 200`; use `time.AfterFunc` in Go rather than blocking `sleep`	Metric: `retry_dispatch_latency_p99`; histogram showing bimodal distribution signals GC interference
Lease not released after holder crash	Set lease TTL to `2 × p99_execution_time`; implement a reconciliation job that scans for orphaned locks older than 2× TTL	Alert: `orphaned_lock_count > 0`; log field `lock_holder_node_id` for crash correlation
Payment gateway timeout while server processes async	Return `202 Accepted` with a polling endpoint; validate idempotency key on each poll response to prevent double-charge on retry	Metric: `payment_retry_success_vs_dedup_ratio`; audit log entry with cryptographic signature per attempt

SRE Observability Checklist

Emit these signals from every service that uses this pattern:

retry_attempt_total (counter, labels: attempt_number, outcome=[success|dedup|exhausted]) — tracks retry volume and deduplication hit rate.
retry_dispatch_latency_ms (histogram, p50/p95/p99) — separates backoff scheduling overhead from actual network transit time.
deduplication_conflict_rate (counter) — incremented whenever two concurrent requests share an idempotency key; alert at > 0.01% over any 5 min window.
lock_acquisition_duration_ms (histogram) — monitors Redis coordination overhead; p99 above 50 ms indicates Redis saturation.
Log field idempotency_key_status (values: created / reused / expired) — enables trace-level auditing of every deduplication decision.
Trace span attribute retry_origin_node_id — correlates concurrent dispatch sources during post-mortems; essential when mitigating thundering herd during retry storms.

Retry Logic & Backoff Fundamentals — parent page covering the full backoff algorithm landscape, circuit breakers, and the guarantee model for at-least-once retry.
Using Redis SET NX for Distributed Request Deduplication — deep dive into the atomic SET NX operation, Lua scripting, and TTL management for the idempotency store used in Step 3 and Step 4 above.
Handling Duplicate Webhook Deliveries in Payment Gateways — applies the same lease-and-dedup pattern to inbound webhook events, where the sender controls retry timing.
Implementing Redlock for High-Availability Deduplication — multi-node lease acquisition for deployments where a single Redis instance is a SPOF.
Mitigating Thundering Herd During Retry Storms — complementary server-side rate limiting and load-shedding strategies when jitter alone is insufficient.