What guarantee does Raft provide for request deduplication?

Raft provides linearizability: every committed idempotency key is visible to all subsequent quorum reads regardless of which node handles the request. This prevents duplicate execution even when different nodes receive retried requests.

When does consensus-backed deduplication break under partition?

The minority partition loses write availability — it cannot commit new idempotency keys until quorum is restored. This CP trade-off is intentional for payment processing: the minority returns 503 rather than risk a duplicate execution.

Can etcd or ZooKeeper replace a custom consensus implementation?

Yes. etcd (Raft-based) and ZooKeeper (ZAB-based) are production-grade managed consensus stores. They trade network round-trip overhead (~1–5 ms LAN latency) for avoiding the operational burden of running an embedded consensus library inside your application process.

Consensus Algorithms for Deduplication

Part of: Distributed Coordination & Locking Strategies

Problem Framing

Local idempotency guarantees — enforced via single-node database constraints, in-memory caches, or application-level hash sets — fail at the failure boundary. When a node crashes after acknowledging a request but before persisting the idempotency state, subsequent retries hitting a different node have no shared source of truth to consult. Stateless API gateways, load balancers, and client-side retry mechanisms intersect with shared financial state, so request duplication becomes a systemic risk rather than an isolated edge case. Consensus algorithms solve this by replacing per-node deduplication state with a replicated, ordered log that all nodes read and write through a single coordination protocol. For the foundational contract that defines why this guarantee is needed across the full request lifecycle, see Distributed Coordination & Locking Strategies.

Guarantee Model

Consensus-backed deduplication provides linearizability: every committed idempotency key is visible to all subsequent reads, regardless of which node handles the incoming request. This is a stronger guarantee than causal or eventual consistency, and it is the minimum required for financial reconciliation, inventory reservations, and payment gateway callbacks.

Where the guarantee breaks:

Network partition (minority side): A partitioned minority cannot commit new entries without quorum. It returns 503 Service Unavailable rather than risk a duplicate commit. The majority partition continues normally.
Clock skew: Consensus protocols sequence operations by log index, not wall clock. TTL-based expiry that relies on wall-clock timestamps must account for clock drift across nodes; use a monotonic sequence number or a centrally assigned lease epoch instead.
Leader lease expiry during commit: If a leader’s lease expires mid-commit, the new leader will either find the entry committed (safe replay) or absent (safe retry). The protocol never leaves an entry in an ambiguous half-committed state.

Sequence Diagram: Commit-Then-Execute

The following diagram shows the full lifecycle of a deduplicated request through a Raft-backed idempotency store.

Core Algorithm: State Machine Replication for Idempotency Keys

Raft, Paxos, and ZAB share a foundational property: they replicate a deterministic log across all participating nodes, ensuring all committed entries are applied in the same order on every node. Applied to request deduplication, the protocol functions as the coordination backbone for an idempotency state machine.

Commit-Then-Execute Protocol

Idempotency tokens are committed to the consensus log before business logic execution begins:

Proposal phase: The API gateway proposes a new log entry containing the idempotency key, payload hash (SHA-256 of the normalized request body), and a Unix epoch timestamp.
Quorum acknowledgment: The leader replicates the entry to ⌊N/2⌋ + 1 followers (e.g., 3 of 5 nodes). The entry is marked committed only after quorum acknowledgment.
Linearizable read validation: Subsequent requests for the same key issue a linearizable read against the committed log. If the key exists, the cached response is returned without re-executing business logic.
Deterministic replay: On node restart or log truncation, the state machine replays committed entries to reconstruct the idempotency table. No committed key is lost.

This eliminates the TOCTOU (time-of-check-time-of-use) race inherent in check-then-act patterns: the consensus protocol serializes all key validations into a single ordered stream.

Pseudo-code: Raft-Backed Key Registration

// RegisterKey attempts to commit an idempotency key to the Raft log.
// Returns (cachedResponse, nil) on duplicate; (nil, nil) on new key (proceed to execute).
func RegisterKey(ctx context.Context, key string, payloadHash []byte) (*CachedResponse, error) {
    entry := &LogEntry{
        IdempotencyKey: key,
        PayloadHash:    payloadHash,
        ProposedAt:     time.Now().UTC().Unix(),
    }

    // Linearizable read first — avoid proposing a key that already exists.
    existing, err := raftNode.LinearizableRead(ctx, key)
    if err != nil {
        return nil, fmt.Errorf("linearizable read failed: %w", err)
    }
    if existing != nil {
        if !bytes.Equal(existing.PayloadHash, payloadHash) {
            return nil, ErrKeyHashMismatch // same key, different payload — reject
        }
        return existing.CachedResponse, nil
    }

    // Propose new entry to Raft log.
    if err := raftNode.Propose(ctx, entry); err != nil {
        return nil, fmt.Errorf("raft propose failed: %w", err)
    }

    // Wait for commit confirmation (quorum has acknowledged).
    committed, err := raftNode.AwaitCommit(ctx, entry.ProposedAt)
    if err != nil {
        return nil, fmt.Errorf("raft commit timed out: %w", err)
    }
    if committed.Duplicate {
        return committed.CachedResponse, nil
    }
    return nil, nil // new key — caller proceeds with business logic
}

Implementation Variants

Four deployment models cover the range from embedded libraries to fully managed services.

Variant A: Embedded Raft Library (Hashicorp Raft / etcd Raft)

The application process runs a Raft node internally. Zero extra network hops for log proposals; strong co-location of business logic and idempotency state.

// Initialize embedded Raft using hashicorp/raft
config := raft.DefaultConfig()
config.LocalID = raft.ServerID(nodeID)
config.HeartbeatTimeout = 150 * time.Millisecond
config.ElectionTimeout  = 300 * time.Millisecond
config.CommitTimeout    = 50  * time.Millisecond

logStore, _ := raftboltdb.NewBoltStore(filepath.Join(dataDir, "raft-log.db"))
stableStore, _ := raftboltdb.NewBoltStore(filepath.Join(dataDir, "raft-stable.db"))
snapStore, _  := raft.NewFileSnapshotStore(dataDir, 3, os.Stderr)

transport, _ := raft.NewTCPTransport(bindAddr, nil, 3, 10*time.Second, os.Stderr)
ra, _ := raft.NewRaft(config, &idempotencyFSM{db: kvStore}, logStore, stableStore, snapStore, transport)

Trade-offs: High operational complexity (lifecycle coupling between application and consensus); GC pressure from log buffer allocations in JVM services.

Variant B: etcd as External Consensus Store

etcd exposes a key-value API backed by Raft with built-in linearizable reads (WithSerializable(false)). Decouples the application process from consensus lifecycle.

cli, _ := clientv3.New(clientv3.Config{
    Endpoints:   []string{"etcd-0:2379", "etcd-1:2379", "etcd-2:2379"},
    DialTimeout: 3 * time.Second,
})

// Atomic create-if-absent: returns existing value if key already present.
txn := cli.Txn(ctx).
    If(clientv3.Compare(clientv3.Version(idempotencyKey), "=", 0)).
    Then(clientv3.OpPut(idempotencyKey, serializedResponse, clientv3.WithLease(leaseID))).
    Else(clientv3.OpGet(idempotencyKey))

resp, err := txn.Commit()
if !resp.Succeeded {
    // Key already exists — return cached response
    cached := resp.Responses[0].GetResponseRange().Kvs[0].Value
    return deserializeCachedResponse(cached), nil
}

Trade-offs: 1–5 ms LAN overhead per commit; dependency on etcd cluster availability; requires separate etcd monitoring.

Variant C: ZooKeeper + ZAB (Java/JVM Services)

ZooKeeper’s create with CreateMode.PERSISTENT is atomic: it fails with NodeExistsException if the path already exists. This makes ZAB semantics directly usable as a distributed deduplication gate.

try {
    zk.create(
        "/idempotency/" + idempotencyKey,
        serializedPayloadHash,
        ZooDefs.Ids.OPEN_ACL_UNSAFE,
        CreateMode.PERSISTENT
    );
    // Node created — new request, proceed with execution.
    return executeBusinessLogic(request);
} catch (KeeperException.NodeExistsException e) {
    // Duplicate — retrieve and return cached response.
    byte[] cached = zk.getData("/idempotency/" + idempotencyKey, false, null);
    return deserializeCachedResponse(cached);
}

Trade-offs: ZooKeeper’s per-node memory model (all data in RAM) limits key cardinality; watch-based TTL expiry requires a dedicated cleanup job.

Variant D: Managed Spanner / DynamoDB with Conditional Writes

Cloud-managed databases that provide external consistency (Spanner) or conditional writes (DynamoDB) without operating a consensus cluster directly. Best for teams that prioritize operational simplicity over minimizing per-commit latency.

# DynamoDB conditional write — idempotent key registration
import boto3
from botocore.exceptions import ClientError

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('idempotency_keys')

try:
    table.put_item(
        Item={
            'idempotency_key': key,
            'payload_hash':    payload_hash,
            'response_body':   cached_response,
            'ttl':             int(time.time()) + 86400,   # 24-hour TTL
        },
        ConditionExpression='attribute_not_exists(idempotency_key)'
    )
    return None  # new key — caller proceeds
except ClientError as e:
    if e.response['Error']['Code'] == 'ConditionalCheckFailedException':
        existing = table.get_item(Key={'idempotency_key': key})['Item']
        return existing['response_body']  # cached response
    raise

Trade-offs: 5–20 ms single-digit RTT for cross-AZ Spanner or DynamoDB; no control over internal replication factor; vendor lock-in.

Comparison Table

Variant	Consistency	Median commit latency	Operational overhead	Best for
Embedded Raft	Linearizable	0.5–2 ms intra-process	High (lifecycle coupling)	Low-latency fintech, full control
etcd external	Linearizable	2–8 ms LAN	Medium (etcd cluster ops)	Go/polyglot microservices
ZooKeeper ZAB	Linearizable	3–10 ms LAN	Medium (Java-native)	JVM monoliths, Kafka-adjacent stacks
Managed DB	External consistency / strong	5–20 ms cross-AZ	Low (vendor-managed)	Teams avoiding consensus ops

Edge Cases & Failure Scenarios

Failure Scenario	Remediation Steps	Observability Hooks
Leader election during an in-flight commit	The new leader checks whether the log entry exists at the proposed index. If absent, the gateway retries the proposal with the original idempotency key; the state machine deduplicates any concurrent commit.	Alert on `raft_leader_changes_total > 3` per minute; trace `proposal.retry_count` per request.
Network partition isolates minority nodes	Minority nodes return `503 Service Unavailable`. The gateway uses exponential backoff with jitter to retry against the majority partition. Do not cache negative results: the key may have committed on the majority side.	`raft_quorum_loss_seconds` gauge; PagerDuty alert if quorum is absent for more than 10 seconds.
Payload hash mismatch on duplicate key	Reject the request with `422 Unprocessable Entity` and a body naming the conflict. Log both hashes for security audit. Do not overwrite the existing entry.	`dedup.hash_mismatch_total` counter; emit to security event stream with `idempotency_key`, `expected_hash`, `received_hash`.
Log compaction purges a key before retry window expires	Enforce a compaction safe-deletion window ≥ `max_processing_time_ms + max_client_retry_window_ms`. For payment flows, a minimum of 86,400 seconds (24 hours) after the last successful commit is standard.	`compaction.premature_deletion_total` counter; alert if any key is deleted before its TTL timestamp.
Stale read from a non-leader node (routing misconfiguration)	Enforce that all idempotency reads route exclusively to the current leader or use linearizable read mode (Raft read index protocol). For etcd, always pass `WithSerializable(false)`.	`dedup.stale_read_total` counter; trace `read.node_id` vs `raft.leader_id` per request.

Operational Concerns

TTL Management and Log Compaction

Every committed idempotency key generates a replicated log entry. Without proactive lifecycle management, storage exhaustion and degraded commit latency cascade across all replicas.

Safe-deletion window: Delete a key only after max_processing_time + max_retry_window. For payment systems this is a minimum of 86,400 seconds. For webhook callbacks with 72-hour retry windows, extend to 259,200 seconds.
Snapshot compaction: After compaction, the state machine discards raw log entries and regenerates state from a snapshot. Schedule compaction when raft_log_entries_total > 100,000 or disk usage exceeds 70%.
Lease renewal for long-running operations: Batch payment reconciliation jobs hold an idempotency lease for the full job duration. Implement heartbeat renewal every lease_ttl / 3 seconds. If the worker fails to renew, the consensus layer automatically expires the lease and marks the key available for retry routing. Align these patterns with lock timeout and lease management practices for consistent expiry behavior across the system.

Index Strategy

For etcd and ZooKeeper, namespace keys under a fixed prefix (/idempotency/v1/) to enable prefix-scan compaction jobs. For Spanner and DynamoDB, add a ttl attribute and enable TTL-based expiry at the table level. Index on (idempotency_key, created_at) for time-range compaction queries; avoid full-table scans.

Memory and Storage Budgeting

A typical idempotency entry (key + SHA-256 hash + serialized response + metadata) is approximately 512–2048 bytes. At 10,000 requests per second with a 24-hour retention window, the state store must hold roughly 10,000 × 86,400 × 1,024 bytes ≈ 860 GB. Size your consensus cluster storage accordingly and set compaction intervals to keep the active log under 10% of total disk capacity.

SRE Alert Thresholds

Consensus commit latency p99 > 50 ms: investigate follower replication lag or network congestion.
Deduplication hit rate < 0.1%: validate that clients are correctly retransmitting idempotency keys; a zero hit rate may indicate key regeneration on retry.
Quorum loss duration > 10 seconds: trigger PagerDuty; begin automated failover runbook.
Idempotency key hash mismatch rate > 0/hour: any non-zero value indicates a client bug or a security event; escalate immediately.

Integration with Microservice Architecture

Consensus deduplication is the serialization layer that prevents race conditions across asynchronous boundaries. By guaranteeing total ordering of idempotency commits, consensus logs eliminate non-deterministic execution paths in webhook processing, payment gateway callbacks, and inventory deduction pipelines.

Preventing Race Conditions in Asynchronous Workflows

Asynchronous microservices frequently process events out of order due to message broker partitioning, consumer lag, or retry storms. Consensus logs provide a deterministic ordering mechanism:

Financial transactions: Payment confirmations, refunds, and chargebacks are sequenced by commit index. The state machine applies them in strict order, preventing double-spend anomalies. For a detailed treatment of preventing these race conditions, see preventing race conditions in microservices.
Webhook processing: Third-party callbacks often arrive as duplicates or with delayed delivery. The consensus layer filters duplicates at ingestion, ensuring downstream services process each event exactly once. For the full set of webhook delivery guarantees this depends on, see webhook delivery guarantees.
Inventory deduction: Concurrent checkout requests for limited stock are serialized through the idempotency log. The leader processes requests in arrival order, committing stock reservations atomically and rejecting subsequent duplicates before they reach the inventory service.

Client-Side Token Generation

The consensus layer is only as strong as the uniqueness of the tokens it receives. Clients must generate cryptographically strong idempotency key generation strategies — 128-bit UUIDs (v4 random or v7 time-ordered) or HMAC-deterministic tokens derived from the canonical request payload. Deterministic tokens guarantee that a client crash-and-restart generates the same key for the same logical operation, enabling safe retry without key proliferation.

Cross-Stack Migration Paths

Adopting consensus-driven deduplication in legacy monoliths or polyglot microservices requires incremental rollout:

Shadow traffic: Route a percentage of production requests through the consensus layer in shadow mode. Compare deduplication outcomes against the legacy system without affecting live responses.
Dual-write validation: Write idempotency keys to both the existing datastore and the consensus log during a validation window. Reconcile discrepancies before switching the read path.
Rollback routing: Maintain a fallback layer that bypasses consensus during cluster instability. If commit latency exceeds 100 ms p99 or quorum loss is detected, route traffic to local database constraints with explicit post-incident reconciliation jobs.

For lightweight alternatives where strict linearizability is not required, distributed lock acquisition patterns offer advisory locking at lower operational cost — though they sacrifice the exactly-once guarantee under concurrent retries.

Distributed Coordination & Locking Strategies — parent: the full guarantee contract for cluster-wide coordination and the failure modes this page addresses.
Lock Timeout & Lease Management — sibling: TTL enforcement and lease heartbeat renewal patterns that complement consensus-backed key expiry.
Distributed Lock Acquisition Patterns — sibling: advisory locking with Redlock as a lower-latency alternative when linearizability is not required.
Preventing Race Conditions in Microservices — sibling: how consensus deduplication integrates with broader race-condition prevention across asynchronous service boundaries.
Idempotency Key Generation Strategies — foundational: UUIDv4 vs UUIDv7 vs HMAC-deterministic token generation for use with consensus stores.