Skip to content

Resilience & Fault Tolerance

In distributed systems, failures are guaranteed. Networks drop packets, JVMs pause for garbage collection, and entire availability zones go offline. LoomCache is designed with active and passive resilience mechanisms to survive the chaos.

Resilience Circuit Breaker

LoomCache automatically isolates failing partitions. Connection pools fail fast to prevent cascading timeouts across the cluster.

CLOSED (Normal)

Traffic flows freely to the node.

OPEN (Failing)

Node isolated. Requests immediately fail fast.

HALF-OPEN (Recovery)

Testing node health with limited probe traffic.

Client
ReplicaOnline

Phi-Accrual Failure Detector

Dynamic probability (Φ) instead of static timeouts

Node A
Observer
Node B
Target
Suspicion Level (Φ)
0.00
THRESHOLD (8.0)
State: ALIVE
Network: HEALTHY

When the cluster detects that a node is not responding to heartbeats (using a Phi-Accrual failure detector), the connection pool actively trips the circuit breaker for that specific route.

This mechanism provides critical “fail-fast” semantics. Instead of thread pools saturating while waiting on TCP timeouts for a dead node, the LoomClient instantly rejects the request, protecting the upstream application from cascading failures.

While the circuit breaker is Open, LoomCache runs background health probes (Half-Open state). Once a node comes back online and stabilizes, traffic is seamlessly allowed back through the primary connection pool routes.