Resilience & Fault Tolerance
In distributed systems, failures are guaranteed. Networks drop packets, JVMs pause for garbage collection, and entire availability zones go offline. LoomCache is designed with active and passive resilience mechanisms to survive the chaos.
Resilience Circuit Breaker
LoomCache automatically isolates failing partitions. Connection pools fail fast to prevent cascading timeouts across the cluster.
Traffic flows freely to the node.
Node isolated. Requests immediately fail fast.
Testing node health with limited probe traffic.
Circuit Breakers
Section titled “Circuit Breakers”Phi-Accrual Failure Detector
Dynamic probability (Φ) instead of static timeouts
When the cluster detects that a node is not responding to heartbeats (using a Phi-Accrual failure detector), the connection pool actively trips the circuit breaker for that specific route.
This mechanism provides critical “fail-fast” semantics. Instead of thread pools saturating while waiting on TCP timeouts for a dead node, the LoomClient instantly rejects the request, protecting the upstream application from cascading failures.
Connection Pool Recovery
Section titled “Connection Pool Recovery”While the circuit breaker is Open, LoomCache runs background health probes (Half-Open state). Once a node comes back online and stabilizes, traffic is seamlessly allowed back through the primary connection pool routes.