Skip to content

Chaos Testing

LoomCache ships a self-contained Java chaos-testing framework in the server test suite. There is no external dependency or separate runtime — histories, checkers, and nemeses are all pure Java.

A run:

  1. Generates operations — concurrent clients drive reads/writes/CAS against the cluster (ChaosWorkload produces Register / Counter / Queue / Set / Mixed workloads; ChaosClient / ChaosRealClient drive them).
  2. Injects faultsChaosNemesis is a sealed fault family: symmetric partition, node isolation, message reorder, kill node / kill leader / pause node, clock skew, slow disk (message-send delay), memory pressure, and CPU contention, composable via ChaosNemesis.Combined.
  3. Records a historyChaosHistory captures invocation and completion timestamps.
  4. Verifies consistency — a linearizability checker or the per-model ChaosChecker checkers validate the history.
  • ChaosTestHarness — wires workload + nemesis + cluster into a single executable test.
  • ChaosCluster / ChaosRealCluster — both start real LoomCache clusters (Raft + TCP) with fault hooks.
  • ChaosClient / ChaosRealClient — op generators.
  • ChaosWorkload — op generators: Register / Counter / Queue / Set / Mixed workloads.
  • ChaosNemesis — sealed fault family (see Methodology for the fault list).
  • ChaosHistory / ChaosReport — history and reporting.
  • ChaosChecker and the linearizability checker — ChaosChecker exposes the per-model checkers; the linearizability checker runs a linearizability search over register / counter / queue / set.
  • model/Operation, CounterModel, QueueModel, LockModel, SetModel.

There is no per-scenario file tree under tests/. The harness is exercised by exactly two test classes, both in loom-server test sources:

  • ChaosFrameworkEnhancedTest (no chaos tag) — drives the framework primitives: linearizability register/counter/queue/set checks (positive, violation, and malformed-input cases), the per-model ChaosChecker checkers, every ChaosWorkload generator, ChaosHistory recording/concurrency, ChaosReport summary/latency/fault-timeline output, the ChaosNemesis fault types against a recording cluster double, and one end-to-end ChaosTestHarness run that starts a real 3-node LoomCache cluster.
  • tests/RealClusterLinearizabilityTest (@Tag("chaos")) — starts a real 3-node ChaosRealCluster and asserts register linearizability via the linearizability checker over concurrent leader-local / per-node map operations.

ChaosNemesis is a sealed interface; all faults compose via ChaosNemesis.Combined:

  • NetworkSymmetricPartition, IsolateNode, MessageReorder.
  • ProcessKillNode, KillLeader, PauseNode.
  • Resource / timingClockSkew (simulates clock-skew effects by calling cluster.pauseNode() — no actual system clock manipulation), SlowDisk (message-send delay), MemoryPressure, CpuContention.
  • The real-cluster runs (ChaosTestHarness / ChaosRealCluster) exercise actual Raft and TCP via real LoomCache instances, which run the WAL and snapshot machinery as part of normal startup.
  • The framework-primitive tests feed hand-built histories straight into the checkers — they verify checker/history logic without touching the network stack.
  • The linearizability checker is a linearization search supporting the register, counter, queue, and set data types only (it rejects any other type) and carries its own internal model state. The search is bounded by MAX_SEARCH_STATES = 500_000; when that limit is reached the checker conservatively reports a violation rather than returning a heuristic pass.
  • The per-model ChaosChecker checkers cover the rest: register, counter, queue, set, and mutual-exclusion models (including fence-token and double-lock checks).
  • RealClusterLinearizabilityTest lives in loom-server test sources (not loom-integration-tests); it boots three real LoomCache nodes per test, reserving fork-scoped TCP ports to stay parallel-safe under the repository Maven defaults.

Run one Maven clean/test/verify command per checkout at a time. clean mutates shared target/ directories; use a separate worktree for parallel chaos or evidence lanes.

Terminal window
# Framework-primitive + one end-to-end harness run (no chaos tag, runs by default):
./mvnw -pl loom-server -am test -Dtest=ChaosFrameworkEnhancedTest
# Real 3-node cluster register-linearizability run. RealClusterLinearizabilityTest is @Tag("chaos"),
# which the default unit lane excludes (ut.excludedGroups=benchmark,chaos,stress,slow), so opt in
# with -Dgroups=chaos while keeping benchmark, stress, and slow isolated:
./mvnw -pl loom-server -am test -Dgroups=chaos -Dut.excludedGroups=benchmark,stress,slow -Dtest=RealClusterLinearizabilityTest

The framework-primitive checks complete in seconds. The real-cluster runs (ChaosTestHarness inside ChaosFrameworkEnhancedTest, and RealClusterLinearizabilityTest) boot real LoomCache nodes and elect a Raft leader, so they take longer; per-client op joins use a 30 s ceiling.

ChaosReport emits a human-readable summary plus a full history trace for failing runs. Replay the trace through the linearizability checker to reproduce and debug locally.

For the broader correctness story (Raft invariants, durability, near-cache coherence), see the architecture overview.

LoomCache is an independent open-source project. It is not affiliated with, endorsed by, or sponsored by Hazelcast, Inc. or by any other company whose products are named in this documentation. “Hazelcast” is a trademark of Hazelcast, Inc.; references to it are nominative and describe only migration and comparison. All other product and company names are trademarks of their respective owners and are used for identification purposes only.