So, You Want to Learn More About Deterministic Simulation Testing?

I recently attended BugBash 2025, a software reliability conference organized by Antithesis in Washington, D.C. during April 3-4, 2025. The conference brought together industry experts like Kyle Kingsbury, Ankush Desai, and Mitchell Hashimoto to discuss various aspects of building reliable software, with deterministic simulation testing being a significant focus throughout many of the sessions and discussions.

One of the highlights for me was having the chance to talk with the Antithesis team and meet some of the original creators of FoundationDB.

🔗What is Deterministic Simulation Testing?

Note: For a deeper dive into this concept and its practical applications, check out my article on What if we embraced simulation-driven development?.

The best description of DST I've found is described in FoundationDB's testing page:

The major goal of Simulation is to make sure that we find and diagnose issues in simulation rather than the real world. Simulation runs tens of thousands of simulations every night, each one simulating large numbers of component failures. Based on the volume of tests that we run and the increased intensity of the failures in our scenarios, we estimate that we have run the equivalent of roughly one trillion CPU-hours of simulation on FoundationDB.

Simulation is able to conduct a deterministic simulation of an entire FoundationDB cluster within a single-threaded process. Determinism is crucial in that it allows perfect repeatability of a simulated run, facilitating controlled experiments to home in on issues. The simulation steps through time, synchronized across the system, representing a larger amount of real time in a smaller amount of simulated time. In practice, our simulations usually have about a 10-1 factor of real-to-simulated time, which is advantageous for the efficiency of testing.

We use Simulation to simulate failures modes at the network, machine, and datacenter levels, including connection failures, degradation of machine performance, machine shutdowns or reboots, machines coming back from the dead, etc. We stress-test all of these failure modes, failing machines at very short intervals, inducing unusually severe loads, and delaying communications channels.

Simulation's success has surpassed our expectation and has been vital to our engineering team. It seems unlikely that we would have been able to build FoundationDB without this technology.

After years of operating many Apache-oriented distributed systems, I can confidently say that FoundationDB stands apart in its remarkable robustness—I've rarely been paged for it, which speaks volumes about its stability in production. At Clever Cloud, we've even leveraged FoundationDB's simulation framework during our application development by embedding Rust code inside FDB's simulation environment, allowing us to inherit the same reliability guarantees for our custom applications.

🔗TL;DR

If you only have limited time, here are the three must-watch videos that will give you the best introduction to deterministic simulation testing:

A curated feed of recent articles and blog posts about DST can be found at Planet DST.

🔗Essential Reading

🔗Foundations & Concepts

🔗Language-Specific Implementations

🔗Real-World Case Studies

🔗Talks

Have I missed any important resources on Deterministic Simulation Testing? This field is rapidly evolving, and I'm always looking to expand this collection. If you know of any articles, talks, or tools related to DST that should be included here, please reach out! I'd love to hear about your experiences with deterministic testing as well.

Please, feel free to react to this article, you can reach me on Twitter, or have a look on my website.