<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
    <channel>
      <title>Pierre Zemb&#x27;s Blog</title>
      <link>https://pierrezemb.fr</link>
      <description>Pierre Zemb personal blog</description>
      <generator>Zola</generator>
      <language>en</language>
      <atom:link href="https://pierrezemb.fr/rss.xml" rel="self" type="application/rss+xml"/>
      <lastBuildDate>Wed, 25 Feb 2026 00:00:00 +0000</lastBuildDate>
      <item>
          <title>Building Index-Backed Query Plans in DataFusion</title>
          <pubDate>Wed, 25 Feb 2026 00:00:00 +0000</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/datafusion-index-provider/</link>
          <guid>https://pierrezemb.fr/posts/datafusion-index-provider/</guid>
          <description xml:base="https://pierrezemb.fr/posts/datafusion-index-provider/">&lt;p&gt;When you build a system on top of a key-value store like FoundationDB, you eventually need secondary indexes. You create them, you maintain them, and then one day you need to query them. Not just scan a single index, but combine results from multiple indexes: intersect them for AND conditions, union them for OR conditions, and fetch the actual records at the end. That&#x27;s a query engine&#x27;s job. I didn&#x27;t want to write a query engine. But I had to learn how one thinks.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a href=&quot;&#x2F;posts&#x2F;thank-you-datafusion&#x2F;&quot;&gt;Last year&lt;&#x2F;a&gt;, I wrote about integrating DataFusion and mentioned a PoC library for index-backed queries. That PoC has grown into &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;datafusion-contrib&#x2F;datafusion-index-provider&quot;&gt;&lt;code&gt;datafusion-index-provider&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;, a real library in &lt;code&gt;datafusion-contrib&lt;&#x2F;code&gt; running in production. Building it meant learning how to construct physical query plans by hand, assembling them from existing DataFusion operators instead of writing execution logic from scratch.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-postgresql-pattern&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-postgresql-pattern&quot; aria-label=&quot;Anchor link for: the-postgresql-pattern&quot;&gt;🔗&lt;&#x2F;a&gt;The PostgreSQL Pattern&lt;&#x2F;h2&gt;
&lt;p&gt;Every database with secondary indexes follows the same two-phase pattern. Take &lt;code&gt;SELECT * FROM employees WHERE age &amp;gt; 30&lt;&#x2F;code&gt;. PostgreSQL doesn&#x27;t scan every row. It walks the B-tree index on &lt;code&gt;age&lt;&#x2F;code&gt;, collecting &lt;strong&gt;TIDs&lt;&#x2F;strong&gt; (tuple identifiers) pointing to matching rows. Then it fetches the actual data using those TIDs. &lt;strong&gt;Find where, then fetch what.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;For multi-index queries like &lt;code&gt;WHERE age &amp;lt; 25 OR department = &#x27;Sales&#x27;&lt;&#x2F;code&gt;, PostgreSQL uses &lt;strong&gt;BitmapIndexScan&lt;&#x2F;strong&gt;: each index produces a bitmap of matching TIDs, bitmaps get combined (OR for union, AND for intersection), and one pass fetches the results. No duplicates, no wasted reads.&lt;&#x2F;p&gt;
&lt;p&gt;This is the pattern I wanted to bring to DataFusion. But DataFusion&#x27;s existing index support (ParquetAccessPlan, zone maps) works at &lt;strong&gt;planning time&lt;&#x2F;strong&gt; on &lt;strong&gt;row groups&lt;&#x2F;strong&gt;. What I needed was OLTP-style indexes that resolve at &lt;strong&gt;execution time&lt;&#x2F;strong&gt; and produce &lt;strong&gt;specific row identifiers&lt;&#x2F;strong&gt;. The DataFusion community has &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;datafusion&#x2F;discussions&#x2F;9963#discussioncomment-6464175&quot;&gt;discussed both approaches&lt;&#x2F;a&gt;. &lt;code&gt;datafusion-index-provider&lt;&#x2F;code&gt; implements the OLTP path.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;primary-keys-as-the-universal-glue&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#primary-keys-as-the-universal-glue&quot; aria-label=&quot;Anchor link for: primary-keys-as-the-universal-glue&quot;&gt;🔗&lt;&#x2F;a&gt;Primary Keys as the Universal Glue&lt;&#x2F;h2&gt;
&lt;p&gt;In PostgreSQL, TIDs connect everything. In my system, that role falls to the &lt;strong&gt;primary key schema&lt;&#x2F;strong&gt;. Every index declares an &lt;code&gt;index_schema()&lt;&#x2F;code&gt; defining the columns that form the row&#x27;s primary key. Could be a single &lt;code&gt;id&lt;&#x2F;code&gt; column, could be a composite &lt;code&gt;(tenant_id, employee_id)&lt;&#x2F;code&gt;. Every index scan produces batches of these primary key columns, every join operates on them, every record fetch consumes them. Because every operator in the pipeline agrees on what a &quot;row identifier&quot; looks like, you can wire up standard DataFusion joins, unions, and aggregations without any custom glue.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;from-filters-to-execution-plans&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#from-filters-to-execution-plans&quot; aria-label=&quot;Anchor link for: from-filters-to-execution-plans&quot;&gt;🔗&lt;&#x2F;a&gt;From Filters to Execution Plans&lt;&#x2F;h2&gt;
&lt;p&gt;The library first converts SQL filters into an intermediate &lt;code&gt;IndexFilter&lt;&#x2F;code&gt; enum: &lt;code&gt;Single&lt;&#x2F;code&gt; (one index handles one filter), &lt;code&gt;And&lt;&#x2F;code&gt; (intersection), or &lt;code&gt;Or&lt;&#x2F;code&gt; (union). Then it recursively builds the physical plan from that intermediate representation.&lt;&#x2F;p&gt;
&lt;p&gt;The library introduces only two custom &lt;code&gt;ExecutionPlan&lt;&#x2F;code&gt; nodes: &lt;code&gt;IndexScanExec&lt;&#x2F;code&gt; (which wraps your index) and &lt;code&gt;RecordFetchExec&lt;&#x2F;code&gt; (which wraps your storage). No custom join logic, no custom dedup, no custom union. Everything in between is standard DataFusion operators wired together.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;single-index&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#single-index&quot; aria-label=&quot;Anchor link for: single-index&quot;&gt;🔗&lt;&#x2F;a&gt;Single Index&lt;&#x2F;h3&gt;
&lt;pre data-lang=&quot;sql&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-sql &quot;&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;SELECT &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;* &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;FROM&lt;&#x2F;span&gt;&lt;span&gt; employees &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;WHERE&lt;&#x2F;span&gt;&lt;span&gt; age &amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;29
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The simplest case needs just these two custom nodes wired together. &lt;code&gt;IndexScanExec&lt;&#x2F;code&gt; calls &lt;code&gt;index.scan(filters, limit)&lt;&#x2F;code&gt; and streams primary key batches. &lt;code&gt;RecordFetchExec&lt;&#x2F;code&gt; consumes those batches and calls a &lt;code&gt;RecordFetcher&lt;&#x2F;code&gt; to look up complete records.&lt;&#x2F;p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
    &lt;pre class=&quot;mermaid&quot;&gt;
        flowchart BT
    A[IndexScanExec] --&amp;gt; B[RecordFetchExec]
    &lt;&#x2F;pre&gt;
&lt;&#x2F;div&gt;&lt;h3 id=&quot;and-intersection-through-joins&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#and-intersection-through-joins&quot; aria-label=&quot;Anchor link for: and-intersection-through-joins&quot;&gt;🔗&lt;&#x2F;a&gt;AND: Intersection Through Joins&lt;&#x2F;h3&gt;
&lt;pre data-lang=&quot;sql&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-sql &quot;&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;SELECT &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;* &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;FROM&lt;&#x2F;span&gt;&lt;span&gt; employees &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;WHERE&lt;&#x2F;span&gt;&lt;span&gt; age &amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;25 &lt;&#x2F;span&gt;&lt;span&gt;AND department = &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Engineering&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Both conditions must hold. Each index produces a separate stream of primary keys, and we need their intersection: only keys that appear in both streams. How do you compute an intersection of two streams? That&#x27;s exactly what an &lt;strong&gt;INNER JOIN&lt;&#x2F;strong&gt; does when both sides share the same key columns.&lt;&#x2F;p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
    &lt;pre class=&quot;mermaid&quot;&gt;
        flowchart BT
    A[IndexScanExec&amp;lt;br&amp;#x2F;&amp;gt;age index] --&amp;gt; C[HashJoinExec&amp;lt;br&amp;#x2F;&amp;gt;INNER on PK columns]
    B[IndexScanExec&amp;lt;br&amp;#x2F;&amp;gt;department index] --&amp;gt; C
    C --&amp;gt; P[ProjectionExec&amp;lt;br&amp;#x2F;&amp;gt;PK columns]
    P --&amp;gt; D[RecordFetchExec]
    &lt;&#x2F;pre&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;DataFusion ships two join implementations that we can pick from. &lt;strong&gt;HashJoinExec&lt;&#x2F;strong&gt; works in two phases: it reads the entire left (build) side into memory, constructs a hash table keyed on the join columns, then streams the right (probe) side through, looking up each row&#x27;s key in that table. Matches produce output rows. Memory cost is proportional to the build side, but the probe side streams through with no buffering. The library uses &lt;code&gt;PartitionMode::CollectLeft&lt;&#x2F;code&gt;, which collects the left input into a single partition before building the hash table.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;SortMergeJoinExec&lt;&#x2F;strong&gt; takes a different approach. When both inputs are already sorted on the join keys, it walks both streams in lockstep, comparing keys as it goes. When keys match, it buffers rows sharing that key value and outputs all combinations. For unique primary keys (the common case with index scans), this means constant memory: one row buffered from each side at a time. No hash table, no bulk memory allocation, just two cursors advancing together.&lt;&#x2F;p&gt;
&lt;p&gt;How does the library choose? If both indexes report sorted output via &lt;code&gt;is_ordered()&lt;&#x2F;code&gt;, it picks SortMergeJoin. Otherwise, HashJoin. For ordered key-value stores like FoundationDB, indexes naturally return sorted keys, so SortMergeJoin is the common path in practice.&lt;&#x2F;p&gt;
&lt;p&gt;There&#x27;s a wrinkle after the join. An inner join on column &lt;code&gt;id&lt;&#x2F;code&gt; from both sides produces output with columns &lt;code&gt;(id_left, id_right)&lt;&#x2F;code&gt;, but downstream operators expect just &lt;code&gt;(id)&lt;&#x2F;code&gt;. A &lt;code&gt;ProjectionExec&lt;&#x2F;code&gt; after the join strips the duplicates back to the primary key schema. This matters because when three or more indexes are involved, the library builds a &lt;strong&gt;left-deep join tree&lt;&#x2F;strong&gt;: join the first two, project back to the primary key schema, then join that result with the third, and so on. Each join progressively narrows the result set, and the projection keeps the schema clean between steps.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;or-union-with-deduplication&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#or-union-with-deduplication&quot; aria-label=&quot;Anchor link for: or-union-with-deduplication&quot;&gt;🔗&lt;&#x2F;a&gt;OR: Union with Deduplication&lt;&#x2F;h3&gt;
&lt;pre data-lang=&quot;sql&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-sql &quot;&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;SELECT &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;* &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;FROM&lt;&#x2F;span&gt;&lt;span&gt; employees &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;WHERE&lt;&#x2F;span&gt;&lt;span&gt; age &amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;25 &lt;&#x2F;span&gt;&lt;span&gt;OR department = &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Sales&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A row matches if either condition is true, but a row satisfying both should appear exactly once. The plan needs to combine both index results and deduplicate before fetching.&lt;&#x2F;p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
    &lt;pre class=&quot;mermaid&quot;&gt;
        flowchart BT
    A[IndexScanExec&amp;lt;br&amp;#x2F;&amp;gt;age index] --&amp;gt; C[UnionExec]
    B[IndexScanExec&amp;lt;br&amp;#x2F;&amp;gt;department index] --&amp;gt; C
    C --&amp;gt; D[AggregateExec&amp;lt;br&amp;#x2F;&amp;gt;GROUP BY PK columns]
    D --&amp;gt; E[RecordFetchExec]
    &lt;&#x2F;pre&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;Each index scan feeds into DataFusion&#x27;s &lt;code&gt;UnionExec&lt;&#x2F;code&gt;, which concatenates streams with zero-copy partition pass-through. But a row matching both conditions appears twice, once from each index. The deduplication step uses DataFusion&#x27;s &lt;code&gt;AggregateExec&lt;&#x2F;code&gt; with a &lt;code&gt;GROUP BY&lt;&#x2F;code&gt; on all primary key columns. AggregateExec maintains a hash table mapping group key values to group indices. For pure dedup (no aggregate functions, just GROUP BY), it&#x27;s essentially a hash set of seen primary keys. When memory pressure exceeds limits, it spills groups to disk in Arrow IPC format and merges them back later.&lt;&#x2F;p&gt;
&lt;p&gt;Why not write a custom dedup node? Because &lt;code&gt;AggregateExec&lt;&#x2F;code&gt; already handles hash-based grouping, memory tracking against DataFusion&#x27;s memory pool, and spill-to-disk. Writing a custom dedup operator would mean reimplementing all of that. The library&#x27;s philosophy is to construct a query plan that DataFusion already knows how to execute, not to reinvent execution primitives.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;combining-and-and-or&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#combining-and-and-or&quot; aria-label=&quot;Anchor link for: combining-and-and-or&quot;&gt;🔗&lt;&#x2F;a&gt;Combining AND and OR&lt;&#x2F;h3&gt;
&lt;pre data-lang=&quot;sql&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-sql &quot;&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;SELECT &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;* &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;FROM&lt;&#x2F;span&gt;&lt;span&gt; employees
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;WHERE&lt;&#x2F;span&gt;&lt;span&gt; (age &amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;30 &lt;&#x2F;span&gt;&lt;span&gt;AND department = &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Engineering&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;)
&lt;&#x2F;span&gt;&lt;span&gt;   OR (age &amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;25 &lt;&#x2F;span&gt;&lt;span&gt;AND department = &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Sales&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The &lt;code&gt;IndexFilter&lt;&#x2F;code&gt; tree for this query is an &lt;code&gt;Or&lt;&#x2F;code&gt; of two &lt;code&gt;And&lt;&#x2F;code&gt; branches. Each &lt;code&gt;And&lt;&#x2F;code&gt; branch becomes a join subtree (two IndexScanExec nodes joined on primary key columns), and the two subtrees feed into a UnionExec + AggregateExec for deduplication, just like a simple OR.&lt;&#x2F;p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
    &lt;pre class=&quot;mermaid&quot;&gt;
        flowchart BT
    A1[IndexScanExec&amp;lt;br&amp;#x2F;&amp;gt;age &amp;gt; 30] --&amp;gt; J1[HashJoinExec&amp;lt;br&amp;#x2F;&amp;gt;INNER on PK]
    B1[IndexScanExec&amp;lt;br&amp;#x2F;&amp;gt;dept = Engineering] --&amp;gt; J1
    J1 --&amp;gt; P1[ProjectionExec]
    A2[IndexScanExec&amp;lt;br&amp;#x2F;&amp;gt;age &amp;lt; 25] --&amp;gt; J2[HashJoinExec&amp;lt;br&amp;#x2F;&amp;gt;INNER on PK]
    B2[IndexScanExec&amp;lt;br&amp;#x2F;&amp;gt;dept = Sales] --&amp;gt; J2
    J2 --&amp;gt; P2[ProjectionExec]
    P1 --&amp;gt; U[UnionExec]
    P2 --&amp;gt; U
    U --&amp;gt; AG[AggregateExec&amp;lt;br&amp;#x2F;&amp;gt;GROUP BY PK columns]
    AG --&amp;gt; RF[RecordFetchExec]
    &lt;&#x2F;pre&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;Joins for AND and union + dedup for OR compose naturally into nested plans. The library doesn&#x27;t need special handling for nested expressions. It recurses down the &lt;code&gt;IndexFilter&lt;&#x2F;code&gt; tree, builds the appropriate subtree for each node, and DataFusion executes the whole thing as one pipeline.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;limitations-and-what-s-next&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#limitations-and-what-s-next&quot; aria-label=&quot;Anchor link for: limitations-and-what-s-next&quot;&gt;🔗&lt;&#x2F;a&gt;Limitations and What&#x27;s Next&lt;&#x2F;h2&gt;
&lt;p&gt;The filter analysis has one important simplification: if any part of an AND&#x2F;OR expression can&#x27;t be handled by an index, the entire expression falls back to a regular scan. Consider &lt;code&gt;WHERE age &amp;gt; 30 AND color = &#x27;blue&#x27;&lt;&#x2F;code&gt; with an index on &lt;code&gt;age&lt;&#x2F;code&gt; but not &lt;code&gt;color&lt;&#x2F;code&gt;. A smarter approach would use the age index then scan-filter for color, but mixing index-backed and scan-based execution paths complicates plan construction, especially when AND&#x2F;OR expressions are nested. For v1, the clean boundary keeps things correct. Partial index usage is on the roadmap, along with &lt;strong&gt;projection pushdown&lt;&#x2F;strong&gt; into the fetch phase and &lt;strong&gt;multi-partition execution&lt;&#x2F;strong&gt; for parallelism. Each is an opportunity for &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;datafusion-contrib&#x2F;datafusion-index-provider&quot;&gt;contribution&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;try-it&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#try-it&quot; aria-label=&quot;Anchor link for: try-it&quot;&gt;🔗&lt;&#x2F;a&gt;Try It&lt;&#x2F;h2&gt;
&lt;p&gt;If you&#x27;re building a system that needs secondary index queries over your own storage, give &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;datafusion-contrib&#x2F;datafusion-index-provider&quot;&gt;&lt;code&gt;datafusion-index-provider&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; a try. The &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;datafusion-contrib&#x2F;datafusion-index-provider&#x2F;tree&#x2F;main&#x2F;tests&#x2F;common&quot;&gt;tests directory&lt;&#x2F;a&gt; has reference implementations for both single-column and composite primary keys.&lt;&#x2F;p&gt;
&lt;p&gt;This library only works because DataFusion&#x27;s architecture is genuinely composable. Special thanks to &lt;strong&gt;Andrew Lamb&lt;&#x2F;strong&gt;, whose work on DataFusion and the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;datafusion&#x2F;issues&#x2F;6782&quot;&gt;architecture paper&lt;&#x2F;a&gt; has been instrumental. HashJoinExec, SortMergeJoinExec, AggregateExec, UnionExec, ProjectionExec: all ready to be wired up into whatever query plan your system needs. Do you have a custom storage layer that could benefit from secondary index queries?&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Feel free to reach out with any questions or to share your experiences with DataFusion. You can find me on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;pierrezemb.fr&quot;&gt;Bluesky&lt;&#x2F;a&gt; or through my &lt;a href=&quot;https:&#x2F;&#x2F;pierrezemb.fr&quot;&gt;website&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">rust</category>
          <category domain="tag">datafusion</category>
          <category domain="tag">sql</category>
          <category domain="tag">query-engine</category>
          <category domain="tag">databases</category>
          <category domain="tag">distributed-systems</category>
      </item>
      <item>
          <title>Simulating Leader Election on top of FoundationDB</title>
          <pubDate>Thu, 05 Feb 2026 00:00:00 +0000</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/simulating-leader-election-on-foundationdb/</link>
          <guid>https://pierrezemb.fr/posts/simulating-leader-election-on-foundationdb/</guid>
          <description xml:base="https://pierrezemb.fr/posts/simulating-leader-election-on-foundationdb/">&lt;p&gt;People are right to fear LLM-generated code. The models hallucinate APIs, miss edge cases, and produce code that looks correct but fails under pressure. For distributed systems, where bugs hide behind race conditions and network partitions, the stakes are even higher. A subtle leader election bug can cause split-brain, data corruption, or cascading failures across your cluster.&lt;&#x2F;p&gt;
&lt;p&gt;But what if you could give an LLM &lt;a href=&quot;&#x2F;posts&#x2F;llms-for-engineering&#x2F;#feedback-loops&quot;&gt;the right feedback loop&lt;&#x2F;a&gt;? Not just &quot;write me leader election&quot; but &quot;write me leader election, and here&#x27;s how I&#x27;ll prove it correct.&quot; That feedback loop is simulation. I decided to test this idea by building a leader election recipe for &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;foundationdb-rs&#x2F;foundationdb-rs&quot;&gt;foundationdb-rs&lt;&#x2F;a&gt;, with invariants generated by Claude Code.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-algorithm&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-algorithm&quot; aria-label=&quot;Anchor link for: the-algorithm&quot;&gt;🔗&lt;&#x2F;a&gt;The Algorithm&lt;&#x2F;h2&gt;
&lt;p&gt;The recipe uses a ballot-based approach, similar to Raft&#x27;s term concept but simpler. The algorithm is based on &lt;a href=&quot;https:&#x2F;&#x2F;inria.hal.science&#x2F;hal-01775025v1&#x2F;document&quot;&gt;&quot;Leader Election Using NewSQL Database Systems&quot;&lt;&#x2F;a&gt; (Ismail et al., DAIS 2015), which uses a database as distributed shared memory. Our implementation differs by leveraging FDB&#x27;s &lt;a href=&quot;&#x2F;posts&#x2F;fdb-transaction-model-for-layer-engineers&#x2F;&quot;&gt;strictly serializable transactions&lt;&#x2F;a&gt; and versionstamps instead of timestamps. Leaders hold time-bounded leases. Ballots are monotonically increasing fencing tokens that prevent stale leaders from corrupting state. FDB&#x27;s serializable transactions handle mutual exclusion without explicit locking.&lt;&#x2F;p&gt;
&lt;p&gt;The key insight is storing the leader state at a &lt;strong&gt;single key&lt;&#x2F;strong&gt;, making all operations O(1) instead of O(N) candidate scanning. Most leader election implementations require scanning all candidates to determine who&#x27;s in charge. With N candidates, that&#x27;s N reads per query. Our approach stores the current leader explicitly, so checking who&#x27;s leader is a single read. Claiming leadership is a single write with a conflict check. No quorum is needed because FDB already provides the coordination guarantees we need through its serializable transactions.&lt;&#x2F;p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
    &lt;pre class=&quot;mermaid&quot;&gt;
        sequenceDiagram
    participant A as Process A
    participant FDB as FoundationDB
    participant B as Process B

    A-&amp;gt;&amp;gt;FDB: read leader key
    FDB--&amp;gt;&amp;gt;A: ballot=5, lease expired
    B-&amp;gt;&amp;gt;FDB: read leader key
    FDB--&amp;gt;&amp;gt;B: ballot=5, lease expired

    A-&amp;gt;&amp;gt;FDB: write ballot=6
    B-&amp;gt;&amp;gt;FDB: write ballot=6

    FDB--&amp;gt;&amp;gt;A: COMMIT OK
    FDB--&amp;gt;&amp;gt;B: CONFLICT!

    Note over A: Becomes Leader
    Note over B: Retries, sees ballot=6, backs off
    &lt;&#x2F;pre&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;The diagram shows what happens when two processes try to claim leadership simultaneously. Both read the same expired leader state. Both try to write ballot=6. FDB&#x27;s serializable transactions guarantee only one succeeds. The loser gets a conflict error, retries, sees ballot=6 with an active lease, and backs off. No split-brain possible. This is the fundamental safety property that the entire algorithm rests on.&lt;&#x2F;p&gt;
&lt;p&gt;When a process wants to become leader, it first registers as a candidate. FDB assigns a versionstamp at commit time using the &lt;code&gt;SetVersionstampedValue&lt;&#x2F;code&gt; atomic operation. This versionstamp is assigned by FDB itself when the transaction commits, providing a globally-ordered identity that no other process can have. The versionstamp never changes throughout the process&#x27;s lifetime, even as the process sends heartbeats and refreshes its registration. This immutability is important: it means the process&#x27;s identity is stable and can be used reliably in log replay during verification.&lt;&#x2F;p&gt;
&lt;p&gt;Then the process enters a loop: send heartbeats to prove liveness, try to claim leadership if the current lease expired, and optionally resign if graceful handoff is needed. Each leadership claim increments the ballot number. The new leader stores its versionstamp in the leader state, so followers can verify they&#x27;re talking to the actual leader. The full implementation lives in the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;foundationdb-rs&#x2F;foundationdb-rs&#x2F;tree&#x2F;main&#x2F;foundationdb&#x2F;src&#x2F;recipes&#x2F;leader_election&quot;&gt;foundationdb-rs recipes module&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-simulation-workload&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-simulation-workload&quot; aria-label=&quot;Anchor link for: the-simulation-workload&quot;&gt;🔗&lt;&#x2F;a&gt;The Simulation Workload&lt;&#x2F;h2&gt;
&lt;p&gt;Simulation on top of FoundationDB may feel redundant. FDB itself has been validated through &lt;a href=&quot;&#x2F;posts&#x2F;diving-into-foundationdb-simulation&#x2F;&quot;&gt;a trillion CPU-hours of simulation&lt;&#x2F;a&gt;. But FDB&#x27;s resiliency doesn&#x27;t mean your code is resilient. Your layer sits on top of FDB. Your transaction logic, your conflict handling, your retry behavior. How does your code respond when FDB is unhealthy?&lt;&#x2F;p&gt;
&lt;p&gt;Simulation introduces chaos to answer that question. Network partitions, process crashes, clock skew up to ±1 second. Same seed, same execution, same bugs. Deterministic replay.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Phase&lt;&#x2F;th&gt;&lt;th&gt;Who&lt;&#x2F;th&gt;&lt;th&gt;What&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Setup&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;td&gt;Client 0&lt;&#x2F;td&gt;&lt;td&gt;Initialize election, register all candidates&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Chaos&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;td&gt;All clients&lt;&#x2F;td&gt;&lt;td&gt;Loop: heartbeat → try claim → maybe resign (10%)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Check&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;td&gt;Client 0&lt;&#x2F;td&gt;&lt;td&gt;Read logs, snapshot state, run 13 invariants&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;FDB&#x27;s simulator runs built-in chaos workloads alongside ours: &lt;strong&gt;RandomClogging&lt;&#x2F;strong&gt; injects network partitions, &lt;strong&gt;Attrition&lt;&#x2F;strong&gt; kills and reboots processes. Our workload adds clock skew simulation up to ±1 second.&lt;&#x2F;p&gt;
&lt;p&gt;Each operation logs its intent atomically in the same transaction, using the pattern from FDB&#x27;s own &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;main&#x2F;fdbserver&#x2F;workloads&#x2F;AtomicOps.actor.cpp&quot;&gt;AtomicOps workload&lt;&#x2F;a&gt;. The &lt;code&gt;SetVersionstampedKey&lt;&#x2F;code&gt; &lt;a href=&quot;&#x2F;posts&#x2F;fdb-transaction-model-for-layer-engineers&#x2F;#atomic-operations-writing-without-reading&quot;&gt;atomic operation&lt;&#x2F;a&gt; writes both the leader election mutation and its log entry in a single transaction. If the transaction commits, both succeed. If it aborts, neither does. This gives us a proper write-ahead log without the complexity of two-phase commit. The log key uses a versionstamp prefix for true FDB commit ordering. No clock skew ambiguity.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;how-atomic-logging-works&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#how-atomic-logging-works&quot; aria-label=&quot;Anchor link for: how-atomic-logging-works&quot;&gt;🔗&lt;&#x2F;a&gt;How Atomic Logging Works&lt;&#x2F;h3&gt;
&lt;p&gt;Each operation writes a log entry in the same transaction as the operation itself. The log key uses a versionstamp prefix, so entries sort by true FDB commit order, not by wall clock (which can drift ±1 second under simulation).&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;Log Key Structure:
&lt;&#x2F;span&gt;&lt;span&gt;┌─────────────────────┬───────────┬────────┐
&lt;&#x2F;span&gt;&lt;span&gt;│ versionstamp (10B)  │ client_id │ op_num │
&lt;&#x2F;span&gt;&lt;span&gt;└─────────────────────┴───────────┴────────┘
&lt;&#x2F;span&gt;&lt;span&gt;         ↑
&lt;&#x2F;span&gt;&lt;span&gt;   Assigned by FDB at commit time
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Here&#x27;s what a typical log might look like after chaos:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Versionstamp&lt;&#x2F;th&gt;&lt;th&gt;Client&lt;&#x2F;th&gt;&lt;th&gt;Op&lt;&#x2F;th&gt;&lt;th&gt;Result&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;0x0001...&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;Register&lt;&#x2F;td&gt;&lt;td&gt;✓&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;0x0002...&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;Register&lt;&#x2F;td&gt;&lt;td&gt;✓&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;0x0003...&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;Register&lt;&#x2F;td&gt;&lt;td&gt;✓&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;0x0004...&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;TryBecomeLeader&lt;&#x2F;td&gt;&lt;td&gt;✓ became leader, ballot=1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;0x0005...&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;TryBecomeLeader&lt;&#x2F;td&gt;&lt;td&gt;✗ conflict&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;0x0006...&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;Heartbeat&lt;&#x2F;td&gt;&lt;td&gt;✓&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;0x0007...&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;TryBecomeLeader&lt;&#x2F;td&gt;&lt;td&gt;✓ became leader, ballot=2&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;0x0008...&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;Resign&lt;&#x2F;td&gt;&lt;td&gt;✓&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Client 0 became leader with ballot 1. Client 1&#x27;s claim failed (conflict). Client 2 later claimed ballot 2 when client 0&#x27;s lease expired, then resigned. The versionstamp ordering is ground truth: no matter how skewed each client&#x27;s clock was, the log shows exactly what committed and in what order.&lt;&#x2F;p&gt;
&lt;p&gt;During &lt;strong&gt;check&lt;&#x2F;strong&gt;, client 0 reads all logs and database state in a single snapshot, then runs the invariants. Same seed, same execution, same bugs. Deterministic replay.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-invariants&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-invariants&quot; aria-label=&quot;Anchor link for: the-invariants&quot;&gt;🔗&lt;&#x2F;a&gt;The Invariants&lt;&#x2F;h2&gt;
&lt;p&gt;Here&#x27;s where Claude Code enters the story. I asked it to generate invariants for leader election validation. I didn&#x27;t give it a detailed specification. I pointed it at my &lt;a href=&quot;&#x2F;posts&#x2F;writing-rust-fdb-workloads-that-find-bugs&#x2F;&quot;&gt;previous post about designing workloads that find bugs&lt;&#x2F;a&gt; and said &quot;apply these patterns to leader election.&quot;&lt;&#x2F;p&gt;
&lt;p&gt;It generated &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;foundationdb-rs&#x2F;foundationdb-rs&#x2F;blob&#x2F;main&#x2F;foundationdb-recipes-simulation&#x2F;README.md&quot;&gt;13 invariants&lt;&#x2F;a&gt;. Here are the seven most important ones:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Invariant&lt;&#x2F;th&gt;&lt;th&gt;What&lt;&#x2F;th&gt;&lt;th&gt;Why&lt;&#x2F;th&gt;&lt;th&gt;How&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;strong&gt;DualPathValidation&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;td&gt;Expected state from log replay must match actual FDB snapshot&lt;&#x2F;td&gt;&lt;td&gt;The keystone check. If logs say process A is leader with ballot 7, but FDB shows process B with ballot 6, something corrupted state&lt;&#x2F;td&gt;&lt;td&gt;Replay all logged operations in versionstamp order to compute expected leader&#x2F;ballot, then compare with actual database state&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;strong&gt;FencingTokenMonotonicity&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;td&gt;Each successful claim must have a strictly higher ballot than the previous&lt;&#x2F;td&gt;&lt;td&gt;When a network partition heals, an old leader might try to act with a stale ballot. This catches that write&lt;&#x2F;td&gt;&lt;td&gt;For each claim in the log, verify ballot &amp;gt; previous_ballot from the same transaction&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;strong&gt;OneValuePerBallot&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;td&gt;Each ballot number maps to exactly one client&lt;&#x2F;td&gt;&lt;td&gt;Two clients claiming the same ballot means either conflict detection failed or ballot increment is broken. Classic split-brain symptom&lt;&#x2F;td&gt;&lt;td&gt;Scan all claims, verify no two different clients ever claimed the same ballot number&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;strong&gt;LeaderIsCandidate&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;td&gt;Current leader must exist in the candidates registry&lt;&#x2F;td&gt;&lt;td&gt;Edge cases like crash-during-registration or eviction-while-claiming can leave orphaned leaders&lt;&#x2F;td&gt;&lt;td&gt;Read leader state, verify a matching candidate entry exists&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;strong&gt;NoOverlappingLeadership&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;td&gt;Every leadership claim has a globally unique versionstamp&lt;&#x2F;td&gt;&lt;td&gt;FDB serializes commits globally. Duplicate versionstamps mean either a logging bug or actual split-brain&lt;&#x2F;td&gt;&lt;td&gt;Collect all claim versionstamps, verify no duplicates&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;strong&gt;GlobalBallotSuccession&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;td&gt;Each new leader must have ballot &amp;gt; previous leader&#x27;s ballot&lt;&#x2F;td&gt;&lt;td&gt;Catches state regression after partition heals. An old leader can&#x27;t &quot;go back&quot; to a stale ballot&lt;&#x2F;td&gt;&lt;td&gt;Track previous_ballot in log entries, verify new_ballot &amp;gt; previous_ballot for every transition&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;strong&gt;LeaseExpiryAfterClaim&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;td&gt;Lease must expire after the claim timestamp&lt;&#x2F;td&gt;&lt;td&gt;Clock skew can cause a leader to claim with an already-expired lease. This catches incorrect lease calculation or extreme clock drift&lt;&#x2F;td&gt;&lt;td&gt;For each claim, verify lease_expiry_nanos &amp;gt; claim_timestamp_nanos&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h2 id=&quot;what-s-next&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what-s-next&quot; aria-label=&quot;Anchor link for: what-s-next&quot;&gt;🔗&lt;&#x2F;a&gt;What&#x27;s Next&lt;&#x2F;h2&gt;
&lt;p&gt;The leader election recipe has just been merged into foundationdb-rs as experimental. I haven&#x27;t run it in production yet. The entire development was simulation-driven: write code, run simulation, fix what breaks, repeat. The simulation runs in CI on every PR.&lt;&#x2F;p&gt;
&lt;p&gt;Simulation gives us guarantees that the code behaves correctly under the right rules, even when FDB is tortured. Network partitions, clock skew, process crashes. The invariants verify that safety properties hold through all of it. Not a proof of correctness, but confidence earned through chaos.&lt;&#x2F;p&gt;
&lt;p&gt;The invariants that Claude Code generated encode patterns from prior simulation work: dual-path validation from AtomicOps, fencing tokens from the literature, lease checks tied to clock skew. What the LLM provided was speed: weeks of invariant development compressed into hours of review. The LLM proposes, simulation disposes.&lt;&#x2F;p&gt;
&lt;p&gt;If you have ideas for new tortures or invariants, the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;foundationdb-rs&#x2F;foundationdb-rs&#x2F;tree&#x2F;main&#x2F;foundationdb-recipes-simulation&quot;&gt;simulation code&lt;&#x2F;a&gt; is open. Merge requests welcome.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Feel free to reach out with questions or to share your experiences with leader election and simulation testing. You can find me on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;pierrezemb.fr&quot;&gt;Bluesky&lt;&#x2F;a&gt; or through my &lt;a href=&quot;https:&#x2F;&#x2F;pierrezemb.fr&quot;&gt;website&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">foundationdb</category>
          <category domain="tag">rust</category>
          <category domain="tag">simulation</category>
          <category domain="tag">distributed-systems</category>
          <category domain="tag">consensus</category>
          <category domain="tag">leader-election</category>
      </item>
      <item>
          <title>FoundationDB&#x27;s Transaction Model for Layer Engineers</title>
          <pubDate>Fri, 30 Jan 2026 00:00:00 +0000</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/fdb-transaction-model-for-layer-engineers/</link>
          <guid>https://pierrezemb.fr/posts/fdb-transaction-model-for-layer-engineers/</guid>
          <description xml:base="https://pierrezemb.fr/posts/fdb-transaction-model-for-layer-engineers/">&lt;p&gt;FoundationDB gives you serializable transactions with external consistency, automatic sharding, and fault tolerance. But once your first layer hits production under real load, you start seeing transaction conflicts you don&#x27;t understand. The logic looks correct: read a key, check a condition, write the result. Under load, conflicts pile up and throughput collapses.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;how-occ-works&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#how-occ-works&quot; aria-label=&quot;Anchor link for: how-occ-works&quot;&gt;🔗&lt;&#x2F;a&gt;How OCC Works&lt;&#x2F;h2&gt;
&lt;p&gt;FoundationDB implements these guarantees using &lt;strong&gt;Optimistic Concurrency Control&lt;&#x2F;strong&gt; (OCC). Your transaction runs without holding any locks. It reads from a consistent snapshot, does its work, and at commit time the system checks whether anything you read was modified by another transaction since you started. If yes, your transaction is aborted and retried. If no, it commits atomically.&lt;&#x2F;p&gt;
&lt;p&gt;All writes are buffered locally in the client until commit. Nothing goes to the cluster while your transaction is running. At commit time, the client sends the buffered writes and the read&#x2F;write conflict sets to the Resolver in a single request. A read-only transaction that calls commit is mostly a no-op: the network thread checks there are no writes to send and skips the round-trip.&lt;&#x2F;p&gt;
&lt;p&gt;No locks means no waiting, but it also means your reads and writes play different roles in conflict detection. &lt;strong&gt;Your reads determine whether YOU can conflict. Your writes determine what OTHER transactions will conflict with.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;A read-only transaction never conflicts. It observes a snapshot and goes away. A write-only transaction also never conflicts. It blindly sets keys and commits. Only transactions that both read and write can fail. When they do, your writes don&#x27;t cause your conflicts. Your reads do. The writes cause problems for future transactions, but your transaction was doomed the moment you issued reads on keys that someone else was modifying. Every time you add a read to a transaction, ask yourself: &lt;strong&gt;do I actually need to conflict on this?&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;read-version-commit-version-and-the-window-of-vulnerability&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#read-version-commit-version-and-the-window-of-vulnerability&quot; aria-label=&quot;Anchor link for: read-version-commit-version-and-the-window-of-vulnerability&quot;&gt;🔗&lt;&#x2F;a&gt;Read Version, Commit Version, and the Window of Vulnerability&lt;&#x2F;h2&gt;
&lt;p&gt;When your transaction starts, it obtains a &lt;strong&gt;read version&lt;&#x2F;strong&gt; from the cluster. All your reads see a consistent snapshot frozen at that version. When you commit, your transaction gets a &lt;strong&gt;commit version&lt;&#x2F;strong&gt;, guaranteed to be higher. Between these two versions lies what I call &lt;strong&gt;the Window of Vulnerability&lt;&#x2F;strong&gt;: any key you read that was modified by another committed transaction within this window will cause your transaction to abort.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;read version                                          commit version
&lt;&#x2F;span&gt;&lt;span&gt;     │                                                      │
&lt;&#x2F;span&gt;&lt;span&gt;     ▼                                                      ▼
&lt;&#x2F;span&gt;&lt;span&gt;─────┼──────────────────────────────────────────────────────┼────── time
&lt;&#x2F;span&gt;&lt;span&gt;     │              Window of Vulnerability                 │
&lt;&#x2F;span&gt;&lt;span&gt;     │◄────────────────────────────────────────────────────►│
&lt;&#x2F;span&gt;&lt;span&gt;     │                                                      │
&lt;&#x2F;span&gt;&lt;span&gt;     │   your reads see          other transactions         │
&lt;&#x2F;span&gt;&lt;span&gt;     │   a frozen snapshot       may commit writes here     │
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The longer your transaction runs, the wider this window grows. FoundationDB enforces a strict &lt;strong&gt;5-second transaction limit&lt;&#x2F;strong&gt;, which is exactly &lt;strong&gt;5 million versions&lt;&#x2F;strong&gt; (&lt;code&gt;MAX_WRITE_TRANSACTION_LIFE_VERSIONS = 5 * VERSIONS_PER_SECOND&lt;&#x2F;code&gt;). The Resolver tracks conflict history in memory up to this age; transactions older than &lt;code&gt;currentVersion - 5,000,000&lt;&#x2F;code&gt; are rejected as &quot;transaction too old.&quot;&lt;&#x2F;p&gt;
&lt;p&gt;A transaction that completes in 50 milliseconds has almost no exposure. A transaction that takes 4.5 seconds is exposed to every concurrent write on every key it read.&lt;&#x2F;p&gt;
&lt;p&gt;This is why long transactions are one of the most common sources of production trouble. More work means more time, wider window, more conflicts. The fix is parallelizing your reads so the transaction completes faster, splitting work into smaller transactions when full atomicity isn&#x27;t required, or using &lt;a href=&quot;&#x2F;posts&#x2F;understanding-fdb-record-layer-continuations&#x2F;&quot;&gt;continuations&lt;&#x2F;a&gt; to checkpoint progress across transaction boundaries.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;how-conflicts-actually-work&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#how-conflicts-actually-work&quot; aria-label=&quot;Anchor link for: how-conflicts-actually-work&quot;&gt;🔗&lt;&#x2F;a&gt;How Conflicts Actually Work&lt;&#x2F;h2&gt;
&lt;p&gt;Every read your transaction performs adds a &lt;strong&gt;read-conflict range&lt;&#x2F;strong&gt; to your transaction. Every write adds a &lt;strong&gt;write-conflict range&lt;&#x2F;strong&gt;. At commit time, the Resolver checks: does your read-conflict set intersect any committed write-conflict set since your read version? If yes, your transaction is aborted. The Resolver uses a version-aware skiplist to make this check efficient, pruning entire subtrees of committed writes that predate your read version.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt; Your Transaction                Resolver               Another Transaction
&lt;&#x2F;span&gt;&lt;span&gt;┌─────────────────┐                                    ┌─────────────────┐
&lt;&#x2F;span&gt;&lt;span&gt;│ get(key_A)      │─► read conflict: {key_A}           │                 │
&lt;&#x2F;span&gt;&lt;span&gt;│ get_range(B, D) │─► read conflict: {B..D}            │ set(key_C)      │─► write conflict: {key_C}
&lt;&#x2F;span&gt;&lt;span&gt;│ set(key_X)      │─► write conflict: {key_X}          │                 │
&lt;&#x2F;span&gt;&lt;span&gt;└─────────────────┘                                    └─────────────────┘
&lt;&#x2F;span&gt;&lt;span&gt;                              │
&lt;&#x2F;span&gt;&lt;span&gt;                     at commit time:
&lt;&#x2F;span&gt;&lt;span&gt;                     read conflicts ∩ write conflicts
&lt;&#x2F;span&gt;&lt;span&gt;                     from txns committed since read version?
&lt;&#x2F;span&gt;&lt;span&gt;                              │
&lt;&#x2F;span&gt;&lt;span&gt;                     key_C ∈ {B..D}? → YES → ABORT
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;code&gt;get&lt;&#x2F;code&gt; and &lt;code&gt;get_range&lt;&#x2F;code&gt; create read conflicts. &lt;code&gt;set&lt;&#x2F;code&gt;, &lt;code&gt;clear&lt;&#x2F;code&gt;, and &lt;code&gt;clear_range&lt;&#x2F;code&gt; create write conflicts.&lt;&#x2F;p&gt;
&lt;p&gt;The simplest conflict pattern is the &lt;strong&gt;hot key&lt;&#x2F;strong&gt;: a single key read and written by many concurrent transactions. A naive global counter, a &quot;last updated&quot; timestamp, a configuration value everyone checks. The read-modify-write creates a read conflict, and under concurrent updates, all but one transaction fails.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-phantom-conflict-problem&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-phantom-conflict-problem&quot; aria-label=&quot;Anchor link for: the-phantom-conflict-problem&quot;&gt;🔗&lt;&#x2F;a&gt;The Phantom Conflict Problem&lt;&#x2F;h2&gt;
&lt;p&gt;When you call &lt;code&gt;get(key)&lt;&#x2F;code&gt;, FDB adds that single key to your read conflict set. Straightforward. But when you call &lt;code&gt;get_range(start, end)&lt;&#x2F;code&gt;, FDB adds &lt;strong&gt;the entire range&lt;&#x2F;strong&gt; to your conflict set, not just the keys that happened to exist, not just the keys your code iterated over. The mathematical range from start to end, including every possible key that could exist within it. The SIGMOD 2021 paper calls this &lt;strong&gt;phantom read prevention&lt;&#x2F;strong&gt;: &quot;The read set is checked against the modified key ranges of concurrent committed transactions, which prevents phantom reads.&quot; &lt;strong&gt;You can conflict on keys you never saw.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;Your range read: get_range(&amp;quot;order&#x2F;user1&#x2F;&amp;quot;, &amp;quot;order&#x2F;user1&#x2F;\xff&amp;quot;)
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;Keyspace:
&lt;&#x2F;span&gt;&lt;span&gt;  order&#x2F;user1&#x2F;001  ◄── exists, returned
&lt;&#x2F;span&gt;&lt;span&gt;  order&#x2F;user1&#x2F;002  ◄── exists, returned
&lt;&#x2F;span&gt;&lt;span&gt;  order&#x2F;user1&#x2F;003  ◄── exists, returned
&lt;&#x2F;span&gt;&lt;span&gt;  order&#x2F;user1&#x2F;004  ◄── DOES NOT EXIST YET
&lt;&#x2F;span&gt;&lt;span&gt;  ···
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;Read conflict range: [ &amp;quot;order&#x2F;user1&#x2F;&amp;quot; , &amp;quot;order&#x2F;user1&#x2F;\xff&amp;quot; )
&lt;&#x2F;span&gt;&lt;span&gt;                       ◄──────── covers EVERYTHING ────────►
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;Another transaction: set(&amp;quot;order&#x2F;user1&#x2F;004&amp;quot;, ...)
&lt;&#x2F;span&gt;&lt;span&gt;  → write conflict on &amp;quot;order&#x2F;user1&#x2F;004&amp;quot;
&lt;&#x2F;span&gt;&lt;span&gt;  → inside your read conflict range
&lt;&#x2F;span&gt;&lt;span&gt;  → YOUR transaction aborts (you never saw this key)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Imagine you&#x27;re scanning a user&#x27;s orders to check if they have any pending shipments. Your range read returns 3 orders. You check each one, they&#x27;re all shipped, great. You decide to update a status flag. Meanwhile, another transaction inserts a brand new order for that same user. The key for that new order falls within your scanned range. Your transaction conflicts and aborts, even though you never touched that key, never saw it, and your business logic doesn&#x27;t care about it at all. Full table scans are the extreme version of this problem: the wider your range, the more phantom writes can abort you. The fix requires either narrowing your reads to touch less keyspace, or using snapshot reads with selective conflicts.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;snapshot-reads&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#snapshot-reads&quot; aria-label=&quot;Anchor link for: snapshot-reads&quot;&gt;🔗&lt;&#x2F;a&gt;Snapshot Reads&lt;&#x2F;h2&gt;
&lt;p&gt;A snapshot read returns the same data as a regular read from the same consistent snapshot, but it does not add any read conflicts to your transaction. The operation is &lt;code&gt;tr.snapshot().get(key)&lt;&#x2F;code&gt; or &lt;code&gt;tr.snapshot().get_range(start, end)&lt;&#x2F;code&gt;. The data you get back is identical. The only difference is what happens at commit time: the Resolver won&#x27;t check whether those keys changed.&lt;&#x2F;p&gt;
&lt;p&gt;When would you want this? Whenever you need to read data for your logic but don&#x27;t need the transaction to abort if that data changes concurrently. A common case is reading configuration or metadata that rarely changes and where a slightly stale value is acceptable within the transaction&#x27;s own snapshot.&lt;&#x2F;p&gt;
&lt;p&gt;The trade-off is that you&#x27;re accepting your decision might be based on data that changed concurrently. This is safe for read-mostly metadata or filtering logic. It&#x27;s dangerous for business-critical checks like balance verification or uniqueness constraints. If your code path is &quot;read X, decide based on X, write Y&quot;, and the decision must hold at commit time, you need the read conflict.&lt;&#x2F;p&gt;
&lt;p&gt;But what if you need to read broadly and conflict narrowly? FDB exposes &lt;strong&gt;manual conflict APIs&lt;&#x2F;strong&gt; that complement snapshot reads. &lt;code&gt;add_read_conflict_key&lt;&#x2F;code&gt; and &lt;code&gt;add_read_conflict_range&lt;&#x2F;code&gt; let you inject read conflicts explicitly: you read without conflicts, then selectively add conflicts on exactly the keys you care about. On the write side, &lt;code&gt;add_write_conflict_key&lt;&#x2F;code&gt; and &lt;code&gt;add_write_conflict_range&lt;&#x2F;code&gt; let you inject write conflicts without actually writing data. This is useful for implementing locks or coordination primitives where your transaction claims a key to block others without storing anything there.&lt;&#x2F;p&gt;
&lt;p&gt;Go back to the phantom conflict problem: you need to scan a user&#x27;s orders to check for pending shipments, but you don&#x27;t want inserts of new orders to abort your transaction. With a regular &lt;code&gt;get_range&lt;&#x2F;code&gt;, any write within that range kills you. With a snapshot range read plus manual conflicts, you read the entire range via &lt;code&gt;tr.snapshot().get_range()&lt;&#x2F;code&gt; without adding any read conflicts, then call &lt;code&gt;add_read_conflict_key&lt;&#x2F;code&gt; only on the specific pending orders your logic depends on. If another transaction inserts a new order, your transaction doesn&#x27;t care. If another transaction modifies a pending order you&#x27;re acting on, your transaction correctly conflicts. You went from conflicting on the entire keyspace range to conflicting on exactly the keys that matter.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;atomic-operations-writing-without-reading&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#atomic-operations-writing-without-reading&quot; aria-label=&quot;Anchor link for: atomic-operations-writing-without-reading&quot;&gt;🔗&lt;&#x2F;a&gt;Atomic Operations: Writing Without Reading&lt;&#x2F;h2&gt;
&lt;p&gt;When you need to increment a counter, the obvious approach is to read the current value, add one, and write the result back. This creates a read conflict on that key, and under concurrent updates, transactions start failing because they all race to write their incremented value. &lt;strong&gt;Atomic operations&lt;&#x2F;strong&gt; take a different approach: they send an instruction to the storage server (&quot;add this delta to whatever value is there&quot;) without your transaction ever knowing the current value. No read, no read conflict. Instead of &lt;code&gt;tr.get(key)&lt;&#x2F;code&gt; followed by &lt;code&gt;tr.set(key, value + 1)&lt;&#x2F;code&gt;, you call &lt;code&gt;tr.atomic_add(key, 1)&lt;&#x2F;code&gt; and concurrent updates all succeed.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;a href=&quot;https:&#x2F;&#x2F;foundationdb.github.io&#x2F;fdb-record-layer&#x2F;&quot;&gt;Record Layer&lt;&#x2F;a&gt; exploits this for aggregate indexes. A &lt;code&gt;COUNT&lt;&#x2F;code&gt; index issues &lt;code&gt;atomic_add(count_key, 1)&lt;&#x2F;code&gt; on every record insertion and &lt;code&gt;atomic_add(count_key, -1)&lt;&#x2F;code&gt; on deletion. A &lt;code&gt;SUM&lt;&#x2F;code&gt; index adds the field&#x27;s value. &lt;code&gt;MAX_EVER&lt;&#x2F;code&gt; and &lt;code&gt;MIN_EVER&lt;&#x2F;code&gt; use &lt;code&gt;atomic_max&lt;&#x2F;code&gt; and &lt;code&gt;atomic_min&lt;&#x2F;code&gt;. Unlimited concurrent updates to the same aggregate, zero conflicts between writers.&lt;&#x2F;p&gt;
&lt;p&gt;But there&#x27;s a trap: if you read a key and also atomically modify it in the same transaction, you lose all the benefits. The FoundationDB documentation is explicit: &quot;If a transaction uses both an atomic operation and a strictly serializable read on the same key, the benefits of using the atomic operation (for both conflict checking and performance) are lost.&quot; The read already poisoned the transaction. The pattern only works when you genuinely don&#x27;t need to see the current value.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;versionstamps-conflict-free-ordering&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#versionstamps-conflict-free-ordering&quot; aria-label=&quot;Anchor link for: versionstamps-conflict-free-ordering&quot;&gt;🔗&lt;&#x2F;a&gt;Versionstamps: Conflict-Free Ordering&lt;&#x2F;h2&gt;
&lt;p&gt;Generating sequential IDs the obvious way means reading the current maximum, incrementing it, and writing the new value. That&#x27;s a read-modify-write on a single key, which is exactly the conflict pattern we&#x27;ve been trying to avoid. Every concurrent transaction reads the same max ID, and all but one will abort.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Versionstamps&lt;&#x2F;strong&gt; solve this by deferring ID assignment to commit time. Instead of your transaction deciding what the next ID is, FoundationDB fills it in at the moment of commit. A versionstamp is a &lt;strong&gt;12-byte value&lt;&#x2F;strong&gt;: 8 bytes of commit version (assigned by the Sequencer), 2 bytes of batch ordering, and 2 bytes of user version. The result is globally unique and monotonically increasing across the entire cluster. You write a key containing a placeholder that FDB replaces with the actual versionstamp at commit. Your transaction doesn&#x27;t know the final key until it commits, but multiple concurrent appends generate different versionstamps and write to different keys. Zero conflicts. Versionstamps also spread writes across shards, avoiding the hot spots that monotonic keys create. For more key design patterns, see &lt;a href=&quot;&#x2F;posts&#x2F;crafting-keys-in-fdb&#x2F;&quot;&gt;crafting keys in FoundationDB&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The Record Layer uses this for its &lt;code&gt;VERSION&lt;&#x2F;code&gt; index, which powers CloudKit&#x27;s sync protocol. Each record stores its commit version, and a secondary index maps versions to primary keys. When a mobile device syncs, it scans the version index starting from its last-known version. Writers don&#x27;t coordinate at all.&lt;&#x2F;p&gt;
&lt;p&gt;One limitation: you cannot read a versionstamped key within the same transaction that creates it. The final key doesn&#x27;t exist until commit. Versionstamps work beautifully for append-only structures where you write and walk away.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;cross-cluster-ordering&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#cross-cluster-ordering&quot; aria-label=&quot;Anchor link for: cross-cluster-ordering&quot;&gt;🔗&lt;&#x2F;a&gt;Cross-Cluster Ordering&lt;&#x2F;h3&gt;
&lt;p&gt;Versionstamps work within a single cluster. But what happens when data moves between clusters? Versions assigned by different FoundationDB clusters are uncorrelated. This creates a problem when migrating data between clusters for load balancing or locality. A sync index based purely on versionstamps would break: updates committed after the move might sort before updates committed before the move.&lt;&#x2F;p&gt;
&lt;p&gt;The Record Layer solves this with an &lt;strong&gt;incarnation&lt;&#x2F;strong&gt; counter. Each user starts with incarnation 1, incremented every time their data moves to a different cluster. On every record update, the current incarnation is written to the record&#x27;s header. The VERSION sync index maps &lt;code&gt;(incarnation, version)&lt;&#x2F;code&gt; pairs to changed records, sorting first by incarnation, then by version. Updates after a move have a higher incarnation and correctly sort after pre-move updates, even if the new cluster&#x27;s version numbers are lower.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#conclusion&quot; aria-label=&quot;Anchor link for: conclusion&quot;&gt;🔗&lt;&#x2F;a&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;The next time you see a conflict error, ask yourself: what did I read that I didn&#x27;t need to? The answer is usually hiding in a range read that could have been narrower, a read-modify-write that could have been an atomic operation, or a check that could have used a snapshot read.&lt;&#x2F;p&gt;
&lt;p&gt;None of these techniques require changing FoundationDB itself. They&#x27;re all about how you design your key schema and structure your transactions. As always, data-modeling in ordered key-value stores is the hard part of the job. What&#x27;s the most surprising conflict you&#x27;ve debugged?&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Feel free to reach out with any questions or to share your experiences with FDB transaction debugging. You can find me on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;pierrezemb.fr&quot;&gt;Bluesky&lt;&#x2F;a&gt; or through my &lt;a href=&quot;https:&#x2F;&#x2F;pierrezemb.fr&quot;&gt;website&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">foundationdb</category>
          <category domain="tag">distributed-systems</category>
          <category domain="tag">database</category>
          <category domain="tag">transactions</category>
      </item>
      <item>
          <title>What I Tell Colleagues About Using LLMs for Engineering</title>
          <pubDate>Thu, 15 Jan 2026 00:00:00 +0000</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/llms-for-engineering/</link>
          <guid>https://pierrezemb.fr/posts/llms-for-engineering/</guid>
          <description xml:base="https://pierrezemb.fr/posts/llms-for-engineering/">&lt;p&gt;In a few months, I went from skeptic to heavy user. Claude Code is now part of my daily workflow, both for my personal projects and at Clever Cloud where I help teams adopt these tools. I keep having the same conversation: colleagues ask how I use it, what works, what doesn&#x27;t. This post captures what I tell them.&lt;&#x2F;p&gt;
&lt;p&gt;The shift matters because &lt;a href=&quot;https:&#x2F;&#x2F;world.hey.com&#x2F;joaoqalves&#x2F;when-software-becomes-fast-food-23147c9b&quot;&gt;code is becoming cheap&lt;&#x2F;a&gt;. What used to take hours now takes minutes. But this doesn&#x27;t diminish the craft.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;x.com&#x2F;tobi&#x2F;status&#x2F;2010438500609663110&quot;&gt;Tobi Lütke&lt;&#x2F;a&gt; got MRI data on a USB stick that required commercial Windows software to view. He asked Claude to build an HTML viewer instead. It looked better than the commercial tool. That&#x27;s the superpower: not just using software, but &lt;strong&gt;making&lt;&#x2F;strong&gt; software for your exact problem. As &lt;a href=&quot;https:&#x2F;&#x2F;antirez.com&#x2F;news&#x2F;158&quot;&gt;antirez wrote&lt;&#x2F;a&gt;, the fire that kept us coding until night was never about typing. It was about building. LLMs let us build more and better. The fun is still there, untouched.&lt;&#x2F;p&gt;
&lt;p&gt;The first months were honestly frustrating. Code that looked right but broke in subtle ways. APIs that no longer existed. Patterns that didn&#x27;t match my codebase. It took experimentation to find what actually works. These are the patterns that survived.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;what-becomes-reachable&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what-becomes-reachable&quot; aria-label=&quot;Anchor link for: what-becomes-reachable&quot;&gt;🔗&lt;&#x2F;a&gt;What Becomes Reachable&lt;&#x2F;h2&gt;
&lt;p&gt;I keep hearing that LLMs unlock velocity. We can ship faster! While that may be true, I think it misses the main benefit. LLMs are about reaching work that would never get done otherwise.&lt;&#x2F;p&gt;
&lt;p&gt;Every engineering team has a backlog of things that matter but never happen: comprehensive doc tests, database migrations, dependency updates, technical debt. These tasks sit in dream lists because the return on investment is too low given the effort required. LLMs change that equation.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;PierreZ&#x2F;moonpool&quot;&gt;moonpool&lt;&#x2F;a&gt; is my concrete example. Backporting FoundationDB&#x27;s low-level internals to Rust was always a dream project. I had operated distributed systems for years and understood the concepts, but the sheer volume of translation work kept it out of reach. I could throw multiple codebases at Claude for analysis, create recap files summarizing key patterns, and nourish my own implementation plan in hours instead of days. The project exists because LLMs made it reachable.&lt;&#x2F;p&gt;
&lt;p&gt;This is the shift worth paying attention to: LLMs amplify expertise, they do not replace it. The knowledge of what to build and why remains the bottleneck. The execution barrier just got lower.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;plan-first-always&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#plan-first-always&quot; aria-label=&quot;Anchor link for: plan-first-always&quot;&gt;🔗&lt;&#x2F;a&gt;Plan First, Always&lt;&#x2F;h2&gt;
&lt;p&gt;Here is the paradox: when code becomes cheap, design becomes more valuable. Not less. You can now afford to spend time on architecture, discuss tradeoffs, commit to an approach before writing a single line of code. &lt;a href=&quot;&#x2F;posts&#x2F;specs-are-back&#x2F;&quot;&gt;Specs are coming back&lt;&#x2F;a&gt;, and the judgment to write good ones still requires years of building systems.&lt;&#x2F;p&gt;
&lt;p&gt;Every significant task now starts in Plan Mode with &lt;code&gt;ultrathink&lt;&#x2F;code&gt;. Boris Cherny &lt;a href=&quot;https:&#x2F;&#x2F;x.com&#x2F;bcherny&#x2F;status&#x2F;2007892431031988385&quot;&gt;says thinking is on by default now&lt;&#x2F;a&gt; and the command does not do much anymore, but old habits die hard. The practical goal is breaking work into chunks small enough that the AI can digest the context without hallucinating. This is not about limiting ambition. It is about matching task scope to context window.&lt;&#x2F;p&gt;
&lt;p&gt;For large tasks, I produce a &lt;strong&gt;spec file&lt;&#x2F;strong&gt; that Claude and I iterate on together. Claude Code has an &lt;code&gt;AskUserQuestion&lt;&#x2F;code&gt; tool that lets Claude ask clarifying questions mid-task. Combined with a spec file, this becomes powerful: Claude asks about edge cases, I refine the requirements, we converge on an approach before writing code. The collaboration happens in the spec, not scattered across conversation turns. As a bonus, the spec survives context compaction and remains the source of truth when Claude summarizes the conversation.&lt;&#x2F;p&gt;
&lt;p&gt;Instead of generating a spec from assumptions, I tell Claude to clarify first. Here is an example prompt I use:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;ultrathink. Generate a spec.md for adding a new API endpoint to this codebase. Before writing anything, ask me about the endpoint&#x27;s purpose, request&#x2F;response schema, authentication requirements, and edge cases. Then produce a comprehensive spec covering motivation, technical design, error handling, and testing strategy.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;The result is a spec that matches what I actually need, not what Claude guessed I might want. Fewer iterations, better alignment.&lt;&#x2F;p&gt;
&lt;p&gt;A plan is only as good as the context it is built on.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;context-is-everything&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#context-is-everything&quot; aria-label=&quot;Anchor link for: context-is-everything&quot;&gt;🔗&lt;&#x2F;a&gt;Context is Everything&lt;&#x2F;h2&gt;
&lt;p&gt;The output quality depends entirely on the context you provide. This sounds obvious, but the implications took me a while to internalize. I now create context files with domain knowledge, code patterns, and project summaries. Writing down the hidden coding style rules that exist only in your head is surprisingly valuable. The conventions you enforce in code review but never documented? Write them down. The LLM will follow them, and so will newcomers on your team. I am currently experimenting with Claude skills to make this context reusable across sessions.&lt;&#x2F;p&gt;
&lt;p&gt;The difference between &lt;a href=&quot;https:&#x2F;&#x2F;simonwillison.net&#x2F;2025&#x2F;Oct&#x2F;7&#x2F;vibe-engineering&#x2F;&quot;&gt;vibe coding and vibe engineering&lt;&#x2F;a&gt;, as Simon Willison puts it, is whether you stay accountable for what the LLM produces. Accountability requires understanding, and understanding requires context.&lt;&#x2F;p&gt;
&lt;p&gt;Without enough context framing the problem, Claude over-engineers. I have seen it add abstraction layers, configuration options, and patterns I never asked for. The cure is constraints: explicit context about what simplicity looks like in this codebase. The LLM can generate code faster than I ever could, but knowing what context matters is expertise that cannot be delegated.&lt;&#x2F;p&gt;
&lt;p&gt;Context works when it is accurate. Documentation often is not.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;clone-your-dependencies&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#clone-your-dependencies&quot; aria-label=&quot;Anchor link for: clone-your-dependencies&quot;&gt;🔗&lt;&#x2F;a&gt;Clone Your Dependencies&lt;&#x2F;h2&gt;
&lt;p&gt;MCP tools exist to fetch documentation, but I find git clone more powerful. I clone the dependencies I care about and &lt;strong&gt;checkout the version I actually use&lt;&#x2F;strong&gt;. Claude browses the real source code, not cached docs or outdated training data. When I ask about an API, the answer comes from the actual implementation in my lock file. This simple habit prevents entire categories of frustrating debugging sessions where the model confidently generates code for an API that no longer exists.&lt;&#x2F;p&gt;
&lt;p&gt;This also works for &lt;strong&gt;understanding unfamiliar code&lt;&#x2F;strong&gt;. Clone a dependency, check out the version you use, and ask specific questions. The LLM handles breadth, you handle depth.&lt;&#x2F;p&gt;
&lt;p&gt;Good context helps Claude generate better code. But how does it know when the code is wrong?&lt;&#x2F;p&gt;
&lt;h2 id=&quot;feedback-loops&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#feedback-loops&quot; aria-label=&quot;Anchor link for: feedback-loops&quot;&gt;🔗&lt;&#x2F;a&gt;Feedback Loops&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;x.com&#x2F;bcherny&#x2F;status&#x2F;2007179832300581177&quot;&gt;Boris Cherny&lt;&#x2F;a&gt;, creator of Claude Code, calls this the most important thing: &lt;strong&gt;give Claude a way to verify its work&lt;&#x2F;strong&gt;. If Claude has that feedback loop, it will 2-3x the quality of the final result. The pattern is simple: generate code, get feedback, fix, repeat. The faster and clearer the feedback, the better the results.&lt;&#x2F;p&gt;
&lt;p&gt;This is why Rust and Claude work so well together. The compiler gives &lt;strong&gt;actionable error messages&lt;&#x2F;strong&gt;. The type system catches bugs before runtime. Clippy suggests improvements. Claude reads the feedback and fixes issues immediately. The compiler output is isolated, textual, actionable. The model does not have to guess what went wrong. Any language or tool that provides clear, structured feedback enables this same cycle.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;TDD&lt;&#x2F;strong&gt; fits perfectly here. Tests are easy for you to read and verify, and they give fast feedback to the LLM. Write the test first, let Claude implement until it passes. You stay in control of the specification while delegating the implementation.&lt;&#x2F;p&gt;
&lt;p&gt;For software that needs to be correct, the feedback must be exhaustive. I maintain the &lt;a href=&quot;https:&#x2F;&#x2F;crates.io&#x2F;crates&#x2F;foundationdb&quot;&gt;FoundationDB Rust crate&lt;&#x2F;a&gt;. Over 11 million downloads, used by real companies. The &lt;a href=&quot;&#x2F;posts&#x2F;providing-safety-fdb-rs&#x2F;&quot;&gt;binding tester&lt;&#x2F;a&gt; generates operation sequences and compares our implementation against the reference. We run the equivalent of &lt;strong&gt;219 days of continuous testing each month&lt;&#x2F;strong&gt; across our CI runners. When Claude contributes code, the binding tester tells it exactly where behavior diverges. This kind of feedback gives confidence to change things in a database driver that you would never touch otherwise.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;simulation-feedback-for-distributed-systems&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#simulation-feedback-for-distributed-systems&quot; aria-label=&quot;Anchor link for: simulation-feedback-for-distributed-systems&quot;&gt;🔗&lt;&#x2F;a&gt;Simulation: Feedback for Distributed Systems&lt;&#x2F;h3&gt;
&lt;p&gt;Compiler feedback catches syntax and types. Tests catch logic errors. But what about bugs that hide in timing and network partitions?&lt;&#x2F;p&gt;
&lt;p&gt;Distributed systems fail in ways that only manifest under specific conditions. &lt;a href=&quot;&#x2F;posts&#x2F;simulation-driven-development&#x2F;#the-tale-of-a-bug&quot;&gt;A network partition once disrupted a 70-node Hadoop cluster&lt;&#x2F;a&gt; and left it unable to restart due to corrupted state. That incident shaped how I think about testing. This is why I love FoundationDB: &lt;a href=&quot;&#x2F;posts&#x2F;diving-into-foundationdb-simulation&#x2F;&quot;&gt;after years of on-call, it has never woken me up&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Distributed systems need feedback loops that inject failures &lt;strong&gt;before&lt;&#x2F;strong&gt; production. This is what &lt;a href=&quot;&#x2F;posts&#x2F;simulation-driven-development&#x2F;&quot;&gt;deterministic simulation&lt;&#x2F;a&gt; provides. Same seed, same execution, same bugs. When every run is reproducible, the LLM can methodically explore the state space, find a failure, and debug it step by step.&lt;&#x2F;p&gt;
&lt;p&gt;In moonpool, Claude &lt;a href=&quot;&#x2F;posts&#x2F;testing-prevention-vs-discovery&#x2F;&quot;&gt;discovered a bug I did not know existed&lt;&#x2F;a&gt; through active exploration of edge cases I had not considered. &lt;a href=&quot;https:&#x2F;&#x2F;x.com&#x2F;mitsuhiko&#x2F;status&#x2F;2011048778896212251&quot;&gt;Armin Ronacher&lt;&#x2F;a&gt; recently noted that agents can now port entire codebases to new languages with all tests passing. The combination of simulation and LLMs makes this possible.&lt;&#x2F;p&gt;
&lt;p&gt;The most awful bugs are the &lt;strong&gt;unknown unknowns&lt;&#x2F;strong&gt;. You cannot write a test for a bug you do not know exists. Simulation and state exploration are the cheatsheet. If the code survives exhaustive exploration of edge cases, failures, and adversarial conditions, it behaves correctly. It does not matter whether an LLM wrote it or you did.&lt;&#x2F;p&gt;
&lt;p&gt;What dream project has been sitting on your list, waiting for the execution barrier to drop?&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Feel free to reach out with any questions or to share your experiences with LLM-assisted development. You can find me on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;pierrezemb.fr&quot;&gt;Bluesky&lt;&#x2F;a&gt; or through my &lt;a href=&quot;https:&#x2F;&#x2F;pierrezemb.fr&quot;&gt;website&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">llm</category>
          <category domain="tag">software-engineering</category>
          <category domain="tag">rust</category>
          <category domain="tag">testing</category>
      </item>
      <item>
          <title>2025: A Year in Review</title>
          <pubDate>Fri, 26 Dec 2025 00:00:00 +0000</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/2025-year-in-review/</link>
          <guid>https://pierrezemb.fr/posts/2025-year-in-review/</guid>
          <description xml:base="https://pierrezemb.fr/posts/2025-year-in-review/">&lt;p&gt;2025 was the year I stopped managing and started shipping software again. After nearly two years of context-switching between fires and people issues, I returned to the keyboard. As the year closes, it feels like the right time to look back. It was about &lt;strong&gt;going deeper&lt;&#x2F;strong&gt;: into code, into writing, and into understanding why simulation testing is a superpower.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;back-in-engineering&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#back-in-engineering&quot; aria-label=&quot;Anchor link for: back-in-engineering&quot;&gt;🔗&lt;&#x2F;a&gt;Back in Engineering&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;the-transition&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-transition&quot; aria-label=&quot;Anchor link for: the-transition&quot;&gt;🔗&lt;&#x2F;a&gt;The Transition&lt;&#x2F;h3&gt;
&lt;p&gt;In January, I &lt;a href=&quot;&#x2F;posts&#x2F;back-engineering&#x2F;&quot;&gt;went back to engineering&lt;&#x2F;a&gt; after nearly two years in management. It felt like coming home.&lt;&#x2F;p&gt;
&lt;p&gt;But I will be honest: the transition was harder than I expected. There was real imposter syndrome. Had I lost my edge? Was I still the technical person I used to be? The hardest part was not the code itself but giving myself &lt;strong&gt;permission to focus&lt;&#x2F;strong&gt;. Having another engineering manager handle the team and another SRE team handling the fires while I dove into low-level work was the perfect setup. I am also &lt;strong&gt;grateful&lt;&#x2F;strong&gt; to the whole team for making that transition possible. But after three years of context-switching between fires and people issues, sitting down to write code without interruption felt almost wrong. It took months to fully allow myself to focus without afterthoughts.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;building-the-toolbox-behind-materia-the-long-way-around&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#building-the-toolbox-behind-materia-the-long-way-around&quot; aria-label=&quot;Anchor link for: building-the-toolbox-behind-materia-the-long-way-around&quot;&gt;🔗&lt;&#x2F;a&gt;Building the Toolbox Behind Materia, the Long Way Around&lt;&#x2F;h3&gt;
&lt;p&gt;Helping put &lt;a href=&quot;https:&#x2F;&#x2F;www.clever-cloud.com&#x2F;blog&#x2F;company&#x2F;2025&#x2F;06&#x2F;27&#x2F;why-we-finally-built-our-own-managed-kubernetes-etcd&#x2F;&quot;&gt;Clever Cloud&#x27;s etcd shim&lt;&#x2F;a&gt; into production was meaningful because of the long arc behind it.&lt;&#x2F;p&gt;
&lt;p&gt;At OVHcloud, I got paged too often when etcd hit its performance ceiling. We were adding hundreds of customers per etcd cluster, each with their own Kubernetes control plane. I &lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=IrJyrGQ_R9c&quot;&gt;talked about this at KubeCon&lt;&#x2F;a&gt;. Spawning three etcd nodes per customer is not a valid approach at scale, whether in the cloud or on-premise. You need to mutualize, but you cannot scale etcd horizontally because the whole keyspace must fit on every member and the recommended storage limit is &lt;strong&gt;8GB&lt;&#x2F;strong&gt;. When you outgrow one cluster, you boot another, split your keys, and now you operate two, three, or many clusters. We had to balance customers manually across clusters. After &lt;a href=&quot;&#x2F;posts&#x2F;hbase-custom-data-balancing&#x2F;&quot;&gt;contributing a custom balancer to HBase&lt;&#x2F;a&gt; to solve exactly this problem, doing it by hand for etcd felt like going backward.&lt;&#x2F;p&gt;
&lt;p&gt;Then I discovered Apple&#x27;s &lt;a href=&quot;https:&#x2F;&#x2F;pierrez.github.io&#x2F;fdb-book&#x2F;the-record-layer&#x2F;what-is-record-layer.html&quot;&gt;FDB Record Layer&lt;&#x2F;a&gt;. It was an eye-opener: here was a way to &lt;strong&gt;virtualize database-like systems&lt;&#x2F;strong&gt; on top of FoundationDB, building any storage abstraction you want on a rock-solid distributed foundation. During France&#x27;s first lockdown, I &lt;a href=&quot;https:&#x2F;&#x2F;forums.foundationdb.org&#x2F;t&#x2F;a-foundationdb-layer-for-apiserver-as-an-alternative-to-etcd&#x2F;2697&quot;&gt;prototyped an etcd layer&lt;&#x2F;a&gt; using the Record Layer. The prototype worked, but more importantly, the Record Layer showed me what was important: &lt;strong&gt;a reusable toolbox to encapsulate FoundationDB knowledge&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;I moved to Clever Cloud to build exactly that: serverless systems based on FoundationDB. We started building the toolbox in Rust, piece by piece, driven by what our layers actually needed. Our first layer was &lt;a href=&quot;https:&#x2F;&#x2F;www.clever-cloud.com&#x2F;product&#x2F;materia-kv&#x2F;&quot;&gt;Materia KV&lt;&#x2F;a&gt;, exposing the Redis protocol. That forced us to build the foundational primitives.&lt;&#x2F;p&gt;
&lt;p&gt;One highlight was building the query engine for Materia. I wrote &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;datafusion-contrib&#x2F;datafusion-index-provider&quot;&gt;datafusion-index-provider&lt;&#x2F;a&gt;, a library that extends Apache DataFusion with index-based query acceleration. I had a lot of fun digging into how a query plan might look when fetching indexes: a two-phase model where you first scan indexes to identify matching row IDs, then fetch complete records. The interesting part was combining &lt;strong&gt;AND&lt;&#x2F;strong&gt; and &lt;strong&gt;OR&lt;&#x2F;strong&gt; operations. AND predicates build a left-deep tree of joins to intersect row IDs across indexes. OR predicates use unions with deduplication to merge results without fetching the same record twice. The first time DataFusion, FoundationDB, and our indexes all connected and a SELECT query returned real data, &lt;a href=&quot;&#x2F;posts&#x2F;thank-you-datafusion&#x2F;&quot;&gt;I remembered why I write software&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Then came etcd, which required &lt;strong&gt;a lot&lt;&#x2F;strong&gt; more work: watches, leases, revision tracking. I &lt;a href=&quot;&#x2F;posts&#x2F;diving-into-kubernetes-watch-cache&#x2F;&quot;&gt;debugged the watch cache&lt;&#x2F;a&gt; along the way. We are not alone in this approach: &lt;a href=&quot;https:&#x2F;&#x2F;aws.amazon.com&#x2F;blogs&#x2F;containers&#x2F;under-the-hood-amazon-eks-ultra-scale-clusters&#x2F;&quot;&gt;AWS&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;cloud.google.com&#x2F;blog&#x2F;products&#x2F;containers-kubernetes&#x2F;gke-65k-nodes-and-counting?hl=en&quot;&gt;GKE&lt;&#x2F;a&gt; also run custom storage layers for Kubernetes at scale. No more splitting clusters, no more manual data balancing, no more operational nightmares. FoundationDB handles the hard distributed systems parts. Years of frustration with etcd turned into an etcd-compatible API backing Kubernetes control planes.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-simulation-year&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-simulation-year&quot; aria-label=&quot;Anchor link for: the-simulation-year&quot;&gt;🔗&lt;&#x2F;a&gt;The Simulation Year&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;the-awakening&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-awakening&quot; aria-label=&quot;Anchor link for: the-awakening&quot;&gt;🔗&lt;&#x2F;a&gt;The Awakening&lt;&#x2F;h3&gt;
&lt;p&gt;Then came &lt;a href=&quot;https:&#x2F;&#x2F;bugbash.antithesis.com&#x2F;&quot;&gt;BugBash 2025&lt;&#x2F;a&gt; in early April.&lt;&#x2F;p&gt;
&lt;p&gt;The conference in Washington D.C., organized by Antithesis, brought together people like Kyle Kingsbury, Ankush Desai, and Mitchell Hashimoto to discuss software reliability. The highlight was meeting some of the original FoundationDB creators. Hearing their war stories and seeing how deeply simulation shaped FDB&#x27;s legendary reliability reignited something in me. I had been using FDB&#x27;s simulation for years, but I had never fully internalized that &lt;strong&gt;&lt;a href=&quot;&#x2F;posts&#x2F;simulation-driven-development&#x2F;&quot;&gt;this could be how I write all software&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;BugBash reminded me of my painful on-call years at OVHcloud. Before etcd, I operated a massive HBase cluster: 255 machines, 2 million writes per second, 6 million reads. HBase was weak to network issues, and every incident triggered region split inconsistencies. We ran hbck in brutal ways just to keep things running. HBase led me to FDB, a system &lt;strong&gt;built to handle network chaos&lt;&#x2F;strong&gt;, and that robustness comes from simulation.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;from-understanding-to-building&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#from-understanding-to-building&quot; aria-label=&quot;Anchor link for: from-understanding-to-building&quot;&gt;🔗&lt;&#x2F;a&gt;From Understanding to Building&lt;&#x2F;h3&gt;
&lt;p&gt;2025 was about turning that understanding into something real. At work, we added simulation workloads to critical software running on top of Materia. I &lt;a href=&quot;https:&#x2F;&#x2F;docs.google.com&#x2F;presentation&#x2F;d&#x2F;1xm4yNGnV2Oi8Lk3ZHEvg4aDMNEFieSmW06CkItCigSc&#x2F;edit?usp=sharing&quot;&gt;spoke at Devoxx France&lt;&#x2F;a&gt; about embracing simulation-driven development. But I wanted to go deeper: to understand how FoundationDB created its robustness. So I &lt;a href=&quot;&#x2F;posts&#x2F;diving-into-foundationdb-simulation&#x2F;&quot;&gt;dug into the implementation&lt;&#x2F;a&gt;, then started backporting it to Rust.&lt;&#x2F;p&gt;
&lt;p&gt;That became &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;PierreZ&#x2F;moonpool&quot;&gt;moonpool&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;I split moonpool into four crates, each a foundational block for distributed systems. &lt;strong&gt;&lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;moonpool-core&#x2F;latest&#x2F;moonpool_core&#x2F;&quot;&gt;moonpool-core&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt; defines provider traits for time, networking, task spawning, and randomness. Same code runs in production and simulation. &lt;strong&gt;&lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;moonpool-sim&#x2F;latest&#x2F;moonpool_sim&#x2F;&quot;&gt;moonpool-sim&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt; is the simulation engine: virtual time, event queues, and fault injection. &lt;strong&gt;&lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;moonpool-transport&#x2F;latest&#x2F;moonpool_transport&#x2F;&quot;&gt;moonpool-transport&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt; provides FDB-style RPC: connection management, checksummed wire formats, automatic reconnection. Claude accelerated this work significantly. I used it to translate C++ patterns into Rust. At some point, &lt;a href=&quot;&#x2F;posts&#x2F;testing-prevention-vs-discovery&#x2F;&quot;&gt;Claude started fixing bugs on its own&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;At the beginning of 2025, I had basic knowledge of deterministic simulation. By the end, I had built a framework. In 2026, I will start a &lt;strong&gt;blogpost series&lt;&#x2F;strong&gt; explaining moonpool&#x27;s internals. Maybe it becomes something others can use. Maybe it stays a hobby project. Either way, I am convinced: &lt;strong&gt;simulation is the future&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;sharing&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#sharing&quot; aria-label=&quot;Anchor link for: sharing&quot;&gt;🔗&lt;&#x2F;a&gt;Sharing&lt;&#x2F;h2&gt;
&lt;p&gt;I set a goal to write one or two posts per month, and I published 20. You can trace my monthly focus just by looking at what I published. The results surprised me: traffic multiplied by 2.5x according to Plausible. This was also the first year where people actually reached out to say &lt;strong&gt;thank you&lt;&#x2F;strong&gt; for sharing. I always assumed no one was reading.&lt;&#x2F;p&gt;
&lt;p&gt;The top posts by visitors: &lt;a href=&quot;&#x2F;posts&#x2F;nixos-good-bad-ugly&#x2F;&quot;&gt;NixOS: The Good, The Bad, and The Ugly&lt;&#x2F;a&gt;, &lt;a href=&quot;&#x2F;posts&#x2F;tokio-hidden-gems&#x2F;&quot;&gt;Unlocking Tokio&#x27;s Hidden Gems&lt;&#x2F;a&gt;, &lt;a href=&quot;&#x2F;posts&#x2F;distsys-resources&#x2F;&quot;&gt;Distributed Systems Resources&lt;&#x2F;a&gt;, and &lt;a href=&quot;&#x2F;posts&#x2F;simulation-driven-development&#x2F;&quot;&gt;What if we embraced simulation-driven development?&lt;&#x2F;a&gt;. Strangely enough, my most shared post was not about distributed systems or FoundationDB but about NixOS. I think people appreciated the honest take: the good, the bad, &lt;strong&gt;and&lt;&#x2F;strong&gt; the ugly. The Tokio post being #2 was also unexpected. Sometimes the posts you almost do not publish are the ones that resonate.&lt;&#x2F;p&gt;
&lt;p&gt;I love making presentations. In 2025, I gave two talks at Devoxx France: one about &lt;a href=&quot;https:&#x2F;&#x2F;docs.google.com&#x2F;presentation&#x2F;d&#x2F;1xm4yNGnV2Oi8Lk3ZHEvg4aDMNEFieSmW06CkItCigSc&#x2F;edit?usp=sharing&quot;&gt;simulation-driven development&lt;&#x2F;a&gt; and another about &lt;a href=&quot;https:&#x2F;&#x2F;docs.google.com&#x2F;presentation&#x2F;d&#x2F;1UbJ7drA_6hX7kLN2nV8IxOsAt1k8WOnfGrEfRlbIa7k&#x2F;edit?usp=sharing&quot;&gt;prototyping distributed systems with Maelstrom&lt;&#x2F;a&gt;. I also presented &lt;a href=&quot;https:&#x2F;&#x2F;docs.google.com&#x2F;presentation&#x2F;d&#x2F;13pCaWXNkITj5Sh4dKofILbxPg_Wb2BBedbbi2Mv4PoE&#x2F;edit?usp=sharing&quot;&gt;my fdb-rs journey&lt;&#x2F;a&gt; at &lt;a href=&quot;https:&#x2F;&#x2F;finistdevs.org&#x2F;&quot;&gt;FinistDevs&lt;&#x2F;a&gt;, which I help organize in Brest.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;a href=&quot;https:&#x2F;&#x2F;crates.io&#x2F;crates&#x2F;foundationdb&quot;&gt;FoundationDB Rust crate&lt;&#x2F;a&gt; keeps growing: 11 million downloads, used in production by real companies. But I was not a great maintainer this year. Development followed Clever Cloud&#x27;s requirements. People asked for documentation about simulation testing and a roadmap, and I did not deliver. That is not a complaint about open source, just honesty. Being a solo maintainer without foundation backing means priorities get driven by the day job. Last week I finally wrote a roadmap and flushed my brain into GitHub issues so contributors can pick up work. I hope to do better in 2026.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;how-llms-changed-my-work&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#how-llms-changed-my-work&quot; aria-label=&quot;Anchor link for: how-llms-changed-my-work&quot;&gt;🔗&lt;&#x2F;a&gt;How LLMs Changed My Work&lt;&#x2F;h2&gt;
&lt;p&gt;I cannot write about 2025 without talking about LLMs. I spent a &lt;strong&gt;lot&lt;&#x2F;strong&gt; of time learning how to use them in my work.&lt;&#x2F;p&gt;
&lt;p&gt;For years, I had a weekly habit: two hours dedicated to reading codebases I depend on, understanding the internals of libraries, frameworks, and databases. Then I stopped. Life got busy, management took over, and diving into unfamiliar code took too long to justify. With LLMs, I picked the habit back up. What used to take hours now takes minutes. I can explore a codebase conversationally, asking questions, jumping to relevant sections, building mental models faster than ever. I learn more now than I did before.&lt;&#x2F;p&gt;
&lt;p&gt;They handle peripheral code well: glue code, boilerplate, scaffolding. Geoffrey Litt calls this &quot;&lt;a href=&quot;https:&#x2F;&#x2F;www.geoffreylitt.com&#x2F;2025&#x2F;10&#x2F;24&#x2F;code-like-a-surgeon&quot;&gt;coding like a surgeon&lt;&#x2F;a&gt;&quot;: delegate the prep work, focus on what matters. Moonpool is a good example. Claude translated C++ Flow patterns into Rust and generated test scaffolding. But knowing &lt;strong&gt;what&lt;&#x2F;strong&gt; to translate required years of operating distributed systems: why FDB uses interface swapping, why buggify has two-phase activation, why virtual time compresses hours into seconds. As João Alves wrote, when &quot;&lt;a href=&quot;https:&#x2F;&#x2F;world.hey.com&#x2F;joaoqalves&#x2F;when-software-becomes-fast-food-23147c9b&quot;&gt;software becomes fast food&lt;&#x2F;a&gt;&quot;, expertise becomes the scarce resource.&lt;&#x2F;p&gt;
&lt;p&gt;But what I did not expect is that working with LLMs forces me to flesh out invariants and hidden rules somewhere explicit. You need to write things down for the LLM to understand, and that documentation ends up being useful for humans too. &lt;strong&gt;Context is everything&lt;&#x2F;strong&gt;: given the right context, LLMs generate the right code. So I spent a lot of time (and tokens) generating project recaps and summaries to feed them. When working with libraries, I make local git clones so the LLM can browse the actual source code instead of relying on potentially outdated training data. I have been using Claude extensively, and I found &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;github&#x2F;spec-kit&quot;&gt;spec-kit&lt;&#x2F;a&gt; helpful for framing my prompts. It is a toolkit for &quot;spec-driven development&quot; that helps you focus on product scenarios instead of what Simon Willison calls &quot;&lt;a href=&quot;https:&#x2F;&#x2F;simonwillison.net&#x2F;2025&#x2F;Oct&#x2F;7&#x2F;vibe-engineering&#x2F;&quot;&gt;vibe coding&lt;&#x2F;a&gt;&quot;. But &lt;a href=&quot;&#x2F;posts&#x2F;specs-are-back&#x2F;&quot;&gt;we are still missing the tools&lt;&#x2F;a&gt; to make this workflow seamless.&lt;&#x2F;p&gt;
&lt;p&gt;Some posts became unexpectedly useful as LLM context. My &lt;a href=&quot;&#x2F;posts&#x2F;practical-guide-to-application-metrics&#x2F;&quot;&gt;practical guide to application metrics&lt;&#x2F;a&gt; and my &lt;a href=&quot;&#x2F;posts&#x2F;writing-rust-fdb-workloads-that-find-bugs&#x2F;&quot;&gt;guidelines for FDB workloads&lt;&#x2F;a&gt; now live in project contexts. When I ask Claude to add instrumentation or write a simulation workload, it already knows my patterns.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;looking-ahead&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#looking-ahead&quot; aria-label=&quot;Anchor link for: looking-ahead&quot;&gt;🔗&lt;&#x2F;a&gt;Looking Ahead&lt;&#x2F;h2&gt;
&lt;p&gt;2025 reminded me why I love this work: building systems, learning in public, watching years of investment pay off.&lt;&#x2F;p&gt;
&lt;p&gt;For 2026, the habits stay: writing one or two posts per month, reading codebases with LLM assistance, speaking at conferences. I want to push moonpool toward something others can actually use. We have ambitious plans for Materia at Clever Cloud, and some of them should lead me to contribute to FoundationDB directly. I will keep helping organize &lt;a href=&quot;https:&#x2F;&#x2F;finistdevs.org&#x2F;&quot;&gt;FinistDevs&lt;&#x2F;a&gt; in Brest. And I will definitely be back at BugBash 2026.&lt;&#x2F;p&gt;
&lt;p&gt;The theme is the same as 2025: go deeper, share what I learn, build things that last. &lt;strong&gt;What did your 2025 look like?&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Feel free to reach out to share your own 2025 reflections. You can find me on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;pierrezemb.fr&quot;&gt;Bluesky&lt;&#x2F;a&gt; or through my &lt;a href=&quot;https:&#x2F;&#x2F;pierrezemb.fr&quot;&gt;website&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">personal</category>
      </item>
      <item>
          <title>Specs Are Back, But We&#x27;re Missing the Tools</title>
          <pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/specs-are-back/</link>
          <guid>https://pierrezemb.fr/posts/specs-are-back/</guid>
          <description xml:base="https://pierrezemb.fr/posts/specs-are-back/">&lt;p&gt;I truly think LLMs are changing how we write software. For me, it&#x27;s been a massive productivity boost. I can ask Claude to read some piece of code and explain it to me, or make a quick PoC of something, or refactor stuff that would take me hours. I even used it to help me &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;PierreZ&#x2F;moonpool&quot;&gt;backport features from FoundationDB in Rust&lt;&#x2F;a&gt;, and it worked surprisingly well 🤯&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;world.hey.com&#x2F;joaoqalves&#x2F;when-software-becomes-fast-food-23147c9b&quot;&gt;João Alves made a great point recently&lt;&#x2F;a&gt;: code is becoming like fast food. Cheap, fast, everywhere. You ask an LLM to generate something, it compiles, the tests pass, and you ship it. For critical systems, &lt;a href=&quot;&#x2F;posts&#x2F;diving-into-foundationdb-simulation&#x2F;&quot;&gt;simulation testing&lt;&#x2F;a&gt; can validate that code actually survives production chaos.&lt;&#x2F;p&gt;
&lt;p&gt;But here&#x27;s what I keep running into: we&#x27;re still using &lt;strong&gt;natural language&lt;&#x2F;strong&gt; to prompt, correct, and guide LLMs. Vague instructions produce vague code. The bottleneck isn&#x27;t writing code anymore, it&#x27;s knowing &lt;strong&gt;what&lt;&#x2F;strong&gt; to write in the first place.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;why-specs-died-in-the-first-place&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#why-specs-died-in-the-first-place&quot; aria-label=&quot;Anchor link for: why-specs-died-in-the-first-place&quot;&gt;🔗&lt;&#x2F;a&gt;Why specs died in the first place&lt;&#x2F;h2&gt;
&lt;p&gt;Ask any engineering team &quot;where&#x27;s the spec for this service?&quot; and you&#x27;ll probably get one of three answers: blank stares, a link to some 3-year-old Google doc that&#x27;s completely outdated, or my personal favorite, &quot;the code is the spec.&quot;&lt;&#x2F;p&gt;
&lt;p&gt;I think the problem was simple: &lt;strong&gt;specs had no feedback loop&lt;&#x2F;strong&gt;. Code compiles, tests pass, but specs? They just sit there. Nobody validates them, nobody updates them. Six months later, the spec has become archaeology, and new team members learn to ignore it because they can&#x27;t trust it anyway.&lt;&#x2F;p&gt;
&lt;p&gt;What changed is that LLMs can actually &lt;strong&gt;read&lt;&#x2F;strong&gt; specifications now. And suddenly, specs aren&#x27;t dead documents anymore. They&#x27;re instructions that can be executed. I&#x27;ve found two modes that actually work:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Generation&lt;&#x2F;strong&gt;: you give an LLM a structured spec, and it gives you an implementation&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Validation&lt;&#x2F;strong&gt;: you give an LLM some existing code and a spec, and ask &quot;does this implementation actually respect the specification?&quot;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;spec-kit-and-the-right-prompt-chain&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#spec-kit-and-the-right-prompt-chain&quot; aria-label=&quot;Anchor link for: spec-kit-and-the-right-prompt-chain&quot;&gt;🔗&lt;&#x2F;a&gt;spec-kit and the right prompt chain&lt;&#x2F;h2&gt;
&lt;p&gt;I tried &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;github&#x2F;spec-kit&quot;&gt;spec-kit&lt;&#x2F;a&gt; a while ago and found it pretty useful. What it does well is guide you through a structured chain of prompts: you start with a Constitution (your project principles), then you write Specifications (requirements with acceptance criteria), then Technical Plans, then Tasks, and finally Implementation.&lt;&#x2F;p&gt;
&lt;p&gt;It sounds obvious when I write it like that, but it&#x27;s surprisingly effective. This isn&#x27;t scattered TODO comments. It&#x27;s a queryable structure that builds context progressively, and the LLM can use all of it.&lt;&#x2F;p&gt;
&lt;p&gt;The generated code was actually good, because spec-kit forced me to build the right context first. And here&#x27;s what surprised me: the LLM kept challenging my vague requirements. Every time I wrote something like &quot;handle edge cases,&quot; it would ask &quot;what happens when X? what about Y?&quot; until the spec was actually implementable.&lt;&#x2F;p&gt;
&lt;p&gt;I think that&#x27;s the trick. &lt;strong&gt;Context is everything&lt;&#x2F;strong&gt;. Build the right context, and the LLM produces the right code.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-limits-of-english&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-limits-of-english&quot; aria-label=&quot;Anchor link for: the-limits-of-english&quot;&gt;🔗&lt;&#x2F;a&gt;The limits of English&lt;&#x2F;h2&gt;
&lt;p&gt;Here&#x27;s where I hit a wall though. English-based specs work great for user stories and acceptance criteria, the kind of stuff product managers care about. But for algorithms and system behavior? Natural language gets ambiguous really fast.&lt;&#x2F;p&gt;
&lt;p&gt;&quot;Handle concurrent access&quot; means different things to different people. &quot;Ensure consistency&quot; is even worse. When you&#x27;re designing distributed algorithms with subtle timing constraints, you need precision. English just doesn&#x27;t cut it.&lt;&#x2F;p&gt;
&lt;p&gt;I needed something more engineering-driven. Not formal verification for academic purposes, but practical precision that the whole team could read and reason about.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;finding-an-engineering-driven-approach&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#finding-an-engineering-driven-approach&quot; aria-label=&quot;Anchor link for: finding-an-engineering-driven-approach&quot;&gt;🔗&lt;&#x2F;a&gt;Finding an engineering-driven approach&lt;&#x2F;h2&gt;
&lt;p&gt;I started looking at formal methods. &lt;a href=&quot;https:&#x2F;&#x2F;lamport.azurewebsites.net&#x2F;video&#x2F;videos.html&quot;&gt;TLA+&lt;&#x2F;a&gt; is the classic choice, but the notation felt like another language to maintain. I didn&#x27;t want to be the only one on the team who could read the specs. I&#x27;ve been there before with other technologies, and it&#x27;s not a great place to be 😅&lt;&#x2F;p&gt;
&lt;p&gt;Then &lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;alexmillerdb.bsky.social&#x2F;post&#x2F;3m6ptancmus2o&quot;&gt;a friend&lt;&#x2F;a&gt; suggested &lt;a href=&quot;https:&#x2F;&#x2F;fizzbee.io&quot;&gt;Fizzbee&lt;&#x2F;a&gt;. It&#x27;s based on Starlark, a Python dialect. Model checking without the TLA+ notation. The whole team can contribute.&lt;&#x2F;p&gt;
&lt;p&gt;Learning new languages with LLMs works well. The trick is to find or generate a spec of the language first, then ask for a tutorial tailored to your specific problem. I asked Claude to write a Starlark reference and a Fizzbee concepts recap. Now we share vocabulary, and the conversations are productive.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;what-we-re-still-missing&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what-we-re-still-missing&quot; aria-label=&quot;Anchor link for: what-we-re-still-missing&quot;&gt;🔗&lt;&#x2F;a&gt;What we&#x27;re still missing&lt;&#x2F;h2&gt;
&lt;p&gt;Fizzbee is great for what it does. For algorithms and concurrency, model checking feels like the Rust compiler but for higher-level design. It explores all possible states and finds bugs before any code exists. Here&#x27;s what an invariant looks like:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# Safety: no duplicate completions
&lt;&#x2F;span&gt;&lt;span&gt;always assertion NoDuplicateCompletions:
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;len&lt;&#x2F;span&gt;&lt;span&gt;(completed) == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;len&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;set&lt;&#x2F;span&gt;&lt;span&gt;(completed))
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Readable by anyone who knows Python.&lt;&#x2F;p&gt;
&lt;p&gt;But most software isn&#x27;t distributed algorithms. Most of what we build is about storing data in databases, sending messages to queues, calling other services, transforming inputs into outputs.&lt;&#x2F;p&gt;
&lt;p&gt;And we describe this behavior in a dozen different places: C4 diagrams for architecture, OpenAPI for HTTP endpoints, protobuf for message schemas, ADRs for decisions, markdown for everything else. No single notation captures the full picture.&lt;&#x2F;p&gt;
&lt;p&gt;I keep thinking about what this tool would need to be. It should be &lt;strong&gt;compact&lt;&#x2F;strong&gt;, short enough to fit in an LLM&#x27;s context window without eating thousands of tokens. It should work at both levels: service interactions (&quot;UserService stores in Postgres, publishes to Kafka&quot;) and function behavior (&quot;validateUser checks format, queries DB, returns DTO&quot;). It should be the &lt;strong&gt;common language&lt;&#x2F;strong&gt; that both the team and the LLM can read, write, and reason about.&lt;&#x2F;p&gt;
&lt;p&gt;Most importantly, it should be &lt;strong&gt;verifiable&lt;&#x2F;strong&gt;. Not just documentation that sits there, but something that can actually validate whether an implementation matches what we said it would do. The feedback loop that specs never had.&lt;&#x2F;p&gt;
&lt;p&gt;OpenAPI gets close for HTTP APIs. You can validate requests, generate clients, catch breaking changes. But for the rest? For business logic, for service contracts that aren&#x27;t just endpoints, for the behavior that actually matters? The tooling doesn&#x27;t exist yet.&lt;&#x2F;p&gt;
&lt;p&gt;As the old joke goes, &lt;a href=&quot;https:&#x2F;&#x2F;www.commitstrip.com&#x2F;en&#x2F;2016&#x2F;08&#x2F;25&#x2F;a-very-comprehensive-and-precise-spec&#x2F;&quot;&gt;a spec precise enough to generate code is just called code&lt;&#x2F;a&gt;. But there has to be something in between prose and implementation. And yes, I&#x27;m aware that proposing a new format makes me &lt;a href=&quot;https:&#x2F;&#x2F;xkcd.com&#x2F;927&#x2F;&quot;&gt;the 15th competing standard&lt;&#x2F;a&gt;. But if you&#x27;ve found something that fills this gap, or have ideas about what it should look like, I&#x27;d love to hear about it.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Feel free to reach out to share your thoughts on spec tooling. You can find me on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;pierrezemb.fr&quot;&gt;Bluesky&lt;&#x2F;a&gt; or through my &lt;a href=&quot;https:&#x2F;&#x2F;pierrezemb.fr&quot;&gt;website&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">software-engineering</category>
          <category domain="tag">llm</category>
          <category domain="tag">specifications</category>
          <category domain="tag">formal-methods</category>
          <category domain="tag">model-checking</category>
          <category domain="tag">fizzbee</category>
      </item>
      <item>
          <title>Designing Rust FDB Workloads That Actually Find Bugs</title>
          <pubDate>Tue, 09 Dec 2025 00:00:00 +0000</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/writing-rust-fdb-workloads-that-find-bugs/</link>
          <guid>https://pierrezemb.fr/posts/writing-rust-fdb-workloads-that-find-bugs/</guid>
          <description xml:base="https://pierrezemb.fr/posts/writing-rust-fdb-workloads-that-find-bugs/">&lt;p&gt;After &lt;a href=&quot;&#x2F;posts&#x2F;diving-into-foundationdb-simulation&#x2F;&quot;&gt;one trillion CPU-hours of simulation testing&lt;&#x2F;a&gt;, FoundationDB has been stress-tested under conditions far worse than any production environment. Network partitions, disk failures, Byzantine faults. FDB handles them all. &lt;strong&gt;But what about your code?&lt;&#x2F;strong&gt; Your layer sits on top of FDB. Your indexes, your transaction logic, your retry handling. How do you know it survives chaos?&lt;&#x2F;p&gt;
&lt;p&gt;At Clever Cloud, we are building &lt;a href=&quot;https:&#x2F;&#x2F;www.clever-cloud.com&#x2F;materia&#x2F;&quot;&gt;Materia&lt;&#x2F;a&gt;, our serverless database product. The question haunted us: how do you ship layer code with the same confidence FDB has in its own? Our answer was to hack our way into FDB&#x27;s simulator using &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;foundationdb-rs&#x2F;foundationdb-rs&#x2F;tree&#x2F;4ed057a&#x2F;foundationdb-simulation&quot;&gt;foundationdb-simulation&lt;&#x2F;a&gt;, a crate that compiles Rust to run inside FDB&#x27;s deterministic simulator. We&#x27;re the only language besides Flow that can pull this off.&lt;&#x2F;p&gt;
&lt;p&gt;The first seed triggered &lt;a href=&quot;https:&#x2F;&#x2F;apple.github.io&#x2F;foundationdb&#x2F;developer-guide.html#transactions-with-unknown-results&quot;&gt;&lt;code&gt;commit_unknown_result&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;, one of the most feared edge cases for FDB layer developers. When a connection drops, the client can&#x27;t know if the transaction committed. Our atomic counters were incrementing twice. In production, this surfaces once every few months under heavy load and during failures. In simulation? &lt;strong&gt;Almost immediately.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;This post won&#x27;t walk you through the code mechanics. The &lt;a href=&quot;https:&#x2F;&#x2F;crates.io&#x2F;crates&#x2F;foundationdb-simulation&quot;&gt;foundationdb-simulation crate&lt;&#x2F;a&gt; and its &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;foundationdb-rs&#x2F;foundationdb-rs&#x2F;tree&#x2F;4ed057a&#x2F;foundationdb-simulation&quot;&gt;README&lt;&#x2F;a&gt; cover that. Instead, this teaches you how to &lt;strong&gt;design&lt;&#x2F;strong&gt; workloads that catch real bugs. Whether you&#x27;re a junior engineer or an LLM helping write tests, these principles will guide you.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;why-autonomous-testing-works&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#why-autonomous-testing-works&quot; aria-label=&quot;Anchor link for: why-autonomous-testing-works&quot;&gt;🔗&lt;&#x2F;a&gt;Why Autonomous Testing Works&lt;&#x2F;h2&gt;
&lt;p&gt;Traditional testing has you write specific tests for scenarios you imagined. But as Will Wilson put it at &lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=eZ1mmqlq-mY&quot;&gt;Bug Bash 2025&lt;&#x2F;a&gt;: &lt;strong&gt;&quot;The most dangerous bugs occur in states you never imagined possible.&quot;&lt;&#x2F;strong&gt; The key insight of autonomous testing (what FDB&#x27;s simulation embodies) is that instead of writing tests, you write a &lt;strong&gt;test generator&lt;&#x2F;strong&gt;. If you ran it for infinite time, it would eventually produce all possible tests you could have written. You don&#x27;t have infinite time, so instead you get a probability distribution over all possible tests. And probability distributions are leaky: they cover cases you never would have thought to test.&lt;&#x2F;p&gt;
&lt;p&gt;This is why simulation finds bugs so fast. You&#x27;re not testing what you thought to test. You&#x27;re testing what the probability distribution happens to generate, which includes edge cases you&#x27;d never have written explicitly. Add fault injection (a probability distribution over all possible ways the world can conspire to screw you) and now you&#x27;re finding bugs that would take months or years to surface in production.&lt;&#x2F;p&gt;
&lt;p&gt;This is what got me interested in simulation in the first place: how do you test the things you see during on-call shifts? Those weird transient bugs at 3 AM, the race conditions that happen once a month, the edge cases you only discover when production is on fire. Simulation shifts that complexity from SRE time to SWE time. What was a 3 AM page becomes a daytime debugging session. What was a high-pressure incident becomes a reproducible test case you can bisect, rewind, and experiment with freely.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-sequential-luck-problem&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-sequential-luck-problem&quot; aria-label=&quot;Anchor link for: the-sequential-luck-problem&quot;&gt;🔗&lt;&#x2F;a&gt;The Sequential Luck Problem&lt;&#x2F;h2&gt;
&lt;p&gt;Here&#x27;s why rare bugs are so hard to find: imagine a bug that requires three unlikely events in sequence. Each event has a 1&#x2F;1000 probability. Finding that bug requires 1&#x2F;1,000,000,000 attempts, roughly a billion tries with random testing. Research confirms this: &lt;a href=&quot;https:&#x2F;&#x2F;www.usenix.org&#x2F;conference&#x2F;osdi18&#x2F;presentation&#x2F;alquraan&quot;&gt;a study of network partition failures&lt;&#x2F;a&gt; found that 83% require 3+ events to manifest, 80% have catastrophic impact, and 21% cause permanent damage that persists after the partition heals. &lt;strong&gt;But here&#x27;s the good news for Rust workloads&lt;&#x2F;strong&gt;: you don&#x27;t solve this problem yourself. FDB&#x27;s simulation handles fault injection. BUGGIFY injects failures at arbitrary code points. Network partitions appear and disappear. Disks fail. Machines crash and restart. The simulator explores failure combinations that would take years to encounter in production.&lt;&#x2F;p&gt;
&lt;p&gt;Your job is different. You need to design operations that exercise interesting code paths. Not just reads and writes, but the edge cases your users will inevitably trigger. And you need to write invariants that CATCH bugs when simulation surfaces them. After a million injected faults, how do you prove your data is still correct? This division of labor is the key insight: FDB injects chaos, you verify correctness.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;designing-your-operation-alphabet&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#designing-your-operation-alphabet&quot; aria-label=&quot;Anchor link for: designing-your-operation-alphabet&quot;&gt;🔗&lt;&#x2F;a&gt;Designing Your Operation Alphabet&lt;&#x2F;h2&gt;
&lt;p&gt;The &lt;strong&gt;operation alphabet&lt;&#x2F;strong&gt; is the complete set of operations your workload can perform. This is where most workloads fail: they test happy paths with uniform distribution and miss the edge cases that break production. Think about three categories:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Normal operations&lt;&#x2F;strong&gt; with realistic weights. In production, maybe 80% of your traffic is reads, 15% is simple writes, 5% is complex updates. Your workload should reflect this, because bugs often hide in the interactions between operation types. A workload that runs 50% reads and 50% writes tests different code paths than one that runs 95% reads and 5% writes. Both might be valid, but they&#x27;ll find different bugs.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Adversarial inputs&lt;&#x2F;strong&gt; that customers will inevitably send. Empty strings. Maximum-length values. Null bytes in the middle of strings. Unicode edge cases. Boundary integers (0, -1, MAX_INT). Customers never respect your API specs, so model the chaos they create.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Nemesis operations&lt;&#x2F;strong&gt; that break things on purpose. Delete random data mid-test. Clear ranges that &quot;shouldn&#x27;t&quot; be cleared. Crash batch jobs mid-execution to test recovery. Run compaction every operation instead of daily. Create conflict storms where multiple clients hammer the same key. Approach the 10MB transaction limit. These operations stress your error handling and recovery paths. The rare operations are where bugs hide. That batch job running once a day in production? In simulation, you&#x27;ll hit its partial-failure edge case in minutes, but only if your operation alphabet includes it.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;designing-invariants&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#designing-invariants&quot; aria-label=&quot;Anchor link for: designing-invariants&quot;&gt;🔗&lt;&#x2F;a&gt;Designing Invariants&lt;&#x2F;h2&gt;
&lt;p&gt;After simulation runs thousands of operations with injected faults, network partitions, and machine crashes, how do you know your data is still correct? Unlike FDB&#x27;s internal testing, Rust workloads can&#x27;t inject assertions at arbitrary code points. You verify correctness in the &lt;code&gt;check()&lt;&#x2F;code&gt; phase, after the chaos ends. The key question: &lt;strong&gt;&quot;After all this, how do I PROVE my data is still correct?&quot;&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;One critical tip: validate during &lt;code&gt;start()&lt;&#x2F;code&gt;, not just in &lt;code&gt;check()&lt;&#x2F;code&gt;.&lt;&#x2F;strong&gt; Don&#x27;t wait until the end to discover corruption. After each operation (or batch of operations), read back the data and verify it matches expectations. If you&#x27;re maintaining a counter, read it and check the bounds. If you&#x27;re building an index, query it immediately after insertion. Early validation catches bugs closer to their source, making debugging far easier. The &lt;code&gt;check()&lt;&#x2F;code&gt; phase is your final safety net, but continuous validation during execution is where you&#x27;ll catch most issues.&lt;&#x2F;p&gt;
&lt;p&gt;An invariant is just a property that must always hold, no matter what operations ran. If you&#x27;ve seen property-based testing, it&#x27;s the same idea: instead of &lt;code&gt;assertFalse(new User(GUEST).canUse(SAVED_CARD))&lt;&#x2F;code&gt;, you write &lt;code&gt;assertEquals(user.isAuthenticated(), user.canUse(SAVED_CARD))&lt;&#x2F;code&gt;. The first tests one case. The second tests a rule that holds for all cases.&lt;&#x2F;p&gt;
&lt;p&gt;Four patterns dominate invariant design:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Reference Models&lt;&#x2F;strong&gt; maintain an in-memory copy of expected state. Every operation updates both the database and the reference model. In &lt;code&gt;check()&lt;&#x2F;code&gt;, you compare them. If they diverge, something went wrong. Use &lt;code&gt;BTreeMap&lt;&#x2F;code&gt; (not &lt;code&gt;HashMap&lt;&#x2F;code&gt;) for deterministic iteration. This pattern works best for single-client workloads where you can track state locally.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Conservation Laws&lt;&#x2F;strong&gt; track quantities that must stay constant. Inventory transfers between warehouses shouldn&#x27;t change total inventory. Money transfers between accounts shouldn&#x27;t create or destroy money. Sum everything up and verify the conservation law holds. This pattern is elegant because it doesn&#x27;t require tracking individual operations, just the aggregate property.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Structural Integrity&lt;&#x2F;strong&gt; verifies data structures remain valid. If you maintain a secondary index, verify every index entry points to an existing record and every record appears in the index exactly once. If you maintain a linked list in FDB, traverse it and confirm every node is reachable. The cycle validation pattern (creating a circular list where nodes point to each other) is a classic technique from &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;231f762&#x2F;fdbserver&#x2F;workloads&#x2F;Cycle.actor.cpp&quot;&gt;FDB&#x27;s own Cycle workload&lt;&#x2F;a&gt;. After chaos, traverse the cycle and verify you visit exactly N nodes.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Operation Logging&lt;&#x2F;strong&gt; solves two problems at once: &lt;code&gt;maybe_committed&lt;&#x2F;code&gt; uncertainty and multi-client coordination. The trick from &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;231f762&#x2F;fdbserver&#x2F;workloads&#x2F;AtomicOps.actor.cpp&quot;&gt;FDB&#x27;s own AtomicOps workload&lt;&#x2F;a&gt;: &lt;strong&gt;log the intent alongside the operation in the same transaction&lt;&#x2F;strong&gt;. Write both your operation AND a log entry recording what you intended. Since they&#x27;re in the same transaction, they either both commit or neither does. No uncertainty. For multi-client workloads, each client logs under its own prefix (e.g., &lt;code&gt;log&#x2F;{client_id}&#x2F;&lt;&#x2F;code&gt;). In &lt;code&gt;check()&lt;&#x2F;code&gt;, client 0 reads all logs from all clients, replays them to compute expected state, and compares against actual state. If they diverge, something went wrong, and you&#x27;ll know exactly which operations succeeded. See the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;foundationdb-rs&#x2F;foundationdb-rs&#x2F;blob&#x2F;4ed057a&#x2F;foundationdb-simulation&#x2F;examples&#x2F;atomic&#x2F;lib.rs&quot;&gt;Rust atomic workload example&lt;&#x2F;a&gt; for a complete implementation.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-determinism-rules&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-determinism-rules&quot; aria-label=&quot;Anchor link for: the-determinism-rules&quot;&gt;🔗&lt;&#x2F;a&gt;The Determinism Rules&lt;&#x2F;h2&gt;
&lt;p&gt;FDB&#x27;s simulation is deterministic. Same seed, same execution path, same bugs. This is the superpower that lets you reproduce failures. But determinism is fragile. Break it, and you lose reproducibility. Five rules to remember:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;BTreeMap, not HashMap&lt;&#x2F;strong&gt;: HashMap iteration order is non-deterministic&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;context.rnd(), not rand::random()&lt;&#x2F;strong&gt;: All randomness must come from the seeded PRNG&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;context.now(), not SystemTime::now()&lt;&#x2F;strong&gt;: Use simulation time, not wall clock&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;db.run(), not manual retry loops&lt;&#x2F;strong&gt;: The framework handles retries and &lt;code&gt;maybe_committed&lt;&#x2F;code&gt; correctly&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;No tokio::spawn()&lt;&#x2F;strong&gt;: The simulation runs on a custom executor, spawning breaks it&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;If you take nothing else from this post, memorize these. Break any of them and your failures become unreproducible. You&#x27;ll see a bug once and never find it again.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;architecture-the-three-crate-pattern&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#architecture-the-three-crate-pattern&quot; aria-label=&quot;Anchor link for: architecture-the-three-crate-pattern&quot;&gt;🔗&lt;&#x2F;a&gt;Architecture: The Three-Crate Pattern&lt;&#x2F;h2&gt;
&lt;p&gt;Real production systems use tokio, gRPC, REST frameworks, all of which break simulation determinism. You can&#x27;t just drop your production binary into the simulator. The solution is separating your FDB operations into a simulation-friendly crate:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;my-project&#x2F;
&lt;&#x2F;span&gt;&lt;span&gt;├── my-fdb-service&#x2F;      # Core FDB operations - NO tokio
&lt;&#x2F;span&gt;&lt;span&gt;├── my-grpc-server&#x2F;      # Production layer (tokio + tonic)
&lt;&#x2F;span&gt;&lt;span&gt;└── my-fdb-workloads&#x2F;    # Simulation tests
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The service crate contains pure FDB transaction logic with no async runtime dependency. The server crate wraps it for production. The workloads crate tests the actual service logic under simulation chaos. This lets you test your real production code, not a reimplementation that might have different bugs.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;common-pitfalls&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#common-pitfalls&quot; aria-label=&quot;Anchor link for: common-pitfalls&quot;&gt;🔗&lt;&#x2F;a&gt;Common Pitfalls&lt;&#x2F;h2&gt;
&lt;p&gt;Beyond the determinism rules above, these mistakes will bite you:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Running setup or check on all clients.&lt;&#x2F;strong&gt; The framework runs multiple clients concurrently. If every client initializes data in &lt;code&gt;setup()&lt;&#x2F;code&gt;, you get duplicate initialization. If every client validates in &lt;code&gt;check()&lt;&#x2F;code&gt;, you get inconsistent results. Use &lt;code&gt;if self.client_id == 0&lt;&#x2F;code&gt; to ensure only one client handles initialization and validation.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Forgetting maybe_committed.&lt;&#x2F;strong&gt; The &lt;code&gt;db.run()&lt;&#x2F;code&gt; closure receives a &lt;code&gt;maybe_committed&lt;&#x2F;code&gt; flag indicating the previous attempt might have succeeded. If you&#x27;re doing non-idempotent operations like atomic increments, you need either truly idempotent transactions or &lt;a href=&quot;&#x2F;posts&#x2F;automatic-txn-fdb-730&#x2F;&quot;&gt;automatic idempotency&lt;&#x2F;a&gt; in FDB 7.3+. Ignoring this flag means your workload might count operations twice.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Storing SimDatabase between phases.&lt;&#x2F;strong&gt; Each phase (&lt;code&gt;setup&lt;&#x2F;code&gt;, &lt;code&gt;start&lt;&#x2F;code&gt;, &lt;code&gt;check&lt;&#x2F;code&gt;) gets a fresh database reference. Storing the old one leads to undefined behavior. Always use the &lt;code&gt;db&lt;&#x2F;code&gt; parameter passed to each method.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Wrapping FdbError in custom error types.&lt;&#x2F;strong&gt; The &lt;code&gt;db.run()&lt;&#x2F;code&gt; retry mechanism checks if errors are retryable via &lt;code&gt;FdbError::is_retryable()&lt;&#x2F;code&gt;. If you wrap &lt;code&gt;FdbError&lt;&#x2F;code&gt; in your own error type (like &lt;code&gt;anyhow::Error&lt;&#x2F;code&gt; or a custom enum), the retry logic can&#x27;t see the underlying error and won&#x27;t retry. Keep &lt;code&gt;FdbError&lt;&#x2F;code&gt; unwrapped in your transaction closures, or ensure your error type preserves retryability information.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Assuming setup is safe from failures.&lt;&#x2F;strong&gt; BUGGIFY is disabled during &lt;code&gt;setup()&lt;&#x2F;code&gt;, so you might think transactions can&#x27;t fail. But simulation randomizes FDB knobs, which can still cause transaction failures. Always use &lt;code&gt;db.run()&lt;&#x2F;code&gt; with retry logic even in setup, or wrap your setup in a retry loop.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-real-value&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-real-value&quot; aria-label=&quot;Anchor link for: the-real-value&quot;&gt;🔗&lt;&#x2F;a&gt;The Real Value&lt;&#x2F;h2&gt;
&lt;p&gt;That &lt;code&gt;commit_unknown_result&lt;&#x2F;code&gt; edge case appeared on our first simulation seed. In production, we&#x27;d still be hunting it months later. 30 minutes of simulation covers what would take 24 hours of chaos testing. But the real value of simulation testing isn&#x27;t just finding bugs, it&#x27;s &lt;strong&gt;forcing you to think about correctness.&lt;&#x2F;strong&gt; When you design a workload, you&#x27;re forced to ask: &quot;What happens when this retries during a partition?&quot; &quot;How do I verify correctness when transactions can commit in any order?&quot; &quot;What invariants must hold no matter what chaos occurs?&quot; Designing for chaos becomes natural. And if it survives simulation, it survives production.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Feel free to reach out with questions or to share your simulation workloads. You can find me on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;pierrezemb.fr&quot;&gt;Bluesky&lt;&#x2F;a&gt; or through my &lt;a href=&quot;https:&#x2F;&#x2F;pierrezemb.fr&quot;&gt;website&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">foundationdb</category>
          <category domain="tag">rust</category>
          <category domain="tag">testing</category>
          <category domain="tag">simulation</category>
          <category domain="tag">deterministic</category>
          <category domain="tag">distributed-systems</category>
      </item>
      <item>
          <title>Diving into Kubernetes&#x27; Watch Cache</title>
          <pubDate>Wed, 12 Nov 2025 00:00:00 +0000</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/diving-into-kubernetes-watch-cache/</link>
          <guid>https://pierrezemb.fr/posts/diving-into-kubernetes-watch-cache/</guid>
          <description xml:base="https://pierrezemb.fr/posts/diving-into-kubernetes-watch-cache/">&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;a href=&quot;&#x2F;tags&#x2F;diving-into&#x2F;&quot;&gt;Diving Into&lt;&#x2F;a&gt; is a blogpost series where we dig into specific parts of a project&#x27;s codebase. In this episode, we dig into Kubernetes&#x27; watch cache implementation.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;While debugging an etcd-shim on FoundationDB, I kept hitting &lt;code&gt;&quot;Timeout: Too large resource version&quot;&lt;&#x2F;code&gt; errors. The cache was stuck at revision 3044, but clients requested 3047. Three seconds later: timeout. This led me into the watch cache internals: specifically the 3-second timeout in &lt;code&gt;waitUntilFreshAndBlock()&lt;&#x2F;code&gt; and how progress notifications solve the problem. Let&#x27;s dig into how it actually works.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;&#x2F;strong&gt; Yes, &lt;a href=&quot;https:&#x2F;&#x2F;clever.cloud&quot;&gt;Clever Cloud&lt;&#x2F;a&gt; runs an etcd-shim on top of FoundationDB for Kubernetes. Truth is, we&#x27;re not alone: &lt;a href=&quot;https:&#x2F;&#x2F;aws.amazon.com&#x2F;blogs&#x2F;containers&#x2F;under-the-hood-amazon-eks-ultra-scale-clusters&#x2F;&quot;&gt;AWS&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;cloud.google.com&#x2F;blog&#x2F;products&#x2F;containers-kubernetes&#x2F;gke-65k-nodes-and-counting?hl=en&quot;&gt;GKE&lt;&#x2F;a&gt; have custom storage layers too. After &lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=IrJyrGQ_R9c&quot;&gt;operating etcd at OVHcloud&lt;&#x2F;a&gt;, we chose a different path. I actually wrote a naive PoC during COVID (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;PierreZ&#x2F;fdb-etcd&quot;&gt;fdb-etcd&lt;&#x2F;a&gt;) without testing it against a real apiserver 😅 it was mostly an excuse to discover &lt;a href=&quot;https:&#x2F;&#x2F;pierrez.github.io&#x2F;fdb-book&#x2F;the-record-layer&#x2F;what-is-record-layer.html&quot;&gt;the Record-Layer&lt;&#x2F;a&gt;. You can read more about the technical challenges in &lt;a href=&quot;https:&#x2F;&#x2F;forums.foundationdb.org&#x2F;t&#x2F;a-foundationdb-layer-for-apiserver-as-an-alternative-to-etcd&#x2F;2697&quot;&gt;this FoundationDB forum discussion&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;overview-of-the-watch-cache&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#overview-of-the-watch-cache&quot; aria-label=&quot;Anchor link for: overview-of-the-watch-cache&quot;&gt;🔗&lt;&#x2F;a&gt;Overview of the Watch Cache&lt;&#x2F;h2&gt;
&lt;p&gt;When I first looked at the watch cache implementation, I expected a single monolithic cache sitting between the apiserver and etcd. It took compiling my own apiserver with additional logging to realize the architecture is more interesting: &lt;strong&gt;each resource type gets its own independent Cacher instance&lt;&#x2F;strong&gt;. Pods have one. Services have another. Deployments get their own. Every resource group runs an isolated LIST+WATCH loop, maintaining its own in-memory cache.&lt;&#x2F;p&gt;
&lt;p&gt;As the &lt;a href=&quot;https:&#x2F;&#x2F;kubernetes.io&#x2F;blog&#x2F;2024&#x2F;08&#x2F;15&#x2F;consistent-read-from-cache-beta&#x2F;&quot;&gt;Kubernetes 1.34 blog post&lt;&#x2F;a&gt; explains, this enhancement allows the API server to serve consistent read requests directly from the watch cache, significantly reducing the load on etcd and improving overall cluster performance.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;architecture&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#architecture&quot; aria-label=&quot;Anchor link for: architecture&quot;&gt;🔗&lt;&#x2F;a&gt;Architecture&lt;&#x2F;h2&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;Client Requests (kubectl, controllers)
&lt;&#x2F;span&gt;&lt;span&gt;          ↓
&lt;&#x2F;span&gt;&lt;span&gt;    Cacher (per resource)
&lt;&#x2F;span&gt;&lt;span&gt;          ↓ In-memory watch cache
&lt;&#x2F;span&gt;&lt;span&gt;          ↓ (on cache miss&#x2F;delegate)
&lt;&#x2F;span&gt;&lt;span&gt;    etcd3&#x2F;Store
&lt;&#x2F;span&gt;&lt;span&gt;          ↓
&lt;&#x2F;span&gt;&lt;span&gt;    etcd &#x2F; etcd-shim
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The main components:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;kubernetes&#x2F;kubernetes&#x2F;blob&#x2F;release-1.34&#x2F;staging&#x2F;src&#x2F;k8s.io&#x2F;apiserver&#x2F;pkg&#x2F;storage&#x2F;cacher&#x2F;cacher.go&quot;&gt;cacher.go&lt;&#x2F;a&gt; - The in-memory watch cache&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;kubernetes&#x2F;kubernetes&#x2F;blob&#x2F;release-1.34&#x2F;staging&#x2F;src&#x2F;k8s.io&#x2F;apiserver&#x2F;pkg&#x2F;storage&#x2F;etcd3&#x2F;store.go&quot;&gt;store.go&lt;&#x2F;a&gt; - Direct &lt;a href=&quot;&#x2F;posts&#x2F;notes-about-etcd&#x2F;&quot;&gt;etcd&lt;&#x2F;a&gt; communication layer&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;how-the-cache-gets-fed&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#how-the-cache-gets-fed&quot; aria-label=&quot;Anchor link for: how-the-cache-gets-fed&quot;&gt;🔗&lt;&#x2F;a&gt;How The Cache Gets Fed&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;initialization-the-list-phase&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#initialization-the-list-phase&quot; aria-label=&quot;Anchor link for: initialization-the-list-phase&quot;&gt;🔗&lt;&#x2F;a&gt;Initialization: The LIST Phase&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;strong&gt;Nothing works until the cache initializes&lt;&#x2F;strong&gt;. When a Cacher starts, every read for that resource blocks until initialization completes. This matters because initialization isn&#x27;t instant: it&#x27;s a paginated LIST operation fetching 10,000 items per page. For a large cluster with thousands of pods, this takes time.&lt;&#x2F;p&gt;
&lt;p&gt;Here&#x27;s the sequence: The Reflector pattern kicks off with a complete LIST operation. Each resource cache fetches all existing objects through paginated requests. Once the LIST completes, &lt;code&gt;watchCache.Replace()&lt;&#x2F;code&gt; populates the in-memory cache with these objects. The &lt;strong&gt;critical moment&lt;&#x2F;strong&gt; happens when the &lt;code&gt;SetOnReplace()&lt;&#x2F;code&gt; callback fires (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;kubernetes&#x2F;kubernetes&#x2F;blob&#x2F;release-1.34&#x2F;staging&#x2F;src&#x2F;k8s.io&#x2F;apiserver&#x2F;pkg&#x2F;storage&#x2F;cacher&#x2F;cacher.go#L468-L478&quot;&gt;cacher.go:468-478&lt;&#x2F;a&gt;), marking the cache as READY. Until that callback fires, every request for that resource waits.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;continuous-sync-the-watch-phase&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#continuous-sync-the-watch-phase&quot; aria-label=&quot;Anchor link for: continuous-sync-the-watch-phase&quot;&gt;🔗&lt;&#x2F;a&gt;Continuous Sync: The WATCH Phase&lt;&#x2F;h3&gt;
&lt;p&gt;After initialization, the real trick begins: the cache maintains synchronization through a Watch stream that starts at LIST revision + 1. This &lt;strong&gt;guarantees no events are missed&lt;&#x2F;strong&gt; between the LIST and WATCH operations. The watch picks up exactly where the list left off. Events flow from etcd through a buffered channel (capacity: 100 events) and are processed by the &lt;code&gt;dispatchEvents()&lt;&#x2F;code&gt; goroutine, which runs continuously, matching events to interested watchers.&lt;&#x2F;p&gt;
&lt;p&gt;This pattern depends on continuous event flow. When events stop arriving, when resources go quiet, that&#x27;s when progress notifications become essential. See &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;kubernetes&#x2F;kubernetes&#x2F;blob&#x2F;release-1.34&#x2F;staging&#x2F;src&#x2F;k8s.io&#x2F;client-go&#x2F;tools&#x2F;cache&#x2F;reflector.go&quot;&gt;Reflector documentation&lt;&#x2F;a&gt; for the complete pattern.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-problem-timeout-too-large-resource-version&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-problem-timeout-too-large-resource-version&quot; aria-label=&quot;Anchor link for: the-problem-timeout-too-large-resource-version&quot;&gt;🔗&lt;&#x2F;a&gt;The Problem: &quot;Timeout: Too large resource version&quot;&lt;&#x2F;h2&gt;
&lt;p&gt;While debugging our etcd-shim, we kept hitting this error:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;Error getting keys: err=&amp;quot;Timeout: Too large resource version: 3047, current: 3044&amp;quot;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A client was requesting ResourceVersion 3047, but the cache only knew about revision 3044. The cache would wait... and timeout after 3 seconds.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;understanding-cache-freshness&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#understanding-cache-freshness&quot; aria-label=&quot;Anchor link for: understanding-cache-freshness&quot;&gt;🔗&lt;&#x2F;a&gt;Understanding Cache Freshness&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;the-freshness-check&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-freshness-check&quot; aria-label=&quot;Anchor link for: the-freshness-check&quot;&gt;🔗&lt;&#x2F;a&gt;The Freshness Check&lt;&#x2F;h3&gt;
&lt;p&gt;When a client requests a consistent read at a specific ResourceVersion, Kubernetes needs to ensure the cache is &quot;fresh enough&quot; to serve that request. Here&#x27;s the check: is my current revision at least as high as the requested revision? If not, it calls &lt;code&gt;waitUntilFreshAndBlock()&lt;&#x2F;code&gt; with a 3-second timeout, waiting for Watch events to bring the cache up to date.&lt;&#x2F;p&gt;
&lt;p&gt;From &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;kubernetes&#x2F;kubernetes&#x2F;blob&#x2F;release-1.34&#x2F;staging&#x2F;src&#x2F;k8s.io&#x2F;apiserver&#x2F;pkg&#x2F;storage&#x2F;cacher&#x2F;cacher.go#L1257-L1261&quot;&gt;cacher.go:1257-1261&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;go&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-go &quot;&gt;&lt;code class=&quot;language-go&quot; data-lang=&quot;go&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;c&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;watchCache&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;notFresh&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;requestedWatchRV&lt;&#x2F;span&gt;&lt;span&gt;) {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;c&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;watchCache&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;waitingUntilFresh&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Add&lt;&#x2F;span&gt;&lt;span&gt;()
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;defer &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;c&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;watchCache&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;waitingUntilFresh&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Remove&lt;&#x2F;span&gt;&lt;span&gt;()
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;err &lt;&#x2F;span&gt;&lt;span&gt;:= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;c&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;watchCache&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;waitUntilFreshAndBlock&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ctx&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;requestedWatchRV&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The actual timeout implementation (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;kubernetes&#x2F;kubernetes&#x2F;blob&#x2F;release-1.34&#x2F;staging&#x2F;src&#x2F;k8s.io&#x2F;apiserver&#x2F;pkg&#x2F;storage&#x2F;cacher&#x2F;watch_cache.go#L448-L488&quot;&gt;watch_cache.go:448-488&lt;&#x2F;a&gt;):&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;go&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-go &quot;&gt;&lt;code class=&quot;language-go&quot; data-lang=&quot;go&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;func &lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;w &lt;&#x2F;span&gt;&lt;span&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;watchCache&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;waitUntilFreshAndBlock&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ctx context&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;Context&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;resourceVersion &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;uint64&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;error &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;startTime &lt;&#x2F;span&gt;&lt;span&gt;:= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;w&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;clock&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Now&lt;&#x2F;span&gt;&lt;span&gt;()
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;defer func&lt;&#x2F;span&gt;&lt;span&gt;() {
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;resourceVersion &lt;&#x2F;span&gt;&lt;span&gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;metrics&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;WatchCacheReadWait&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;WithContext&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ctx&lt;&#x2F;span&gt;&lt;span&gt;).&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;WithLabelValues&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;w&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;groupResource&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Group&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;w&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;groupResource&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Resource&lt;&#x2F;span&gt;&lt;span&gt;).&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Observe&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;w&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;clock&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Since&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;startTime&lt;&#x2F;span&gt;&lt;span&gt;).&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Seconds&lt;&#x2F;span&gt;&lt;span&gt;())
&lt;&#x2F;span&gt;&lt;span&gt;        }
&lt;&#x2F;span&gt;&lt;span&gt;    }()
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; In case resourceVersion is 0, we accept arbitrarily stale result.
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; As a result, the condition in the below for loop will never be
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; satisfied (w.resourceVersion is never negative), this call will
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; never hit the w.cond.Wait().
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; As a result - we can optimize the code by not firing the wakeup
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; function (and avoid starting a gorotuine), especially given that
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; resourceVersion=0 is the most common case.
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;resourceVersion &lt;&#x2F;span&gt;&lt;span&gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;go func&lt;&#x2F;span&gt;&lt;span&gt;() {
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Wake us up when the time limit has expired.  The docs
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; promise that time.After (well, NewTimer, which it calls)
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; will wait *at least* the duration given. Since this go
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; routine starts sometime after we record the start time, and
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; it will wake up the loop below sometime after the broadcast,
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; we don&amp;#39;t need to worry about waking it up before the time
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; has expired accidentally.
&lt;&#x2F;span&gt;&lt;span&gt;            &amp;lt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;w&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;clock&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;After&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;blockTimeout&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;w&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;cond&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Broadcast&lt;&#x2F;span&gt;&lt;span&gt;()
&lt;&#x2F;span&gt;&lt;span&gt;        }()
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;w&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;RLock&lt;&#x2F;span&gt;&lt;span&gt;()
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;span &lt;&#x2F;span&gt;&lt;span&gt;:= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;tracing&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;SpanFromContext&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ctx&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;span&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;AddEvent&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;watchCache locked acquired&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;)
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;w&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;resourceVersion &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;resourceVersion &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;w&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;clock&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Since&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;startTime&lt;&#x2F;span&gt;&lt;span&gt;) &amp;gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;blockTimeout &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Request that the client retry after &amp;#39;resourceVersionTooHighRetrySeconds&amp;#39; seconds.
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;storage&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;NewTooLargeResourceVersionError&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;resourceVersion&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;w&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;resourceVersion&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;resourceVersionTooHighRetrySeconds&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;        }
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;w&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;cond&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Wait&lt;&#x2F;span&gt;&lt;span&gt;()
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;span&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;AddEvent&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;watchCache fresh enough&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;)
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;nil
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If the cache can&#x27;t catch up within those 3 seconds, the request times out.&lt;&#x2F;p&gt;
&lt;p&gt;If you&#x27;ve ever seen kubectl commands hang for exactly 3 seconds before returning data, this is why. The cache is waiting for events that will never come.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;the-problem-with-quiet-resources&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-problem-with-quiet-resources&quot; aria-label=&quot;Anchor link for: the-problem-with-quiet-resources&quot;&gt;🔗&lt;&#x2F;a&gt;The Problem with Quiet Resources&lt;&#x2F;h3&gt;
&lt;p&gt;This is where things get tricky. For infrequently-updated resources (namespaces, configmaps, etc.):&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Time&lt;&#x2F;th&gt;&lt;th&gt;Component&lt;&#x2F;th&gt;&lt;th&gt;Event&lt;&#x2F;th&gt;&lt;th&gt;Cache RV&lt;&#x2F;th&gt;&lt;th&gt;etcd RV&lt;&#x2F;th&gt;&lt;th&gt;Notes&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;T0&lt;&#x2F;td&gt;&lt;td&gt;Namespace cache&lt;&#x2F;td&gt;&lt;td&gt;Idle, no changes&lt;&#x2F;td&gt;&lt;td&gt;3044&lt;&#x2F;td&gt;&lt;td&gt;3044&lt;&#x2F;td&gt;&lt;td&gt;No namespace changes for 5 minutes&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;T1&lt;&#x2F;td&gt;&lt;td&gt;Pod&#x2F;Service caches&lt;&#x2F;td&gt;&lt;td&gt;Resources changing&lt;&#x2F;td&gt;&lt;td&gt;-&lt;&#x2F;td&gt;&lt;td&gt;3047&lt;&#x2F;td&gt;&lt;td&gt;Global etcd revision advances&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;T2&lt;&#x2F;td&gt;&lt;td&gt;Namespace watch&lt;&#x2F;td&gt;&lt;td&gt;Receives nothing&lt;&#x2F;td&gt;&lt;td&gt;3044&lt;&#x2F;td&gt;&lt;td&gt;3047&lt;&#x2F;td&gt;&lt;td&gt;No namespace events to process&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;T3&lt;&#x2F;td&gt;&lt;td&gt;Namespace cache&lt;&#x2F;td&gt;&lt;td&gt;Still waiting&lt;&#x2F;td&gt;&lt;td&gt;3044&lt;&#x2F;td&gt;&lt;td&gt;3047&lt;&#x2F;td&gt;&lt;td&gt;Cache stuck, unaware of global progress&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;T4&lt;&#x2F;td&gt;&lt;td&gt;Client&lt;&#x2F;td&gt;&lt;td&gt;Lists pods successfully&lt;&#x2F;td&gt;&lt;td&gt;-&lt;&#x2F;td&gt;&lt;td&gt;3047&lt;&#x2F;td&gt;&lt;td&gt;Response includes current RV 3047&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;T5&lt;&#x2F;td&gt;&lt;td&gt;Client&lt;&#x2F;td&gt;&lt;td&gt;Requests namespace read at RV ≥ 3047&lt;&#x2F;td&gt;&lt;td&gt;-&lt;&#x2F;td&gt;&lt;td&gt;3047&lt;&#x2F;td&gt;&lt;td&gt;Consistent read requirement&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;T6&lt;&#x2F;td&gt;&lt;td&gt;Namespace cache&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;waitUntilFreshAndBlock()&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;3044&lt;&#x2F;td&gt;&lt;td&gt;3047&lt;&#x2F;td&gt;&lt;td&gt;&quot;I&#x27;m at 3044, need 3047... waiting&quot;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;T7&lt;&#x2F;td&gt;&lt;td&gt;Namespace cache&lt;&#x2F;td&gt;&lt;td&gt;Timeout!&lt;&#x2F;td&gt;&lt;td&gt;3044&lt;&#x2F;td&gt;&lt;td&gt;3047&lt;&#x2F;td&gt;&lt;td&gt;3 seconds elapsed, returns error&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;The cache has no way to know if etcd has moved forward. Is the system healthy? Is something broken? It just sees... nothing.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;timeout-behavior-summary&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#timeout-behavior-summary&quot; aria-label=&quot;Anchor link for: timeout-behavior-summary&quot;&gt;🔗&lt;&#x2F;a&gt;Timeout Behavior Summary&lt;&#x2F;h3&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;&#x2F;th&gt;&lt;th&gt;Cache RV&lt;&#x2F;th&gt;&lt;th&gt;Requested RV&lt;&#x2F;th&gt;&lt;th&gt;Result&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Fresh cache&lt;&#x2F;td&gt;&lt;td&gt;3047&lt;&#x2F;td&gt;&lt;td&gt;3045&lt;&#x2F;td&gt;&lt;td&gt;✓ Serve immediately&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Stale cache&lt;&#x2F;td&gt;&lt;td&gt;3044&lt;&#x2F;td&gt;&lt;td&gt;3047&lt;&#x2F;td&gt;&lt;td&gt;⏱ Wait 3s → timeout&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;With progress&lt;&#x2F;td&gt;&lt;td&gt;3044&lt;&#x2F;td&gt;&lt;td&gt;3047&lt;&#x2F;td&gt;&lt;td&gt;✓ RequestProgress → serve&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h2 id=&quot;progress-notifications-keeping-quiet-resources-fresh&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#progress-notifications-keeping-quiet-resources-fresh&quot; aria-label=&quot;Anchor link for: progress-notifications-keeping-quiet-resources-fresh&quot;&gt;🔗&lt;&#x2F;a&gt;Progress Notifications: Keeping Quiet Resources Fresh&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;what-are-progress-notifications&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what-are-progress-notifications&quot; aria-label=&quot;Anchor link for: what-are-progress-notifications&quot;&gt;🔗&lt;&#x2F;a&gt;What Are Progress Notifications?&lt;&#x2F;h3&gt;
&lt;p&gt;Here&#x27;s the trick: progress notifications are &lt;strong&gt;empty Watch responses&lt;&#x2F;strong&gt; that only update the revision:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;go&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-go &quot;&gt;&lt;code class=&quot;language-go&quot; data-lang=&quot;go&quot;&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;WatchResponse &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Header&lt;&#x2F;span&gt;&lt;span&gt;: { &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Revision&lt;&#x2F;span&gt;&lt;span&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;3047 &lt;&#x2F;span&gt;&lt;span&gt;},  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Current etcd revision
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Events&lt;&#x2F;span&gt;&lt;span&gt;: []                     &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; No actual data changes
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;They solve the quiet resource problem by telling the cache: &quot;etcd is now at revision X, even though your resource hasn&#x27;t changed.&quot;&lt;&#x2F;p&gt;
&lt;p&gt;This is exactly what we had forgotten to implement in our etcd-shim. We handled regular Watch events perfectly, but didn&#x27;t support progress notifications. The result? Kubernetes&#x27; watch cache would timeout waiting for revisions that would never arrive through normal events. Once we added &lt;code&gt;RequestProgress&lt;&#x2F;code&gt; support and started sending these empty bookmark responses, the timeouts disappeared.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;two-mechanisms&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#two-mechanisms&quot; aria-label=&quot;Anchor link for: two-mechanisms&quot;&gt;🔗&lt;&#x2F;a&gt;Two Mechanisms&lt;&#x2F;h3&gt;
&lt;h4 id=&quot;1-on-demand-requestwatchprogress&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#1-on-demand-requestwatchprogress&quot; aria-label=&quot;Anchor link for: 1-on-demand-requestwatchprogress&quot;&gt;🔗&lt;&#x2F;a&gt;1. On-Demand: RequestWatchProgress()&lt;&#x2F;h4&gt;
&lt;p&gt;When the cache needs to catch up, it can explicitly request a progress notification. See &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;kubernetes&#x2F;kubernetes&#x2F;blob&#x2F;release-1.34&#x2F;staging&#x2F;src&#x2F;k8s.io&#x2F;apiserver&#x2F;pkg&#x2F;storage&#x2F;etcd3&#x2F;store.go#L99-L103&quot;&gt;store.go:99-103&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;go&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-go &quot;&gt;&lt;code class=&quot;language-go&quot; data-lang=&quot;go&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;func &lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;s &lt;&#x2F;span&gt;&lt;span&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;store&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;RequestWatchProgress&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ctx context&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;Context&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;error &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;s&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;client&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;RequestProgress&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;s&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;watchContext&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ctx&lt;&#x2F;span&gt;&lt;span&gt;))
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;When called, etcd responds with a bookmark (also called a progress notification) containing the current revision. The cache at revision 3044 calls &lt;code&gt;RequestProgress()&lt;&#x2F;code&gt;, receives &lt;code&gt;{ Revision: 3047, Events: [] }&lt;&#x2F;code&gt;, and immediately updates its internal state to 3047.&lt;&#x2F;p&gt;
&lt;p&gt;The progress notification is detected in the watch stream (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;kubernetes&#x2F;kubernetes&#x2F;blob&#x2F;release-1.34&#x2F;staging&#x2F;src&#x2F;k8s.io&#x2F;apiserver&#x2F;pkg&#x2F;storage&#x2F;etcd3&#x2F;watcher.go#L401-L404&quot;&gt;watcher.go:401-404&lt;&#x2F;a&gt;):&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;go&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-go &quot;&gt;&lt;code class=&quot;language-go&quot; data-lang=&quot;go&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Handle progress notifications (bookmarks)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;wres&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;IsProgressNotify&lt;&#x2F;span&gt;&lt;span&gt;() {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;wc&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;queueEvent&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;progressNotifyEvent&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;wres&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Header&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;GetRevision&lt;&#x2F;span&gt;&lt;span&gt;()))
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;metrics&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;RecordEtcdBookmark&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;wc&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;watcher&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;groupResource&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;continue
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h4 id=&quot;2-proactive-periodic-progress-requests&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#2-proactive-periodic-progress-requests&quot; aria-label=&quot;Anchor link for: 2-proactive-periodic-progress-requests&quot;&gt;🔗&lt;&#x2F;a&gt;2. Proactive: Periodic Progress Requests&lt;&#x2F;h4&gt;
&lt;p&gt;Kubernetes also runs a background component called &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;kubernetes&#x2F;kubernetes&#x2F;blob&#x2F;release-1.34&#x2F;staging&#x2F;src&#x2F;k8s.io&#x2F;apiserver&#x2F;pkg&#x2F;storage&#x2F;cacher&#x2F;cacher.go#L425-L428&quot;&gt;progressRequester&lt;&#x2F;a&gt; that monitors quiet watches. This component detects when watches haven&#x27;t received events for a while and periodically calls &lt;code&gt;RequestProgress()&lt;&#x2F;code&gt; to ensure even completely idle resources stay fresh. This proactive approach prevents timeout errors before they happen.&lt;&#x2F;p&gt;
&lt;p&gt;The progress requester is initialized when the Cacher is created (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;kubernetes&#x2F;kubernetes&#x2F;blob&#x2F;release-1.34&#x2F;staging&#x2F;src&#x2F;k8s.io&#x2F;apiserver&#x2F;pkg&#x2F;storage&#x2F;cacher&#x2F;cacher.go#L425-L428&quot;&gt;cacher.go:425-428&lt;&#x2F;a&gt;):&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;go&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-go &quot;&gt;&lt;code class=&quot;language-go&quot; data-lang=&quot;go&quot;&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;progressRequester &lt;&#x2F;span&gt;&lt;span&gt;:= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;progress&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;NewConditionalProgressRequester&lt;&#x2F;span&gt;&lt;span&gt;(
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;config&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Storage&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;RequestWatchProgress&lt;&#x2F;span&gt;&lt;span&gt;,  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; The function to call
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;config&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Clock&lt;&#x2F;span&gt;&lt;span&gt;,
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;contextMetadata
&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;the-complete-flow&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-complete-flow&quot; aria-label=&quot;Anchor link for: the-complete-flow&quot;&gt;🔗&lt;&#x2F;a&gt;The Complete Flow&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;strong&gt;Timeline showing how progress notifications solve the timeout:&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Time&lt;&#x2F;th&gt;&lt;th&gt;Component&lt;&#x2F;th&gt;&lt;th&gt;Action&lt;&#x2F;th&gt;&lt;th&gt;Cache RV&lt;&#x2F;th&gt;&lt;th&gt;etcd RV&lt;&#x2F;th&gt;&lt;th&gt;Details&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;T0&lt;&#x2F;td&gt;&lt;td&gt;Namespace watch&lt;&#x2F;td&gt;&lt;td&gt;Established&lt;&#x2F;td&gt;&lt;td&gt;3044&lt;&#x2F;td&gt;&lt;td&gt;3044&lt;&#x2F;td&gt;&lt;td&gt;No namespace changes happening&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;T1&lt;&#x2F;td&gt;&lt;td&gt;Pod resources&lt;&#x2F;td&gt;&lt;td&gt;Creates&#x2F;updates&lt;&#x2F;td&gt;&lt;td&gt;3044&lt;&#x2F;td&gt;&lt;td&gt;3047&lt;&#x2F;td&gt;&lt;td&gt;Namespace watch: silent, cache stuck at 3044&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;T2&lt;&#x2F;td&gt;&lt;td&gt;Client&lt;&#x2F;td&gt;&lt;td&gt;Requests namespace LIST at RV 3047&lt;&#x2F;td&gt;&lt;td&gt;3044&lt;&#x2F;td&gt;&lt;td&gt;3047&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;notFresh(3047)&lt;&#x2F;code&gt; → true, starts &lt;code&gt;waitUntilFreshAndBlock()&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;T3&lt;&#x2F;td&gt;&lt;td&gt;progressRequester&lt;&#x2F;td&gt;&lt;td&gt;Detects quiet watch&lt;&#x2F;td&gt;&lt;td&gt;3044&lt;&#x2F;td&gt;&lt;td&gt;3047&lt;&#x2F;td&gt;&lt;td&gt;Calls &lt;code&gt;RequestProgress()&lt;&#x2F;code&gt; on namespace watch stream&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;T4&lt;&#x2F;td&gt;&lt;td&gt;etcd&lt;&#x2F;td&gt;&lt;td&gt;Sends progress notification&lt;&#x2F;td&gt;&lt;td&gt;3044&lt;&#x2F;td&gt;&lt;td&gt;3047&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;WatchResponse { Header: { Revision: 3047 }, Events: [] }&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;T5&lt;&#x2F;td&gt;&lt;td&gt;Namespace cache&lt;&#x2F;td&gt;&lt;td&gt;Processes bookmark&lt;&#x2F;td&gt;&lt;td&gt;3047&lt;&#x2F;td&gt;&lt;td&gt;3047&lt;&#x2F;td&gt;&lt;td&gt;Updates internal revision 3044 → 3047, signals waiters&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;T6&lt;&#x2F;td&gt;&lt;td&gt;Namespace cache&lt;&#x2F;td&gt;&lt;td&gt;Returns successfully&lt;&#x2F;td&gt;&lt;td&gt;3047&lt;&#x2F;td&gt;&lt;td&gt;3047&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;waitUntilFreshAndBlock()&lt;&#x2F;code&gt; completes, request served from cache&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h2 id=&quot;key-takeaways&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#key-takeaways&quot; aria-label=&quot;Anchor link for: key-takeaways&quot;&gt;🔗&lt;&#x2F;a&gt;Key Takeaways&lt;&#x2F;h2&gt;
&lt;p&gt;Here&#x27;s what you need to know: Kubernetes runs a separate watch cache for each resource type (pods, services, deployments, etc.), and each one maintains its own LIST+WATCH loop. When you request a consistent read, the cache performs a freshness check with a &lt;strong&gt;3-second timeout&lt;&#x2F;strong&gt; via &lt;code&gt;waitUntilFreshAndBlock()&lt;&#x2F;code&gt;. Without this mechanism, you&#x27;d see 3-second hangs on every consistent read to quiet resources.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Progress notifications&lt;&#x2F;strong&gt; solve the critical problem of quiet resources: those that don&#x27;t receive updates for extended periods. These empty Watch responses update the cache&#x27;s revision without transferring data. Kubernetes implements this through two mechanisms: &lt;strong&gt;on-demand&lt;&#x2F;strong&gt; (explicit RequestProgress calls when the cache needs to catch up) and &lt;strong&gt;proactive&lt;&#x2F;strong&gt; (periodic monitoring by the progressRequester component).&lt;&#x2F;p&gt;
&lt;p&gt;Without progress notifications, consistent reads must bypass the cache entirely and go directly to etcd, significantly increasing load on the storage layer. This is the difference between a responsive cluster and one where every kubectl command feels sluggish.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;related-posts&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#related-posts&quot; aria-label=&quot;Anchor link for: related-posts&quot;&gt;🔗&lt;&#x2F;a&gt;Related Posts&lt;&#x2F;h2&gt;
&lt;p&gt;If you enjoyed this deep dive into Kubernetes watch caching, you might also be interested in:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;&#x2F;posts&#x2F;notes-about-etcd&#x2F;&quot;&gt;Notes about ETCD&lt;&#x2F;a&gt; - An overview and collection of resources about etcd, the distributed key-value store that powers Kubernetes&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;&#x2F;posts&#x2F;diving-into-etcd-linearizable&#x2F;&quot;&gt;Diving into ETCD&#x27;s linearizable reads&lt;&#x2F;a&gt; - A deep dive into how etcd implements linearizable reads using Raft consensus&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Feel free to reach out with any questions or to share your experiences with Kubernetes watch caching. You can find me on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;pierrezemb.fr&quot;&gt;Bluesky&lt;&#x2F;a&gt; or through my &lt;a href=&quot;https:&#x2F;&#x2F;pierrezemb.fr&quot;&gt;website&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">diving-into</category>
          <category domain="tag">kubernetes</category>
          <category domain="tag">distributed-systems</category>
          <category domain="tag">etcd</category>
          <category domain="tag">caching</category>
      </item>
      <item>
          <title>Diving into FoundationDB&#x27;s Simulation Framework</title>
          <pubDate>Thu, 30 Oct 2025 00:00:00 +0000</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/diving-into-foundationdb-simulation/</link>
          <guid>https://pierrezemb.fr/posts/diving-into-foundationdb-simulation/</guid>
          <description xml:base="https://pierrezemb.fr/posts/diving-into-foundationdb-simulation/">&lt;blockquote&gt;
&lt;p&gt;&lt;a href=&quot;&#x2F;tags&#x2F;diving-into&#x2F;&quot;&gt;Diving Into&lt;&#x2F;a&gt; is a blogpost series where we are digging a specific part of the project&#x27;s codebase. In this episode, we will dig into the implementation behind FoundationDB&#x27;s simulation framework.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;After years of on-call shifts running FoundationDB at Clever Cloud, here&#x27;s what I&#x27;ve learned: &lt;strong&gt;I&#x27;ve never been woken up by FDB&lt;&#x2F;strong&gt;. Every production incident traced back to our code, our infrastructure, our mistakes. Never FDB itself. That kind of reliability doesn&#x27;t happen by accident.&lt;&#x2F;p&gt;
&lt;p&gt;The secret? &lt;strong&gt;Deterministic simulation testing&lt;&#x2F;strong&gt;. FoundationDB runs the real database software (not mocks, not stubs) in a discrete-event simulator alongside randomized workloads and aggressive fault injection. All sources of nondeterminism are abstracted: network, disk, time, and random number generation. Multiple FDB servers communicate through a simulated network in a single-threaded process. The simulator injects machine crashes, rack failures, network partitions, disk corruption, bit flips. Every failure mode you can imagine, happening in rapid succession, deterministically. Same seed, same execution path, same bugs, every single time.&lt;&#x2F;p&gt;
&lt;p&gt;After roughly &lt;strong&gt;one trillion CPU-hours of simulation testing&lt;&#x2F;strong&gt;, FoundationDB has been stress-tested under conditions far worse than any production environment will ever encounter. The development environment is deliberately harsher than production: network partitions every few seconds, machine crashes mid-transaction, disks randomly swapped between nodes on reboot. If your code survives the simulator, production is easy.&lt;&#x2F;p&gt;
&lt;p&gt;I&#x27;ve written before about &lt;a href=&quot;&#x2F;posts&#x2F;notes-about-foundationdb&#x2F;&quot;&gt;FoundationDB&lt;&#x2F;a&gt;, &lt;a href=&quot;&#x2F;posts&#x2F;simulation-driven-development&#x2F;&quot;&gt;simulation-driven development&lt;&#x2F;a&gt;, and &lt;a href=&quot;&#x2F;posts&#x2F;testing-prevention-vs-discovery&#x2F;&quot;&gt;testing prevention vs discovery&lt;&#x2F;a&gt;. Those posts cover the concepts and benefits. This post is different: &lt;strong&gt;this is how FoundationDB actually implements deterministic simulation&lt;&#x2F;strong&gt;. Interface swapping, deterministic event loops, BUGGIFY chaos injection, Flow actors, and the architecture that makes it all work. We&#x27;re going deep into the implementation.&lt;&#x2F;p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;img src=&quot;&#x2F;images&#x2F;fdb-simulation-deep-dive&#x2F;simulator-architecture.jpeg&quot; alt=&quot;FoundationDB Simulator Architecture&quot; &#x2F;&gt;
  &lt;p&gt;&lt;em&gt;FoundationDB&#x27;s simulation architecture: the same FDB server code runs in both the simulator process (using simulated I&#x2F;O) and the real world (using real I&#x2F;O)&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;h2 id=&quot;the-trick-interface-swapping&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-trick-interface-swapping&quot; aria-label=&quot;Anchor link for: the-trick-interface-swapping&quot;&gt;🔗&lt;&#x2F;a&gt;The Trick: Interface Swapping&lt;&#x2F;h2&gt;
&lt;p&gt;The genius of FDB&#x27;s simulation is surprisingly simple: &lt;strong&gt;the same code runs in both production and simulation by swapping interface implementations&lt;&#x2F;strong&gt;. The global &lt;code&gt;g_network&lt;&#x2F;code&gt; pointer holds an &lt;code&gt;INetwork&lt;&#x2F;code&gt; interface. In production, this points to &lt;code&gt;Net2&lt;&#x2F;code&gt;, which creates real TCP connections using Boost.ASIO. In simulation, it points to &lt;code&gt;Sim2&lt;&#x2F;code&gt; (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;dfbb0ea72ce01ba87148ef67cf216200e8b249cd&#x2F;fdbrpc&#x2F;sim2.actor.cpp&quot;&gt;sim2.actor.cpp&lt;&#x2F;a&gt;), which creates &lt;code&gt;Sim2Conn&lt;&#x2F;code&gt; objects (fake connections that write to in-memory buffers).&lt;&#x2F;p&gt;
&lt;p&gt;When code needs to send data, it gets a &lt;code&gt;Reference&amp;lt;IConnection&amp;gt;&lt;&#x2F;code&gt; from the network layer. In production, that&#x27;s a real socket. In simulation, it&#x27;s &lt;code&gt;Sim2Conn&lt;&#x2F;code&gt; (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;dfbb0ea72ce01ba87148ef67cf216200e8b249cd&#x2F;fdbrpc&#x2F;sim2.actor.cpp&quot;&gt;sim2.actor.cpp&lt;&#x2F;a&gt;) with a &lt;code&gt;std::deque&amp;lt;uint8_t&amp;gt;&lt;&#x2F;code&gt; buffer. Network latency? The simulator adds &lt;code&gt;delay()&lt;&#x2F;code&gt; calls with values from &lt;code&gt;deterministicRandom()&lt;&#x2F;code&gt;. Packet loss? Just throw &lt;code&gt;connection_failed()&lt;&#x2F;code&gt;. Network partition? &lt;code&gt;Sim2Conn&lt;&#x2F;code&gt; checks &lt;code&gt;g_clogging.disconnected()&lt;&#x2F;code&gt; and refuses delivery. &lt;strong&gt;It&#x27;s all just memory operations with delays&lt;&#x2F;strong&gt;, running single-threaded and completely deterministic.&lt;&#x2F;p&gt;
&lt;p&gt;What makes this truly deterministic is &lt;code&gt;deterministicRandom()&lt;&#x2F;code&gt;, a seeded PRNG that replaces all randomness. Every network latency value, every backoff delay (like the &lt;code&gt;Peer&lt;&#x2F;code&gt;&#x27;s exponential reconnection timing), every process crash timing goes through the same deterministic stream. Same seed, same execution path, every single time. When a test fails after 1 trillion simulated operations, you can reproduce the exact failure by running with the same seed.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;biasing-the-simulator-buggify&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#biasing-the-simulator-buggify&quot; aria-label=&quot;Anchor link for: biasing-the-simulator-buggify&quot;&gt;🔗&lt;&#x2F;a&gt;Biasing the Simulator: BUGGIFY&lt;&#x2F;h3&gt;
&lt;p&gt;Most deep bugs need a rare combination of events. A network partition &lt;strong&gt;and&lt;&#x2F;strong&gt; a slow disk &lt;strong&gt;and&lt;&#x2F;strong&gt; a coordinator crash happening at the exact same moment. The probability of all three aligning randomly? Astronomical. You&#x27;d burn CPU-centuries waiting.&lt;&#x2F;p&gt;
&lt;p&gt;FoundationDB solves this with &lt;code&gt;BUGGIFY&lt;&#x2F;code&gt;, spread throughout the codebase. Each &lt;code&gt;BUGGIFY&lt;&#x2F;code&gt; point fires 25% of the time, deterministically, so every test explores a different corner of the state space (Alex Miller&#x27;s &lt;a href=&quot;https:&#x2F;&#x2F;transactional.blog&#x2F;simulation&#x2F;buggify&quot;&gt;excellent post on BUGGIFY&lt;&#x2F;a&gt; covers the implementation details).&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s take timeout handling in data distribution as an example:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; DDShardTracker.actor.cpp (fdbserver&#x2F;DDShardTracker.actor.cpp:1508)
&lt;&#x2F;span&gt;&lt;span&gt;choose {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;when&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;wait&lt;&#x2F;span&gt;&lt;span&gt;(g_network-&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;isSimulated&lt;&#x2F;span&gt;&lt;span&gt;() &amp;amp;&amp;amp; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;BUGGIFY_WITH_PROB&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0.01&lt;&#x2F;span&gt;&lt;span&gt;) ? &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Never&lt;&#x2F;span&gt;&lt;span&gt;()
&lt;&#x2F;span&gt;&lt;span&gt;                                                          : &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;fetchTopKShardMetrics_impl&lt;&#x2F;span&gt;&lt;span&gt;(self, req))) {}
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;when&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;wait&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;delay&lt;&#x2F;span&gt;&lt;span&gt;(SERVER_KNOBS-&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;DD_SHARD_METRICS_TIMEOUT&lt;&#x2F;span&gt;&lt;span&gt;))) {
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Timeout path
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The &lt;code&gt;Never()&lt;&#x2F;code&gt; future never completes. Literally hangs forever. This happens only in simulation (&lt;code&gt;g_network-&amp;gt;isSimulated()&lt;&#x2F;code&gt;) and with 1% probability (&lt;code&gt;BUGGIFY_WITH_PROB(0.01)&lt;&#x2F;code&gt;). When it fires, the operation gets stuck, forcing the timeout branch to execute. Simple, elegant failure injection.&lt;&#x2F;p&gt;
&lt;p&gt;But here&#x27;s the trick: &lt;strong&gt;the timeout value itself is also buggified&lt;&#x2F;strong&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; ServerKnobs.cpp
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;init&lt;&#x2F;span&gt;&lt;span&gt;( DD_SHARD_METRICS_TIMEOUT, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;60.0 &lt;&#x2F;span&gt;&lt;span&gt;);  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Production: 60 seconds
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if&lt;&#x2F;span&gt;&lt;span&gt;( randomize &amp;amp;&amp;amp; BUGGIFY ) DD_SHARD_METRICS_TIMEOUT = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0.1&lt;&#x2F;span&gt;&lt;span&gt;;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Simulation: 0.1 seconds!
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Production timeout: 60 seconds. BUGGIFY timeout: 0.1 seconds (600x shorter). The shrinking timeout window means legitimate operations are far more likely to hit timeout paths. Even without &lt;code&gt;Never()&lt;&#x2F;code&gt; forcing a hang, simulated network delays and slow operations will trigger timeouts constantly. When &lt;code&gt;Never()&lt;&#x2F;code&gt; does fire, you get guaranteed timeout execution. Every knob marked &lt;code&gt;if (randomize &amp;amp;&amp;amp; BUGGIFY)&lt;&#x2F;code&gt; becomes a chaos variable. Timeouts shrink, cache sizes drop, I&#x2F;O patterns randomize.&lt;&#x2F;p&gt;
&lt;p&gt;This creates &lt;strong&gt;combinatorial explosion&lt;&#x2F;strong&gt;. FoundationDB has hundreds of randomized knobs. Each BUGGIFY-enabled test run picks a different configuration: maybe connection monitors are 4x slower, but file I&#x2F;O is using 32KB blocks, and cache size is 1000 entries, and reconnection delays are doubled. The next run? Completely different knob values. Same code, thousands of different operating environments. After one trillion simulated operations across countless test runs, you&#x27;ve stress-tested your code under scenarios that would take decades to encounter in production.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-developer-workflow-simulation-as-ci-cd&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-developer-workflow-simulation-as-ci-cd&quot; aria-label=&quot;Anchor link for: the-developer-workflow-simulation-as-ci-cd&quot;&gt;🔗&lt;&#x2F;a&gt;The Developer Workflow: Simulation as CI&#x2F;CD&lt;&#x2F;h2&gt;
&lt;p&gt;Here&#x27;s the FoundationDB developer experience: &lt;strong&gt;write code, run a few local simulation tests to catch obvious bugs, submit your merge request, then let the machines do the hard work&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Every pull request triggers &lt;strong&gt;hundreds of thousands of simulation tests&lt;&#x2F;strong&gt; running on hundreds of cores for hours before a human even begins code review. Different seeds explore different execution paths, different failure timings, different BUGGIFY configurations. Nightly testing runs tens of thousands more simulations, crawling through edge cases you&#x27;d never think to test manually.&lt;&#x2F;p&gt;
&lt;p&gt;In the early days when FoundationDB was still a company, they took this philosophy to its logical extreme: &lt;strong&gt;merge requests were automatically merged if simulation passed&lt;&#x2F;strong&gt;. No human approval needed. The simulation was so trusted that passing tests meant the code was production-ready. (You can hear more about FoundationDB&#x27;s early development culture on &lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=C1nZzQqcPZw&amp;amp;list=PLh4UhOpNuTJO1S8xkfa3QmQzJemsUhuL8&amp;amp;index=6&quot;&gt;The BugBash Podcast&lt;&#x2F;a&gt;).&lt;&#x2F;p&gt;
&lt;p&gt;This changes how you think about distributed systems development. Instead of spending hours debugging race conditions or trying to mentally model all possible failure scenarios, you focus on building features. The simulation finds the edge cases. It discovers the bugs you&#x27;d never anticipate. It stress-tests your code under conditions that would take years to encounter in production.&lt;&#x2F;p&gt;
&lt;p&gt;The scale ramps up through the development cycle: thousands of seeds during merge request testing, tens of thousands in nightly runs, potentially millions during major release cycles. Each seed represents a completely different execution path through your code. By the time your change reaches production, it&#x27;s survived more chaos than most distributed systems see in their entire lifetime.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;The confidence this gives developers is extraordinary&lt;&#x2F;strong&gt;: if your code survives hundreds of thousands of simulated disasters, production feels easy in comparison.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;flow-actors-and-cooperative-multitasking&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#flow-actors-and-cooperative-multitasking&quot; aria-label=&quot;Anchor link for: flow-actors-and-cooperative-multitasking&quot;&gt;🔗&lt;&#x2F;a&gt;Flow: Actors and Cooperative Multitasking&lt;&#x2F;h2&gt;
&lt;p&gt;FoundationDB doesn&#x27;t use traditional threads. It uses Flow, a custom actor model built on C++. Here&#x27;s a simple example:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span&gt;ACTOR Future&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;asyncAdd&lt;&#x2F;span&gt;&lt;span&gt;(Future&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;f&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;offset&lt;&#x2F;span&gt;&lt;span&gt;) {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span&gt; value = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;wait&lt;&#x2F;span&gt;&lt;span&gt;(f);  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Suspend until f completes, then resume with its value
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; value + offset;
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The &lt;code&gt;ACTOR&lt;&#x2F;code&gt; keyword marks functions that can use &lt;code&gt;wait()&lt;&#x2F;code&gt;. When you call &lt;code&gt;wait(f)&lt;&#x2F;code&gt;, the actor &lt;strong&gt;suspends&lt;&#x2F;strong&gt;. It returns control to the event loop and resumes later when the &lt;code&gt;Future&lt;&#x2F;code&gt; completes, continuing with the result. No blocking. All asynchronous. Use the &lt;code&gt;state&lt;&#x2F;code&gt; keyword for variables that need to persist across multiple &lt;code&gt;wait()&lt;&#x2F;code&gt; calls.&lt;&#x2F;p&gt;
&lt;p&gt;If you know Rust&#x27;s async&#x2F;await, Flow is the same concept. &lt;code&gt;ACTOR&lt;&#x2F;code&gt; functions are like &lt;code&gt;async fn&lt;&#x2F;code&gt;, &lt;code&gt;wait()&lt;&#x2F;code&gt; is like &lt;code&gt;.await&lt;&#x2F;code&gt;, and &lt;code&gt;Future&amp;lt;T&amp;gt;&lt;&#x2F;code&gt; is like Rust&#x27;s &lt;code&gt;Future&lt;&#x2F;code&gt;. The difference? Flow was built in 2009 for C++, and gets compiled by &lt;code&gt;actorcompiler.h&lt;&#x2F;code&gt; into state machines rather than relying on language support.&lt;&#x2F;p&gt;
&lt;p&gt;The same Flow code runs in both production and simulation. An actor waiting for network I&#x2F;O gets a real socket in production, a simulated buffer in simulation. The code doesn&#x27;t know the difference. The Flow documentation at &lt;a href=&quot;https:&#x2F;&#x2F;apple.github.io&#x2F;foundationdb&#x2F;flow.html&quot;&gt;apple.github.io&#x2F;foundationdb&#x2F;flow.html&lt;&#x2F;a&gt; covers the full programming model.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;single-threaded-time-travel-the-event-loop&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#single-threaded-time-travel-the-event-loop&quot; aria-label=&quot;Anchor link for: single-threaded-time-travel-the-event-loop&quot;&gt;🔗&lt;&#x2F;a&gt;Single-Threaded Time Travel: The Event Loop&lt;&#x2F;h2&gt;
&lt;p&gt;Hundreds of actors running concurrently. Coordinators electing leaders, transaction logs replicating commits, storage servers handling reads. All happening in &lt;strong&gt;one thread&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The trick is cooperative multitasking. Actors yield control with &lt;code&gt;wait()&lt;&#x2F;code&gt;. When all actors are waiting, the event loop can &lt;strong&gt;advance simulated time&lt;&#x2F;strong&gt;:&lt;&#x2F;p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
    &lt;pre class=&quot;mermaid&quot;&gt;
        flowchart TD
    Start([Event Loop]) --&amp;gt; CheckReady{Any actors&amp;lt;br&amp;#x2F;&amp;gt;ready to run?}
    CheckReady --&amp;gt;|Yes| RunActor[Run next ready actor&amp;lt;br&amp;#x2F;&amp;gt;until it hits wait]
    RunActor --&amp;gt; CheckReady
    CheckReady --&amp;gt;|No, all waiting| CheckPending{Any pending&amp;lt;br&amp;#x2F;&amp;gt;futures?}
    CheckPending --&amp;gt;|Yes| AdvanceTime[Advance simulated clock&amp;lt;br&amp;#x2F;&amp;gt;to next event]
    AdvanceTime --&amp;gt; CheckReady
    CheckPending --&amp;gt;|No| Done([Simulation complete])
    &lt;&#x2F;pre&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;Here&#x27;s the key insight: when all actors are blocked waiting on futures, the event loop finds the next scheduled event (the earliest timestamp) and &lt;strong&gt;jumps the simulated clock forward&lt;&#x2F;strong&gt; to that time. Then it wakes the actors waiting for that event and runs them until they &lt;code&gt;wait()&lt;&#x2F;code&gt; again.&lt;&#x2F;p&gt;
&lt;p&gt;Example: 100 storage servers each execute &lt;code&gt;wait(delay(deterministicRandom()-&amp;gt;random01() * 60.0))&lt;&#x2F;code&gt;. In wall-clock time, this takes microseconds. In simulated time, these delays are spread across 60 seconds. The event loop processes them in order, advancing time as it goes. &lt;strong&gt;Zero wall-clock time has passed. 60 simulated seconds have passed.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;This gives you:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Compressed time&lt;&#x2F;strong&gt;: Years of uptime in seconds of testing. &lt;code&gt;wait(delay(86400.0))&lt;&#x2F;code&gt; simulates 24 hours instantly.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Perfect determinism&lt;&#x2F;strong&gt;: Single-threaded execution means no race conditions. Same seed, same event ordering, exact same execution path.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Reproducibility&lt;&#x2F;strong&gt;: Test fails after 1 trillion simulated operations? Run again with the same seed, get the exact same failure at the exact same point.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;No actor ever blocks. They all cooperate, yielding control back to the event loop. This is the foundation that makes realistic cluster simulation possible.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;building-the-simulated-cluster&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#building-the-simulated-cluster&quot; aria-label=&quot;Anchor link for: building-the-simulated-cluster&quot;&gt;🔗&lt;&#x2F;a&gt;Building the Simulated Cluster&lt;&#x2F;h2&gt;
&lt;p&gt;Now that we understand Flow actors and the event loop, let&#x27;s see what runs on it. SimulatedCluster &lt;strong&gt;builds an entire distributed cluster in memory&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;SimulatedCluster&lt;&#x2F;code&gt; starts by generating a random cluster configuration: 1-5 datacenters, 1-10+ machines per DC, different storage engines (memory, ssd, redwood-1), different replication modes (single, double, triple). Every test run gets a different topology.&lt;&#x2F;p&gt;
&lt;p&gt;The actor hierarchy looks like this: SimulatedCluster creates machine actors (&lt;code&gt;simulatedMachine&lt;&#x2F;code&gt;). Each machine actor creates process actors (&lt;code&gt;simulatedFDBDRebooter&lt;&#x2F;code&gt;). Each process actor runs &lt;strong&gt;actual fdbserver code&lt;&#x2F;strong&gt;. The machine actor sits in an infinite loop: wait for all processes to die, delay 10 simulated seconds, reboot.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;The same fdbserver code that runs in production runs here&lt;&#x2F;strong&gt;. No mocks. No stubs. Real transaction logs writing to simulated disk. Real storage engines (RocksDB, Redwood). Real Paxos consensus. The only difference? &lt;code&gt;Sim2&lt;&#x2F;code&gt; network instead of &lt;code&gt;Net2&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;And of course, BUGGIFY shows up here too. Remember how BUGGIFY shrinks timeouts and injects failures? It also does something &lt;strong&gt;completely insane&lt;&#x2F;strong&gt; during machine reboots. When a machine reboots, the simulator can &lt;strong&gt;swap its disks&lt;&#x2F;strong&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; SimulatedCluster.actor.cpp - machine reboot
&lt;&#x2F;span&gt;&lt;span&gt;state &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;bool&lt;&#x2F;span&gt;&lt;span&gt; swap = killType == ISimulator::KillType::Reboot &amp;amp;&amp;amp;
&lt;&#x2F;span&gt;&lt;span&gt;                  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;BUGGIFY_WITH_PROB&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0.75&lt;&#x2F;span&gt;&lt;span&gt;) &amp;amp;&amp;amp;
&lt;&#x2F;span&gt;&lt;span&gt;                  g_simulator-&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;canSwapToMachine&lt;&#x2F;span&gt;&lt;span&gt;(localities.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;zoneId&lt;&#x2F;span&gt;&lt;span&gt;());
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(swap) {
&lt;&#x2F;span&gt;&lt;span&gt;    availableFolders[localities.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;dcId&lt;&#x2F;span&gt;&lt;span&gt;()].&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;push_back&lt;&#x2F;span&gt;&lt;span&gt;(myFolders);  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Return my disks to pool
&lt;&#x2F;span&gt;&lt;span&gt;    myFolders = availableFolders[localities.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;dcId&lt;&#x2F;span&gt;&lt;span&gt;()][randomIndex];  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Get random disks from pool
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;75% of the time when BUGGIFY is enabled, a rebooting machine gets &lt;strong&gt;random disks from the datacenter pool&lt;&#x2F;strong&gt;. Maybe it gets its own disks back. Maybe it gets another machine&#x27;s disks with completely different data. Maybe it gets the disks from a machine that was destroyed 10 minutes ago. Your storage server just woke up with someone else&#x27;s data (or no data at all). Can the cluster handle this? Can it detect the mismatch and rebuild correctly?&lt;&#x2F;p&gt;
&lt;p&gt;For extra chaos, there&#x27;s also &lt;code&gt;RebootAndDelete&lt;&#x2F;code&gt; which gives the machine &lt;strong&gt;brand new empty folders&lt;&#x2F;strong&gt;. No data. Fresh disks. This tests the actual failure mode of replacing a dead drive or provisioning a new machine.&lt;&#x2F;p&gt;
&lt;p&gt;Read that again. During testing, FoundationDB &lt;strong&gt;randomly swaps or deletes storage server data on reboot&lt;&#x2F;strong&gt;. If your distributed database doesn&#x27;t assume storage servers occasionally come back with amnesia or someone else&#x27;s memories, you&#x27;re not testing the real world. Because surely, no one has ever accidentally mounted the wrong volume in a Kubernetes deployment, right?&lt;&#x2F;p&gt;
&lt;p&gt;What you get from all this:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Real cluster behavior&lt;&#x2F;strong&gt;: Coordinators elect leaders, transaction logs replicate commits, storage servers handle reads&#x2F;writes, backup agents run&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Real failure modes&lt;&#x2F;strong&gt;: Process crashes, machine reboots, network partitions (via &lt;code&gt;g_clogging&lt;&#x2F;code&gt;), slow disks (via &lt;code&gt;AsyncFileNonDurable&lt;&#x2F;code&gt;), disk swaps, data loss&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Realistic topologies&lt;&#x2F;strong&gt;: Multi-region configurations, different storage engines, different replication modes, different machine counts&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;When you run a simulation test, SimulatedCluster boots this entire virtual cluster, lets it stabilize, runs workloads against it while injecting chaos, then validates correctness.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;workloads-stress-testing-under-chaos&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#workloads-stress-testing-under-chaos&quot; aria-label=&quot;Anchor link for: workloads-stress-testing-under-chaos&quot;&gt;🔗&lt;&#x2F;a&gt;Workloads: Stress Testing Under Chaos&lt;&#x2F;h2&gt;
&lt;p&gt;30 seconds. 2500 transactions per second. Concurrent machines swapping edges in a distributed data structure while chaos engines inject failures. Let&#x27;s see if the database survives.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;Simulation Overview
&lt;&#x2F;span&gt;&lt;span&gt;┌────────────┬──────────────────┬────────────────┬─────────────────┬────────────────┐
&lt;&#x2F;span&gt;&lt;span&gt;│ Seed       ┆ Replication      ┆ Simulated Time ┆ Real Time       ┆ Storage Engine │
&lt;&#x2F;span&gt;&lt;span&gt;╞════════════╪══════════════════╪════════════════╪═════════════════╪════════════════╡
&lt;&#x2F;span&gt;&lt;span&gt;│ 1876983470 ┆ triple           ┆ 5m 47s         ┆ 18s 891ms       ┆ ssd-2          │
&lt;&#x2F;span&gt;&lt;span&gt;└────────────┴──────────────────┴────────────────┴─────────────────┴────────────────┘
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;Timeline of Chaos Events
&lt;&#x2F;span&gt;&lt;span&gt;┌──────────┬────────────────────┬──────────────────────────────────────┐
&lt;&#x2F;span&gt;&lt;span&gt;│ Time (s) ┆ Event Type         ┆ Details                              │
&lt;&#x2F;span&gt;&lt;span&gt;╞══════════╪════════════════════╪══════════════════════════════════════╡
&lt;&#x2F;span&gt;&lt;span&gt;│ 87.234   ┆ Coordinator Change ┆ Triggering leader election           │
&lt;&#x2F;span&gt;&lt;span&gt;│ 92.156   ┆ Process Reboot     ┆ KillInstantly process at 10.0.4.2:3  │
&lt;&#x2F;span&gt;&lt;span&gt;│ 92.156   ┆ Process Reboot     ┆ KillInstantly process at 10.0.4.2:1  │
&lt;&#x2F;span&gt;&lt;span&gt;│ 95.871   ┆ Coordinator Change ┆ Triggering leader election           │
&lt;&#x2F;span&gt;&lt;span&gt;│ 103.445  ┆ Process Reboot     ┆ RebootAndDelete process at 10.0.2.1:4│
&lt;&#x2F;span&gt;&lt;span&gt;│ 103.445  ┆ Process Reboot     ┆ RebootAndDelete process at 10.0.2.1:2│
&lt;&#x2F;span&gt;&lt;span&gt;└──────────┴────────────────────┴──────────────────────────────────────┘
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;Chaos Summary
&lt;&#x2F;span&gt;&lt;span&gt;  Network Partitions: 187 events (max duration: 5.2s)
&lt;&#x2F;span&gt;&lt;span&gt;  Process Kills: 2 KillInstantly, 2 RebootAndDelete
&lt;&#x2F;span&gt;&lt;span&gt;  Coordinator Changes: 2
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;strong&gt;The cluster survived.&lt;&#x2F;strong&gt; 187 network partitions. 4 process kills. 2 coordinator changes. 5 minutes of simulated time compressed into 18 seconds of wall-clock time. Every transaction completed correctly. The cycle invariant never broke.&lt;&#x2F;p&gt;
&lt;p&gt;How did we unleash this chaos? Here&#x27;s the test configuration:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;toml&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-toml &quot;&gt;&lt;code class=&quot;language-toml&quot; data-lang=&quot;toml&quot;&gt;&lt;span&gt;[configuration]
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;buggify &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;true
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;minimumReplication &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;3
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;[[test]]
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testTitle &lt;&#x2F;span&gt;&lt;span&gt;= &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;CycleWithAttrition&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    [[test.workload]]
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testName &lt;&#x2F;span&gt;&lt;span&gt;= &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Cycle&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testDuration &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;30.0
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;transactionsPerSecond &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;2500.0
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    [[test.workload]]
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testName &lt;&#x2F;span&gt;&lt;span&gt;= &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;RandomClogging&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testDuration &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;30.0
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    [[test.workload]]
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testName &lt;&#x2F;span&gt;&lt;span&gt;= &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Attrition&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testDuration &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;30.0
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    [[test.workload]]
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testName &lt;&#x2F;span&gt;&lt;span&gt;= &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Rollback&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testDuration &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;30.0
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h3 id=&quot;what-just-happened&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what-just-happened&quot; aria-label=&quot;Anchor link for: what-just-happened&quot;&gt;🔗&lt;&#x2F;a&gt;What Just Happened?&lt;&#x2F;h3&gt;
&lt;p&gt;Four concurrent workloads ran on the same simulated cluster for 30 seconds. &lt;strong&gt;Workloads&lt;&#x2F;strong&gt; are reusable scenario templates (180+ built-in in &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;tree&#x2F;dfbb0ea72ce01ba87148ef67cf216200e8b249cd&#x2F;fdbserver&#x2F;workloads&quot;&gt;fdbserver&#x2F;workloads&#x2F;&lt;&#x2F;a&gt;) that either generate transactions or inject chaos.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;The application workload&lt;&#x2F;strong&gt; we ran:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Cycle&lt;&#x2F;strong&gt; (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;dfbb0ea72ce01ba87148ef67cf216200e8b249cd&#x2F;fdbserver&#x2F;workloads&#x2F;Cycle.actor.cpp&quot;&gt;Cycle.actor.cpp&lt;&#x2F;a&gt;): Hammered the database with 2500 transactions&#x2F;second, each one swapping edges in a distributed graph. Tests SERIALIZABLE isolation by maintaining a cycle invariant. If isolation breaks, the cycle splits or nodes vanish. We&#x27;ll dive deep into how this works below.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;&lt;strong&gt;The chaos workloads&lt;&#x2F;strong&gt; that tried to break it:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;RandomClogging&lt;&#x2F;strong&gt; (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;dfbb0ea72ce01ba87148ef67cf216200e8b249cd&#x2F;fdbserver&#x2F;workloads&#x2F;RandomClogging.actor.cpp&quot;&gt;RandomClogging.actor.cpp&lt;&#x2F;a&gt;): Calls &lt;code&gt;g_simulator-&amp;gt;clogInterface(ip, duration)&lt;&#x2F;code&gt; to partition machines. Those &lt;strong&gt;187 network partitions&lt;&#x2F;strong&gt; we saw? This workload. Some lasted over 5 seconds.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Attrition&lt;&#x2F;strong&gt; (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;dfbb0ea72ce01ba87148ef67cf216200e8b249cd&#x2F;fdbserver&#x2F;workloads&#x2F;MachineAttrition.actor.cpp&quot;&gt;MachineAttrition.actor.cpp&lt;&#x2F;a&gt;): Calls &lt;code&gt;g_simulator-&amp;gt;killMachine()&lt;&#x2F;code&gt; and &lt;code&gt;g_simulator-&amp;gt;rebootMachine()&lt;&#x2F;code&gt;. The &lt;strong&gt;4 process kills&lt;&#x2F;strong&gt; (2 instant, 2 with deleted data)? This workload.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Rollback&lt;&#x2F;strong&gt; (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;dfbb0ea72ce01ba87148ef67cf216200e8b249cd&#x2F;fdbserver&#x2F;workloads&#x2F;Rollback.actor.cpp&quot;&gt;Rollback.actor.cpp&lt;&#x2F;a&gt;): Forces proxy-to-TLog failures, triggering coordinator recovery. The &lt;strong&gt;2 coordinator changes&lt;&#x2F;strong&gt;? This workload.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Workloads are composable. The TOML format lets you stack them: &lt;code&gt;[configuration]&lt;&#x2F;code&gt; sets global parameters (BUGGIFY, replication), each &lt;code&gt;[[test.workload]]&lt;&#x2F;code&gt; adds another concurrent workload. Want to test atomic operations under network partitions? Stack &lt;code&gt;AtomicOps&lt;&#x2F;code&gt; + &lt;code&gt;RandomClogging&lt;&#x2F;code&gt;. Want to test backup during machine failures? Combine &lt;code&gt;BackupToBlob&lt;&#x2F;code&gt; + &lt;code&gt;Attrition&lt;&#x2F;code&gt;. Test files live in &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;tree&#x2F;dfbb0ea72ce01ba87148ef67cf216200e8b249cd&#x2F;tests&quot;&gt;tests&#x2F;&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;how-does-cycle-work&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#how-does-cycle-work&quot; aria-label=&quot;Anchor link for: how-does-cycle-work&quot;&gt;🔗&lt;&#x2F;a&gt;How Does Cycle Work?&lt;&#x2F;h3&gt;
&lt;p&gt;Remember that test we just ran? Let&#x27;s break down how the &lt;code&gt;Cycle&lt;&#x2F;code&gt; workload actually works. It creates a directed graph where every node points to exactly one other node, forming a single cycle: &lt;code&gt;0→1→2→...→N→0&lt;&#x2F;code&gt;. Then it runs 2500 concurrent transactions per second, each one randomly swapping edges in the graph. Meanwhile, chaos workloads kill machines, partition the network, and force coordinator changes. &lt;strong&gt;If SERIALIZABLE isolation works correctly, the cycle never breaks&lt;&#x2F;strong&gt;. You always have exactly N nodes in one ring, never split cycles or dangling pointers.&lt;&#x2F;p&gt;
&lt;p&gt;Every workload implements four phases (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;dfbb0ea72ce01ba87148ef67cf216200e8b249cd&#x2F;fdbserver&#x2F;include&#x2F;fdbserver&#x2F;workloads&#x2F;workloads.actor.h&quot;&gt;workloads.actor.h&lt;&#x2F;a&gt;):&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;SETUP&lt;&#x2F;strong&gt; (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;dfbb0ea72ce01ba87148ef67cf216200e8b249cd&#x2F;fdbserver&#x2F;workloads&#x2F;Cycle.actor.cpp&quot;&gt;Cycle.actor.cpp&lt;&#x2F;a&gt;): Creates &lt;code&gt;nodeCount&lt;&#x2F;code&gt; nodes. Each key stores the index of the next node in the cycle. Key 0 → value 1, key 1 → value 2, ..., key N-1 → value 0.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;EXECUTION&lt;&#x2F;strong&gt;: Multiple concurrent &lt;code&gt;cycleClient&lt;&#x2F;code&gt; actors run this loop:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Pick random node &lt;code&gt;r&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Read three hops: &lt;code&gt;r→r2→r3→r4&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Swap the middle two edges: make &lt;code&gt;r→r3&lt;&#x2F;code&gt; and &lt;code&gt;r2→r4&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Commit&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;This transaction reads 3 keys and writes 2. If isolation breaks, you could create cycles of the wrong length or lose nodes entirely.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;CHECK&lt;&#x2F;strong&gt;: One client reads the entire graph in a single transaction. Starting from node 0, follow pointers: 0→next→next→next. Count the hops. After exactly &lt;code&gt;nodeCount&lt;&#x2F;code&gt; hops, you must be back at node 0. If you get there earlier (cycle too short) or can&#x27;t get there (broken chain), the test fails. Also verifies transaction throughput met the expected rate.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;METRICS&lt;&#x2F;strong&gt;: Reports transactions completed, retry counts, latency percentiles.&lt;&#x2F;p&gt;
&lt;p&gt;This is the pattern all workloads follow: SETUP initializes data, EXECUTION generates load, CHECK verifies correctness, METRICS reports results. When you execute a test, SimulatedCluster boots the cluster, runs SETUP phases sequentially, then runs all EXECUTION phases concurrently (they&#x27;re Flow actors on the same event loop). After &lt;code&gt;testDuration&lt;&#x2F;code&gt;, CHECK phases verify correctness.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;This is what runs before every FoundationDB commit.&lt;&#x2F;strong&gt; Not once. Not a few times. Thousands of test runs with different seeds, different cluster configurations, different workload combinations. Application workloads generate realistic transactions. Chaos workloads inject failures. The CHECK phases prove correctness survived the chaos. This is why FoundationDB doesn&#x27;t fail in production. The simulator has already broken it every possible way, and every bug got fixed before shipping.&lt;&#x2F;p&gt;
&lt;p&gt;I generated that simulation output using &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;PierreZ&#x2F;fdb-sim-visualizer&quot;&gt;fdb-sim-visualizer&lt;&#x2F;a&gt;, a tool I wrote to parse simulation trace logs and understand what happened during testing.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;verifying-correctness-building-reliable-workloads&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#verifying-correctness-building-reliable-workloads&quot; aria-label=&quot;Anchor link for: verifying-correctness-building-reliable-workloads&quot;&gt;🔗&lt;&#x2F;a&gt;Verifying Correctness: Building Reliable Workloads&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;strong&gt;But here&#x27;s the hard part: proving correctness when everything is randomized.&lt;&#x2F;strong&gt; The cluster survived. Transactions completed. The cycle invariant never broke... or did it? When you&#x27;re running 2500 transactions per second with random edge swaps under 187 network partitions, how do you &lt;strong&gt;prove&lt;&#x2F;strong&gt; nothing went wrong? You can&#x27;t just check if the database &quot;looks okay.&quot; You need &lt;strong&gt;proof&lt;&#x2F;strong&gt; the invariants held.&lt;&#x2F;p&gt;
&lt;p&gt;FoundationDB&#x27;s approach: &lt;strong&gt;track during EXECUTION, verify in CHECK.&lt;&#x2F;strong&gt; Three patterns emerge across the codebase:&lt;&#x2F;p&gt;
&lt;h3 id=&quot;pattern-1-reference-implementation&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#pattern-1-reference-implementation&quot; aria-label=&quot;Anchor link for: pattern-1-reference-implementation&quot;&gt;🔗&lt;&#x2F;a&gt;Pattern 1: Reference Implementation&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;strong&gt;The challenge&lt;&#x2F;strong&gt;: How do you verify complex API behavior under chaos?&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;The solution&lt;&#x2F;strong&gt;: Run every operation twice. &lt;code&gt;ApiCorrectness&lt;&#x2F;code&gt; (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;dfbb0ea72ce01ba87148ef67cf216200e8b249cd&#x2F;fdbserver&#x2F;workloads&#x2F;ApiCorrectness.actor.cpp&quot;&gt;ApiCorrectness.actor.cpp&lt;&#x2F;a&gt;) mirrors all operations in a simple &lt;code&gt;MemoryKeyValueStore&lt;&#x2F;code&gt; (just a &lt;code&gt;std::map&amp;lt;Key, Value&amp;gt;&lt;&#x2F;code&gt;). Every &lt;code&gt;transaction-&amp;gt;set(k, v)&lt;&#x2F;code&gt; also executes &lt;code&gt;store.set(k, v)&lt;&#x2F;code&gt; in memory. The CHECK phase reads from FDB and compares with the memory model. Mismatch = bug found.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;pattern-2-operation-logging&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#pattern-2-operation-logging&quot; aria-label=&quot;Anchor link for: pattern-2-operation-logging&quot;&gt;🔗&lt;&#x2F;a&gt;Pattern 2: Operation Logging&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;strong&gt;The challenge&lt;&#x2F;strong&gt;: How do you verify atomic operations executed in the right order?&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;The solution&lt;&#x2F;strong&gt;: Log everything. &lt;code&gt;AtomicOps&lt;&#x2F;code&gt; (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;dfbb0ea72ce01ba87148ef67cf216200e8b249cd&#x2F;fdbserver&#x2F;workloads&#x2F;AtomicOps.actor.cpp&quot;&gt;AtomicOps.actor.cpp&lt;&#x2F;a&gt;) logs every operation to a separate keyspace. During EXECUTION: &lt;code&gt;atomicOp(ops_key, value)&lt;&#x2F;code&gt; on real data, &lt;code&gt;set(log_key, value)&lt;&#x2F;code&gt; to track what happened. During CHECK: replay all logged operations, compute what the final state should be, compare with actual database state.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;pattern-3-invariant-tracking&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#pattern-3-invariant-tracking&quot; aria-label=&quot;Anchor link for: pattern-3-invariant-tracking&quot;&gt;🔗&lt;&#x2F;a&gt;Pattern 3: Invariant Tracking&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;strong&gt;The challenge&lt;&#x2F;strong&gt;: How do you prove SERIALIZABLE isolation worked during chaos?&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;The solution&lt;&#x2F;strong&gt;: Maintain a mathematical invariant that breaks if isolation fails. &lt;code&gt;Cycle&lt;&#x2F;code&gt; (from our test earlier) maintains &quot;exactly N nodes in one ring.&quot; During EXECUTION, random edge swaps must preserve the invariant. During CHECK, walk the graph: 0→next→next→next. After exactly N hops, you must be back at node 0. If you arrive earlier (cycle split) or can&#x27;t arrive (broken chain), isolation failed. The CHECK phase catches this immediately.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;using-clientid-for-work-distribution&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#using-clientid-for-work-distribution&quot; aria-label=&quot;Anchor link for: using-clientid-for-work-distribution&quot;&gt;🔗&lt;&#x2F;a&gt;Using clientId for Work Distribution&lt;&#x2F;h3&gt;
&lt;p&gt;Every workload gets &lt;code&gt;clientId&lt;&#x2F;code&gt; (0, 1, 2...) and &lt;code&gt;clientCount&lt;&#x2F;code&gt; (total clients). Three patterns:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Client 0 only&lt;&#x2F;strong&gt; (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;dfbb0ea72ce01ba87148ef67cf216200e8b249cd&#x2F;fdbserver&#x2F;workloads&#x2F;AtomicOps.actor.cpp&quot;&gt;AtomicOps.actor.cpp&lt;&#x2F;a&gt;):&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(clientId != &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;true&lt;&#x2F;span&gt;&lt;span&gt;;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Common for CHECK phases
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;strong&gt;Partition keyspace&lt;&#x2F;strong&gt; (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;dfbb0ea72ce01ba87148ef67cf216200e8b249cd&#x2F;fdbserver&#x2F;workloads&#x2F;WatchAndWait.actor.cpp&quot;&gt;WatchAndWait.actor.cpp&lt;&#x2F;a&gt;):&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span&gt;uint64_t startNode = (nodeCount * clientId) &#x2F; clientCount;
&lt;&#x2F;span&gt;&lt;span&gt;uint64_t endNode = (nodeCount * (clientId + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;)) &#x2F; clientCount;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Client 0: nodes 0-33, Client 1: nodes 34-66, Client 2: nodes 67-99
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;strong&gt;Round-robin&lt;&#x2F;strong&gt; (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;dfbb0ea72ce01ba87148ef67cf216200e8b249cd&#x2F;fdbserver&#x2F;workloads&#x2F;Watches.actor.cpp&quot;&gt;Watches.actor.cpp&lt;&#x2F;a&gt;):&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(i % clientCount == clientId)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Client 0: keys 0,3,6,9... Client 1: keys 1,4,7,10...
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Use &lt;code&gt;clientId&lt;&#x2F;code&gt; to create concurrency (multiple clients hitting different keys) or coordinate work (one client checks, others generate load).&lt;&#x2F;p&gt;
&lt;h3 id=&quot;randomize-everything&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#randomize-everything&quot; aria-label=&quot;Anchor link for: randomize-everything&quot;&gt;🔗&lt;&#x2F;a&gt;Randomize Everything&lt;&#x2F;h3&gt;
&lt;p&gt;The key to finding bugs: &lt;strong&gt;randomize every decision&lt;&#x2F;strong&gt;. Which keys to read? Random. How many operations per transaction? Random. Which atomic operation type? Random. Order of operations? Random. When to inject chaos? Random.&lt;&#x2F;p&gt;
&lt;p&gt;But use &lt;code&gt;deterministicRandom()&lt;&#x2F;code&gt; for all randomness. It&#x27;s a seeded PRNG. Same seed = same random choices = reproducible failures. When a test fails after 10 million operations, rerun with the same seed, get the exact same failure at the exact same point.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;pattern-selection-guide&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#pattern-selection-guide&quot; aria-label=&quot;Anchor link for: pattern-selection-guide&quot;&gt;🔗&lt;&#x2F;a&gt;Pattern Selection Guide&lt;&#x2F;h3&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Testing&lt;&#x2F;th&gt;&lt;th&gt;Use Pattern&lt;&#x2F;th&gt;&lt;th&gt;Example Workload&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;API correctness&lt;&#x2F;td&gt;&lt;td&gt;Reference implementation&lt;&#x2F;td&gt;&lt;td&gt;ApiCorrectness&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Atomic operations&lt;&#x2F;td&gt;&lt;td&gt;Operation logging&lt;&#x2F;td&gt;&lt;td&gt;AtomicOps&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;ACID guarantees&lt;&#x2F;td&gt;&lt;td&gt;Invariant tracking&lt;&#x2F;td&gt;&lt;td&gt;Cycle&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Backup&#x2F;restore&lt;&#x2F;td&gt;&lt;td&gt;Absence checking&lt;&#x2F;td&gt;&lt;td&gt;BackupCorrectness&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Chaos workloads (&lt;code&gt;RandomClogging&lt;&#x2F;code&gt;, &lt;code&gt;Attrition&lt;&#x2F;code&gt;, &lt;code&gt;Rollback&lt;&#x2F;code&gt;) don&#x27;t need CHECK phases. They just return &lt;code&gt;true&lt;&#x2F;code&gt;. They inject failures. Application workloads verify that correctness survived the chaos.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;writing-workloads-in-rust&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#writing-workloads-in-rust&quot; aria-label=&quot;Anchor link for: writing-workloads-in-rust&quot;&gt;🔗&lt;&#x2F;a&gt;Writing Workloads in Rust&lt;&#x2F;h2&gt;
&lt;p&gt;Remember those chaos workloads hammering the Cycle test? &lt;code&gt;RandomClogging&lt;&#x2F;code&gt;, &lt;code&gt;Attrition&lt;&#x2F;code&gt;, &lt;code&gt;Rollback&lt;&#x2F;code&gt;. All written in C++ Flow. But you can write workloads in &lt;strong&gt;Rust&lt;&#x2F;strong&gt; and compile them directly into the simulator. At Clever Cloud, we open-sourced &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;foundationdb-rs&#x2F;foundationdb-rs&#x2F;tree&#x2F;main&#x2F;foundationdb-simulation&quot;&gt;foundationdb-simulation&lt;&#x2F;a&gt;, which lets you implement the &lt;code&gt;RustWorkload&lt;&#x2F;code&gt; trait with &lt;code&gt;setup()&lt;&#x2F;code&gt;, &lt;code&gt;start()&lt;&#x2F;code&gt;, and &lt;code&gt;check()&lt;&#x2F;code&gt; methods using Rust&#x27;s async&#x2F;await:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span&gt;#[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;async_trait&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;impl &lt;&#x2F;span&gt;&lt;span&gt;RustWorkload &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;MyWorkload {
&lt;&#x2F;span&gt;&lt;span&gt;    async &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;setup&lt;&#x2F;span&gt;&lt;span&gt;(&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;mut &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;db&lt;&#x2F;span&gt;&lt;span&gt;: Database, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;_ctx&lt;&#x2F;span&gt;&lt;span&gt;: Context) -&amp;gt; Result&amp;lt;()&amp;gt; {
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Initialize test data
&lt;&#x2F;span&gt;&lt;span&gt;        db.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;run&lt;&#x2F;span&gt;&lt;span&gt;(|&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;tx&lt;&#x2F;span&gt;&lt;span&gt;, _| async &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;move &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;            tx.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;set&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;b&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;key&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;b&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;value&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;);
&lt;&#x2F;span&gt;&lt;span&gt;            Ok(())
&lt;&#x2F;span&gt;&lt;span&gt;        }).await
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    async &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;start&lt;&#x2F;span&gt;&lt;span&gt;(&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;mut &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;db&lt;&#x2F;span&gt;&lt;span&gt;: Database, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ctx&lt;&#x2F;span&gt;&lt;span&gt;: Context) -&amp;gt; Result&amp;lt;()&amp;gt; {
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Generate load under simulation
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;_ in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;..ctx.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;get_option&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;nodeCount&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;100&lt;&#x2F;span&gt;&lt;span&gt;) {
&lt;&#x2F;span&gt;&lt;span&gt;            db.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;run&lt;&#x2F;span&gt;&lt;span&gt;(|&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;tx&lt;&#x2F;span&gt;&lt;span&gt;, _| async &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;move &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;                &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; value = tx.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;get&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;b&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;key&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;false&lt;&#x2F;span&gt;&lt;span&gt;).await?;
&lt;&#x2F;span&gt;&lt;span&gt;                &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Your workload logic here
&lt;&#x2F;span&gt;&lt;span&gt;                Ok(())
&lt;&#x2F;span&gt;&lt;span&gt;            }).await?;
&lt;&#x2F;span&gt;&lt;span&gt;        }
&lt;&#x2F;span&gt;&lt;span&gt;        Ok(())
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    async &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;check&lt;&#x2F;span&gt;&lt;span&gt;(&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;mut &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;db&lt;&#x2F;span&gt;&lt;span&gt;: Database, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;_ctx&lt;&#x2F;span&gt;&lt;span&gt;: Context) -&amp;gt; Result&amp;lt;()&amp;gt; {
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Verify correctness after chaos
&lt;&#x2F;span&gt;&lt;span&gt;        Ok(())
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Your Rust code compiles to a shared library, FDB&#x27;s &lt;code&gt;ExternalWorkload&lt;&#x2F;code&gt; loads it at runtime via FFI, and your Rust async functions run on the same Flow event loop as the C++ cluster. The FFI boundary is managed by the &lt;code&gt;foundationdb-simulation&lt;&#x2F;code&gt; crate, which handles marshaling between Flow&#x27;s event loop and Rust futures. Same determinism, same reproducibility, same chaos injection. But you&#x27;re writing &lt;code&gt;async fn&lt;&#x2F;code&gt; instead of &lt;code&gt;ACTOR Future&amp;lt;Void&amp;gt;&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Here&#x27;s a &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;foundationdb-rs&#x2F;foundationdb-rs&#x2F;blob&#x2F;main&#x2F;foundationdb-simulation&#x2F;examples&#x2F;atomic&#x2F;lib.rs&quot;&gt;complete example workload&lt;&#x2F;a&gt; testing atomic operations in ~100 lines of Rust.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;simulation-at-clever-cloud-building-materia&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#simulation-at-clever-cloud-building-materia&quot; aria-label=&quot;Anchor link for: simulation-at-clever-cloud-building-materia&quot;&gt;🔗&lt;&#x2F;a&gt;Simulation at Clever Cloud: Building Materia&lt;&#x2F;h3&gt;
&lt;p&gt;At Clever Cloud, we use simulation to build &lt;a href=&quot;https:&#x2F;&#x2F;www.clever-cloud.com&#x2F;blog&#x2F;features&#x2F;2024&#x2F;06&#x2F;11&#x2F;materia-kv-our-easy-to-use-serverless-key-value-database-is-available-to-all&#x2F;&quot;&gt;Materia&lt;&#x2F;a&gt;, our serverless database products. I&#x27;m the lead engineer behind Materia and the main maintainer of &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;foundationdb-rs&#x2F;foundationdb-rs&quot;&gt;foundationdb-rs&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;We built a Rust SDK on top of FDB, similar to Apple&#x27;s &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;FoundationDB&#x2F;fdb-record-layer&quot;&gt;Record Layer&lt;&#x2F;a&gt;. It provides structured records, secondary indexes, query planning, and multi-tenant isolation. The result: a distributed transactional database built on FoundationDB&#x27;s guarantees.&lt;&#x2F;p&gt;
&lt;p&gt;FoundationDB is a hidden technology: you don&#x27;t use it directly, you build layers on top. But here&#x27;s the trick: &lt;strong&gt;patterns like index design, quota management, and schema management should be written once and consumed, not reimplemented in every product&lt;&#x2F;strong&gt;. Our SDK abstracts these patterns. Need secondary indexes? The SDK handles keyspace layout, index updates, and query planning. Need multi-tenant isolation? The SDK provides it. Need quota enforcement or permission management? &lt;strong&gt;We built a common control plane that works across all products&lt;&#x2F;strong&gt; built on the SDK. Write the hard distributed systems logic once, simulate it until it&#x27;s bulletproof, then reuse it everywhere.&lt;&#x2F;p&gt;
&lt;p&gt;Every merge request runs simulation tests in CI. We test SDK scenarios (indexing, query planning) and full product workloads under chaos. Multi-tenant isolation, concurrent queries, network partitions, machine crashes. The bugs we catch vary by layer. SDK changes catch nasty bugs like duplicated indexes during &lt;code&gt;maybe_committed&lt;&#x2F;code&gt; transactions. Product changes catch simpler errors like accidentally blocking FDB&#x27;s retry logic or breaking atomicity.&lt;&#x2F;p&gt;
&lt;p&gt;But the real value isn&#x27;t just bug detection. &lt;strong&gt;Instead of writing hundreds of unit tests, we write workloads that fuzz our code under chaos.&lt;&#x2F;strong&gt; One workload with randomized operations and deterministic chaos replaces dozens of hand-crafted test cases. When engineers write workloads for their features, they&#x27;re forced to think: &quot;What happens when this retries during a partition?&quot; &quot;How do I verify correctness when transactions can commit in any order?&quot; &lt;strong&gt;Designing for chaos&lt;&#x2F;strong&gt; becomes natural. The act of writing simulation workloads improves the design itself.&lt;&#x2F;p&gt;
&lt;p&gt;The confidence this gives a small team is extraordinary. When you can prove your code survives hundreds of network partitions and machine crashes before shipping, you sleep better at night. Our latest layer, an etcd-compatible API for managed Kubernetes, was built from the ground up with simulation in mind. We&#x27;re even &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;issues&#x2F;12343&quot;&gt;contributing features back to FoundationDB&lt;&#x2F;a&gt; to better support layers like ours.&lt;&#x2F;p&gt;
&lt;p&gt;If it survives simulation, it survives production.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;running-simulations-yourself&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#running-simulations-yourself&quot; aria-label=&quot;Anchor link for: running-simulations-yourself&quot;&gt;🔗&lt;&#x2F;a&gt;Running Simulations Yourself&lt;&#x2F;h2&gt;
&lt;p&gt;Think you can break FoundationDB? You don&#x27;t need to build from source or set up a cluster. Download a prebuilt &lt;code&gt;fdbserver&lt;&#x2F;code&gt; binary from the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;releases&quot;&gt;releases page&lt;&#x2F;a&gt;, create a test file, and unleash chaos:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;bash&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-bash &quot;&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# Download fdbserver (Linux example, adjust for your platform)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;wget&lt;&#x2F;span&gt;&lt;span&gt; https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;releases&#x2F;download&#x2F;7.3.27&#x2F;fdbserver.x86_64
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;chmod&lt;&#x2F;span&gt;&lt;span&gt; +x fdbserver.x86_64
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# Create the folder for traces
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;mkdir&lt;&#x2F;span&gt;&lt;span&gt; events
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# Run a simulation test with JSON trace output
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;.&#x2F;fdbserver.x86_64 -r&lt;&#x2F;span&gt;&lt;span&gt; simulation&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt; -f&lt;&#x2F;span&gt;&lt;span&gt; Attritions.toml&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt; --trace-format&lt;&#x2F;span&gt;&lt;span&gt; json&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt; -L&lt;&#x2F;span&gt;&lt;span&gt; .&#x2F;events&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt; --logsize&lt;&#x2F;span&gt;&lt;span&gt; 1GiB
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Here are two test files to get you started. Save either as a &lt;code&gt;.toml&lt;&#x2F;code&gt; file and run with the command above.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Attritions.toml&lt;&#x2F;strong&gt; - Network partitions + machine crashes + database reconfigurations (the NemesisTest shown earlier):&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;toml&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-toml &quot;&gt;&lt;code class=&quot;language-toml&quot; data-lang=&quot;toml&quot;&gt;&lt;span&gt;[configuration]
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;buggify &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;true
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;minimumReplication &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;3
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;[[test]]
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testTitle &lt;&#x2F;span&gt;&lt;span&gt;= &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;NemesisTest&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;
&lt;&#x2F;span&gt;&lt;span&gt;    [[test.workload]]
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testName &lt;&#x2F;span&gt;&lt;span&gt;= &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;ReadWrite&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testDuration &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;30.0
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;transactionsPerSecond &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;1000.0
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    [[test.workload]]
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testName &lt;&#x2F;span&gt;&lt;span&gt;= &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;RandomClogging&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# Network partitions
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testDuration &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;30.0
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;swizzle &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;1  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# Unclog in reversed order
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    [[test.workload]]
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testName &lt;&#x2F;span&gt;&lt;span&gt;= &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Attrition&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# Machine crashes
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testDuration &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;30.0
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    [[test.workload]]
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testName &lt;&#x2F;span&gt;&lt;span&gt;= &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Rollback&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# Proxy-to-TLog errors
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testDuration &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;30
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    [[test.workload]]
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testName &lt;&#x2F;span&gt;&lt;span&gt;= &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;ChangeConfig&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# Database reconfigurations
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;coordinators &lt;&#x2F;span&gt;&lt;span&gt;= &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;auto&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;strong&gt;DiskFailureCycle.toml&lt;&#x2F;strong&gt; - Disk failures + bit flips during the Cycle workload:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;toml&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-toml &quot;&gt;&lt;code class=&quot;language-toml&quot; data-lang=&quot;toml&quot;&gt;&lt;span&gt;[configuration]
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;minimumReplication &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;3
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;minimumRegions &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;3
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;[[test]]
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testTitle &lt;&#x2F;span&gt;&lt;span&gt;= &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;DiskFailureCycle&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;
&lt;&#x2F;span&gt;&lt;span&gt;    [[test.workload]]
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testName &lt;&#x2F;span&gt;&lt;span&gt;= &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Cycle&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;transactionsPerSecond &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;2500.0
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testDuration &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;30.0
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    [[test.workload]]
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testName &lt;&#x2F;span&gt;&lt;span&gt;= &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;DiskFailureInjection&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;testDuration &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;120.0
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;stallInterval &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;5.0
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;stallPeriod &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;5.0
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;throttlePeriod &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;30.0
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;corruptFile &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;true
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;percentBitFlips &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;10
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The simulation generates JSON trace logs in &lt;code&gt;.&#x2F;events&#x2F;&lt;&#x2F;code&gt;. Parse them with &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;PierreZ&#x2F;fdb-sim-visualizer&quot;&gt;fdb-sim-visualizer&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;For more test examples, check FoundationDB&#x27;s &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;tree&#x2F;dfbb0ea72ce01ba87148ef67cf216200e8b249cd&#x2F;tests&quot;&gt;tests&#x2F;&lt;&#x2F;a&gt; directory. Hundreds of workload combinations testing every corner of the system.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;why-i-ve-never-been-woken-up-by-fdb&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#why-i-ve-never-been-woken-up-by-fdb&quot; aria-label=&quot;Anchor link for: why-i-ve-never-been-woken-up-by-fdb&quot;&gt;🔗&lt;&#x2F;a&gt;Why I&#x27;ve Never Been Woken Up by FDB&lt;&#x2F;h2&gt;
&lt;p&gt;After years of on-call and one trillion CPU-hours of simulation, I&#x27;ve never been woken up by FoundationDB. Now you know why.&lt;&#x2F;p&gt;
&lt;p&gt;Interface swapping lets the same code run in both production and simulation. Flow actors enable single-threaded determinism. The event loop compresses years into seconds. BUGGIFY injects chaos into every corner of the codebase. SimulatedCluster builds entire distributed systems in memory. Workloads generate realistic transactions while chaos engines try to break everything. And deterministic randomness guarantees every bug can be reproduced, diagnosed, and fixed before shipping.&lt;&#x2F;p&gt;
&lt;p&gt;The simulator has already broken FoundationDB in every possible way. Network partitions during coordinator elections. Machine crashes mid-transaction. Disks swapped between nodes on reboot. Bit flips. Slow I&#x2F;O. Every edge case, every race condition, every distributed systems nightmare. Found, fixed, and verified before production ever sees it.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Want to try breaking FoundationDB yourself?&lt;&#x2F;strong&gt; Grab a test config from above, run the simulator, inject chaos, and see if you can find a bug that survived one trillion CPU-hours. If you do, the FDB team would love to hear about it.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Feel free to reach out with any questions or to share your simulation testing experiences or FDB workloads. You can find me on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;pierrezemb.fr&quot;&gt;Bluesky&lt;&#x2F;a&gt; or through my &lt;a href=&quot;https:&#x2F;&#x2F;pierrezemb.fr&quot;&gt;website&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">foundationdb</category>
          <category domain="tag">testing</category>
          <category domain="tag">simulation</category>
          <category domain="tag">deterministic</category>
          <category domain="tag">distributed-systems</category>
          <category domain="tag">diving-into</category>
      </item>
      <item>
          <title>From Arc to Box: One Deref Bound to Rule Them All</title>
          <pubDate>Thu, 02 Oct 2025 00:00:00 +0000</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/from-arc-to-box-one-deref-bound/</link>
          <guid>https://pierrezemb.fr/posts/from-arc-to-box-one-deref-bound/</guid>
          <description xml:base="https://pierrezemb.fr/posts/from-arc-to-box-one-deref-bound/">&lt;p&gt;While working on &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;foundationdb-rs&#x2F;foundationdb-rs&quot;&gt;FoundationDB-rs&lt;&#x2F;a&gt;, I hit a design problem that seemed like it would require complex trait gymnastics. I had two transaction types with identical APIs but different ownership semantics, and I needed functions to accept both. The solution turned out to be embarrassingly simple. It was already implemented.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-problem-two-transaction-types-one-api&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-problem-two-transaction-types-one-api&quot; aria-label=&quot;Anchor link for: the-problem-two-transaction-types-one-api&quot;&gt;🔗&lt;&#x2F;a&gt;The Problem: Two Transaction Types, One API&lt;&#x2F;h2&gt;
&lt;p&gt;FoundationDB-rs has two transaction types that do exactly the same thing but with different ownership models:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;pub struct &lt;&#x2F;span&gt;&lt;span&gt;Transaction {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;inner&lt;&#x2F;span&gt;&lt;span&gt;: NonNull&amp;lt;fdb_sys::FDBTransaction&amp;gt;,
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;metrics&lt;&#x2F;span&gt;&lt;span&gt;: Option&amp;lt;TransactionMetrics&amp;gt;,
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;pub struct &lt;&#x2F;span&gt;&lt;span&gt;RetryableTransaction {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;inner&lt;&#x2F;span&gt;&lt;span&gt;: Arc&amp;lt;Transaction&amp;gt;,  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Arc needed for retry loop ownership
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;strong&gt;Why two types?&lt;&#x2F;strong&gt; FoundationDB requires retry loops for handling conflicts and retriable errors. The &lt;code&gt;Transaction&lt;&#x2F;code&gt; is perfect when you&#x27;re managing retries manually or doing single-shot operations. The &lt;code&gt;RetryableTransaction&lt;&#x2F;code&gt; wraps it in an &lt;code&gt;Arc&lt;&#x2F;code&gt; so the automatic retry machinery in &lt;code&gt;Database::run()&lt;&#x2F;code&gt; can clone references across async boundaries and exponential backoff delays.&lt;&#x2F;p&gt;
&lt;p&gt;The challenge: users need to write code that works with both. Real FoundationDB applications mix both patterns depending on the use case.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-obvious-solutions-didn-t-work&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-obvious-solutions-didn-t-work&quot; aria-label=&quot;Anchor link for: the-obvious-solutions-didn-t-work&quot;&gt;🔗&lt;&#x2F;a&gt;The Obvious Solutions Didn&#x27;t Work&lt;&#x2F;h2&gt;
&lt;p&gt;My first instinct was creating a trait. But FoundationDB-rs operates directly on raw C pointers (&lt;code&gt;NonNull&amp;lt;fdb_sys::FDBTransaction&amp;gt;&lt;&#x2F;code&gt;) with custom &lt;code&gt;Future&lt;&#x2F;code&gt; implementations that handle FFI complexity and error mapping. Writing a trait with async methods that return these custom futures means associated types, lifetime bounds, and complex error handling. The resulting trait becomes painful to use and understand.&lt;&#x2F;p&gt;
&lt;p&gt;I considered an enum wrapper:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;enum &lt;&#x2F;span&gt;&lt;span&gt;AnyTransaction&amp;lt;&amp;#39;a&amp;gt; {
&lt;&#x2F;span&gt;&lt;span&gt;    Regular(&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;&amp;#39;a&lt;&#x2F;span&gt;&lt;span&gt; Transaction),
&lt;&#x2F;span&gt;&lt;span&gt;    Retryable(&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;&amp;#39;a&lt;&#x2F;span&gt;&lt;span&gt; RetryableTransaction),
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But this felt wrong. Users would need to match everywhere, and it adds runtime overhead for what should be a compile-time decision. Plus it doesn&#x27;t feel natural to use.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-accidental-solution&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-accidental-solution&quot; aria-label=&quot;Anchor link for: the-accidental-solution&quot;&gt;🔗&lt;&#x2F;a&gt;The Accidental Solution&lt;&#x2F;h2&gt;
&lt;p&gt;The &lt;code&gt;RetryableTransaction&lt;&#x2F;code&gt; already had this implementation for convenience:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;impl &lt;&#x2F;span&gt;&lt;span&gt;Deref &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;RetryableTransaction {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;type &lt;&#x2F;span&gt;&lt;span&gt;Target = Transaction;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;deref&lt;&#x2F;span&gt;&lt;span&gt;(&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt; &amp;amp;Transaction {
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;.inner.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;deref&lt;&#x2F;span&gt;&lt;span&gt;()
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;I&#x27;d added this so users could call transaction methods directly on &lt;code&gt;RetryableTransaction&lt;&#x2F;code&gt; instances. But this &lt;strong&gt;accidentally solved the entire design problem.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Functions can accept both types through a simple generic bound:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span&gt;async &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;perform_operations&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;T&amp;gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;tx&lt;&#x2F;span&gt;&lt;span&gt;: &amp;amp;T) -&amp;gt; FdbResult&amp;lt;()&amp;gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;where
&lt;&#x2F;span&gt;&lt;span&gt;    T: Deref&amp;lt;Target = Transaction&amp;gt;,
&lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    tx.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;set&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;b&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;key&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;b&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;value&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; value = tx.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;get&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;b&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;key&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;false&lt;&#x2F;span&gt;&lt;span&gt;).await?;
&lt;&#x2F;span&gt;&lt;span&gt;    tx.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;clear_range&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;b&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;start&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;b&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;end&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;);
&lt;&#x2F;span&gt;&lt;span&gt;    Ok(())
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Now the same function works seamlessly with both transaction types:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Direct transaction usage
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; tx = db.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;create_transaction&lt;&#x2F;span&gt;&lt;span&gt;()?;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;perform_operations&lt;&#x2F;span&gt;&lt;span&gt;(&amp;amp;tx).await?;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Automatic retry loop usage
&lt;&#x2F;span&gt;&lt;span&gt;db.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;run&lt;&#x2F;span&gt;&lt;span&gt;(|&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;rtx&lt;&#x2F;span&gt;&lt;span&gt;| async &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;move &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;perform_operations&lt;&#x2F;span&gt;&lt;span&gt;(&amp;amp;rtx).await?;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Same function, no changes needed!
&lt;&#x2F;span&gt;&lt;span&gt;    Ok(())
&lt;&#x2F;span&gt;&lt;span&gt;}).await?;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The compiler handles everything through deref coercion. All methods of &lt;code&gt;Transaction&lt;&#x2F;code&gt; remain directly accessible on both types, and there&#x27;s zero runtime overhead.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-pattern-arc-deref-universal-apis&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-pattern-arc-deref-universal-apis&quot; aria-label=&quot;Anchor link for: the-pattern-arc-deref-universal-apis&quot;&gt;🔗&lt;&#x2F;a&gt;The Pattern: Arc&lt;T&gt; + Deref = Universal APIs&lt;&#x2F;h2&gt;
&lt;p&gt;This pattern works whenever you have a type &lt;code&gt;T&lt;&#x2F;code&gt; and a wrapper containing &lt;code&gt;Arc&amp;lt;T&amp;gt;&lt;&#x2F;code&gt; (or &lt;code&gt;Box&amp;lt;T&amp;gt;&lt;&#x2F;code&gt;, &lt;code&gt;Rc&amp;lt;T&amp;gt;&lt;&#x2F;code&gt;, etc.). As long as the wrapper implements &lt;code&gt;Deref&amp;lt;Target = T&amp;gt;&lt;&#x2F;code&gt;, you can write generic functions that accept both:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Any function with this signature accepts:
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; - &amp;amp;T directly  
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; - &amp;amp;WrapperType where WrapperType: Deref&amp;lt;Target = T&amp;gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; - &amp;amp;Arc&amp;lt;T&amp;gt;, &amp;amp;Box&amp;lt;T&amp;gt;, &amp;amp;Rc&amp;lt;T&amp;gt; (stdlib types already implement Deref)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;use_any_version&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;D&amp;gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;val&lt;&#x2F;span&gt;&lt;span&gt;: &amp;amp;D)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;where 
&lt;&#x2F;span&gt;&lt;span&gt;    D: Deref&amp;lt;Target = T&amp;gt;,
&lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    val.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;some_method&lt;&#x2F;span&gt;&lt;span&gt;();  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; All methods of T available through deref coercion
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The key insight: when you&#x27;re designing APIs that need to work with both &lt;code&gt;T&lt;&#x2F;code&gt; and &lt;code&gt;Arc&amp;lt;T&amp;gt;&lt;&#x2F;code&gt;, don&#x27;t reach for traits or enums. The standard library already solved this. &lt;code&gt;Arc&amp;lt;T&amp;gt;&lt;&#x2F;code&gt; implements &lt;code&gt;Deref&amp;lt;Target = T&amp;gt;&lt;&#x2F;code&gt;, and your custom wrapper types should do the same.&lt;&#x2F;p&gt;
&lt;p&gt;Once you implement &lt;code&gt;Deref&lt;&#x2F;code&gt;, any function that accepts &lt;code&gt;&amp;amp;D where D: Deref&amp;lt;Target = T&amp;gt;&lt;&#x2F;code&gt; automatically works with your owned type, your wrapper type, and any smart pointer containing your type. The compiler handles everything through deref coercion, and you get zero-cost abstraction that feels completely natural to use.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Feel free to reach out with any questions or to share your experiences with Deref patterns in Rust. You can find me on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;pierrezemb.fr&quot;&gt;Bluesky&lt;&#x2F;a&gt; or through my &lt;a href=&quot;https:&#x2F;&#x2F;pierrezemb.fr&quot;&gt;website&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">rust</category>
          <category domain="tag">foundationdb</category>
          <category domain="tag">programming</category>
          <category domain="tag">metaprogramming</category>
      </item>
      <item>
          <title>A Practical Guide to Application Metrics: Where to Put Your Instrumentation</title>
          <pubDate>Wed, 24 Sep 2025 00:00:00 +0000</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/practical-guide-to-application-metrics/</link>
          <guid>https://pierrezemb.fr/posts/practical-guide-to-application-metrics/</guid>
          <description xml:base="https://pierrezemb.fr/posts/practical-guide-to-application-metrics/">&lt;p&gt;I keep having the same conversation with junior developers. They&#x27;re building their first production service, and they ask: &quot;Where should I put metrics in my application?&quot; Then, inevitably: &quot;What should I actually measure?&quot;&lt;&#x2F;p&gt;
&lt;p&gt;After mentoring dozens of engineers and running distributed systems for years, I&#x27;ve learned these aren&#x27;t just beginner questions. Even experienced developers struggle with metrics placement because most of us learned observability as an afterthought, not as a core design principle.&lt;&#x2F;p&gt;
&lt;p&gt;I&#x27;ve been on both sides: deploying services with no metrics and scrambling at 3 AM to understand what broke, and also building comprehensive monitoring that caught issues before users noticed. The difference isn&#x27;t just about sleep quality; it&#x27;s about building systems you can actually operate with confidence.&lt;&#x2F;p&gt;
&lt;p&gt;This post gives you a practical framework for where to instrument your applications. No theory, just patterns I&#x27;ve learned from years of production incidents.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-five-essential-metric-types&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-five-essential-metric-types&quot; aria-label=&quot;Anchor link for: the-five-essential-metric-types&quot;&gt;🔗&lt;&#x2F;a&gt;The Five Essential Metric Types&lt;&#x2F;h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Quick note on naming:&lt;&#x2F;strong&gt; Throughout this post, I use dots (&lt;code&gt;.&lt;&#x2F;code&gt;) as metric separators like &lt;code&gt;api.requests.total&lt;&#x2F;code&gt;. This works perfectly for us because we&#x27;re heavy &lt;a href=&quot;https:&#x2F;&#x2F;warp10.io&#x2F;&quot;&gt;Warp 10&lt;&#x2F;a&gt; users, and Warp 10 handles dots beautifully. If you&#x27;re using Prometheus or other systems that prefer underscores, just replace the dots with underscores (&lt;code&gt;api_requests_total&lt;&#x2F;code&gt;). The patterns remain the same!&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;All useful application metrics fall into five categories. Understanding these helps you decide what to instrument and where:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;1. Operational Counters&lt;&#x2F;strong&gt; track discrete events in your system. Every time something happens (a request arrives, a job finishes, an error occurs), you increment a counter. The most critical insight here is measuring both success and failure paths. Most developers remember to count successful operations but forget the errors, leaving them blind when things break. Examples include &lt;code&gt;api.requests.total&lt;&#x2F;code&gt;, &lt;code&gt;db.queries.executed&lt;&#x2F;code&gt;, &lt;code&gt;auth.failures.count&lt;&#x2F;code&gt;, &lt;code&gt;payments.declined.count&lt;&#x2F;code&gt;, &lt;code&gt;jobs.started&lt;&#x2F;code&gt;, and &lt;code&gt;cache.evictions&lt;&#x2F;code&gt;. Always include labels like &lt;code&gt;method&lt;&#x2F;code&gt;, &lt;code&gt;endpoint&lt;&#x2F;code&gt;, &lt;code&gt;error_type&lt;&#x2F;code&gt; to provide context.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;2. Resource Utilization&lt;&#x2F;strong&gt; answers &quot;how much of X am I using right now?&quot; These are your early warning system for capacity problems. Track current values with gauges, cumulative usage with counters. The key is monitoring resources before they&#x27;re completely exhausted. A connection pool might support 100 connections, but if 95 are active, you&#x27;re in trouble. Monitor &lt;code&gt;memory.used.bytes&lt;&#x2F;code&gt;, &lt;code&gt;db.connections.active&lt;&#x2F;code&gt;, &lt;code&gt;cache.size.entries&lt;&#x2F;code&gt;, &lt;code&gt;thread_pool.active_threads&lt;&#x2F;code&gt;, and &lt;code&gt;disk.space.available.bytes&lt;&#x2F;code&gt;. Watch for patterns like steadily increasing memory usage or connection counts approaching pool limits.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;3. Performance and Latency&lt;&#x2F;strong&gt; shows how fast (or slow) things are running. Users feel latency immediately, making these often your most-watched dashboards. Always include units in metric names (&lt;code&gt;.ms&lt;&#x2F;code&gt;, &lt;code&gt;.seconds&lt;&#x2F;code&gt;, &lt;code&gt;.bytes&lt;&#x2F;code&gt;) to make dashboards self-documenting. Track &lt;code&gt;api.response_time.ms&lt;&#x2F;code&gt;, &lt;code&gt;db.query.duration.ms&lt;&#x2F;code&gt;, &lt;code&gt;jobs.processing_time.seconds&lt;&#x2F;code&gt;, and &lt;code&gt;external_api.call.duration.ms&lt;&#x2F;code&gt;. Monitor percentiles (p50, p95, p99) not just averages: a 1ms average with a 5-second p99 indicates serious problems.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;4. Data Volume and Throughput&lt;&#x2F;strong&gt; tracks data flow through your system. These metrics are crucial for capacity planning and spotting bottlenecks before they cause user-visible problems. Monitor both input and output rates to understand processing efficiency. Focus on &lt;code&gt;queue.messages.consumed&lt;&#x2F;code&gt;, &lt;code&gt;network.bytes.sent&lt;&#x2F;code&gt;, &lt;code&gt;database.rows.processed&lt;&#x2F;code&gt;, &lt;code&gt;file_processor.files.completed&lt;&#x2F;code&gt;, and &lt;code&gt;batch_processor.records.per_batch&lt;&#x2F;code&gt;. Compare input vs output rates to identify accumulating backlogs.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;5. Business Logic&lt;&#x2F;strong&gt; captures domain-specific metrics that relate to your actual business value. These are often the most valuable metrics for understanding how your application is really being used and whether technical problems are affecting business outcomes. Track &lt;code&gt;orders.placed&lt;&#x2F;code&gt;, &lt;code&gt;users.registered&lt;&#x2F;code&gt;, &lt;code&gt;searches.executed&lt;&#x2F;code&gt;, &lt;code&gt;documents.uploaded&lt;&#x2F;code&gt;, and &lt;code&gt;subscriptions.activated&lt;&#x2F;code&gt;. Don&#x27;t underestimate these: they&#x27;re what your executives care about and often reveal problems that technical metrics miss.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Type&lt;&#x2F;th&gt;&lt;th&gt;Examples&lt;&#x2F;th&gt;&lt;th&gt;Key Insight&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Operational&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;api.requests.total&lt;&#x2F;code&gt;, &lt;code&gt;auth.failures&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Track success AND failure paths&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Resource&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;memory.used.bytes&lt;&#x2F;code&gt;, &lt;code&gt;db.connections.active&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Early warning for capacity issues&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Performance&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;api.response_time.ms&lt;&#x2F;code&gt;, &lt;code&gt;db.query.duration.ms&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Users feel latency immediately&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Throughput&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;queue.messages.consumed&lt;&#x2F;code&gt;, &lt;code&gt;network.bytes.sent&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Understand data flow patterns&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Business&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;orders.placed&lt;&#x2F;code&gt;, &lt;code&gt;users.login&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;What executives actually care about&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h2 id=&quot;where-to-instrument-a-component-guide&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#where-to-instrument-a-component-guide&quot; aria-label=&quot;Anchor link for: where-to-instrument-a-component-guide&quot;&gt;🔗&lt;&#x2F;a&gt;Where to Instrument: A Component Guide&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;api-endpoints-and-http-requests&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#api-endpoints-and-http-requests&quot; aria-label=&quot;Anchor link for: api-endpoints-and-http-requests&quot;&gt;🔗&lt;&#x2F;a&gt;API Endpoints and HTTP Requests&lt;&#x2F;h3&gt;
&lt;p&gt;Your application&#x27;s front door deserves comprehensive monitoring. Every HTTP request tells a story from arrival to completion, and you want to capture that entire narrative, not just the happy path. Track &lt;code&gt;api.requests.total&lt;&#x2F;code&gt; with labels for &lt;code&gt;method&lt;&#x2F;code&gt;, &lt;code&gt;endpoint&lt;&#x2F;code&gt;, and &lt;code&gt;status_code&lt;&#x2F;code&gt; to understand usage patterns. Monitor &lt;code&gt;api.response_time.ms&lt;&#x2F;code&gt; to show user experience, and &lt;code&gt;api.errors.count&lt;&#x2F;code&gt; with &lt;code&gt;error_type&lt;&#x2F;code&gt; labels to reveal reliability issues. Include &lt;code&gt;auth.failures.count&lt;&#x2F;code&gt; with &lt;code&gt;failure_reason&lt;&#x2F;code&gt; to catch security problems, and &lt;code&gt;api.concurrent_requests&lt;&#x2F;code&gt; to identify when you&#x27;re approaching capacity limits. The common mistake is only instrumenting successful requests; the real value comes from measuring what happens when things go wrong: network timeouts, validation errors, service dependencies failing.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;database-layer&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#database-layer&quot; aria-label=&quot;Anchor link for: database-layer&quot;&gt;🔗&lt;&#x2F;a&gt;Database Layer&lt;&#x2F;h3&gt;
&lt;p&gt;Database calls are often your biggest bottleneck and cause more production incidents than any other component. For connection management, track &lt;code&gt;db.connections.active&lt;&#x2F;code&gt; (critical for pool management), &lt;code&gt;db.connections.idle&lt;&#x2F;code&gt; (available connections), and &lt;code&gt;db.connections.wait_time.ms&lt;&#x2F;code&gt; (time threads wait for connections). Monitor query performance with &lt;code&gt;db.queries.executed&lt;&#x2F;code&gt; (including &lt;code&gt;operation_type&lt;&#x2F;code&gt; and &lt;code&gt;table&lt;&#x2F;code&gt; labels), &lt;code&gt;db.query.duration.ms&lt;&#x2F;code&gt; (with percentile tracking), &lt;code&gt;db.slow_queries.count&lt;&#x2F;code&gt; (queries exceeding thresholds), and &lt;code&gt;db.query.rows_affected&lt;&#x2F;code&gt; (rows returned or modified). For error monitoring, track &lt;code&gt;db.errors.count&lt;&#x2F;code&gt; by &lt;code&gt;error_type&lt;&#x2F;code&gt; (timeout, deadlock, constraint violation) and &lt;code&gt;db.connection_errors.count&lt;&#x2F;code&gt; for connection failures. Connection pool exhaustion is a classic way to kill your entire application: if you support 100 connections and 95 are active, you&#x27;re in danger. Start alerting when you hit 85% utilization.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;message-queues-and-background-processing&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#message-queues-and-background-processing&quot; aria-label=&quot;Anchor link for: message-queues-and-background-processing&quot;&gt;🔗&lt;&#x2F;a&gt;Message Queues and Background Processing&lt;&#x2F;h3&gt;
&lt;p&gt;Message queues often hide subtle bugs that manifest as slowly growing delays or stuck processing. Track data flow in both directions to catch issues early. For producers, monitor &lt;code&gt;queue.messages.produced&lt;&#x2F;code&gt; (with &lt;code&gt;topic&lt;&#x2F;code&gt; and &lt;code&gt;producer_id&lt;&#x2F;code&gt; labels), &lt;code&gt;queue.messages.failed&lt;&#x2F;code&gt; (with detailed &lt;code&gt;error_type&lt;&#x2F;code&gt; labels), &lt;code&gt;queue.producer.wait_time.ms&lt;&#x2F;code&gt; (time waiting for producer availability), and &lt;code&gt;queue.batch_size&lt;&#x2F;code&gt; (messages sent per batch). For consumers, track &lt;code&gt;queue.messages.consumed&lt;&#x2F;code&gt; (successfully processed), &lt;code&gt;queue.processing.time.ms&lt;&#x2F;code&gt; (per-message duration), &lt;code&gt;queue.processing.errors&lt;&#x2F;code&gt; (failures with &lt;code&gt;error_type&lt;&#x2F;code&gt; and recovery action), &lt;code&gt;jobs.queue.depth&lt;&#x2F;code&gt; (messages waiting), and &lt;code&gt;consumer.lag.ms&lt;&#x2F;code&gt; (how far behind real-time). Background jobs need additional metrics: &lt;code&gt;jobs.started&lt;&#x2F;code&gt;, &lt;code&gt;jobs.completed&lt;&#x2F;code&gt;, &lt;code&gt;jobs.failed&lt;&#x2F;code&gt; (with failure reason), &lt;code&gt;jobs.retry.count&lt;&#x2F;code&gt;, and &lt;code&gt;jobs.execution_time.seconds&lt;&#x2F;code&gt;. Growing queue depth usually means you&#x27;re processing jobs slower than they&#x27;re being created, leading to increasing delays and eventual system overload. Consumer lag helps you understand if you&#x27;re keeping up with real-time processing needs.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;caching-and-locks&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#caching-and-locks&quot; aria-label=&quot;Anchor link for: caching-and-locks&quot;&gt;🔗&lt;&#x2F;a&gt;Caching and Locks&lt;&#x2F;h3&gt;
&lt;p&gt;For cache performance, track &lt;code&gt;cache.requests.total&lt;&#x2F;code&gt; (with &lt;code&gt;operation&lt;&#x2F;code&gt; labels for get, set, delete), &lt;code&gt;cache.hits&lt;&#x2F;code&gt; and &lt;code&gt;cache.misses&lt;&#x2F;code&gt; (for calculating hit ratio), &lt;code&gt;cache.size.entries&lt;&#x2F;code&gt; (current cached items), &lt;code&gt;cache.size.bytes&lt;&#x2F;code&gt; (memory usage), &lt;code&gt;cache.evictions&lt;&#x2F;code&gt; (items removed with &lt;code&gt;eviction_reason&lt;&#x2F;code&gt;), and &lt;code&gt;cache.operation.duration.ms&lt;&#x2F;code&gt; (time for operations). Hit ratio below 80% usually indicates problems: either you&#x27;re caching the wrong things, cache TTL is too short, or your working set exceeds cache capacity.&lt;&#x2F;p&gt;
&lt;p&gt;For lock and synchronization, monitor &lt;code&gt;locks.acquire.duration.ms&lt;&#x2F;code&gt; (time from requesting to getting lock), &lt;code&gt;locks.held.duration.ms&lt;&#x2F;code&gt; (how long locks are held), &lt;code&gt;locks.contention.count&lt;&#x2F;code&gt; (threads waiting), and &lt;code&gt;locks.timeouts.count&lt;&#x2F;code&gt; (failed acquisitions within timeout). Lock contention can kill your entire application, but it stays invisible without metrics. I&#x27;ve debugged more performance issues with lock metrics than almost any other single type. High acquisition times mean contention; long hold times suggest you&#x27;re doing too much work while holding the lock.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;real-world-instrumentation-patterns&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#real-world-instrumentation-patterns&quot; aria-label=&quot;Anchor link for: real-world-instrumentation-patterns&quot;&gt;🔗&lt;&#x2F;a&gt;Real-World Instrumentation Patterns&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;the-request-lifecycle-pattern&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-request-lifecycle-pattern&quot; aria-label=&quot;Anchor link for: the-request-lifecycle-pattern&quot;&gt;🔗&lt;&#x2F;a&gt;The Request Lifecycle Pattern&lt;&#x2F;h3&gt;
&lt;p&gt;For every user-facing operation, track the complete journey from entry to exit. This means instrumenting not just the success path, but every branch your code can take. Increment &lt;code&gt;api.requests.received&lt;&#x2F;code&gt; the moment a request hits your service, track &lt;code&gt;auth.attempts.count&lt;&#x2F;code&gt; and &lt;code&gt;auth.failures.count&lt;&#x2F;code&gt; separately to show both volume and failure rate, monitor &lt;code&gt;authorization.decisions.count&lt;&#x2F;code&gt; with labels for &lt;code&gt;granted&lt;&#x2F;code&gt; vs &lt;code&gt;denied&lt;&#x2F;code&gt;, measure &lt;code&gt;business_logic.duration.ms&lt;&#x2F;code&gt; to isolate your application logic performance, and record final &lt;code&gt;response.status_code&lt;&#x2F;code&gt; distribution to understand your error patterns. Most developers instrument the happy path but forget edge cases. A request that fails authentication never reaches your business logic, but it still uses resources and affects user experience. The real value comes from measuring what happens when things go wrong: network timeouts, validation errors, service dependencies failing.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;the-resource-exhaustion-pattern&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-resource-exhaustion-pattern&quot; aria-label=&quot;Anchor link for: the-resource-exhaustion-pattern&quot;&gt;🔗&lt;&#x2F;a&gt;The Resource Exhaustion Pattern&lt;&#x2F;h3&gt;
&lt;p&gt;Systems fail when they run out of resources. The trick is measuring resources before they&#x27;re completely exhausted, giving you time to react. For connection pools, track &lt;code&gt;db.connections.active&lt;&#x2F;code&gt; vs &lt;code&gt;db.connections.max&lt;&#x2F;code&gt; and don&#x27;t wait until you hit 100% utilization: start alerting at 85%. Monitor &lt;code&gt;db.connections.wait_time.ms&lt;&#x2F;code&gt; because long waits indicate you&#x27;re close to exhaustion even if you haven&#x27;t hit the limit. For memory pressure, monitor both &lt;code&gt;memory.heap.used.bytes&lt;&#x2F;code&gt; and &lt;code&gt;gc.frequency.per_minute&lt;&#x2F;code&gt; since high GC frequency often predicts memory pressure before OutOfMemory errors occur. Track &lt;code&gt;memory.allocation.rate.bytes_per_second&lt;&#x2F;code&gt; to understand if your allocation rate is sustainable. For queue management, a growing &lt;code&gt;jobs.queue.depth&lt;&#x2F;code&gt; indicates you&#x27;re processing work slower than it arrives, eventually leading to timeouts and system overload. Track &lt;code&gt;queue.processing.rate.per_second&lt;&#x2F;code&gt; and &lt;code&gt;queue.arrival.rate.per_second&lt;&#x2F;code&gt;: the relationship between these rates tells you if you&#x27;re keeping up. For disk space, track &lt;code&gt;disk.available.bytes&lt;&#x2F;code&gt; and &lt;code&gt;disk.usage.rate.bytes_per_hour&lt;&#x2F;code&gt;. Linear growth can be predicted and prevented, while sudden spikes indicate immediate problems.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;the-business-context-pattern&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-business-context-pattern&quot; aria-label=&quot;Anchor link for: the-business-context-pattern&quot;&gt;🔗&lt;&#x2F;a&gt;The Business Context Pattern&lt;&#x2F;h3&gt;
&lt;p&gt;Technical metrics tell you &lt;em&gt;what&lt;&#x2F;em&gt; is happening; business metrics tell you &lt;em&gt;why&lt;&#x2F;em&gt; it matters. Always pair technical instrumentation with business context to understand the real impact of technical problems. Track &lt;code&gt;api.errors.count&lt;&#x2F;code&gt; alongside &lt;code&gt;orders.lost.count&lt;&#x2F;code&gt; to understand how technical problems affect sales, monitor &lt;code&gt;payment_service.response_time.ms&lt;&#x2F;code&gt; alongside &lt;code&gt;checkout.abandonment.rate&lt;&#x2F;code&gt; to see if slow payments drive users away, and measure &lt;code&gt;search.response_time.ms&lt;&#x2F;code&gt; alongside &lt;code&gt;search.result_clicks.count&lt;&#x2F;code&gt; to understand if slow search reduces engagement. For user experience correlation, pair &lt;code&gt;cache.misses.count&lt;&#x2F;code&gt; with &lt;code&gt;page.load.time.ms&lt;&#x2F;code&gt; to quantify cache performance impact, track &lt;code&gt;db.slow_queries.count&lt;&#x2F;code&gt; alongside &lt;code&gt;user.session.duration.minutes&lt;&#x2F;code&gt; to see if database performance affects user retention, and monitor &lt;code&gt;auth.failures.count&lt;&#x2F;code&gt; with &lt;code&gt;support.tickets.count&lt;&#x2F;code&gt; to predict support load from technical issues. For capacity planning, correlate &lt;code&gt;server.cpu.usage.percent&lt;&#x2F;code&gt; with &lt;code&gt;concurrent.users.count&lt;&#x2F;code&gt; to understand scaling requirements, track &lt;code&gt;memory.usage.bytes&lt;&#x2F;code&gt; alongside &lt;code&gt;active.sessions.count&lt;&#x2F;code&gt; to predict memory needs, and monitor &lt;code&gt;network.bandwidth.used.mbps&lt;&#x2F;code&gt; with &lt;code&gt;file.uploads.count&lt;&#x2F;code&gt; to plan infrastructure scaling. This pairing helps you understand the business impact of technical problems and prioritize fixes based on actual user and revenue impact.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;the-error-classification-pattern&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-error-classification-pattern&quot; aria-label=&quot;Anchor link for: the-error-classification-pattern&quot;&gt;🔗&lt;&#x2F;a&gt;The Error Classification Pattern&lt;&#x2F;h3&gt;
&lt;p&gt;Not all errors are created equal. Classify errors by their impact and actionability to build appropriate response strategies. User errors (4xx) like &lt;code&gt;auth.invalid_credentials&lt;&#x2F;code&gt;, &lt;code&gt;validation.missing_field&lt;&#x2F;code&gt;, or &lt;code&gt;resource.not_found&lt;&#x2F;code&gt; are usually not your fault, but track patterns to identify UX issues. High rates might indicate confusing interfaces or inadequate client-side validation; alert on unusual spikes that might indicate attacks or system confusion. System errors (5xx) like &lt;code&gt;db.connection_timeout&lt;&#x2F;code&gt;, &lt;code&gt;service.unavailable&lt;&#x2F;code&gt;, or &lt;code&gt;memory.exhausted&lt;&#x2F;code&gt; are your responsibility to fix immediately. They&#x27;re always actionable and usually indicate infrastructure or code problems that should trigger immediate alerts and investigation. External dependency errors like &lt;code&gt;payment_gateway.timeout&lt;&#x2F;code&gt;, &lt;code&gt;third_party_api.rate_limited&lt;&#x2F;code&gt;, or &lt;code&gt;cdn.unavailable&lt;&#x2F;code&gt; are outside your direct control but affect users. They require fallback strategies and user communication, and help predict when to escalate with external providers. Distinguish transient errors (&lt;code&gt;network.timeout&lt;&#x2F;code&gt;, &lt;code&gt;rate_limit.exceeded&lt;&#x2F;code&gt; that often resolve themselves) from persistent errors (&lt;code&gt;config.invalid&lt;&#x2F;code&gt;, &lt;code&gt;database.schema_mismatch&lt;&#x2F;code&gt; that require immediate intervention). Each category needs different alerting strategies, escalation procedures, and response timeframes: user errors might warrant daily review, system errors need immediate alerts, external errors require monitoring trends and fallback activation.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;best-practices-for-production-metrics&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#best-practices-for-production-metrics&quot; aria-label=&quot;Anchor link for: best-practices-for-production-metrics&quot;&gt;🔗&lt;&#x2F;a&gt;Best Practices for Production Metrics&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;naming-conventions-that-scale&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#naming-conventions-that-scale&quot; aria-label=&quot;Anchor link for: naming-conventions-that-scale&quot;&gt;🔗&lt;&#x2F;a&gt;Naming Conventions That Scale&lt;&#x2F;h3&gt;
&lt;p&gt;Consistent naming prevents the confusion that kills metrics adoption. Use a clear hierarchy: &lt;code&gt;&amp;lt;system&amp;gt;.&amp;lt;component&amp;gt;.&amp;lt;operation&amp;gt;.&amp;lt;metric_type&amp;gt;&lt;&#x2F;code&gt;. Examples include &lt;code&gt;api.auth.requests.count&lt;&#x2F;code&gt;, &lt;code&gt;db.user_queries.duration.ms&lt;&#x2F;code&gt;, &lt;code&gt;cache.metadata.hits.total&lt;&#x2F;code&gt;, and &lt;code&gt;queue.order_processing.messages.consumed&lt;&#x2F;code&gt;. Standardize your suffixes: &lt;code&gt;.count&#x2F;.total&lt;&#x2F;code&gt; for event counters, &lt;code&gt;.current&#x2F;.active&lt;&#x2F;code&gt; for current gauge values, &lt;code&gt;.duration&#x2F;.ms&#x2F;.seconds&lt;&#x2F;code&gt; for time measurements, &lt;code&gt;.bytes&#x2F;.mb&#x2F;.gb&lt;&#x2F;code&gt; for data volume, &lt;code&gt;.errors&#x2F;.failures&lt;&#x2F;code&gt; for error counters, and &lt;code&gt;.ratio&#x2F;.rate&lt;&#x2F;code&gt; for ratios and rates. Avoid mixing naming styles (&lt;code&gt;requestCount&lt;&#x2F;code&gt; vs &lt;code&gt;request_total&lt;&#x2F;code&gt;), ambiguous units (&lt;code&gt;response_time&lt;&#x2F;code&gt; without units), and inconsistent hierarchies (&lt;code&gt;api_requests&lt;&#x2F;code&gt; vs &lt;code&gt;requests.api&lt;&#x2F;code&gt;). This consistency pays off during 3 AM troubleshooting when you don&#x27;t want to waste mental energy remembering naming schemes.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;critical-mistakes-to-avoid&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#critical-mistakes-to-avoid&quot; aria-label=&quot;Anchor link for: critical-mistakes-to-avoid&quot;&gt;🔗&lt;&#x2F;a&gt;Critical Mistakes to Avoid&lt;&#x2F;h3&gt;
&lt;p&gt;The biggest mistake is using unbounded values as labels. Don&#x27;t tag metrics with user IDs, session tokens, IP addresses, or other unlimited values; your metrics system will eventually explode from too many unique series. Use &lt;code&gt;api.requests{user_type=&quot;premium&quot;, region=&quot;us-west&quot;}&lt;&#x2F;code&gt; instead of &lt;code&gt;api.requests{user_id=&quot;12345&quot;, session=&quot;abc123xyz&quot;}&lt;&#x2F;code&gt;. Always instrument failure cases, not just success paths. Track both &lt;code&gt;payments.succeeded&lt;&#x2F;code&gt; AND &lt;code&gt;payments.failed&lt;&#x2F;code&gt; with error type labels, monitor &lt;code&gt;auth.attempts&lt;&#x2F;code&gt; alongside &lt;code&gt;auth.failures&lt;&#x2F;code&gt; to understand failure rates, and count &lt;code&gt;file.uploads.completed&lt;&#x2F;code&gt; and &lt;code&gt;file.uploads.failed&lt;&#x2F;code&gt; to see processing reliability. Every metric has a cost in storage, network bandwidth, and cognitive load. If you can&#x27;t explain why a metric matters for operations or business decisions, skip it. Ask yourself: &quot;Would this metric help me during an incident?&quot; Metrics that stop updating can be worse than no metrics at all. Always include heartbeat or health check metrics to verify your instrumentation is working; track &lt;code&gt;metrics.last_updated.timestamp&lt;&#x2F;code&gt; to detect collection failures.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;a-note-on-histograms-or-why-they-re-not-here&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#a-note-on-histograms-or-why-they-re-not-here&quot; aria-label=&quot;Anchor link for: a-note-on-histograms-or-why-they-re-not-here&quot;&gt;🔗&lt;&#x2F;a&gt;A Note on Histograms (Or: Why They&#x27;re Not Here)&lt;&#x2F;h2&gt;
&lt;p&gt;I know what some of you are thinking: &quot;Where are the histograms?&quot; After all, this is a comprehensive guide to application metrics, and histograms are everywhere in monitoring discussions. Well, I deliberately left them out, and here&#x27;s why.&lt;&#x2F;p&gt;
&lt;p&gt;Prometheus histograms are fundamentally broken in ways that make them more dangerous than useful. The core problem is what I call the bucket pre-configuration paradox: you must define bucket boundaries before you know your data distribution. As LinuxCzar eloquently put it in his &lt;a href=&quot;https:&#x2F;&#x2F;linuxczar.net&#x2F;blog&#x2F;2017&#x2F;06&#x2F;15&#x2F;prometheus-histogram-2&#x2F;&quot;&gt;&quot;tale of woe&quot;&lt;&#x2F;a&gt;, this creates an impossible choice between accuracy (many buckets) and operability (few buckets). Get it wrong, and you either lose precision or crash your Prometheus server with &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;prometheus&#x2F;prometheus&#x2F;discussions&#x2F;10598&quot;&gt;cardinality explosion&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;But the problems run deeper. You &lt;a href=&quot;https:&#x2F;&#x2F;www.solarwinds.com&#x2F;blog&#x2F;why-percentiles-dont-work-the-way-you-think&quot;&gt;mathematically cannot aggregate percentiles&lt;&#x2F;a&gt; across instances because the underlying event data is lost. The linear interpolation algorithm produces &lt;a href=&quot;https:&#x2F;&#x2F;prometheus.io&#x2F;docs&#x2F;practices&#x2F;histograms&#x2F;&quot;&gt;significant estimation errors&lt;&#x2F;a&gt;, and Prometheus&#x27;s scraping architecture introduces &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;prometheus&#x2F;prometheus&#x2F;issues&#x2F;1887&quot;&gt;data corruption&lt;&#x2F;a&gt; where histogram buckets update inconsistently. The &lt;a href=&quot;https:&#x2F;&#x2F;chronosphere.io&#x2F;learn&#x2F;histograms-for-complex-systems&#x2F;&quot;&gt;operational burden&lt;&#x2F;a&gt; never ends: every performance improvement potentially invalidates your bucket choices, forcing constant manual reconfiguration.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;catwell.info&quot;&gt;Pierre Chapuis&lt;&#x2F;a&gt; &lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;catwell.info&#x2F;post&#x2F;3lzxliivegs2k&quot;&gt;pointed out&lt;&#x2F;a&gt; the root cause I missed: Prometheus implements an outdated 2005 algorithm from Cormode et al. for histogram summaries and quantiles. There are much better algorithms available now, including improved versions from the same authors. Check out &lt;a href=&quot;https:&#x2F;&#x2F;cs.uwaterloo.ca&#x2F;~kdaudjee&#x2F;Daudjee_Sketches.pdf&quot;&gt;this paper&lt;&#x2F;a&gt; for a good overview of modern sketch algorithms. The estimation errors and operational problems I described are symptoms of using this old algorithm.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Instead, I prefer the combination of simple counters and gauges paired with distributed tracing. Trace-derived global metrics give you actual data distributions without guessing bucket boundaries, eliminate the aggregation problem by preserving request context, and adapt automatically as your system evolves. You get better insights with less operational overhead, which seems like a better deal to me.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;quick-reference-metrics-by-component&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#quick-reference-metrics-by-component&quot; aria-label=&quot;Anchor link for: quick-reference-metrics-by-component&quot;&gt;🔗&lt;&#x2F;a&gt;Quick Reference: Metrics by Component&lt;&#x2F;h2&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Component&lt;&#x2F;th&gt;&lt;th&gt;Essential Metrics&lt;&#x2F;th&gt;&lt;th&gt;Purpose&lt;&#x2F;th&gt;&lt;th&gt;Key Labels&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;strong&gt;API Endpoints&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;api.requests.total&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;api.response_time.ms&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;api.errors.count&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;auth.failures.count&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Track every request lifecycle&lt;br&gt;Monitor user-facing performance&lt;br&gt;Catch errors before users complain&lt;br&gt;Security monitoring&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;method&lt;&#x2F;code&gt;, &lt;code&gt;endpoint&lt;&#x2F;code&gt;, &lt;code&gt;status_code&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;endpoint&lt;&#x2F;code&gt;, &lt;code&gt;user_type&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;error_type&lt;&#x2F;code&gt;, &lt;code&gt;endpoint&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;failure_reason&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Database&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;db.connections.active&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;db.queries.executed&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;db.query.duration.ms&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;db.errors.count&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;db.slow_queries.count&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Prevent connection exhaustion&lt;br&gt;Track database usage patterns&lt;br&gt;Identify performance bottlenecks&lt;br&gt;Monitor database health&lt;br&gt;Catch expensive queries&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;pool_name&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;operation_type&lt;&#x2F;code&gt;, &lt;code&gt;table&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;operation_type&lt;&#x2F;code&gt;, &lt;code&gt;table&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;error_type&lt;&#x2F;code&gt;, &lt;code&gt;operation&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;table&lt;&#x2F;code&gt;, &lt;code&gt;query_type&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Message Queues&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;queue.messages.produced&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;queue.messages.consumed&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;queue.processing.time.ms&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;queue.processing.errors&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;jobs.queue.depth&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Track producer health&lt;br&gt;Monitor consumer throughput&lt;br&gt;Identify processing bottlenecks&lt;br&gt;Catch processing failures&lt;br&gt;Detect backlog buildup&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;topic&lt;&#x2F;code&gt;, &lt;code&gt;producer_id&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;topic&lt;&#x2F;code&gt;, &lt;code&gt;consumer_group&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;topic&lt;&#x2F;code&gt;, &lt;code&gt;message_type&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;error_type&lt;&#x2F;code&gt;, &lt;code&gt;topic&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;queue_name&lt;&#x2F;code&gt;, &lt;code&gt;priority&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Cache&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;cache.requests.total&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;cache.hits&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;cache.misses&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;cache.size.entries&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;cache.evictions&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Monitor cache usage&lt;br&gt;Track cache effectiveness&lt;br&gt;Identify cache problems&lt;br&gt;Monitor memory usage&lt;br&gt;Understand eviction patterns&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;cache_type&lt;&#x2F;code&gt;, &lt;code&gt;operation&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;cache_type&lt;&#x2F;code&gt;, &lt;code&gt;key_prefix&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;cache_type&lt;&#x2F;code&gt;, &lt;code&gt;miss_reason&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;cache_type&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;cache_type&lt;&#x2F;code&gt;, &lt;code&gt;eviction_reason&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Locks&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;locks.acquire.duration.ms&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;locks.held.duration.ms&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;locks.contention.count&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Detect lock contention&lt;br&gt;Find locks held too long&lt;br&gt;Monitor thread blocking&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;lock_name&lt;&#x2F;code&gt;, &lt;code&gt;thread_type&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;lock_name&lt;&#x2F;code&gt;, &lt;code&gt;operation&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;lock_name&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Business&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;orders.placed&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;users.login&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;payments.processed&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;feature.usage.count&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;workflow.state_changes&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Business KPI tracking&lt;br&gt;User activity monitoring&lt;br&gt;Revenue stream health&lt;br&gt;Feature adoption metrics&lt;br&gt;Process flow monitoring&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;user_type&lt;&#x2F;code&gt;, &lt;code&gt;order_value_range&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;user_type&lt;&#x2F;code&gt;, &lt;code&gt;login_method&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;payment_method&lt;&#x2F;code&gt;, &lt;code&gt;amount_range&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;feature_name&lt;&#x2F;code&gt;, &lt;code&gt;user_segment&lt;&#x2F;code&gt;&lt;br&gt;&lt;code&gt;workflow_name&lt;&#x2F;code&gt;, &lt;code&gt;from_state&lt;&#x2F;code&gt;, &lt;code&gt;to_state&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#conclusion&quot; aria-label=&quot;Anchor link for: conclusion&quot;&gt;🔗&lt;&#x2F;a&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;Effective metrics instrumentation isn&#x27;t about collecting everything; it&#x27;s about collecting the right things in the right places. Start with the five essential metric types, instrument your critical components, and build from there.&lt;&#x2F;p&gt;
&lt;p&gt;Think about metrics as part of your application design, not an afterthought. When writing code, ask yourself: &quot;How will I know if this is working correctly in production?&quot; The answer guides your instrumentation decisions.&lt;&#x2F;p&gt;
&lt;p&gt;The best observability system helps you sleep better at night. If your metrics aren&#x27;t giving you confidence in your system&#x27;s health, you&#x27;re measuring the wrong things.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Feel free to reach out with any questions or to share your experiences with application metrics. You can find me on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;pierrezemb.fr&quot;&gt;Bluesky&lt;&#x2F;a&gt; or through my &lt;a href=&quot;https:&#x2F;&#x2F;pierrezemb.fr&quot;&gt;website&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">observability</category>
          <category domain="tag">metrics</category>
          <category domain="tag">monitoring</category>
          <category domain="tag">distributed-systems</category>
      </item>
      <item>
          <title>Testing: prevention vs discovery</title>
          <pubDate>Mon, 08 Sep 2025 00:00:00 +0000</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/testing-prevention-vs-discovery/</link>
          <guid>https://pierrezemb.fr/posts/testing-prevention-vs-discovery/</guid>
          <description xml:base="https://pierrezemb.fr/posts/testing-prevention-vs-discovery/">&lt;p&gt;While working on &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;PierreZ&#x2F;moonpool&quot;&gt;moonpool&lt;&#x2F;a&gt;, my hobby project for studying and backporting FoundationDB&#x27;s low-level engineering concepts (actor model, deterministic simulation, network fault injection), Claude Code did something remarkable: it found a bug I didn&#x27;t know existed on its own. Not through traditional testing, but through active exploration of edge cases I hadn&#x27;t considered.&lt;&#x2F;p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;testing-prevention-vs-discovery&#x2F;claude-moonpool.png&quot; alt=&quot;Claude Code autonomously debugging moonpool&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;Claude identified a faulty seed triggering an edge case, debugged it locally using deterministic replay, and added it to the test suite. All by itself. 🤯 &lt;strong&gt;This wasn&#x27;t prevention but discovery.&lt;&#x2F;strong&gt; It&#x27;s time to shift our testing paradigm from preventing regressions to actively discovering unknown bugs.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;building-for-discovery&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#building-for-discovery&quot; aria-label=&quot;Anchor link for: building-for-discovery&quot;&gt;🔗&lt;&#x2F;a&gt;Building for Discovery&lt;&#x2F;h2&gt;
&lt;p&gt;The difference between prevention and discovery isn&#x27;t just philosophical but requires completely different system design. Moonpool was built from day one around three principles that enable active bug discovery:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Deterministic simulation&lt;&#x2F;strong&gt;: Every execution is completely reproducible. Given the same seed, the system makes identical decisions every time. This changes debugging from &quot;I can&#x27;t reproduce this&quot; to &quot;let me replay exactly what happened.&quot; More importantly, it lets LLMs explore the state space step by step without getting lost in non-deterministic noise.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Controlled failure injection&lt;&#x2F;strong&gt;: Built-in mechanisms intentionally introduce failures in controlled, reproducible ways. This includes timed failures like network delays and disconnects, plus &lt;a href=&quot;https:&#x2F;&#x2F;transactional.blog&#x2F;simulation&#x2F;buggify&quot;&gt;&quot;buggify&quot; mechanisms&lt;&#x2F;a&gt; that inject faulty internal state at strategic points in the code. Each buggify point is either enabled or disabled for an entire simulation run, creating consistent failure scenarios instead of random chaos. Instead of waiting for production to reveal edge cases, we force the system to encounter dangerous, bug-finding behaviors during development.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Observability through sometimes assertions&lt;&#x2F;strong&gt;: Borrowed from &lt;a href=&quot;https:&#x2F;&#x2F;antithesis.com&#x2F;docs&#x2F;best_practices&#x2F;sometimes_assertions&#x2F;&quot;&gt;Antithesis&lt;&#x2F;a&gt;, these verify we&#x27;re actually discovering the edge cases we think we&#x27;re testing. Here&#x27;s what they look like:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Verify that server binds sometimes fail during chaos testing
&lt;&#x2F;span&gt;&lt;span&gt;sometimes_assert!(
&lt;&#x2F;span&gt;&lt;span&gt;    server_bind_fails,
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;.bind_result.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;is_err&lt;&#x2F;span&gt;&lt;span&gt;(),
&lt;&#x2F;span&gt;&lt;span&gt;    &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Server bind should sometimes fail during chaos testing&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;
&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Ensure message queues sometimes approach capacity under load
&lt;&#x2F;span&gt;&lt;span&gt;sometimes_assert!(
&lt;&#x2F;span&gt;&lt;span&gt;    peer_queue_near_capacity,
&lt;&#x2F;span&gt;&lt;span&gt;    state.send_queue.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;len&lt;&#x2F;span&gt;&lt;span&gt;() &amp;gt;= (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;.config.max_queue_size as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;f64 &lt;&#x2F;span&gt;&lt;span&gt;* &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0.8&lt;&#x2F;span&gt;&lt;span&gt;) as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;usize&lt;&#x2F;span&gt;&lt;span&gt;,
&lt;&#x2F;span&gt;&lt;span&gt;    &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Message queue should sometimes approach capacity limit&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;
&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Traditional code coverage only tells you &quot;this line was reached.&quot; Sometimes assertions verify &quot;this interesting scenario actually happened.&quot; If a sometimes assertion never triggers across thousands of test runs, you know you&#x27;re not discovering the edge cases that matter.&lt;&#x2F;p&gt;
&lt;p&gt;These three elements shift testing from prevention to discovery. Instead of developers writing tests for scenarios they already know about, the system forces them to hit failure modes they haven&#x27;t thought of. For Claude, this meant it could explore the state space step by step, understanding not just what the code does, but what breaks it.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-chaos-environment&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-chaos-environment&quot; aria-label=&quot;Anchor link for: the-chaos-environment&quot;&gt;🔗&lt;&#x2F;a&gt;The Chaos Environment&lt;&#x2F;h2&gt;
&lt;p&gt;Moonpool is currently limited to simulating TCP connections through its Peer abstraction, but even this narrow scope creates a surprisingly rich failure environment. Here&#x27;s what the chaos testing configuration looks like (borrowed from TigerBeetle&#x27;s approach):&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;impl &lt;&#x2F;span&gt;&lt;span&gt;NetworkRandomizationRanges {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F;&#x2F; Create chaos testing ranges with connection cutting enabled for distributed systems testing
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;pub fn &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;chaos_testing&lt;&#x2F;span&gt;&lt;span&gt;() -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;Self &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;Self &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;            bind_base_range: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span&gt;..&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;200&lt;&#x2F;span&gt;&lt;span&gt;,                       &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; 10-200µs
&lt;&#x2F;span&gt;&lt;span&gt;            bind_jitter_range: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span&gt;..&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;100&lt;&#x2F;span&gt;&lt;span&gt;,                     &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; 10-100µs
&lt;&#x2F;span&gt;&lt;span&gt;            accept_base_range: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;1000&lt;&#x2F;span&gt;&lt;span&gt;..&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;10000&lt;&#x2F;span&gt;&lt;span&gt;,                 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; 1-10ms in µs
&lt;&#x2F;span&gt;&lt;span&gt;            accept_jitter_range: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;1000&lt;&#x2F;span&gt;&lt;span&gt;..&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;15000&lt;&#x2F;span&gt;&lt;span&gt;,               &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; 1-15ms in µs
&lt;&#x2F;span&gt;&lt;span&gt;            connect_base_range: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;1000&lt;&#x2F;span&gt;&lt;span&gt;..&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;50000&lt;&#x2F;span&gt;&lt;span&gt;,                &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; 1-50ms in µs
&lt;&#x2F;span&gt;&lt;span&gt;            connect_jitter_range: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;5000&lt;&#x2F;span&gt;&lt;span&gt;..&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;100000&lt;&#x2F;span&gt;&lt;span&gt;,             &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; 5-100ms in µs
&lt;&#x2F;span&gt;&lt;span&gt;            read_base_range: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;..&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;100&lt;&#x2F;span&gt;&lt;span&gt;,                        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; 5-100µs
&lt;&#x2F;span&gt;&lt;span&gt;            read_jitter_range: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span&gt;..&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;200&lt;&#x2F;span&gt;&lt;span&gt;,                     &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; 10-200µs
&lt;&#x2F;span&gt;&lt;span&gt;            write_base_range: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;50&lt;&#x2F;span&gt;&lt;span&gt;..&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;1000&lt;&#x2F;span&gt;&lt;span&gt;,                     &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; 50-1000µs
&lt;&#x2F;span&gt;&lt;span&gt;            write_jitter_range: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;100&lt;&#x2F;span&gt;&lt;span&gt;..&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;2000&lt;&#x2F;span&gt;&lt;span&gt;,                  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; 100-2000µs
&lt;&#x2F;span&gt;&lt;span&gt;            clogging_probability_range: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0.1&lt;&#x2F;span&gt;&lt;span&gt;..&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0.3&lt;&#x2F;span&gt;&lt;span&gt;,           &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; 10-30% chance of temporary network congestion
&lt;&#x2F;span&gt;&lt;span&gt;            clogging_base_duration_range: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;50000&lt;&#x2F;span&gt;&lt;span&gt;..&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;300000&lt;&#x2F;span&gt;&lt;span&gt;,    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; 50-300ms congestion duration in µs
&lt;&#x2F;span&gt;&lt;span&gt;            clogging_jitter_duration_range: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;100000&lt;&#x2F;span&gt;&lt;span&gt;..&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;400000&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; 100-400ms additional congestion variance in µs
&lt;&#x2F;span&gt;&lt;span&gt;            cutting_probability_range: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0.10&lt;&#x2F;span&gt;&lt;span&gt;..&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0.20&lt;&#x2F;span&gt;&lt;span&gt;,          &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; 10-20% cutting chance per tick
&lt;&#x2F;span&gt;&lt;span&gt;            cutting_reconnect_base_range: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;200000&lt;&#x2F;span&gt;&lt;span&gt;..&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;800000&lt;&#x2F;span&gt;&lt;span&gt;,   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; 200-800ms in µs
&lt;&#x2F;span&gt;&lt;span&gt;            cutting_reconnect_jitter_range: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;100000&lt;&#x2F;span&gt;&lt;span&gt;..&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;500000&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; 100-500ms in µs
&lt;&#x2F;span&gt;&lt;span&gt;            cutting_max_cuts_range: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;..&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;,                   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; 1-2 cuts per connection max (exclusive upper bound)
&lt;&#x2F;span&gt;&lt;span&gt;        }
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Even with just TCP simulation, this creates a hostile environment where connections randomly fail, messages get delayed, and network operations experience unpredictable latencies. Each seed represents a different combination of timing and probability, creating unique failure scenarios.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;why-even-simple-network-code-needs-chaos&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#why-even-simple-network-code-needs-chaos&quot; aria-label=&quot;Anchor link for: why-even-simple-network-code-needs-chaos&quot;&gt;🔗&lt;&#x2F;a&gt;Why Even Simple Network Code Needs Chaos&lt;&#x2F;h3&gt;
&lt;p&gt;You might think testing a simple peer implementation with fault injection is overkill, but production experience and research show otherwise. &lt;a href=&quot;https:&#x2F;&#x2F;www.usenix.org&#x2F;system&#x2F;files&#x2F;osdi18-alquraan.pdf&quot;&gt;&quot;An Analysis of Network-Partitioning Failures in Cloud Systems&quot;&lt;&#x2F;a&gt; (OSDI &#x27;18) studied real-world failures and found:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;80%&lt;&#x2F;strong&gt; of network partition failures have catastrophic impact&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;27%&lt;&#x2F;strong&gt; lead to data loss (the most common consequence)&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;90%&lt;&#x2F;strong&gt; of these failures are silent&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;21%&lt;&#x2F;strong&gt; cause permanent damage that persists even after the partition heals&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;83%&lt;&#x2F;strong&gt; need three additional events to manifest&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;That last point is crucial; exactly the kind of complex interaction that deterministic simulation with fault injection helps uncover.&lt;&#x2F;p&gt;
&lt;p&gt;My peer implementation only does simple ping-pong communication, yet it still took some work to make it robust enough to pass all the checks and assertions. It&#x27;s enough complexity for Claude to discover edge cases in connection handling, retry logic, and recovery mechanisms.&lt;&#x2F;p&gt;
&lt;p&gt;The breakthrough wasn&#x27;t that Claude wrote perfect code but that &lt;strong&gt;Claude could discover and explore failure scenarios I hadn&#x27;t thought to test, then use deterministic replay to debug and fix what went wrong.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-paradigm-shift&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-paradigm-shift&quot; aria-label=&quot;Anchor link for: the-paradigm-shift&quot;&gt;🔗&lt;&#x2F;a&gt;The Paradigm Shift&lt;&#x2F;h2&gt;
&lt;p&gt;The difference between prevention and discovery completely changes how we think about software quality. &lt;strong&gt;Prevention testing asks &quot;did we break what used to work?&quot; Discovery testing asks &quot;what else is broken that we haven&#x27;t found yet?&quot;&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;This shift creates a powerful feedback loop for young engineers and LLMs alike. Both developers and LLMs learn what production failure really looks like, not the sanitized version we imagine. When Claude can explore failure scenarios step by step and immediately see the results through sometimes assertions, it becomes a discovery partner that finds edge cases human intuition misses.&lt;&#x2F;p&gt;
&lt;p&gt;This isn&#x27;t theoretical. It&#x27;s working in my hobby project today. Moonpool is definitely hobby-grade, but if a side project can enable LLM-assisted bug discovery, imagine what&#x27;s possible with production systems designed from the ground up for deterministic testing.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&quot;&gt;FoundationDB&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;tigerbeetle.com&#x2F;&quot;&gt;TigerBeetle&lt;&#x2F;a&gt;, and &lt;a href=&quot;https:&#x2F;&#x2F;antithesis.com&#x2F;&quot;&gt;Antithesis&lt;&#x2F;a&gt; communities have been practicing discovery-oriented testing for years. FoundationDB&#x27;s legendary reliability comes from exactly this approach; deterministic simulation that actively hunts for bugs rather than just preventing regressions. After operating FoundationDB in production for 3 years, I can confirm it&#x27;s by far the most robust and predictable distributed system I&#x27;ve encountered. Everything behaves exactly as documented, with none of the usual distributed systems surprises. I&#x27;ve written more about these ideas in my posts on &lt;a href=&quot;&#x2F;posts&#x2F;simulation-driven-development&#x2F;&quot;&gt;simulation-driven development&lt;&#x2F;a&gt; and &lt;a href=&quot;&#x2F;posts&#x2F;notes-about-foundationdb&#x2F;&quot;&gt;notes about FoundationDB&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;What&#x27;s new is that LLMs can now participate in this process.&lt;&#x2F;strong&gt; Through deterministic simulation and sometimes assertions, we&#x27;re not just telling the LLM &quot;write good code&quot; but showing it exactly what production failure looks like. If you&#x27;re curious about production-grade implementations of these ideas, check out &lt;a href=&quot;https:&#x2F;&#x2F;antithesis.com&#x2F;&quot;&gt;Antithesis&lt;&#x2F;a&gt;; their best hidden feature is that it works on any existing system without requiring a rewrite.&lt;&#x2F;p&gt;
&lt;p&gt;The tools exist. The techniques are proven. &lt;strong&gt;Testing must evolve from prevention to discovery.&lt;&#x2F;strong&gt; The future isn&#x27;t about writing better test cases but about building systems that actively reveal their own bugs.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Feel free to reach out with any questions or to share your experiences with deterministic testing. You can find me on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;pierrezemb.fr&quot;&gt;Bluesky&lt;&#x2F;a&gt; or through my &lt;a href=&quot;https:&#x2F;&#x2F;pierrezemb.fr&quot;&gt;website&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">testing</category>
          <category domain="tag">simulation</category>
          <category domain="tag">deterministic</category>
          <category domain="tag">llm</category>
          <category domain="tag">foundationdb</category>
      </item>
      <item>
          <title>Shipped vs. Operated, or How Many Bash Scripts Does It Take?</title>
          <pubDate>Mon, 18 Aug 2025 00:00:00 +0000</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/shipped-vs-operated/</link>
          <guid>https://pierrezemb.fr/posts/shipped-vs-operated/</guid>
          <description xml:base="https://pierrezemb.fr/posts/shipped-vs-operated/">&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Summary:&lt;&#x2F;strong&gt; The difference between shipped and operated software is the difference between something you can run and forget, and something that demands ongoing, hands-on care. Choosing the former protects your team’s focus and sanity.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;the-shipped-vs-operated-spectrum&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-shipped-vs-operated-spectrum&quot; aria-label=&quot;Anchor link for: the-shipped-vs-operated-spectrum&quot;&gt;🔗&lt;&#x2F;a&gt;The Shipped vs. Operated Spectrum&lt;&#x2F;h2&gt;
&lt;p&gt;Some technologies arrive as complete systems: you deploy them, give them minimal care, and they quietly do their job. Others arrive like complex machines: powerful, but demanding regular attention and maintenance. That’s the difference between &lt;em&gt;shipped&lt;&#x2F;em&gt; and &lt;em&gt;operated&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The distinction isn’t just about features; it’s about the level of operational effort the system will demand over its lifetime. &lt;strong&gt;Operated&lt;&#x2F;strong&gt; technologies require continuous human care to stay healthy. They age, drift, and accumulate operational quirks. They often have sharp edges you only discover at 2 a.m., and when something goes wrong, you need people who already know the failure modes by heart. Think of a self-managed &lt;strong&gt;HBase&lt;&#x2F;strong&gt; or a ZooKeeper ensemble that you &lt;em&gt;really&lt;&#x2F;em&gt; hope never splits brain.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Shipped&lt;&#x2F;strong&gt; technologies are built to reduce that constant overhead. They can still fail, but they tend to fail in ways that are predictable, recoverable, and not existential. You can learn them as you go. Your outages will be frustrating, but they won’t demand a dedicated handler on payroll. &lt;strong&gt;FoundationDB&lt;&#x2F;strong&gt; is a good example: it’s not magic, but its operational surface area is small enough to fit in a single human brain.&lt;&#x2F;p&gt;
&lt;p&gt;For contrast, I’ve also spent years with the other kind: &lt;strong&gt;HBase&lt;&#x2F;strong&gt; clusters spread over 250+ nodes, &lt;strong&gt;Ceph&lt;&#x2F;strong&gt;, &lt;strong&gt;Kafka&lt;&#x2F;strong&gt; and &lt;strong&gt;ZooKeeper&lt;&#x2F;strong&gt; in various configurations, &lt;strong&gt;Pulsar&lt;&#x2F;strong&gt;, &lt;strong&gt;Warp10&lt;&#x2F;strong&gt;, &lt;strong&gt;etcd&lt;&#x2F;strong&gt;, &lt;strong&gt;Kubernetes&lt;&#x2F;strong&gt;, &lt;strong&gt;Flink&lt;&#x2F;strong&gt;, and &lt;strong&gt;RabbitMQ&lt;&#x2F;strong&gt;, each with its own set of operational “adventures.”&lt;&#x2F;p&gt;
&lt;h2 id=&quot;identifying-operated-systems&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#identifying-operated-systems&quot; aria-label=&quot;Anchor link for: identifying-operated-systems&quot;&gt;🔗&lt;&#x2F;a&gt;Identifying Operated Systems&lt;&#x2F;h2&gt;
&lt;p&gt;Some systems live in both worlds depending on how you use them. &lt;strong&gt;PostgreSQL&lt;&#x2F;strong&gt; in standalone mode is usually shipped: it’s simple to run, predictable, and rarely causes surprises. But under certain conditions, like fighting vacuum performance at scale or running it in HA mode under sustained heavy load, it shifts into operated territory. The difference isn’t in the codebase, but in the demands your use case puts on it.&lt;&#x2F;p&gt;
&lt;p&gt;A quick way to tell which camp your system belongs to is the &lt;strong&gt;Bash Script Test&lt;&#x2F;strong&gt;: ask how many bash scripts or home-grown tools are required to survive an on-call shift. If the answer includes a collection of automation to clean up data, shuffle it between nodes, or probe the cluster’s health, you’re probably in operated territory. I’ve been there: running &lt;code&gt;hbck&lt;&#x2F;code&gt; and manually moving regions in &lt;strong&gt;HBase&lt;&#x2F;strong&gt;, shuffling partitions around in &lt;strong&gt;Kafka&lt;&#x2F;strong&gt; to balance load, or triggering repairs in &lt;strong&gt;Ceph&lt;&#x2F;strong&gt; after failed scrub errors. Many distributed systems quietly rely on these manual interventions, often run weekly, to stay healthy, and that’s an operational cost you can’t ignore.&lt;&#x2F;p&gt;
&lt;p&gt;By contrast, we have &lt;strong&gt;no&lt;&#x2F;strong&gt; such scripts for &lt;strong&gt;FoundationDB&lt;&#x2F;strong&gt;, and that’s exactly why it feels shipped.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-strategic-cost-of-operations&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-strategic-cost-of-operations&quot; aria-label=&quot;Anchor link for: the-strategic-cost-of-operations&quot;&gt;🔗&lt;&#x2F;a&gt;The Strategic Cost of Operations&lt;&#x2F;h2&gt;
&lt;p&gt;Each operated system consumes a slice of your team’s focus. Add too many, and you’ll spend more time keeping the lights on than moving forward. The more you can choose robust, low-maintenance software, the more space you keep for actually building new things.&lt;&#x2F;p&gt;
&lt;p&gt;I’m not a fan of Kubernetes from an operational perspective. But it does something important for end users: it gives them a standard way to write software that reacts to the state of the infrastructure through &lt;a href=&quot;https:&#x2F;&#x2F;kubernetes.io&#x2F;docs&#x2F;concepts&#x2F;extend-kubernetes&#x2F;operator&#x2F;&quot;&gt;Operators&lt;&#x2F;a&gt;. Operators turn that into continuous automation, with a reconciliation loop that keeps drifting systems aligned with the desired state. It’s a way to bake SRE knowledge into code, so even complex systems can be run and handed over without months of hand-holding.&lt;&#x2F;p&gt;
&lt;p&gt;The stakes are only going to get higher as LLMs become a common tool for software engineers. We’ll inevitably build more advanced and complex systems, but that complexity doesn’t disappear; it gets pushed to the people on call. LLMs are good at fixing failures that are reproducible and deterministic, because they can alter the system freely, but most on-call incidents aren’t like that. The only way to keep operational load sustainable is to change how we design and test: building for robustness from the start, and using techniques like &lt;a href=&quot;&#x2F;posts&#x2F;simulation-driven-development&#x2F;&quot;&gt;simulation-driven development&lt;&#x2F;a&gt; to expose failure modes before they reach production.&lt;&#x2F;p&gt;
&lt;p&gt;If you can, choose the system you can deploy and leave alone, not the complex machine that demands your weekends.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Feel free to reach out with any questions or to share your experiences with shipped&#x2F;operated software. You can find me on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;pierrezemb.fr&quot;&gt;Bluesky&lt;&#x2F;a&gt; or through my &lt;a href=&quot;https:&#x2F;&#x2F;pierrezemb.fr&quot;&gt;website&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">distributed-systems</category>
          <category domain="tag">operation</category>
      </item>
      <item>
          <title>Two Podcast Episodes on Topics Developers Rarely Talk About</title>
          <pubDate>Mon, 11 Aug 2025 00:00:00 +0000</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/debugging-and-correctness-podcasts/</link>
          <guid>https://pierrezemb.fr/posts/debugging-and-correctness-podcasts/</guid>
          <description xml:base="https://pierrezemb.fr/posts/debugging-and-correctness-podcasts/">&lt;p&gt;I was listening to a couple of podcasts the other day and stumbled across two episodes that were so compelling I had to stop my chores and listen. They dive into corners of software engineering that most developers barely think about; not because they’re unimportant, but because they appear in the hard corners of engineering:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;catastrophic data corruption,&lt;&#x2F;li&gt;
&lt;li&gt;correctness work done before a single line is shipped.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The first is &lt;a href=&quot;https:&#x2F;&#x2F;oxide-and-friends.transistor.fm&#x2F;episodes&#x2F;adventures-in-data-corruption&quot;&gt;Adventures in Data Corruption&lt;&#x2F;a&gt; from &lt;em&gt;Oxide and Friends&lt;&#x2F;em&gt;. Two years ago, the Oxide team ran into data corruption during what should have been a routine network transfer. The debugging journey that followed went from packet traces to CPU speculation quirks, peeling back the stack layer by layer, hardware, kernel, network, application, asking hard questions at each step. What I love here is the combination of clear storytelling and the rapid-fire hypotheses: they make an assumption, test it, discard it, and immediately move to the next, pulling you along in the investigation until the root cause finally clicks into place.&lt;&#x2F;p&gt;
&lt;p&gt;The second is &lt;a href=&quot;https:&#x2F;&#x2F;x.com&#x2F;AntithesisHQ&#x2F;status&#x2F;1953097721205710918&quot;&gt;Scaling Correctness: Marc Brooker on a Decade of Formal Methods at AWS&lt;&#x2F;a&gt; of &lt;em&gt;The BugBash Podcast&lt;&#x2F;em&gt; by Antithesis. Marc Brooker, who has spent nearly 17 years building core AWS services like S3 and Lambda, shares the company’s decade-long journey with formal methods, from heavyweight tools like TLA+ to the &lt;em&gt;lightweight&lt;&#x2F;em&gt; approaches that any team can adopt like &lt;a href=&quot;&#x2F;tags&#x2F;simulation&quot;&gt;simulation-based testing&lt;&#x2F;a&gt;. At AWS, they’ve learned that investing in correctness up front not only improves reliability but actually speeds up delivery. They also touch on deterministic simulation testing, the challenge of verifying UIs and control planes, and the role AI might play in the future of verification.&lt;&#x2F;p&gt;
&lt;p&gt;I’ve been paged way too many times for metastable failures, data corruption, network meltdowns, or NTP drift in production. These days, I’d rather tackle the correctness part &lt;em&gt;before&lt;&#x2F;em&gt; those alarms go off. Every new layer I build is designed to be simulated to explore failure modes in a controlled environment before they can hurt real users.&lt;&#x2F;p&gt;
&lt;p&gt;But when things fall apart anyway, and spoilers &lt;strong&gt;they will&lt;&#x2F;strong&gt;, developers have the opportunity to truly understand their software. Being responsible for the systems you build means you’re the one getting paged, and it’s in those moments of crisis that the sharpest debugging skills are forged.&lt;&#x2F;p&gt;
&lt;p&gt;So don’t just bookmark them. Put them at the top of your queue. Listen. And maybe, the next time your system misbehaves, you’ll be ready.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Feel free to reach out with any questions or to share your experiences with debugging and correctness. You can find me on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;pierrezemb.fr&quot;&gt;Bluesky&lt;&#x2F;a&gt; or through my &lt;a href=&quot;https:&#x2F;&#x2F;pierrezemb.fr&quot;&gt;website&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">distributed-systems</category>
          <category domain="tag">debugging</category>
          <category domain="tag">correctness</category>
          <category domain="tag">podcasts</category>
          <category domain="tag">simulation</category>
      </item>
      <item>
          <title>Three Years of Nix and NixOS: The Good, the Bad, and the Ugly</title>
          <pubDate>Wed, 02 Jul 2025 00:37:27 +0100</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/nixos-good-bad-ugly/</link>
          <guid>https://pierrezemb.fr/posts/nixos-good-bad-ugly/</guid>
          <description xml:base="https://pierrezemb.fr/posts/nixos-good-bad-ugly/">&lt;p&gt;For years, I was a serial distro-hopper, working my way through Ubuntu, Arch, Gentoo, Exherbo, Void Linux, Fedora, Pop!_OS, and Manjaro. Every few months, a new Linux distribution would catch my eye, and I’d spend a weekend migrating my setup, hoping to find the perfect fit. That cycle broke three years ago when I switched to NixOS. It has since become the foundation for all my Linux machines, not because it’s perfect, but because it fundamentally changes the contract between the user and the operating system.&lt;&#x2F;p&gt;
&lt;p&gt;It&#x27;s important to distinguish between &lt;strong&gt;Nix&lt;&#x2F;strong&gt;, the powerful package manager that can run on any Linux distro (and even macOS), and &lt;strong&gt;NixOS&lt;&#x2F;strong&gt;, the full immutable operating system built around it. This post is a review of my three years with both—the good, the bad, and the ugly.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-good&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-good&quot; aria-label=&quot;Anchor link for: the-good&quot;&gt;🔗&lt;&#x2F;a&gt;The Good&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;declarative-and-atomic-system-management-on-nixos&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#declarative-and-atomic-system-management-on-nixos&quot; aria-label=&quot;Anchor link for: declarative-and-atomic-system-management-on-nixos&quot;&gt;🔗&lt;&#x2F;a&gt;Declarative and Atomic System Management on NixOS&lt;&#x2F;h3&gt;
&lt;p&gt;The core promise of NixOS is that your entire system is configured from a set of files, which you can store in a Git repository. Every change is a commit, giving you a complete, auditable history of your system&#x27;s state. This makes setting up a new machine trivial: I clone my repository, run one command, and my entire setup is replicated perfectly. No more manually copying dotfiles or running install scripts.&lt;&#x2F;p&gt;
&lt;p&gt;This declarative approach also makes the system incredibly robust. I once broke a laptop running Exherbo right before an on-call shift, and it was a nightmare to fix. With NixOS, that fear is gone. Every &lt;code&gt;nixos-rebuild switch&lt;&#x2F;code&gt; creates a new &quot;generation&quot; of the system. If an update breaks something, you simply reboot and select the previous generation from the boot menu. This atomic update mechanism makes you fearless about making and testing changes.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;system-crafting-as-a-first-class-citizen&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#system-crafting-as-a-first-class-citizen&quot; aria-label=&quot;Anchor link for: system-crafting-as-a-first-class-citizen&quot;&gt;🔗&lt;&#x2F;a&gt;System Crafting as a First-Class Citizen&lt;&#x2F;h3&gt;
&lt;p&gt;On NixOS, customizing your system is not an afterthought—it&#x27;s a core feature. While the Nix package manager gives you fine-grained control over packages, NixOS uses this power to make deep system modifications simple. For example, building a custom ISO with your SSH keys pre-installed is just a few lines of configuration. This philosophy extends to packages: you can use pre-built binaries for most things, but easily build a package from source with your own patches when you need to.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;sandboxed-development-environments&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#sandboxed-development-environments&quot; aria-label=&quot;Anchor link for: sandboxed-development-environments&quot;&gt;🔗&lt;&#x2F;a&gt;Sandboxed Development Environments&lt;&#x2F;h3&gt;
&lt;p&gt;A powerful feature of &lt;strong&gt;Nix&lt;&#x2F;strong&gt; (the package manager) is the ability to define per-project development environments using a &lt;code&gt;flake.nix&lt;&#x2F;code&gt; file. When you enter the project directory, &lt;code&gt;direnv&lt;&#x2F;code&gt; can automatically load a shell with all the specific tools and libraries you need for that project—a specific version of Rust, Node.js, or any other dependency. This completely solves the problem of conflicting dependencies between projects. Each project is perfectly isolated, and you can be sure that you and your colleagues are using the exact same environment.&lt;&#x2F;p&gt;
&lt;p&gt;My favorite tip is to add &lt;code&gt;if has nix; then use nix; fi&lt;&#x2F;code&gt; to the &lt;code&gt;.envrc&lt;&#x2F;code&gt; file, so the environment is only loaded for team members who have Nix installed, avoiding errors for everyone else.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;built-in-vm-based-testing&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#built-in-vm-based-testing&quot; aria-label=&quot;Anchor link for: built-in-vm-based-testing&quot;&gt;🔗&lt;&#x2F;a&gt;Built-in VM-Based Testing&lt;&#x2F;h3&gt;
&lt;p&gt;A great, underrated &lt;strong&gt;NixOS&lt;&#x2F;strong&gt; feature is the built-in testing framework. You can write tests that spin up lightweight virtual machines with their own configurations to test your setup. I saw this firsthand when I recently packaged &lt;code&gt;fdbserver&lt;&#x2F;code&gt;. It took me about 30 minutes to get a test running that spins up a full FoundationDB cluster across multiple VMs. The setup is still basic—it doesn&#x27;t even use systemd—but it was more than enough to validate the packaging. You can see the test definition &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;foundationdb-rs&#x2F;overlay&#x2F;blob&#x2F;main&#x2F;tests&#x2F;cluster.nix&quot;&gt;here&lt;&#x2F;a&gt;. Being able to build that kind of complex integration test so quickly is something I&#x27;ve only found in NixOS.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-bad&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-bad&quot; aria-label=&quot;Anchor link for: the-bad&quot;&gt;🔗&lt;&#x2F;a&gt;The Bad&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;the-friction-of-simple-changes-on-nixos&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-friction-of-simple-changes-on-nixos&quot; aria-label=&quot;Anchor link for: the-friction-of-simple-changes-on-nixos&quot;&gt;🔗&lt;&#x2F;a&gt;The Friction of Simple Changes on NixOS&lt;&#x2F;h3&gt;
&lt;p&gt;On a normal system, if you want to add a shell alias, you edit &lt;code&gt;.bashrc&lt;&#x2F;code&gt; and you&#x27;re done. In NixOS, there are no quick edits. You have to find the right option in your configuration, add the line, and then rebuild your system. This is great for keeping your configuration tracked, but it adds a lot of friction to simple tasks.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;a-steep-and-isolated-learning-curve&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#a-steep-and-isolated-learning-curve&quot; aria-label=&quot;Anchor link for: a-steep-and-isolated-learning-curve&quot;&gt;🔗&lt;&#x2F;a&gt;A Steep and Isolated Learning Curve&lt;&#x2F;h3&gt;
&lt;p&gt;Learning the Nix ecosystem is a big commitment. The ideas are very different from other Linux systems, so your existing knowledge doesn&#x27;t help much. You have to learn the Nix language, how derivations work, and now Flakes. It takes a few months before you feel productive.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;incompatibility-with-the-wider-ecosystem&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#incompatibility-with-the-wider-ecosystem&quot; aria-label=&quot;Anchor link for: incompatibility-with-the-wider-ecosystem&quot;&gt;🔗&lt;&#x2F;a&gt;Incompatibility with the Wider Ecosystem&lt;&#x2F;h3&gt;
&lt;p&gt;Because NixOS doesn&#x27;t use the standard Filesystem Hierarchy Standard (FHS), you can&#x27;t just download a pre-compiled binary and expect it to work. It will fail to run because it can&#x27;t find its shared libraries in places like &lt;code&gt;&#x2F;lib&lt;&#x2F;code&gt; or &lt;code&gt;&#x2F;usr&#x2F;lib&lt;&#x2F;code&gt;. The Nix way to solve this is to use &lt;code&gt;patchelf&lt;&#x2F;code&gt; to modify the binary and tell it where to find its dependencies inside the &lt;code&gt;&#x2F;nix&#x2F;store&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;A similar problem occurs with &quot;impure&quot; build tools. For example, the standard Protobuf plugin for Gradle tries to download the &lt;code&gt;protoc&lt;&#x2F;code&gt; compiler during the build. To make this work in a pure Nix environment, you have to disable this feature and instead provide &lt;code&gt;protoc&lt;&#x2F;code&gt; through the Nix derivation.&lt;&#x2F;p&gt;
&lt;p&gt;While these tools provide a solution, they are another hurdle to overcome. For a deep dive on patching binaries, Sander van der Burg&#x27;s post on &lt;a href=&quot;https:&#x2F;&#x2F;sandervanderburg.blogspot.com&#x2F;2015&#x2F;10&#x2F;deploying-prebuilt-binary-software-with.html&quot;&gt;deploying prebuilt binaries with Nix&lt;&#x2F;a&gt; is an excellent resource.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;handling-hardcoded-build-environments&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#handling-hardcoded-build-environments&quot; aria-label=&quot;Anchor link for: handling-hardcoded-build-environments&quot;&gt;🔗&lt;&#x2F;a&gt;Handling Hardcoded Build Environments&lt;&#x2F;h3&gt;
&lt;p&gt;Sometimes, you can&#x27;t override impure behavior. Certain libraries, particularly in the cryptography space, might have build scripts that are hardcoded to look for dependencies in standard locations like &lt;code&gt;&#x2F;usr&#x2F;lib&lt;&#x2F;code&gt;. In these cases, your only option is to fall back on &lt;a href=&quot;https:&#x2F;&#x2F;ryantm.github.io&#x2F;nixpkgs&#x2F;builders&#x2F;special&#x2F;fhs-environments&#x2F;&quot;&gt;&lt;code&gt;buildFHSUserEnv&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; to create a sandboxed environment that simulates a traditional filesystem. It&#x27;s a powerful tool, but it feels like a workaround and highlights the gap between the pure world of Nix and how many other tools work.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-ugly&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-ugly&quot; aria-label=&quot;Anchor link for: the-ugly&quot;&gt;🔗&lt;&#x2F;a&gt;The Ugly&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;the-nix-language-barrier&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-nix-language-barrier&quot; aria-label=&quot;Anchor link for: the-nix-language-barrier&quot;&gt;🔗&lt;&#x2F;a&gt;The Nix Language Barrier&lt;&#x2F;h3&gt;
&lt;p&gt;The Nix language itself is the hardest part. It’s a functional language that feels very different from most programming languages. Simple things can be hard to figure out, and you often have to look up how to do basic operations.&lt;&#x2F;p&gt;
&lt;p&gt;LLMs have made this much easier. Before they were widely available, I spent countless hours searching for similar packages on GitHub to figure out how to solve a specific problem. Now, you can ask for a code snippet and get something that works. But needing an AI to help with basic packaging shows how hard the language is to learn.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#conclusion&quot; aria-label=&quot;Anchor link for: conclusion&quot;&gt;🔗&lt;&#x2F;a&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;So, what&#x27;s the verdict? The scales may seem evenly balanced between praise and frustration, yet I wouldn&#x27;t switch away from NixOS. The learning curve is a mountain, and the daily friction can be grating. But the payoff—the absolute, ironclad guarantee of reproducibility—is a superpower.&lt;&#x2F;p&gt;
&lt;p&gt;As someone who builds and tests complex distributed systems, I spend my days fighting entropy. NixOS provides a sane foundation where the environment is a solved problem. The fear of a broken update before an on-call shift is gone. The hours spent debugging &quot;works on my machine&quot; issues have vanished. Setting up a new machine is a 15-minute, one-command affair.&lt;&#x2F;p&gt;
&lt;p&gt;NixOS demands a significant upfront investment for long-term peace of mind. It trades short-term convenience for long-term stability and control. It&#x27;s not for everyone, but if you&#x27;re a developer or systems engineer who sees your OS as a critical part of your toolkit—one that should be as reliable and version-controlled as your code—then the tough road of NixOS is absolutely worth it.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;a-gentler-start-try-nix-first&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#a-gentler-start-try-nix-first&quot; aria-label=&quot;Anchor link for: a-gentler-start-try-nix-first&quot;&gt;🔗&lt;&#x2F;a&gt;A Gentler Start: Try Nix First&lt;&#x2F;h3&gt;
&lt;p&gt;If this article makes you curious but wary of diving headfirst into a full OS migration, there’s good news: you don’t have to. You can get a taste of Nix’s power on your existing macOS or Linux setup.&lt;&#x2F;p&gt;
&lt;p&gt;By installing just the Nix package manager, you can start creating reproducible development environments using &lt;code&gt;nix-shell&lt;&#x2F;code&gt; or Nix Flakes. This lets you manage project-specific dependencies without conflicts and share a consistent setup with your team. It&#x27;s a fantastic way to learn the Nix language and experience its benefits in a familiar environment before committing to NixOS.&lt;&#x2F;p&gt;
&lt;p&gt;I’ve found it incredibly useful to have dependencies managed the same way between Linux and macOS. This website, for example, is built using the same Flake to pull Zola, and it works identically on my Linux laptop and my Mac.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Feel free to reach out with any questions or to share your experiences with NixOS. You can find me on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;pierrezemb.fr&quot;&gt;Bluesky&lt;&#x2F;a&gt; or through my &lt;a href=&quot;https:&#x2F;&#x2F;pierrezemb.fr&quot;&gt;website&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">nixos</category>
          <category domain="tag">nix</category>
          <category domain="tag">linux</category>
          <category domain="tag">devops</category>
      </item>
      <item>
          <title>Thank You, DataFusion: Queries in Rust, Without the Pain</title>
          <pubDate>Wed, 04 Jun 2025 00:00:00 +0000</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/thank-you-datafusion/</link>
          <guid>https://pierrezemb.fr/posts/thank-you-datafusion/</guid>
          <description xml:base="https://pierrezemb.fr/posts/thank-you-datafusion/">&lt;h2 id=&quot;that-yatta-moment-rebooted&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#that-yatta-moment-rebooted&quot; aria-label=&quot;Anchor link for: that-yatta-moment-rebooted&quot;&gt;🔗&lt;&#x2F;a&gt;That “YATTA!” Moment, Rebooted&lt;&#x2F;h2&gt;
&lt;p&gt;We just merged at work our first successful data retrieval using &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;datafusion&quot;&gt;DataFusion&lt;&#x2F;a&gt; — a real SQL query, over real data, flowing through a system we built. And I’ll be honest: I haven’t had a “YATTA!” moment like this in years. This wasn&#x27;t just a feature shipped; it felt like unlocking a new superpower for our entire system, a complex vision finally materializing.&lt;&#x2F;p&gt;
&lt;p&gt;Not a silent nod. Not “huh, that works.” A &lt;em&gt;real&lt;&#x2F;em&gt;, physical, joyful reaction. The kind that makes you want to run a lap around the office (or, in my remote-first case, the living room).&lt;&#x2F;p&gt;
&lt;p&gt;Because plugging a query engine into your software isn’t supposed to feel this smooth. It&#x27;s usually a battle. But this one did. This one felt like an invitation.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;you-don-t-just-add-a-query-engine&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#you-don-t-just-add-a-query-engine&quot; aria-label=&quot;Anchor link for: you-don-t-just-add-a-query-engine&quot;&gt;🔗&lt;&#x2F;a&gt;You Don’t Just Add a Query Engine&lt;&#x2F;h2&gt;
&lt;p&gt;Adding a query engine to a codebase isn’t something you do lightly. It’s a foundational piece of infrastructure, the kind of decision that usually ends in regret, or at least a &lt;em&gt;lot&lt;&#x2F;em&gt; of rewriting. Most engines assume they own the world: they want to dictate your storage, your execution model, your schema, your optimizer, often forcing you to contort your application around their idiosyncrasies. It&#x27;s a path often paved with impedance mismatches, performance bottlenecks, and the haunting feeling that you’ve just bolted an opinionated, unyielding black box onto your carefully crafted system.&lt;&#x2F;p&gt;
&lt;p&gt;But then there’s DataFusion. A SQL engine written in Rust, and — against all odds — one you can actually &lt;em&gt;use&lt;&#x2F;em&gt;. Drop-in? Not quite. But close enough to be kind of magical, offering a set of powerful, composable tools rather than a rigid framework.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;i-ve-been-watching-from-day-one&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#i-ve-been-watching-from-day-one&quot; aria-label=&quot;Anchor link for: i-ve-been-watching-from-day-one&quot;&gt;🔗&lt;&#x2F;a&gt;I’ve Been Watching From Day One&lt;&#x2F;h2&gt;
&lt;p&gt;I’ve been following DataFusion since it was a weekend project. I still remember the early blog posts, the prototypes, the potential. And more importantly, I read &lt;a href=&quot;https:&#x2F;&#x2F;andygrove.io&#x2F;how-query-engines-work&#x2F;&quot;&gt;Andy Grove’s book &lt;em&gt;How Query Engines Work&lt;&#x2F;em&gt;&lt;&#x2F;a&gt;. That book unlocked it for me.&lt;&#x2F;p&gt;
&lt;p&gt;It demystified concepts like logical plans, physical plans, and execution trees — enough to give me the confidence to experiment. I first played with Apache Calcite, then circled back to DataFusion. Eventually, I contributed a small example: a custom &lt;code&gt;TableProvider&lt;&#x2F;code&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;datafusion&#x2F;issues&#x2F;1864&quot;&gt;added to DataFusion in this issue&lt;&#x2F;a&gt; to demonstrate how to integrate custom datasources.&lt;&#x2F;p&gt;
&lt;p&gt;And then... it only took me &lt;strong&gt;three years&lt;&#x2F;strong&gt; to actually write the code that &lt;em&gt;used&lt;&#x2F;em&gt; it. Why so long? Well, let&#x27;s just say a gazillion other things, the never-ending sagas of on-call, and a &lt;a href=&quot;&#x2F;posts&#x2F;back-engineering&quot;&gt;brief-but-eventful detour into management&lt;&#x2F;a&gt; kept my dance card impressively full. But hey, it still felt amazing when it finally clicked.&lt;&#x2F;p&gt;
&lt;p&gt;More recently, I was genuinely happy to see that &lt;strong&gt;Andrew Lamb&lt;&#x2F;strong&gt; co-authored an &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;datafusion&#x2F;issues&#x2F;6782&quot;&gt;academic paper describing DataFusion’s architecture&lt;&#x2F;a&gt;. There’s something really validating about seeing a project you’ve followed for years get formalized in research — it’s a sign that the internals are solid and the ideas are worth sharing. And they are.&lt;&#x2F;p&gt;
&lt;p&gt;That moment was big. Because here was a Rust-native query engine where I could plug in &lt;em&gt;my own data&lt;&#x2F;em&gt;, and get &lt;em&gt;real queries&lt;&#x2F;em&gt; back. No layers of JVM glue, no corroded abstractions. Just composable, hackable Rust.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;modular-composable-respectful&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#modular-composable-respectful&quot; aria-label=&quot;Anchor link for: modular-composable-respectful&quot;&gt;🔗&lt;&#x2F;a&gt;Modular, Composable, Respectful&lt;&#x2F;h2&gt;
&lt;p&gt;What I love about DataFusion is that it doesn’t try to control your application. It’s a query engine that knows it’s a library — not a database.&lt;&#x2F;p&gt;
&lt;p&gt;It lets you:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Plug in your own data sources&lt;&#x2F;li&gt;
&lt;li&gt;Register logical tables dynamically&lt;&#x2F;li&gt;
&lt;li&gt;Push down filters, projections, even partitions&lt;&#x2F;li&gt;
&lt;li&gt;Swap in or extend physical execution nodes&lt;&#x2F;li&gt;
&lt;li&gt;Keep your own runtime, threading, and lifecycle&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;And all that without feeling like you’re stepping into “internal” code. It’s all open, cleanly layered, and welcoming.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;my-goal-join-indexes-without-going-insane&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#my-goal-join-indexes-without-going-insane&quot; aria-label=&quot;Anchor link for: my-goal-join-indexes-without-going-insane&quot;&gt;🔗&lt;&#x2F;a&gt;My Goal: Join Indexes Without Going Insane&lt;&#x2F;h2&gt;
&lt;p&gt;From the beginning, my goal was never to just scan data — it was to &lt;strong&gt;query it properly&lt;&#x2F;strong&gt;, with indexes, joins, and all the things a real engine should do. I never had any intention of writing a join execution engine myself. That’s not the kind of wheel I want to reinvent.&lt;&#x2F;p&gt;
&lt;p&gt;It&#x27;s no secret that at work, we&#x27;re building a system on top of FoundationDB that draws inspiration from Apple&#x27;s &lt;a href=&quot;https:&#x2F;&#x2F;foundationdb.github.io&#x2F;fdb-record-layer&#x2F;&quot;&gt;FDB Record Layer&lt;&#x2F;a&gt; (you can learn more about its concepts in &lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=SvoUHHM9IKU&quot;&gt;this talk&lt;&#x2F;a&gt;). We offer &lt;a href=&quot;https:&#x2F;&#x2F;foundationdb.github.io&#x2F;fdb-record-layer&#x2F;GettingStarted.html&quot;&gt;a similar programmatic API for constructing queries&lt;&#x2F;a&gt;, which naturally leads to similar requirements. For example, developers need to express sophisticated data retrieval logic, much like this FDB Record Layer example for querying orders:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;java&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-java &quot;&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;RecordQuery&lt;&#x2F;span&gt;&lt;span&gt; query = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;RecordQuery&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;newBuilder&lt;&#x2F;span&gt;&lt;span&gt;()
&lt;&#x2F;span&gt;&lt;span&gt;        .&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;setRecordType&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Order&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;)
&lt;&#x2F;span&gt;&lt;span&gt;        .&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;setFilter&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Query&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;and&lt;&#x2F;span&gt;&lt;span&gt;(
&lt;&#x2F;span&gt;&lt;span&gt;                &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Query&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;field&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;price&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;).&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;lessThan&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;50&lt;&#x2F;span&gt;&lt;span&gt;),
&lt;&#x2F;span&gt;&lt;span&gt;                &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Query&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;field&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;flower&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;).&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;matches&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Query&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;field&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;type&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;).&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;equalsValue&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;FlowerType&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;ROSE&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;name&lt;&#x2F;span&gt;&lt;span&gt;()))))
&lt;&#x2F;span&gt;&lt;span&gt;        .&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;build&lt;&#x2F;span&gt;&lt;span&gt;();
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The challenge then becomes translating such programmatic queries into efficient, index-backed scans and, crucially, leveraging a robust engine for complex operations like joins—without rebuilding that engine from scratch.&lt;&#x2F;p&gt;
&lt;p&gt;What I wanted was the ability to:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Fetch rows efficiently through custom index-backed scans&lt;&#x2F;li&gt;
&lt;li&gt;Join them using &lt;code&gt;HashJoinExec&lt;&#x2F;code&gt; or &lt;code&gt;MergeJoinExec&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Let the planner and execution engine figure out the hard parts&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;This vision is what spurred me to start working on &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;datafusion-contrib&#x2F;datafusion-index-provider&quot;&gt;&lt;code&gt;datafusion-index-provider&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;, a library hosted in the &lt;code&gt;datafusion-contrib&lt;&#x2F;code&gt; GitHub organization — part of the growing ecosystem around DataFusion. At the time of writing, I’ve built a PoC — you can find it &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;PierreZ&#x2F;datafusion-index-provider&#x2F;tree&#x2F;init-v2&quot;&gt;on this branch&lt;&#x2F;a&gt; — and I’m integrating it into our internal stack before opening a proper PR upstream.&lt;&#x2F;p&gt;
&lt;p&gt;The architecture makes it feel possible. The abstractions are ready. And I still don’t have to write a join engine. Victory.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-joy-of-real-libraries&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-joy-of-real-libraries&quot; aria-label=&quot;Anchor link for: the-joy-of-real-libraries&quot;&gt;🔗&lt;&#x2F;a&gt;The Joy of Real Libraries&lt;&#x2F;h2&gt;
&lt;p&gt;There’s a special joy in finding a library that &lt;em&gt;slots in&lt;&#x2F;em&gt; — that doesn’t just solve a problem, but fits the shape of your system. DataFusion was that for me.&lt;&#x2F;p&gt;
&lt;p&gt;It didn’t just let me query data; it gave me a better way to think about the data I already had, and how I wanted to work with it. Instead of manually stitching together filters and projections, I could describe my intent, and let the engine handle the rest.&lt;&#x2F;p&gt;
&lt;p&gt;What’s even more exciting is that this isn’t happening in a vacuum.&lt;&#x2F;p&gt;
&lt;p&gt;We’re seeing a quiet shift in how query engines are built and used. Projects like &lt;a href=&quot;https:&#x2F;&#x2F;duckdb.org&#x2F;&quot;&gt;DuckDB&lt;&#x2F;a&gt; have shown just how powerful it is to have &lt;strong&gt;SQL as a library&lt;&#x2F;strong&gt;, not a service. No server to deploy. No socket to connect to. Just an API, embedded right in your code.&lt;&#x2F;p&gt;
&lt;p&gt;DataFusion follows that same philosophy — Rust-native, embeddable, and unapologetically library-first.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;to-the-datafusion-team-thank-you&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#to-the-datafusion-team-thank-you&quot; aria-label=&quot;Anchor link for: to-the-datafusion-team-thank-you&quot;&gt;🔗&lt;&#x2F;a&gt;To the DataFusion Team: Thank You&lt;&#x2F;h2&gt;
&lt;p&gt;To Andy Grove, to all the contributors, to everyone filing issues and refining abstractions: thank you. Your work is enabling a new generation of Rust systems to think like databases — without becoming one.&lt;&#x2F;p&gt;
&lt;p&gt;I don’t know if you realize how rare that is. I just know it changed what I thought was possible in my software.&lt;&#x2F;p&gt;
&lt;p&gt;And I’m having a lot more fun because of it.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Feel free to reach out with any questions or to share your experiences with DataFusion. You can find me on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;pierrezemb.fr&quot;&gt;Bluesky&lt;&#x2F;a&gt; or through my &lt;a href=&quot;https:&#x2F;&#x2F;pierrezemb.fr&quot;&gt;website&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">rust</category>
          <category domain="tag">datafusion</category>
          <category domain="tag">sql</category>
          <category domain="tag">query-engine</category>
          <category domain="tag">databases</category>
      </item>
      <item>
          <title>Bypassing FoundationDB&#x27;s Transaction Limits with Record Layer Continuations</title>
          <pubDate>Tue, 03 Jun 2025 00:30:00 +0200</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/understanding-fdb-record-layer-continuations/</link>
          <guid>https://pierrezemb.fr/posts/understanding-fdb-record-layer-continuations/</guid>
          <description xml:base="https://pierrezemb.fr/posts/understanding-fdb-record-layer-continuations/">&lt;h2 id=&quot;introducing-the-foundationdb-record-layer&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#introducing-the-foundationdb-record-layer&quot; aria-label=&quot;Anchor link for: introducing-the-foundationdb-record-layer&quot;&gt;🔗&lt;&#x2F;a&gt;Introducing the FoundationDB Record Layer&lt;&#x2F;h2&gt;
&lt;p&gt;Before we dive into the specifics of handling large operations with continuations (the main topic of this post), let&#x27;s briefly introduce the &lt;a href=&quot;https:&#x2F;&#x2F;foundationdb.github.io&#x2F;fdb-record-layer&#x2F;index.html&quot;&gt;&lt;strong&gt;FoundationDB Record Layer&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt;. It&#x27;s a powerful open-source library built atop FoundationDB that brings a structured, record-oriented data model to FDB&#x27;s highly scalable key-value store. Think of it as adding schema management, rich indexing capabilities, and a sophisticated query engine, making it easier to build complex applications.&lt;&#x2F;p&gt;
&lt;p&gt;The Record Layer is versatile and has been adopted for demanding use-cases, most notably by Apple as the core of CloudKit, powering services for millions of users. It allows developers to define their data models using Protocol Buffers and then query them in a flexible manner.&lt;&#x2F;p&gt;
&lt;p&gt;For instance, you can express queries like finding all &#x27;Order&#x27; records for roses costing less than $50 with a declarative API (example in Java):&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;java&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-java &quot;&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;RecordQuery&lt;&#x2F;span&gt;&lt;span&gt; query = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;RecordQuery&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;newBuilder&lt;&#x2F;span&gt;&lt;span&gt;()
&lt;&#x2F;span&gt;&lt;span&gt;        .&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;setRecordType&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Order&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;)
&lt;&#x2F;span&gt;&lt;span&gt;        .&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;setFilter&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Query&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;and&lt;&#x2F;span&gt;&lt;span&gt;(
&lt;&#x2F;span&gt;&lt;span&gt;                &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Query&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;field&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;price&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;).&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;lessThan&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;50&lt;&#x2F;span&gt;&lt;span&gt;),
&lt;&#x2F;span&gt;&lt;span&gt;                &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Query&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;field&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;flower&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;).&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;matches&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Query&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;field&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;type&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;).&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;equalsValue&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;FlowerType&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;ROSE&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;name&lt;&#x2F;span&gt;&lt;span&gt;()))))
&lt;&#x2F;span&gt;&lt;span&gt;        .&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;build&lt;&#x2F;span&gt;&lt;span&gt;();
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;To get started and explore its capabilities further, the official &lt;a href=&quot;https:&#x2F;&#x2F;foundationdb.github.io&#x2F;fdb-record-layer&#x2F;GettingStarted.html&quot;&gt;Getting Started Guide&lt;&#x2F;a&gt; is an excellent resource. You can also watch these talks for a deeper understanding:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=SvoUHHM9IKU&quot;&gt;Using FoundationDB and the FDB Record Layer to Build CloudKit - Scott Gray, Apple&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=HLE8chgw6LI&quot;&gt;FoundationDB Record Layer: Open Source Structured Storage on FoundationDB - Nicholas Schiefer, Apple&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;blockquote&gt;
&lt;p&gt;For a detailed academic perspective on its design and how CloudKit uses it, refer to the &lt;a href=&quot;https:&#x2F;&#x2F;www.foundationdb.org&#x2F;files&#x2F;record-layer-paper.pdf&quot;&gt;SIGMOD&#x27;19 paper: FoundationDB Record Layer: A Multi-Tenant Structured Datastore&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;the-challenge-fdb-s-transaction-constraints&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-challenge-fdb-s-transaction-constraints&quot; aria-label=&quot;Anchor link for: the-challenge-fdb-s-transaction-constraints&quot;&gt;🔗&lt;&#x2F;a&gt;The Challenge: FDB&#x27;s Transaction Constraints&lt;&#x2F;h2&gt;
&lt;p&gt;FoundationDB (FDB) imposes strict constraints on its transactions: they must complete within 5 seconds and are limited to 10MB of manipulated data, either writes or reads. These constraints are fundamental to FDB&#x27;s design, ensuring high performance and serializable isolation. However, they pose a significant challenge for operations that inherently require processing large datasets or executing complex queries that cannot complete within these tight boundaries, such as full table scans, large analytical queries, or bulk data exports.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;strong&gt;FoundationDB Record Layer&lt;&#x2F;strong&gt; addresses this challenge through a mechanism known as &lt;strong&gt;continuations&lt;&#x2F;strong&gt;. Continuations allow a single logical operation to be broken down into a sequence of smaller, independent FDB transactions. Each transaction processes a segment of the total workload and, if more work remains, yields a &lt;strong&gt;continuation token&lt;&#x2F;strong&gt;. This opaque token encapsulates the state required to resume the operation precisely where the previous transaction left off.&lt;&#x2F;p&gt;
&lt;p&gt;This article delves into the technical details of Record Layer continuations, exploring how they function and how to leverage them effectively to build robust, scalable applications on FDB.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;bridging-transactions-the-role-of-continuations&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#bridging-transactions-the-role-of-continuations&quot; aria-label=&quot;Anchor link for: bridging-transactions-the-role-of-continuations&quot;&gt;🔗&lt;&#x2F;a&gt;Bridging Transactions: The Role of Continuations&lt;&#x2F;h2&gt;
&lt;p&gt;Consider a query to retrieve all records matching a specific filter from a large dataset. Executing this as a single FDB transaction would likely violate the 5-second or 10MB limit. The Record Layer employs continuations to serialize this operation across multiple transactions:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Initial Request:&lt;&#x2F;strong&gt; The application initiates a query against the Record Layer.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Segmented Execution:&lt;&#x2F;strong&gt; The Record Layer&#x27;s query planner executes the query, but with built-in scan limiters. It processes records until a predefined limit (e.g., row count, time duration, or byte size) is approached, or it nears FDB&#x27;s intrinsic transaction limits.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;State Serialization:&lt;&#x2F;strong&gt; Before the current FDB transaction commits, if the logical operation is incomplete, the Record Layer serializes the execution state of the query plan into a continuation token.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Partial Result &amp;amp; Token:&lt;&#x2F;strong&gt; The application receives the processed segment of data and the continuation token. The FDB transaction for this segment commits successfully.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Resumption:&lt;&#x2F;strong&gt; To fetch the next segment, the application submits a new request, providing the previously received continuation token.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;State Deserialization &amp;amp; Continued Execution:&lt;&#x2F;strong&gt; The Record Layer deserializes the token, restores the query plan&#x27;s state, and resumes execution from the exact point it paused. This typically involves adjusting scan boundaries (e.g., starting a key-range scan from the key after the last one processed).&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;This cycle repeats until the entire logical operation is complete. The continuation token acts as the critical link, enabling a series of short, FDB-compliant transactions to collectively achieve the effect of a single, long-running operation without violating FDB&#x27;s core constraints.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;dissecting-the-continuation-token&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#dissecting-the-continuation-token&quot; aria-label=&quot;Anchor link for: dissecting-the-continuation-token&quot;&gt;🔗&lt;&#x2F;a&gt;Dissecting the Continuation Token&lt;&#x2F;h2&gt;
&lt;p&gt;While the continuation token is &lt;strong&gt;opaque&lt;&#x2F;strong&gt; to the application (it&#x27;s a &lt;code&gt;byte[]&lt;&#x2F;code&gt; that should not be introspected or modified), it internally contains structured information vital for resuming query execution. The exact format is an implementation detail of the Record Layer and can evolve, but conceptually, it must capture:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Scan Boundaries:&lt;&#x2F;strong&gt; The key (or keys, for multi-dimensional indexes) where the next scan segment should begin. This ensures no data is missed or re-processed unnecessarily.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Query Plan State:&lt;&#x2F;strong&gt; For complex query plans involving joins, filters, aggregations, or in-memory sorting, the token may need to store intermediate state specific to those operators. For instance, a &lt;code&gt;UnionPlan&lt;&#x2F;code&gt; or &lt;code&gt;IntersectionPlan&lt;&#x2F;code&gt; might need to remember which child plan was active and its respective continuation.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Scan Limiter State:&lt;&#x2F;strong&gt; Information about accumulated counts or sizes if the scan was paused due to application-defined limits rather than FDB limits.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Version Information:&lt;&#x2F;strong&gt; To ensure compatibility if the token format changes across Record Layer versions.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The opacity of the token is a deliberate design choice. It decouples the application from the internal mechanics of the Record Layer, allowing the latter to evolve its continuation strategies (e.g., for efficiency or new features) without breaking client applications. The application&#x27;s responsibility is solely to store and return this token verbatim.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;resuming-query-execution-via-continuations&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#resuming-query-execution-via-continuations&quot; aria-label=&quot;Anchor link for: resuming-query-execution-via-continuations&quot;&gt;🔗&lt;&#x2F;a&gt;Resuming Query Execution via Continuations&lt;&#x2F;h2&gt;
&lt;p&gt;When a continuation token is provided to a &lt;code&gt;RecordCursor&lt;&#x2F;code&gt; (the Record Layer&#x27;s abstraction for iterating over query results), the underlying &lt;code&gt;RecordQueryPlan&lt;&#x2F;code&gt; uses it to reconstruct its state.&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Plan Identification:&lt;&#x2F;strong&gt; The token typically identifies the specific query plan or sub-plan it pertains to.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;State Restoration:&lt;&#x2F;strong&gt; Each operator in the query plan (e.g., &lt;code&gt;IndexScanPlan&lt;&#x2F;code&gt;, &lt;code&gt;FilterPlan&lt;&#x2F;code&gt;, &lt;code&gt;SortPlan&lt;&#x2F;code&gt;) that can be stateful across transaction boundaries implements logic to initialize itself from the continuation. For an &lt;code&gt;IndexScanPlan&lt;&#x2F;code&gt;, this primarily means setting the &lt;code&gt;ScanComparisons&lt;&#x2F;code&gt; for the next range read. For a &lt;code&gt;UnionPlan&lt;&#x2F;code&gt;, it might mean restoring the continuation for one of its child plans and indicating which child to resume.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Execution Resumption:&lt;&#x2F;strong&gt; Once the plan&#x27;s state is restored, the &lt;code&gt;RecordCursor&lt;&#x2F;code&gt; can proceed to fetch the next batch of records. The execution effectively &quot;jumps&quot; to the point encoded in the continuation.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;This mechanism allows the Record Layer to transparently manage the complexities of distributed, stateful iteration over potentially vast datasets, all while adhering to FDB&#x27;s transactional model.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;implications-of-non-atomicity&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#implications-of-non-atomicity&quot; aria-label=&quot;Anchor link for: implications-of-non-atomicity&quot;&gt;🔗&lt;&#x2F;a&gt;Implications of Non-Atomicity&lt;&#x2F;h2&gt;
&lt;p&gt;It&#x27;s important to understand a key implication of this multi-transaction approach: while each individual FDB transaction executed as part of a continued operation is atomic and isolated (typically providing serializable isolation), the overall logical operation spanning multiple continuations is &lt;strong&gt;not atomic&lt;&#x2F;strong&gt; in the same way. Mutations to the data by other concurrent transactions can occur &lt;em&gt;between&lt;&#x2F;em&gt; the FDB transactions of a continued scan. As a result, a long-running operation that uses continuations doesn&#x27;t see the entire dataset at a single, frozen moment in time. Instead, it might see some data that was present or changed &lt;em&gt;after&lt;&#x2F;em&gt; the operation began but &lt;em&gt;before&lt;&#x2F;em&gt; it completed. This is a natural consequence of breaking the work into smaller pieces to fit within FDB&#x27;s transaction limits. Applications should be aware of this behavior, particularly if they need all the data to reflect its state from one specific instant.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#conclusion&quot; aria-label=&quot;Anchor link for: conclusion&quot;&gt;🔗&lt;&#x2F;a&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;The Record Layer&#x27;s continuation feature is a powerful tool for handling large datasets and complex queries in FoundationDB, but it&#x27;s important to understand the implications of non-atomicity. By breaking operations into smaller, FDB-compliant transactions, the Record Layer provides a flexible and scalable solution while maintaining the core principles of FDB&#x27;s transactional model.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Feel free to reach out with any questions or to share your thoughts. You can find me on &lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;pierrezemb.fr&quot;&gt;Bluesky&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt; or through my &lt;a href=&quot;https:&#x2F;&#x2F;pierrezemb.fr&quot;&gt;website&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">foundationdb</category>
          <category domain="tag">record-layer</category>
          <category domain="tag">java</category>
          <category domain="tag">database</category>
          <category domain="tag">continuation</category>
          <category domain="tag">pagination</category>
          <category domain="tag">distributed-systems</category>
      </item>
      <item>
          <title>Unlocking Tokio&#x27;s Hidden Gems: Determinism, Paused Time, and Local Execution</title>
          <pubDate>Sun, 18 May 2025 18:13:02 +0200</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/tokio-hidden-gems/</link>
          <guid>https://pierrezemb.fr/posts/tokio-hidden-gems/</guid>
          <description xml:base="https://pierrezemb.fr/posts/tokio-hidden-gems/">&lt;p&gt;Tokio is the powerhouse of asynchronous Rust, celebrated for its blazing speed and robust concurrency primitives. Many of us interact with its core components daily—&lt;code&gt;spawn&lt;&#x2F;code&gt;, &lt;code&gt;select!&lt;&#x2F;code&gt;, &lt;code&gt;async fn&lt;&#x2F;code&gt;, and the rich ecosystem of I&#x2F;O utilities. But beyond these well-trodden paths lie some incredibly potent, albeit less-publicized, features that can dramatically elevate your testing strategies, offer more nuanced task management, and grant you surgical control over your runtime&#x27;s execution.&lt;&#x2F;p&gt;
&lt;p&gt;Today, let&#x27;s pull back the curtain on a few of these invaluable tools: current-thread runtimes for embracing single-threaded flexibility with &lt;code&gt;!Send&lt;&#x2F;code&gt; types, seeded runtimes for taming non-determinism, and the paused clock for mastering time in your tests.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;effortless-send-futures-with-current-thread-runtimes&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#effortless-send-futures-with-current-thread-runtimes&quot; aria-label=&quot;Anchor link for: effortless-send-futures-with-current-thread-runtimes&quot;&gt;🔗&lt;&#x2F;a&gt;Effortless &lt;code&gt;!Send&lt;&#x2F;code&gt; Futures with Current-Thread Runtimes&lt;&#x2F;h2&gt;
&lt;p&gt;While Tokio&#x27;s multi-threaded scheduler is a marvel for CPU-bound and parallel I&#x2F;O tasks, there are scenarios where a single-threaded execution model is simpler or even necessary. This is particularly true when dealing with types that are not &lt;code&gt;Send&lt;&#x2F;code&gt; (i.e., cannot be safely transferred across threads), such as &lt;code&gt;Rc&amp;lt;T&amp;gt;&lt;&#x2F;code&gt; or &lt;code&gt;RefCell&amp;lt;T&amp;gt;&lt;&#x2F;code&gt;, or when you want to avoid the overhead and complexity of synchronization primitives like &lt;code&gt;Arc&amp;lt;Mutex&amp;lt;T&amp;gt;&amp;gt;&lt;&#x2F;code&gt; for state shared only within a single thread of execution.&lt;&#x2F;p&gt;
&lt;p&gt;Tokio&#x27;s &lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;tokio&#x2F;latest&#x2F;tokio&#x2F;runtime&#x2F;struct.Builder.html#method.new_current_thread&quot;&gt;&lt;code&gt;Builder::new_current_thread()&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; followed by &lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;tokio&#x2F;latest&#x2F;tokio&#x2F;runtime&#x2F;struct.Builder.html#method.build_local&quot;&gt;&lt;code&gt;build_local()&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; (part of the same &lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;tokio&#x2F;latest&#x2F;tokio&#x2F;runtime&#x2F;struct.Builder.html&quot;&gt;&lt;code&gt;Builder&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; API) provides a streamlined way to create a runtime that executes tasks on the thread that created it. This setup inherently supports spawning &lt;code&gt;!Send&lt;&#x2F;code&gt; futures using &lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;tokio&#x2F;latest&#x2F;tokio&#x2F;task&#x2F;fn.spawn_local.html&quot;&gt;&lt;code&gt;tokio::task::spawn_local&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; without needing to manually manage a &lt;code&gt;LocalSet&lt;&#x2F;code&gt; for basic cases. This approach aligns well with ongoing discussions in the Tokio community aimed at simplifying &lt;code&gt;!Send&lt;&#x2F;code&gt; task management.&lt;&#x2F;p&gt;
&lt;p&gt;This &lt;code&gt;build_local()&lt;&#x2F;code&gt; method not only simplifies handling &lt;code&gt;!Send&lt;&#x2F;code&gt; types today but also reflects the direction Tokio is heading. The Tokio team is exploring ways to make this even more integrated and ergonomic through a proposed &lt;strong&gt;&lt;code&gt;LocalRuntime&lt;&#x2F;code&gt;&lt;&#x2F;strong&gt; type (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;tokio-rs&#x2F;tokio&#x2F;issues&#x2F;6739&quot;&gt;#6739&lt;&#x2F;a&gt;). The vision for &lt;code&gt;LocalRuntime&lt;&#x2F;code&gt; is a runtime that is inherently &lt;code&gt;!Send&lt;&#x2F;code&gt; (making &lt;code&gt;!Send&lt;&#x2F;code&gt; task management seamless within its context), where &lt;code&gt;tokio::spawn&lt;&#x2F;code&gt; and &lt;code&gt;tokio::task::spawn_local&lt;&#x2F;code&gt; would effectively behave identically.&lt;&#x2F;p&gt;
&lt;p&gt;This proposed enhancement is linked to a discussion about potentially deprecating the existing &lt;strong&gt;&lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;tokio&#x2F;latest&#x2F;tokio&#x2F;task&#x2F;struct.LocalSet.html&quot;&gt;&lt;code&gt;tokio::task::LocalSet&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt; (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;tokio-rs&#x2F;tokio&#x2F;issues&#x2F;6741&quot;&gt;#6741&lt;&#x2F;a&gt;). While &lt;code&gt;LocalSet&lt;&#x2F;code&gt; currently offers fine-grained control for running &lt;code&gt;!Send&lt;&#x2F;code&gt; tasks (e.g., within specific parts of larger, multi-threaded applications), it comes with complexities, performance overhead, and integration challenges that &lt;code&gt;LocalRuntime&lt;&#x2F;code&gt; aims to resolve.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;So, what&#x27;s the takeaway for you?&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;For most scenarios requiring &lt;code&gt;!Send&lt;&#x2F;code&gt; tasks on a single thread&lt;&#x2F;strong&gt; (like entire applications, test suites, or dedicated utility threads): Using &lt;code&gt;Builder::new_current_thread().build_local()&lt;&#x2F;code&gt; is the recommended, simpler, and more future-proof path. It embodies the principles of the proposed &lt;code&gt;LocalRuntime&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;If you need to embed &lt;code&gt;!Send&lt;&#x2F;code&gt; task execution within a specific scope of a larger, multi-threaded application&lt;&#x2F;strong&gt;: &lt;code&gt;LocalSet&lt;&#x2F;code&gt; is the current tool. However, be mindful of its potential deprecation and associated complexities. For new projects, evaluate if a dedicated thread using a &lt;code&gt;build_local()&lt;&#x2F;code&gt; runtime (or a future &lt;code&gt;LocalRuntime&lt;&#x2F;code&gt;) might offer a cleaner solution.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Essentially, Tokio is moving towards making single-threaded &lt;code&gt;!Send&lt;&#x2F;code&gt; execution more straightforward and deeply integrated. The &lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;tokio&#x2F;latest&#x2F;tokio&#x2F;runtime&#x2F;struct.Builder.html#method.build_local&quot;&gt;&lt;code&gt;build_local()&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; method is a current gem that aligns you with this forward-looking approach.&lt;&#x2F;p&gt;
&lt;p&gt;Here&#x27;s how you typically set one up (the &lt;code&gt;build_local()&lt;&#x2F;code&gt; way):&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;use &lt;&#x2F;span&gt;&lt;span&gt;tokio::runtime::Builder;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;main&lt;&#x2F;span&gt;&lt;span&gt;() {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;let mut&lt;&#x2F;span&gt;&lt;span&gt; rt = Builder::new_current_thread()
&lt;&#x2F;span&gt;&lt;span&gt;        .&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;enable_all&lt;&#x2F;span&gt;&lt;span&gt;() &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Enable I&#x2F;O, time, etc.
&lt;&#x2F;span&gt;&lt;span&gt;        .&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;build_local&lt;&#x2F;span&gt;&lt;span&gt;(&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;mut &lt;&#x2F;span&gt;&lt;span&gt;Default::default()) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Builds a runtime on the current thread
&lt;&#x2F;span&gt;&lt;span&gt;        .&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;unwrap&lt;&#x2F;span&gt;&lt;span&gt;();
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; The runtime itself is the &amp;#39;LocalSet&amp;#39; in this context
&lt;&#x2F;span&gt;&lt;span&gt;    rt.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;block_on&lt;&#x2F;span&gt;&lt;span&gt;(async {
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Spawn !Send futures here using tokio::task::spawn_local(...)
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; For example:
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; rc_value = std::rc::Rc::new(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;        tokio::task::spawn_local(async &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;move &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;            println!(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;RC value: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;{}&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, *rc_value);
&lt;&#x2F;span&gt;&lt;span&gt;        }).await.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;unwrap&lt;&#x2F;span&gt;&lt;span&gt;();
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;        println!(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Running !Send futures on a current-thread runtime!&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;);
&lt;&#x2F;span&gt;&lt;span&gt;    });
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This approach simplifies designs where tasks don&#x27;t need to cross thread boundaries, allowing for more straightforward state management.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;taming-non-determinism-seeded-runtimes&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#taming-non-determinism-seeded-runtimes&quot; aria-label=&quot;Anchor link for: taming-non-determinism-seeded-runtimes&quot;&gt;🔗&lt;&#x2F;a&gt;Taming Non-Determinism: Seeded Runtimes&lt;&#x2F;h2&gt;
&lt;p&gt;One of the challenges in testing concurrent systems is non-determinism. When multiple futures are ready to make progress simultaneously, such as in a &lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;tokio&#x2F;latest&#x2F;tokio&#x2F;macro.select.html&quot;&gt;&lt;code&gt;tokio::select!&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; macro, the order in which they are polled can vary between runs. This can make reproducing and debugging race conditions or specific interleavings tricky.&lt;&#x2F;p&gt;
&lt;p&gt;Tokio offers a solution: &lt;strong&gt;seeded runtimes&lt;&#x2F;strong&gt;. By providing a specific &lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;tokio&#x2F;latest&#x2F;tokio&#x2F;runtime&#x2F;struct.Builder.html#method.rng_seed&quot;&gt;&lt;code&gt;RngSeed&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; when building the runtime, you can make certain scheduler behaviors deterministic. This is particularly useful for &lt;code&gt;select!&lt;&#x2F;code&gt; statements involving multiple futures that become ready around the same time.&lt;&#x2F;p&gt;
&lt;p&gt;Consider this example, which demonstrates how a seed can influence which future &#x27;wins&#x27; a &lt;code&gt;select!&lt;&#x2F;code&gt; race:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;use &lt;&#x2F;span&gt;&lt;span&gt;tokio::runtime::{Builder, RngSeed};
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;use &lt;&#x2F;span&gt;&lt;span&gt;tokio::time::{sleep, Duration};
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Example function to show deterministic select!
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;demo_deterministic_select&lt;&#x2F;span&gt;&lt;span&gt;() {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Try changing this seed to see the select! behavior change (but consistently per seed).
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; seed = RngSeed::from_bytes(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;b&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;my_fixed_seed_001&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; e.g., let seed = RngSeed::from_bytes(b&amp;quot;another_seed_002&amp;quot;);
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;let mut&lt;&#x2F;span&gt;&lt;span&gt; rt = Builder::new_current_thread()
&lt;&#x2F;span&gt;&lt;span&gt;        .&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;enable_time&lt;&#x2F;span&gt;&lt;span&gt;()
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Pausing the clock is crucial here to ensure both tasks become ready 
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; at the *exact same logical time* after we call `tokio::time::advance`.
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; This makes the seed&amp;#39;s role in tie-breaking very clear.
&lt;&#x2F;span&gt;&lt;span&gt;        .&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;start_paused&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;true&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;        .&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;rng_seed&lt;&#x2F;span&gt;&lt;span&gt;(seed)     &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Apply the seed for deterministic polling order
&lt;&#x2F;span&gt;&lt;span&gt;        .&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;build_local&lt;&#x2F;span&gt;&lt;span&gt;(&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;mut &lt;&#x2F;span&gt;&lt;span&gt;Default::default())
&lt;&#x2F;span&gt;&lt;span&gt;        .&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;unwrap&lt;&#x2F;span&gt;&lt;span&gt;();
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Now, let&amp;#39;s run some tasks and see select! in action.
&lt;&#x2F;span&gt;&lt;span&gt;    rt.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;block_on&lt;&#x2F;span&gt;&lt;span&gt;(async {
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; task_a = async {
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;sleep&lt;&#x2F;span&gt;&lt;span&gt;(Duration::from_millis(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;50&lt;&#x2F;span&gt;&lt;span&gt;)).await;
&lt;&#x2F;span&gt;&lt;span&gt;            println!(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Task A finished.&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;);
&lt;&#x2F;span&gt;&lt;span&gt;            &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Result from A&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;
&lt;&#x2F;span&gt;&lt;span&gt;        };
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; task_b = async {
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;sleep&lt;&#x2F;span&gt;&lt;span&gt;(Duration::from_millis(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;50&lt;&#x2F;span&gt;&lt;span&gt;)).await;
&lt;&#x2F;span&gt;&lt;span&gt;            println!(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Task B finished.&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;);
&lt;&#x2F;span&gt;&lt;span&gt;            &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Result from B&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;
&lt;&#x2F;span&gt;&lt;span&gt;        };
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Advance time so both sleeps complete and both tasks become ready.
&lt;&#x2F;span&gt;&lt;span&gt;        tokio::time::advance(Duration::from_millis(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;50&lt;&#x2F;span&gt;&lt;span&gt;)).await;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; With the same seed, the select! macro will consistently pick the same
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; branch if both are ready. Change the seed to see if the other branch gets picked.
&lt;&#x2F;span&gt;&lt;span&gt;        tokio::select! {
&lt;&#x2F;span&gt;&lt;span&gt;            res_a = task_a =&amp;gt; {
&lt;&#x2F;span&gt;&lt;span&gt;                println!(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Select chose Task A, result: &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;{}&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;&amp;#39;&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, res_a);
&lt;&#x2F;span&gt;&lt;span&gt;            }
&lt;&#x2F;span&gt;&lt;span&gt;            res_b = task_b =&amp;gt; {
&lt;&#x2F;span&gt;&lt;span&gt;                println!(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Select chose Task B, result: &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;{}&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;&amp;#39;&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, res_b);
&lt;&#x2F;span&gt;&lt;span&gt;            }
&lt;&#x2F;span&gt;&lt;span&gt;        }
&lt;&#x2F;span&gt;&lt;span&gt;    });
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;main&lt;&#x2F;span&gt;&lt;span&gt;() {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;demo_deterministic_select&lt;&#x2F;span&gt;&lt;span&gt;();
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;mastering-time-paused-clock-and-auto-advancement&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#mastering-time-paused-clock-and-auto-advancement&quot; aria-label=&quot;Anchor link for: mastering-time-paused-clock-and-auto-advancement&quot;&gt;🔗&lt;&#x2F;a&gt;Mastering Time: Paused Clock and Auto-Advancement&lt;&#x2F;h2&gt;
&lt;p&gt;Testing time-dependent behavior (timeouts, retries, scheduled tasks) can be slow and flaky. Waiting for real seconds or minutes to pass during tests is inefficient. Tokio&#x27;s time facilities can be &lt;strong&gt;paused&lt;&#x2F;strong&gt; and &lt;strong&gt;manually advanced&lt;&#x2F;strong&gt;, giving you precise control over the flow of time within your tests.&lt;&#x2F;p&gt;
&lt;p&gt;When you initialize a runtime with &lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;tokio&#x2F;latest&#x2F;tokio&#x2F;runtime&#x2F;struct.Builder.html#method.start_paused&quot;&gt;&lt;code&gt;start_paused(true)&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;, the runtime&#x27;s clock will not advance automatically based on wall-clock time. Instead, you use &lt;code&gt;tokio::time::advance(Duration)&lt;&#x2F;code&gt; to move time forward explicitly.&lt;&#x2F;p&gt;
&lt;p&gt;What&#x27;s particularly neat is Tokio&#x27;s &lt;strong&gt;auto-advance&lt;&#x2F;strong&gt; feature when the runtime is paused and idle. This works because Tokio&#x27;s runtime separates the &lt;strong&gt;executor&lt;&#x2F;strong&gt; (which polls your async code until it&#x27;s blocked) from the &lt;strong&gt;reactor&lt;&#x2F;strong&gt; (which wakes tasks based on I&#x2F;O or timer events). If all tasks are sleeping, the executor is idle. The reactor can then identify the next scheduled timer, allowing Tokio to automatically advance its clock to that point. This prevents tests from hanging indefinitely while still allowing for controlled time progression.&lt;&#x2F;p&gt;
&lt;p&gt;Here&#x27;s your example illustrating this:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;use &lt;&#x2F;span&gt;&lt;span&gt;tokio::time::{Duration, Instant, sleep};
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;async &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;auto_advance_kicks_in_when_idle_example&lt;&#x2F;span&gt;&lt;span&gt;() {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; start = Instant::now();
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Sleep for 5 seconds. Since the runtime is paused, this would normally hang.
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; However, if no other tasks are active, Tokio auto-advances time.
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;sleep&lt;&#x2F;span&gt;&lt;span&gt;(Duration::from_secs(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;)).await;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; elapsed = start.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;elapsed&lt;&#x2F;span&gt;&lt;span&gt;();
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; This will be exactly 5 seconds (simulated time)
&lt;&#x2F;span&gt;&lt;span&gt;    assert_eq!(elapsed, Duration::from_secs(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;));
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    println!(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Elapsed (simulated): &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;{:?}&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, elapsed);
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In this scenario, &lt;code&gt;sleep(Duration::from_secs(5)).await&lt;&#x2F;code&gt; doesn&#x27;t cause your test to wait for 5 real seconds. Because the clock is paused and this &lt;code&gt;sleep&lt;&#x2F;code&gt; is the only pending timed event, Tokio advances its internal clock by 5 seconds, allowing the sleep to complete almost instantaneously in real time. This makes testing timeouts, scheduled events, and other time-sensitive logic fast and reliable.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#conclusion&quot; aria-label=&quot;Anchor link for: conclusion&quot;&gt;🔗&lt;&#x2F;a&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;Tokio offers more than just speed; it&#x27;s a powerful toolkit. Features like current-thread runtimes for &lt;code&gt;!Send&lt;&#x2F;code&gt; tasks, seeded runtimes for deterministic tests, and a controllable clock for time-based logic help build robust and debuggable async Rust applications. These &#x27;hidden gems&#x27; allow you to confidently handle complex concurrency and testing. So, explore Tokio&#x27;s depth—the right tool for your challenge might be closer than you think.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Feel free to reach out with any questions or to share your thoughts. You can find me on &lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;pierrezemb.fr&quot;&gt;Bluesky&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt; or through my &lt;a href=&quot;https:&#x2F;&#x2F;pierrezemb.fr&quot;&gt;website&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">rust</category>
          <category domain="tag">tokio</category>
          <category domain="tag">async</category>
          <category domain="tag">testing</category>
          <category domain="tag">concurrency</category>
          <category domain="tag">deterministic</category>
      </item>
      <item>
          <title>What if we embraced simulation-driven development?</title>
          <pubDate>Fri, 18 Apr 2025 11:12:12 +0200</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/simulation-driven-development/</link>
          <guid>https://pierrezemb.fr/posts/simulation-driven-development/</guid>
          <description xml:base="https://pierrezemb.fr/posts/simulation-driven-development/">&lt;p&gt;This article has been translated from my original French presentation at the upcoming Devoxx France 2025, titled &quot;&lt;a href=&quot;https:&#x2F;&#x2F;docs.google.com&#x2F;presentation&#x2F;d&#x2F;1xm4yNGnV2Oi8Lk3ZHEvg4aDMNEFieSmW06CkItCigSc&#x2F;edit?usp=sharing&quot;&gt;What if we embraced simulation-driven development?&lt;&#x2F;a&gt;&quot;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-tale-of-a-bug&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-tale-of-a-bug&quot; aria-label=&quot;Anchor link for: the-tale-of-a-bug&quot;&gt;🔗&lt;&#x2F;a&gt;The Tale of a Bug&lt;&#x2F;h2&gt;
&lt;p&gt;As a software engineer, my responsibilities include debugging distributed systems during on-call shifts. My tendency to attract peculiar issues during these shifts earned me the nickname &quot;Black Cat&quot;. Let me share a particularly memorable incident:&lt;&#x2F;p&gt;
&lt;p&gt;One of the most memorable incidents happened when a &lt;strong&gt;network partition&lt;&#x2F;strong&gt; completely disrupted a 70+ node Apache Hadoop cluster. The system was in disarray, with nodes confused about &lt;strong&gt;block replication&lt;&#x2F;strong&gt; and &lt;strong&gt;management&lt;&#x2F;strong&gt;. After the network issue was resolved, we decided to &lt;strong&gt;restart the cluster&lt;&#x2F;strong&gt;...&lt;&#x2F;p&gt;
&lt;p&gt;But it wouldn&#x27;t come back online.&lt;&#x2F;p&gt;
&lt;p&gt;The reason? The system was encountering a &lt;code&gt;NullPointerException&lt;&#x2F;code&gt; during startup due to its faulty state. The cluster was too slow to restart properly because of how severely degraded it had become after the network partition. This bug had actually been fixed in newer versions of &lt;strong&gt;HDFS&lt;&#x2F;strong&gt;, but we were running an older release.&lt;&#x2F;p&gt;
&lt;p&gt;The solution required &lt;strong&gt;patching the Hadoop codebase&lt;&#x2F;strong&gt; by &lt;strong&gt;backporting the fix&lt;&#x2F;strong&gt;, &lt;strong&gt;recompiling&lt;&#x2F;strong&gt;, and &lt;strong&gt;distributing the new jar&lt;&#x2F;strong&gt; across all nodes—not exactly what you want to be doing during an active incident. Rolling out patches to a distributed system while it&#x27;s already &quot;on fire&quot; is rarely recommended, but we had no choice.&lt;&#x2F;p&gt;
&lt;p&gt;This is exactly the type of code that feels disconnected from production requirements—the bug appeared at the worst possible moment, during recovery, when the system was most vulnerable.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-development-production-gap&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-development-production-gap&quot; aria-label=&quot;Anchor link for: the-development-production-gap&quot;&gt;🔗&lt;&#x2F;a&gt;The Development-Production Gap&lt;&#x2F;h2&gt;
&lt;p&gt;This incident highlights a fundamental truth in software engineering: &lt;strong&gt;production environments are vastly different from development environments&lt;&#x2F;strong&gt;. The gap between them is comparable to the difference between passing a written driving test and actually driving on a busy highway during rush hour.&lt;&#x2F;p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
    &lt;pre class=&quot;mermaid&quot;&gt;
        flowchart LR
    S[&amp;quot;Your System&amp;quot;] 
    U[&amp;quot;Your Users&amp;quot;]
    W[&amp;quot;The World&amp;quot;]
    
    U --&amp;gt; S
    W --&amp;gt; S
    &lt;&#x2F;pre&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;In development, everything is &lt;strong&gt;controlled&lt;&#x2F;strong&gt;, &lt;strong&gt;clean&lt;&#x2F;strong&gt;, and &lt;strong&gt;predictable&lt;&#x2F;strong&gt;. In production:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Users do &lt;strong&gt;unexpected things&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Systems operate under &lt;strong&gt;pressure&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Components fail in &lt;strong&gt;complex ways&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Edge cases&lt;&#x2F;strong&gt; occur regularly&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Being on-call forces you to confront this reality. The pager is an unforgiving teacher, but is there a better way to instill a production mindset without throwing engineers into the deep end of incident response?&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-testing-problem&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-testing-problem&quot; aria-label=&quot;Anchor link for: the-testing-problem&quot;&gt;🔗&lt;&#x2F;a&gt;The Testing Problem&lt;&#x2F;h2&gt;
&lt;p&gt;Let&#x27;s consider a standard e-commerce API with multiple dimensions of variability:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;User Types: Guest, Logged-in, Premium, Business (4)&lt;&#x2F;li&gt;
&lt;li&gt;Payment Methods: Credit Card, PayPal, Apple Pay, Gift Card, Bank Transfer (5)&lt;&#x2F;li&gt;
&lt;li&gt;Delivery Options: Standard, Express, In-Store Pickup, Same-Day (4)&lt;&#x2F;li&gt;
&lt;li&gt;Promotions: Yes, No, Expired (3)&lt;&#x2F;li&gt;
&lt;li&gt;Inventory Status: In Stock, Low Stock, Out of Stock, Preorder (4)&lt;&#x2F;li&gt;
&lt;li&gt;Currency: USD, EUR, GBP, JPY (4)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Testing all possible combinations requires 4×5×4×3×4×4 = 3,840 unique test cases—and that&#x27;s just for the happy path! Add error conditions, network failures, and other edge cases, and this number explodes exponentially.&lt;&#x2F;p&gt;
&lt;p&gt;This is why comprehensive end-to-end testing is so difficult. Every new feature multiplies the complexity, and bugs often hide in rare combinations of conditions that we never thought to test.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-world-is-harsh&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-world-is-harsh&quot; aria-label=&quot;Anchor link for: the-world-is-harsh&quot;&gt;🔗&lt;&#x2F;a&gt;The World Is Harsh&lt;&#x2F;h2&gt;
&lt;p&gt;Meanwhile, the real world is even more chaotic than our test cases. Research papers like &quot;&lt;a href=&quot;https:&#x2F;&#x2F;www.usenix.org&#x2F;system&#x2F;files&#x2F;osdi18-alquraan.pdf&quot;&gt;An Analysis of Network-Partitioning Failures in Cloud Systems&lt;&#x2F;a&gt;&quot; (OSDI &#x27;18) and &quot;&lt;a href=&quot;https:&#x2F;&#x2F;sigops.org&#x2F;s&#x2F;conferences&#x2F;hotos&#x2F;2021&#x2F;papers&#x2F;hotos21-s11-bronson.pdf&quot;&gt;Metastable Failures in Distributed Systems&lt;&#x2F;a&gt;&quot; (HotOS &#x27;21) document just how complex failure modes can be in production.&lt;&#x2F;p&gt;
&lt;p&gt;In a &lt;a href=&quot;https:&#x2F;&#x2F;qconlondon.com&#x2F;london-2015&#x2F;system&#x2F;files&#x2F;keynotes-slides&#x2F;2015-03%20QCon%20(john%20wilkes).pdf&quot;&gt;presentation by John Wilkes (Google) at QCon London 2015&lt;&#x2F;a&gt;, a 2,000-machine service will experience more than 10 machine crashes per day—and this is considered normal, not exceptional. When you operate at scale, failures become a constant background noise rather than exceptional events.&lt;&#x2F;p&gt;
&lt;p&gt;And yes, your &lt;strong&gt;microservices architecture&lt;&#x2F;strong&gt; is absolutely a distributed system susceptible to these issues.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;sre-vs-swe-perspectives&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#sre-vs-swe-perspectives&quot; aria-label=&quot;Anchor link for: sre-vs-swe-perspectives&quot;&gt;🔗&lt;&#x2F;a&gt;SRE vs. SWE Perspectives&lt;&#x2F;h2&gt;
&lt;p&gt;There&#x27;s often a gap between the Software Engineer (SWE) perspective and the Site Reliability Engineer (SRE) perspective:&lt;&#x2F;p&gt;
&lt;p&gt;SWEs tend to focus on:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Development environments (which are completely different from production)&lt;&#x2F;li&gt;
&lt;li&gt;Feature implementations&lt;&#x2F;li&gt;
&lt;li&gt;Code that passes tests (but may not account for real-world complexity)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;SREs worry about:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;System interactions in production under pressure&lt;&#x2F;li&gt;
&lt;li&gt;Complex, unpredictable failure modes&lt;&#x2F;li&gt;
&lt;li&gt;Recovery mechanisms when things are already broken&lt;&#x2F;li&gt;
&lt;li&gt;Being paged at 3 AM to fix critical issues alone&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The question then becomes: &lt;strong&gt;How can we help developers gain a better understanding of production realities without subjecting them to the trial-by-fire of on-call rotations?&lt;&#x2F;strong&gt; How might we bridge this gap between development and operations, creating environments where engineers can experience production-like conditions safely, learn from failures, and build more resilient systems from the beginning?&lt;&#x2F;p&gt;
&lt;p&gt;We need to test not just our expected use cases, but the &lt;strong&gt;&quot;worse&quot; versions of both our users and the world&lt;&#x2F;strong&gt;. How do we accomplish this comprehensively?&lt;&#x2F;p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
    &lt;pre class=&quot;mermaid&quot;&gt;
        flowchart LR
    S[&amp;quot;Your System&amp;quot;] 
    U[&amp;quot;Your worst Users&amp;quot;]
    W[&amp;quot;The worst World&amp;quot;]
    
    U --&amp;gt; S
    W --&amp;gt; S
    &lt;&#x2F;pre&gt;
&lt;&#x2F;div&gt;&lt;h2 id=&quot;deterministic-simulation-testing&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#deterministic-simulation-testing&quot; aria-label=&quot;Anchor link for: deterministic-simulation-testing&quot;&gt;🔗&lt;&#x2F;a&gt;Deterministic Simulation Testing&lt;&#x2F;h2&gt;
&lt;p&gt;The solution lies in a strategy that&#x27;s both robust and practical: &lt;strong&gt;Deterministic Simulation Testing (DST)&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;For effective testing of complex distributed systems, we need an approach that satisfies three key requirements:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Fast and debuggable testing&lt;&#x2F;strong&gt; → We achieve this with a single-threaded approach that uses a deterministic event loop, making issues perfectly reproducible&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Testing the entire system at once&lt;&#x2F;strong&gt; → By packaging everything into a single binary with simulated network interactions, we can test complex distributed behaviors without actual network infrastructure&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Robust against unknown issues&lt;&#x2F;strong&gt; → Through randomized testing with controlled entropy injection, we discover edge cases that we wouldn&#x27;t think to test explicitly&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;These three elements work together to create a powerful testing methodology that&#x27;s both practical to implement and effective at finding real-world issues.&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s see how we can simulate both our users and the world?&lt;&#x2F;p&gt;
&lt;h2 id=&quot;how-to-simulate&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#how-to-simulate&quot; aria-label=&quot;Anchor link for: how-to-simulate&quot;&gt;🔗&lt;&#x2F;a&gt;How to simulate?&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;simulating-users-randomized-input-and-property-based-testing&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#simulating-users-randomized-input-and-property-based-testing&quot; aria-label=&quot;Anchor link for: simulating-users-randomized-input-and-property-based-testing&quot;&gt;🔗&lt;&#x2F;a&gt;Simulating Users: Randomized Input and Property-Based Testing&lt;&#x2F;h3&gt;
&lt;p&gt;Instead of writing thousands of individual test cases, we can use &lt;strong&gt;property-based testing&lt;&#x2F;strong&gt; to generate randomized inputs and verify system properties. This approach is not new and is well-known for unit tests but is relatively new for integration tests:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;java&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-java &quot;&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;enum &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;UserType &lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;{ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;GUEST&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;, LOGGED_IN, PREMIUM, BUSINESS }
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;enum &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;PaymentMethod &lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;{ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;CARD&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;, PAYPAL, APPLE_PAY, GIFT_CARD, BANK_TRANSFER }
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; ...
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Random&lt;&#x2F;span&gt;&lt;span&gt; rand = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;new &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Random&lt;&#x2F;span&gt;&lt;span&gt;(); &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; random seed
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;UserType&lt;&#x2F;span&gt;&lt;span&gt; user = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;pickRandom&lt;&#x2F;span&gt;&lt;span&gt;(rand, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;UserType&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;values&lt;&#x2F;span&gt;&lt;span&gt;());
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;PaymentMethod&lt;&#x2F;span&gt;&lt;span&gt; paymentMethod = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;pickRandom&lt;&#x2F;span&gt;&lt;span&gt;(rand, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;PaymentMethod&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;values&lt;&#x2F;span&gt;&lt;span&gt;());
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Rather than hardcoding test cases like:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;java&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-java &quot;&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;assertFalse&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;new &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;User&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;GUEST&lt;&#x2F;span&gt;&lt;span&gt;).&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;canUse&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;SAVED_CARD&lt;&#x2F;span&gt;&lt;span&gt;));
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We can write property-based assertions:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;java&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-java &quot;&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;assertEquals&lt;&#x2F;span&gt;&lt;span&gt;(user.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;isAuthenticated&lt;&#x2F;span&gt;&lt;span&gt;(), user.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;canUse&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;SAVED_CARD&lt;&#x2F;span&gt;&lt;span&gt;));
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This approach is implemented in libraries like:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Python: &lt;strong&gt;Hypothesis&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Java: &lt;strong&gt;jqwik&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Rust: &lt;strong&gt;proptest&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;simulating-the-world-injecting-chaos&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#simulating-the-world-injecting-chaos&quot; aria-label=&quot;Anchor link for: simulating-the-world-injecting-chaos&quot;&gt;🔗&lt;&#x2F;a&gt;Simulating the World: Injecting Chaos&lt;&#x2F;h3&gt;
&lt;p&gt;We also need to simulate the chaotic nature of production environments by injecting failures into:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Time (delays, timeouts, retries, race conditions)&lt;&#x2F;li&gt;
&lt;li&gt;Network (latency, failure, disconnection)&lt;&#x2F;li&gt;
&lt;li&gt;Infrastructure (disk full, service crash, replica lag)&lt;&#x2F;li&gt;
&lt;li&gt;External dependencies (slow APIs, rate limiting)&lt;&#x2F;li&gt;
&lt;li&gt;Load (varying numbers of concurrent users)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;It&#x27;s important to note that implementing full deterministic simulation requires control over every aspect of your system, from task scheduling to I&#x2F;O operations. This is significantly easier if your system is built with simulation in mind from day one. Some languages offer advantages in this area—for example, Rust&#x27;s ecosystem makes it relatively straightforward to implement custom virtual threading executors compared to modifying the JVM.&lt;&#x2F;p&gt;
&lt;p&gt;For existing codebases where a full rewrite isn&#x27;t practical, you can still benefit from simulation testing by adding layers of indirection. Even simple mocks like the HTTP client example below can help you discover how your system behaves under various failure conditions:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;java&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-java &quot;&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;class &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;HttpClientMock &lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;private final &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Random &lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;random &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;new &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Random&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;(); &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; random seed
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;String &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;get&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;String &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;url&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;) {
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Simulate random chance of returning an error
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;(random.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;nextDouble&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;() &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0.2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;) {
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;HTTP 500 Internal Server Error&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;        }
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt; delay &lt;&#x2F;span&gt;&lt;span&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt; random.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;nextInt&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;500&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;); &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Simulate 0–499ms latency
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Thread&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;sleep&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;(delay);
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;HTTP 200 OK&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;    }
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;who-uses-dst&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#who-uses-dst&quot; aria-label=&quot;Anchor link for: who-uses-dst&quot;&gt;🔗&lt;&#x2F;a&gt;Who Uses DST?&lt;&#x2F;h2&gt;
&lt;p&gt;Not many companies are using DST, but we are starting to have a nice list:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Clever Cloud&lt;&#x2F;li&gt;
&lt;li&gt;TigerBeetle&lt;&#x2F;li&gt;
&lt;li&gt;Resonate&lt;&#x2F;li&gt;
&lt;li&gt;RisingWave&lt;&#x2F;li&gt;
&lt;li&gt;Sync @ Dropbox&lt;&#x2F;li&gt;
&lt;li&gt;sled.rs&lt;&#x2F;li&gt;
&lt;li&gt;Kafka’s KRaft&lt;&#x2F;li&gt;
&lt;li&gt;Astradot&lt;&#x2F;li&gt;
&lt;li&gt;Polar Signals&lt;&#x2F;li&gt;
&lt;li&gt;AWS&lt;&#x2F;li&gt;
&lt;li&gt;Antithesis&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;dst-at-clever-cloud&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#dst-at-clever-cloud&quot; aria-label=&quot;Anchor link for: dst-at-clever-cloud&quot;&gt;🔗&lt;&#x2F;a&gt;DST at Clever Cloud&lt;&#x2F;h3&gt;
&lt;p&gt;At Clever Cloud, we&#x27;re implementing a multi-tenant, multi-model distributed database heavily relying on FoundationDB. While we haven&#x27;t developed our own deterministic simulation testing framework yet, we leverage &lt;strong&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.foundationdb.org&#x2F;&quot;&gt;FoundationDB&lt;&#x2F;a&gt;&#x27;s built-in simulation by injecting custom workloads.&lt;&#x2F;strong&gt; This approach is core to developing our first serverless product, &lt;a href=&quot;https:&#x2F;&#x2F;www.clever-cloud.com&#x2F;blog&#x2F;features&#x2F;2024&#x2F;06&#x2F;11&#x2F;materia-kv-our-easy-to-use-serverless-key-value-database-is-available-to-all&#x2F;&quot;&gt;Materia KV&lt;&#x2F;a&gt;. The simulations FoundationDB provides include:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Random network partitions&lt;&#x2F;li&gt;
&lt;li&gt;Machine reboots&lt;&#x2F;li&gt;
&lt;li&gt;Concurrent chaos events, like shuffling the actual data disk between 2 nodes&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Our simulation-driven development workflow runs simulations:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;In CI loops&lt;&#x2F;li&gt;
&lt;li&gt;Continuously in the cloud&lt;&#x2F;li&gt;
&lt;li&gt;With 30 minutes of simulation equating to roughly 24 hours of chaos testing&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;When we find a faulty seed, we can replay it locally, providing a superpower for debugging complex distributed systems issues.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;benefits-for-developer-education&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#benefits-for-developer-education&quot; aria-label=&quot;Anchor link for: benefits-for-developer-education&quot;&gt;🔗&lt;&#x2F;a&gt;Benefits for Developer Education&lt;&#x2F;h3&gt;
&lt;p&gt;Deterministic simulation testing doesn&#x27;t just help find bugs—it helps developers grow. By working with simulated but realistic failure scenarios, developers build intuition for how distributed systems behave under stress without having to experience painful on-call incidents.&lt;&#x2F;p&gt;
&lt;p&gt;Moreover, deterministic simulation testing has instilled a &lt;strong&gt;deep trust in our software&lt;&#x2F;strong&gt;, as it is tested under conditions even more challenging than those encountered in production. This confidence has been crucial for us.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#conclusion&quot; aria-label=&quot;Anchor link for: conclusion&quot;&gt;🔗&lt;&#x2F;a&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;The gap between development and production is real and significant. Traditional testing approaches can&#x27;t scale to cover all the possible combinations of user behavior and world events that our systems will encounter.&lt;&#x2F;p&gt;
&lt;p&gt;Deterministic simulation testing offers a powerful alternative that allows us to test complex distributed systems more thoroughly, find bugs before they impact users, and train developers to build more resilient systems.&lt;&#x2F;p&gt;
&lt;p&gt;By embracing simulation-driven development, we can build software that better handles the chaotic reality of production environments—and maybe reduce those 3 AM pages that give engineers like me unfortunate nicknames.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Want to learn more? Check out my &lt;a href=&quot;&#x2F;posts&#x2F;learn-about-dst&#x2F;&quot;&gt;curated list of resources on deterministic simulation testing&lt;&#x2F;a&gt;, which includes articles, talks, and implementation examples.&lt;&#x2F;p&gt;
&lt;p&gt;Feel free to reach out with any questions or to share your experiences with simulation testing. You can find me on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt; or through my &lt;a href=&quot;https:&#x2F;&#x2F;pierrezemb.fr&quot;&gt;website&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">distributed</category>
          <category domain="tag">testing</category>
          <category domain="tag">reliability</category>
          <category domain="tag">simulation</category>
          <category domain="tag">deterministic</category>
      </item>
      <item>
          <title>So, You Want to Learn More About Deterministic Simulation Testing?</title>
          <pubDate>Fri, 11 Apr 2025 00:00:00 +0200</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/learn-about-dst/</link>
          <guid>https://pierrezemb.fr/posts/learn-about-dst/</guid>
          <description xml:base="https://pierrezemb.fr/posts/learn-about-dst/">&lt;p&gt;I recently attended &lt;a href=&quot;https:&#x2F;&#x2F;bugbash.antithesis.com&#x2F;&quot;&gt;BugBash 2025&lt;&#x2F;a&gt;, a software reliability conference organized by &lt;a href=&quot;https:&#x2F;&#x2F;antithesis.com&quot;&gt;Antithesis&lt;&#x2F;a&gt; in Washington, D.C. during April 3-4, 2025. The conference brought together industry experts like Kyle Kingsbury, Ankush Desai, and Mitchell Hashimoto to discuss various aspects of building reliable software, with deterministic simulation testing being a significant focus throughout many of the sessions and discussions.&lt;&#x2F;p&gt;
&lt;p&gt;One of the highlights for me was having the chance to talk with the Antithesis team and meet some of the original creators of FoundationDB.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;what-is-deterministic-simulation-testing&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what-is-deterministic-simulation-testing&quot; aria-label=&quot;Anchor link for: what-is-deterministic-simulation-testing&quot;&gt;🔗&lt;&#x2F;a&gt;What is Deterministic Simulation Testing?&lt;&#x2F;h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;&#x2F;strong&gt; For a deeper dive into this concept and its practical applications, check out my article on &lt;a href=&quot;&#x2F;posts&#x2F;simulation-driven-development&#x2F;&quot;&gt;What if we embraced simulation-driven development?&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;The best description of DST I&#x27;ve found is described in &lt;a href=&quot;https:&#x2F;&#x2F;apple.github.io&#x2F;foundationdb&#x2F;testing.html&quot;&gt;FoundationDB&#x27;s testing page&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;The major goal of Simulation is to make sure that we find and diagnose issues in simulation rather than the real world. Simulation runs tens of thousands of simulations every night, each one simulating large numbers of component failures. Based on the volume of tests that we run and the increased intensity of the failures in our scenarios, we estimate that we have run the equivalent of roughly one trillion CPU-hours of simulation on FoundationDB.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;Simulation is able to conduct a deterministic simulation of an entire FoundationDB cluster within a single-threaded process. Determinism is crucial in that it allows perfect repeatability of a simulated run, facilitating controlled experiments to home in on issues. The simulation steps through time, synchronized across the system, representing a larger amount of real time in a smaller amount of simulated time. In practice, our simulations usually have about a 10-1 factor of real-to-simulated time, which is advantageous for the efficiency of testing.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;We use Simulation to simulate failures modes at the network, machine, and datacenter levels, including connection failures, degradation of machine performance, machine shutdowns or reboots, machines coming back from the dead, etc. We stress-test all of these failure modes, failing machines at very short intervals, inducing unusually severe loads, and delaying communications channels.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;Simulation&#x27;s success has surpassed our expectation and has been vital to our engineering team. It seems unlikely that we would have been able to build FoundationDB without this technology.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;After years of operating many Apache-oriented distributed systems, I can confidently say that FoundationDB stands apart in its remarkable robustness—I&#x27;ve rarely been paged for it, which speaks volumes about its stability in production. At &lt;a href=&quot;https:&#x2F;&#x2F;www.clever-cloud.com&#x2F;&quot;&gt;Clever Cloud&lt;&#x2F;a&gt;, we&#x27;ve even leveraged FoundationDB&#x27;s simulation framework during our application development by &lt;a href=&quot;&#x2F;posts&#x2F;providing-safety-fdb-rs&#x2F;#user-safety&quot;&gt;embedding Rust code inside FDB&#x27;s simulation environment&lt;&#x2F;a&gt;, allowing us to inherit the same reliability guarantees for our custom applications.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;tl-dr&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#tl-dr&quot; aria-label=&quot;Anchor link for: tl-dr&quot;&gt;🔗&lt;&#x2F;a&gt;TL;DR&lt;&#x2F;h2&gt;
&lt;p&gt;If you only have limited time, here are the four must-watch videos that will give you the best introduction to deterministic simulation testing:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=4fFDFbi3toc&quot;&gt;Will Wilson: Testing Distributed Systems with Deterministic Simulation&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=fFSPwJFXVlw&quot;&gt;Will Wilson: Autonomous Testing and the Future of Software Development&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=m3HwXlQPCEU&quot;&gt;Will Wilson: Testing a Single-Node, Single Threaded, Distributed System Written in 1985&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=eZ1mmqlq-mY&quot;&gt;Will Wilson: Let&#x27;s all write good software&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;(Yes, it seems Will Wilson has a monopoly on great introductory talks on the topic. Having had the chance to meet him, I can personally vouch that he is not a deterministic algorithm for generating insightful presentations, though the sheer quality of his talks might make you wonder.)&lt;&#x2F;p&gt;
&lt;p&gt;A curated feed of recent articles and blog posts about DST can be found at &lt;a href=&quot;https:&#x2F;&#x2F;deterministic-simulation-testing.github.io&#x2F;planet-dst&#x2F;&quot;&gt;Planet DST&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;essential-reading&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#essential-reading&quot; aria-label=&quot;Anchor link for: essential-reading&quot;&gt;🔗&lt;&#x2F;a&gt;Essential Reading&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;foundations-concepts&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#foundations-concepts&quot; aria-label=&quot;Anchor link for: foundations-concepts&quot;&gt;🔗&lt;&#x2F;a&gt;Foundations &amp;amp; Concepts&lt;&#x2F;h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.cockroachlabs.com&#x2F;blog&#x2F;demonic-nondeterminism&#x2F;&quot;&gt;CockroachLabs: Demonic Nondeterminism&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;transactional.blog&#x2F;simulation&#x2F;buggify&quot;&gt;Alex Miller: BUGGIFY&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;docs.tigerbeetle.com&#x2F;concepts&#x2F;safety&#x2F;#software-reliability&quot;&gt;TigerBeetle: Building Reliable Systems&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;journal.resonatehq.io&#x2F;p&#x2F;deterministic-simulation-testing&quot;&gt;Dominik Tornow: Deterministic Simulation Testing&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;poorlydefinedbehaviour.github.io&#x2F;posts&#x2F;deterministic_simulation_testing&#x2F;&quot;&gt;Poorly Defined Behaviour: Deterministic Simulation Testing&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;notes.eatonphil.com&#x2F;2024-08-20-deterministic-simulation-testing.html&quot;&gt;Phil Eaton: What&#x27;s the big deal about Deterministic Simulation Testing?&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;queue.acm.org&#x2F;detail.cfm?ref=rss&amp;amp;id=3712057&quot;&gt;AWS: Systems Correctness Practices&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;language-specific-implementations&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#language-specific-implementations&quot; aria-label=&quot;Anchor link for: language-specific-implementations&quot;&gt;🔗&lt;&#x2F;a&gt;Language-Specific Implementations&lt;&#x2F;h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;turmoil&#x2F;latest&#x2F;turmoil&#x2F;&quot;&gt;Turmoil: Network Simulation Framework for Rust&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;madsim&#x2F;latest&#x2F;madsim&#x2F;&quot;&gt;MadSim: Deterministic Simulation Testing Library for Rust&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;sled.rs&#x2F;simulation.html&quot;&gt;Sled: Simulation Testing&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;s2.dev&#x2F;blog&#x2F;dst&quot;&gt;S2: Deterministic simulation testing for async Rust&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.polarsignals.com&#x2F;blog&#x2F;posts&#x2F;2024&#x2F;05&#x2F;28&#x2F;mostly-dst-in-go&quot;&gt;Polar Signals: Mostly-DST in Go&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;real-world-case-studies&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#real-world-case-studies&quot; aria-label=&quot;Anchor link for: real-world-case-studies&quot;&gt;🔗&lt;&#x2F;a&gt;Real-World Case Studies&lt;&#x2F;h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;tigerbeetle.com&#x2F;blog&#x2F;2022-11-23-a-friendly-abstraction-over-iouring-and-kqueue&#x2F;&quot;&gt;TigerBeetle: A Friendly Abstraction Over io_uring and kqueue&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;dropbox.tech&#x2F;infrastructure&#x2F;-testing-our-new-sync-engine&quot;&gt;Dropbox: Testing Our New Sync Engine&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;tigerbeetle.com&#x2F;blog&#x2F;2023-07-11-we-put-a-distributed-database-in-the-browser&#x2F;&quot;&gt;TigerBeetle: We Put a Distributed Database in the Browser&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;antithesis.com&#x2F;solutions&#x2F;case_studies&#x2F;&quot;&gt;Antithesis: Case Studies&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;risingwave.com&#x2F;blog&#x2F;deterministic-simulation-a-new-era-of-distributed-system-testing&#x2F;&quot;&gt;RisingWave: A New Era of Distributed System Testing&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;risingwave.com&#x2F;blog&#x2F;applying-deterministic-simulation-the-risingwave-story-part-2-of-2&#x2F;&quot;&gt;RisingWave: The RisingWave Story&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.warpstream.com&#x2F;blog&#x2F;deterministic-simulation-testing-for-our-entire-saas&quot;&gt;WarpStream: DST for Our Entire SaaS&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;antithesis.com&#x2F;blog&#x2F;sdtalk&#x2F;&quot;&gt;Antithesis: How Antithesis finds bugs (with help from the Super Mario Bros.)&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;talks&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#talks&quot; aria-label=&quot;Anchor link for: talks&quot;&gt;🔗&lt;&#x2F;a&gt;Talks&lt;&#x2F;h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=IaB8jvjW0kk&quot;&gt;Ben Collins: FoundationDB Testing: Past &amp;amp; Present&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=rvHd4Y76-fs&quot;&gt;Marc Brooker: AWS re:Invent 2024 - Try again: The tools and techniques behind resilient systems (ARC403)&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=6y8Ga3oogLY&quot;&gt;TigerBeetle: Episode 064: Two In One, New Request Protocol and VOPR Tutorial&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;fosdem.org&#x2F;2025&#x2F;schedule&#x2F;event&#x2F;fosdem-2025-4279-squashing-the-heisenbug-with-deterministic-simulation-testing&#x2F;&quot;&gt;FOSDEM 2025: Squashing the Heisenbug with Deterministic Simulation Testing&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=024c8OwR4JM&quot;&gt;BugBash 2025: Lawrie Green - How to succeed in software testing without really trying&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Have I missed any important resources on Deterministic Simulation Testing? This field is rapidly evolving, and I&#x27;m always looking to expand this collection. If you know of any articles, talks, or tools related to DST that should be included here, please reach out! I&#x27;d love to hear about your experiences with deterministic testing as well.&lt;&#x2F;p&gt;
&lt;p&gt;Please, feel free to react to this article, you can reach me on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt;, or have a look on my &lt;a href=&quot;https:&#x2F;&#x2F;pierrezemb.fr&quot;&gt;website&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">distributed</category>
          <category domain="tag">testing</category>
          <category domain="tag">reliability</category>
          <category domain="tag">simulation</category>
          <category domain="tag">deterministic</category>
      </item>
      <item>
          <title>Key design tip: reverse number scanning in ordered key-value stores</title>
          <pubDate>Thu, 27 Mar 2025 05:24:27 +0100</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/reverse-number-scanning/</link>
          <guid>https://pierrezemb.fr/posts/reverse-number-scanning/</guid>
          <description xml:base="https://pierrezemb.fr/posts/reverse-number-scanning/">&lt;p&gt;Ordered key-value stores like HBase, FoundationDB or RocksDB store keys in lexicographical order. When getting the latest version or most recent events, this ordering often requires scanning through all values in reverse order. While this works, it can become a performance bottleneck, especially in distributed systems. Let&#x27;s explore a simple yet powerful optimization technique that I&#x27;ve been using recently 🚀&lt;&#x2F;p&gt;
&lt;h2 id=&quot;key-design-in-key-value-stores&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#key-design-in-key-value-stores&quot; aria-label=&quot;Anchor link for: key-design-in-key-value-stores&quot;&gt;🔗&lt;&#x2F;a&gt;Key design in Key-value stores&lt;&#x2F;h2&gt;
&lt;p&gt;Let&#x27;s look at this using a tuple structure of &lt;code&gt;(key, number)&lt;&#x2F;code&gt;. This could represent a document version, a timestamp, or any numeric identifier:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;(&amp;quot;my-key-1&amp;quot;, 1)
&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;my-key-1&amp;quot;, 2)
&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;my-key-2&amp;quot;, 1)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In ordered key-value stores, keys are stored in &lt;code&gt;lexicographical order&lt;&#x2F;code&gt;. This works well when you want to scan from lowest to highest values, but becomes inefficient when you need the opposite order. For example, to find the highest number for a key, you need to scan through all values:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;(&amp;quot;my-key-1&amp;quot;, 1)
&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;my-key-1&amp;quot;, 2)
&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;my-key-1&amp;quot;, 3)
&lt;&#x2F;span&gt;&lt;span&gt;...
&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;my-key-1&amp;quot;, 99)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;You could scan in reverse mode, but you would lose the order of your first prefix(the &quot;my-key-1&quot;).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;reverse-number-scanning&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#reverse-number-scanning&quot; aria-label=&quot;Anchor link for: reverse-number-scanning&quot;&gt;🔗&lt;&#x2F;a&gt;Reverse Number Scanning&lt;&#x2F;h2&gt;
&lt;p&gt;By reversing the numbers using a simple subtraction from the maximum possible value (e.g., &lt;code&gt;Long.MAX_VALUE&lt;&#x2F;code&gt; in Java), we can optimize the scanning process:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;java&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-java &quot;&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;long&lt;&#x2F;span&gt;&lt;span&gt; reversedNumber = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Long&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;MAX_VALUE &lt;&#x2F;span&gt;&lt;span&gt;- number;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This transforms our data into:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;(&amp;quot;my-key-1&amp;quot;, 9223372036854775804) &#x2F;&#x2F; number 3
&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;my-key-1&amp;quot;, 9223372036854775805) &#x2F;&#x2F; number 2
&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;my-key-1&amp;quot;, 9223372036854775806) &#x2F;&#x2F; number 1
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Now, the highest number (which appears first in the reversed order) can be found efficiently, allowing us to stop after finding the first match.&lt;&#x2F;p&gt;
&lt;p&gt;This technique is particularly useful in systems dealing with time-series data, versioned documents, or any scenario requiring efficient retrieval of the most recent or highest-valued items.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;number 1: 9223372036854775806
&lt;&#x2F;span&gt;&lt;span&gt;number 2: 9223372036854775805
&lt;&#x2F;span&gt;&lt;span&gt;number 3: 9223372036854775804
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;&#x2F;&#x2F; Reversing back is straightforward
&lt;&#x2F;span&gt;&lt;span&gt;Long.MAX_VALUE - 9223372036854775806 = 1
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;strong&gt;Thank you&lt;&#x2F;strong&gt; for reading my post! Feel free to react to this article, I am also available on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt; or &lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;pierrezemb.fr&quot;&gt;Bluesky&lt;&#x2F;a&gt; if needed.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">database</category>
          <category domain="tag">performance</category>
          <category domain="tag">optimization</category>
          <category domain="tag">storage</category>
          <category domain="tag">distributed</category>
      </item>
      <item>
          <title>Debugging FoundationDB&#x27;s Data Distributor</title>
          <pubDate>Fri, 07 Mar 2025 00:00:00 +0100</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/fdb-data-distributor/</link>
          <guid>https://pierrezemb.fr/posts/fdb-data-distributor/</guid>
          <description xml:base="https://pierrezemb.fr/posts/fdb-data-distributor/">&lt;p&gt;FoundationDB is a powerful, distributed database designed to handle massive workloads with high consistency guarantees. At its core, the &lt;strong&gt;Data Distributor&lt;&#x2F;strong&gt; plays a critical role in determining how shards are distributed across the cluster to maintain load balance and resilience.&lt;&#x2F;p&gt;
&lt;p&gt;In this post, we dive into the &lt;strong&gt;Data Distributor&#x27;s&lt;&#x2F;strong&gt; internals, along with practical lessons we learned during a outage.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;what-is-the-data-distributor&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what-is-the-data-distributor&quot; aria-label=&quot;Anchor link for: what-is-the-data-distributor&quot;&gt;🔗&lt;&#x2F;a&gt;What is the Data Distributor?&lt;&#x2F;h2&gt;
&lt;p&gt;The &lt;strong&gt;Data Distributor (DD)&lt;&#x2F;strong&gt; is &lt;a href=&quot;https:&#x2F;&#x2F;apple.github.io&#x2F;foundationdb&#x2F;architecture.html&quot;&gt;a subsystem&lt;&#x2F;a&gt; responsible for efficiently placing and relocating shards (range of keys) in a FoundationDB cluster. Its key goals are:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Balancing load across servers&lt;&#x2F;li&gt;
&lt;li&gt;Handling failures by redistributing data&lt;&#x2F;li&gt;
&lt;li&gt;Ensuring optimal data placement for performance reliability&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;data-distributor-wording&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#data-distributor-wording&quot; aria-label=&quot;Anchor link for: data-distributor-wording&quot;&gt;🔗&lt;&#x2F;a&gt;Data Distributor wording&lt;&#x2F;h2&gt;
&lt;p&gt;The architecture and behavior of the &lt;strong&gt;Data Distributor&lt;&#x2F;strong&gt; are documented in the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;release-7.3&#x2F;design&#x2F;data-distributor-internals.md&quot;&gt;official design document&lt;&#x2F;a&gt;, and introduce the following concepts:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Machine&lt;&#x2F;strong&gt;: A failure domain in FoundationDB, often considered equivalent to a rack.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Shard&lt;&#x2F;strong&gt;: A range of key-values—essentially a contiguous block of the database keyspace.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Server Team&lt;&#x2F;strong&gt;: A group of &lt;code&gt;k&lt;&#x2F;code&gt; processes (where &lt;code&gt;k&lt;&#x2F;code&gt; is the replication factor) hosting the same shard.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Machine Team&lt;&#x2F;strong&gt;: A collection of &lt;code&gt;k&lt;&#x2F;code&gt; machines, ensuring fault isolation for redundancy.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The term &quot;machine&quot; in FoundationDB’s documentation &lt;strong&gt;often translates better as &quot;rack&quot;&lt;&#x2F;strong&gt; in many practical cases. Using racks makes the Machine Team&#x27;s role clearer: it ensures fault isolation by storing copies of data in different racks.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;debug-dd-with-status-json&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#debug-dd-with-status-json&quot; aria-label=&quot;Anchor link for: debug-dd-with-status-json&quot;&gt;🔗&lt;&#x2F;a&gt;Debug DD with &lt;code&gt;status json&lt;&#x2F;code&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Your first input point should be to have a look at the &lt;code&gt;team_trackers&lt;&#x2F;code&gt; key in the &lt;code&gt;status json&lt;&#x2F;code&gt;. The JSON should contain enough information for basic monitoring:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;json&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-json &quot;&gt;&lt;code class=&quot;language-json&quot; data-lang=&quot;json&quot;&gt;&lt;span&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;team_trackers&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: [
&lt;&#x2F;span&gt;&lt;span&gt;  {
&lt;&#x2F;span&gt;&lt;span&gt;    &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;primary&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;true&lt;&#x2F;span&gt;&lt;span&gt;,
&lt;&#x2F;span&gt;&lt;span&gt;    &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;unhealthy_servers&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;,
&lt;&#x2F;span&gt;&lt;span&gt;    &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;state&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: {
&lt;&#x2F;span&gt;&lt;span&gt;      &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;healthy&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;true&lt;&#x2F;span&gt;&lt;span&gt;,
&lt;&#x2F;span&gt;&lt;span&gt;      &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;name&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;healthy_rebalancing&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;  }
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;debug-dd-with-trace-events&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#debug-dd-with-trace-events&quot; aria-label=&quot;Anchor link for: debug-dd-with-trace-events&quot;&gt;🔗&lt;&#x2F;a&gt;Debug DD with Trace events&lt;&#x2F;h2&gt;
&lt;p&gt;FoundationDB provides a robust tracing system where each process generates detailed events in either XML or JSON formats. To troubleshoot the &lt;strong&gt;Data Distributor&lt;&#x2F;strong&gt;, you first need to locate the process it has been elected to. From there, trace events can be analyzed to understand shard movements, priorities, and failures.&lt;&#x2F;p&gt;
&lt;p&gt;One particularly important attribute in these events is the &lt;code&gt;Priority&lt;&#x2F;code&gt; field. This field determines the precedence of shard placement or redistribution tasks:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;init&lt;&#x2F;span&gt;&lt;span&gt;( PRIORITY_RECOVER_MOVE, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;110 &lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;init&lt;&#x2F;span&gt;&lt;span&gt;( PRIORITY_REBALANCE_UNDERUTILIZED_TEAM, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;120 &lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;init&lt;&#x2F;span&gt;&lt;span&gt;( PRIORITY_REBALANCE_OVERUTILIZED_TEAM, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;122 &lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;init&lt;&#x2F;span&gt;&lt;span&gt;( PRIORITY_TEAM_UNHEALTHY, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;700&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;init&lt;&#x2F;span&gt;&lt;span&gt;( PRIORITY_SPLIT_SHARD, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;950 &lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A full list of defined priorities can be found in the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;release-7.3&#x2F;fdbclient&#x2F;ServerKnobs.cpp#L155-L173&quot;&gt;Knobs file&lt;&#x2F;a&gt;, providing useful insights into how tasks are scheduled.&lt;&#x2F;p&gt;
&lt;p&gt;EDIT: Yes, &lt;code&gt;SPLIT_SHARD&lt;&#x2F;code&gt; has an higher priority! See &lt;a href=&quot;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;alexmillerdb.bsky.social&#x2F;post&#x2F;3ljsqqvfslc24&quot;&gt;https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;alexmillerdb.bsky.social&#x2F;post&#x2F;3ljsqqvfslc24&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;serverteaminfo-event&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#serverteaminfo-event&quot; aria-label=&quot;Anchor link for: serverteaminfo-event&quot;&gt;🔗&lt;&#x2F;a&gt;&lt;code&gt;ServerTeamInfo&lt;&#x2F;code&gt; Event&lt;&#x2F;h3&gt;
&lt;p&gt;Understanding the state of server teams is essential since the Data Distributor schedules data movements based on real-time metrics. The &lt;code&gt;fdbcli&lt;&#x2F;code&gt; command &lt;code&gt;triggerddteaminfolog&lt;&#x2F;code&gt; triggers informative logs by invoking &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;release-7.3&#x2F;fdbserver&#x2F;DDTeamCollection.actor.cpp#L3425&quot;&gt;printSnapshotTeamsInfo&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;json&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-json &quot;&gt;&lt;code class=&quot;language-json&quot; data-lang=&quot;json&quot;&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Type&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;ServerTeamInfo&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Priority&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;709&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Healthy&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;TeamSize&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;MemberIDs&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;5a69... 5fc1... 8718...&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;LoadBytes&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;1135157527&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;MinAvailableSpaceRatio&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;0.94108&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h3 id=&quot;serverteamprioritychange-event&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#serverteamprioritychange-event&quot; aria-label=&quot;Anchor link for: serverteamprioritychange-event&quot;&gt;🔗&lt;&#x2F;a&gt;&lt;code&gt;ServerTeamPriorityChange&lt;&#x2F;code&gt; Event&lt;&#x2F;h3&gt;
&lt;p&gt;This event is logged when server team priorities change, often indicating server failures or rebalancing actions.&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;json&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-json &quot;&gt;&lt;code class=&quot;language-json&quot; data-lang=&quot;json&quot;&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Type&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;ServerTeamPriorityChange&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Priority&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;950&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;TeamID&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;e9b362decbafbd81&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h3 id=&quot;relocateshard-event&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#relocateshard-event&quot; aria-label=&quot;Anchor link for: relocateshard-event&quot;&gt;🔗&lt;&#x2F;a&gt;&lt;code&gt;RelocateShard&lt;&#x2F;code&gt; Event&lt;&#x2F;h3&gt;
&lt;p&gt;This event tracks shard movement between teams to maintain balance.&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;json&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-json &quot;&gt;&lt;code class=&quot;language-json&quot; data-lang=&quot;json&quot;&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Type&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;RelocateShard&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Priority&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;120&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; PRIORITY_REBALANCE_UNDERUTILIZED_TEAM
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;RelocationID&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;3f1290654949771e&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Again, the most useful field is the priority, indicating why it is relocated.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;valleyfiller-and-mountainchopper-mechanisms&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#valleyfiller-and-mountainchopper-mechanisms&quot; aria-label=&quot;Anchor link for: valleyfiller-and-mountainchopper-mechanisms&quot;&gt;🔗&lt;&#x2F;a&gt;&quot;ValleyFiller&quot; and &quot;MountainChopper&quot; Mechanisms&lt;&#x2F;h3&gt;
&lt;p&gt;To optimize shard placement, FoundationDB employs two balancing strategies:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ValleyFiller&lt;&#x2F;strong&gt;: Fills underutilized servers (the &lt;strong&gt;valleys&lt;&#x2F;strong&gt;) with data to balance the load.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;MountainChopper&lt;&#x2F;strong&gt;: Redistributes shards from overutilized servers (the &lt;strong&gt;mountains&lt;&#x2F;strong&gt;) to spread the load evenly.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Both logs will have a &lt;code&gt;SourceTeam&lt;&#x2F;code&gt; and &lt;code&gt;DestTeam&lt;&#x2F;code&gt; to use in other traces:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;json&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-json &quot;&gt;&lt;code class=&quot;language-json&quot; data-lang=&quot;json&quot;&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Type&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;BgDDValleyFiller&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;QueuedRelocations&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;SourceTeam&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;TeamID 95819f0d3d7ea40d&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;DestTeam&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;TeamID 0817e6fe3135e6f6&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;ShardBytes&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;398281250&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;pre data-lang=&quot;json&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-json &quot;&gt;&lt;code class=&quot;language-json&quot; data-lang=&quot;json&quot;&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Type&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;BgDDMountainChopper&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;QueuedRelocations&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;SourceTeam&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;TeamID 95819f0d3d7ea40d&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;DestTeam&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;TeamID e17dcecd86547e09&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;ShardBytes&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;308000000&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;strong&gt;Thank you&lt;&#x2F;strong&gt; for reading my post! Feel free to react to this article, I am also available on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt; if needed.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">foundationdb</category>
          <category domain="tag">debugging</category>
          <category domain="tag">distributed</category>
          <category domain="tag">database</category>
          <category domain="tag">storage</category>
      </item>
      <item>
          <title>Ensuring Safety in FoundationDB&#x27;s Rust Crate</title>
          <pubDate>Tue, 11 Feb 2025 00:00:00 +0100</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/providing-safety-fdb-rs/</link>
          <guid>https://pierrezemb.fr/posts/providing-safety-fdb-rs/</guid>
          <description xml:base="https://pierrezemb.fr/posts/providing-safety-fdb-rs/">&lt;p&gt;As we approach 5 million downloads of the &lt;a href=&quot;https:&#x2F;&#x2F;crates.io&#x2F;crates&#x2F;foundationdb&quot;&gt;FoundationDB Rust crate&lt;&#x2F;a&gt; (4,998,185 at the time of writing), I wanted to share some insights into how I ensure the safety of the crate. Being the primary maintainer of a database driver comes with responsibility, but I sleep well at night knowing that we have robust safety measures in place.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;crate-overview&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#crate-overview&quot; aria-label=&quot;Anchor link for: crate-overview&quot;&gt;🔗&lt;&#x2F;a&gt;Crate Overview&lt;&#x2F;h2&gt;
&lt;p&gt;The Rust crate, &lt;code&gt;foundationdb-rs&lt;&#x2F;code&gt;, provides bindings to interact with FoundationDB&#x27;s C API (&lt;code&gt;libfdb&lt;&#x2F;code&gt;). It has around 13k lines of code and is used by companies (like Clever Cloud) and projects (such as Apache OpenDAL, SurrealDB). Having experienced numerous outages and issues with drivers and distributed systems, I understand the importance of safety. To ensure the safety of the crate, we need to focus on three layers:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;The underlying client, &lt;code&gt;libfdb&lt;&#x2F;code&gt;,&lt;&#x2F;li&gt;
&lt;li&gt;The crate itself,&lt;&#x2F;li&gt;
&lt;li&gt;The code that uses the crate.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Let&#x27;s dig into each of these areas.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;libfdb-safety&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#libfdb-safety&quot; aria-label=&quot;Anchor link for: libfdb-safety&quot;&gt;🔗&lt;&#x2F;a&gt;libfdb Safety&lt;&#x2F;h2&gt;
&lt;p&gt;This is the simplest part. &lt;code&gt;libfdb&lt;&#x2F;code&gt;&#x27;s safety is guaranteed by FoundationDB&#x27;s &lt;a href=&quot;https:&#x2F;&#x2F;apple.github.io&#x2F;foundationdb&#x2F;testing.html&quot;&gt;simulation framework&lt;&#x2F;a&gt;. Therefore, we can consider it safe.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;classic-testing-suite&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#classic-testing-suite&quot; aria-label=&quot;Anchor link for: classic-testing-suite&quot;&gt;🔗&lt;&#x2F;a&gt;Classic testing suite&lt;&#x2F;h3&gt;
&lt;p&gt;Since we are using a C library, we need to use FFI (Foreign Function Interface) and unsafe code blocks. With around 130 unsafe blocks, we must be extra careful when calling C code, ensuring all preconditions are met. Naturally, we conduct extensive testing, but most importantly, we run tests in high-variety environments:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;On multiple operating systems (Ubuntu, macOS)&lt;&#x2F;li&gt;
&lt;li&gt;On multiple FoundationDB versions (from FDB 6.1 to 7.3)&lt;&#x2F;li&gt;
&lt;li&gt;On multiple Rust compiler versions (Minimum Supported Rust Version or MSRV, stable, beta, nightly)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The most useful tests are run on the nightly Rust compiler, as we can catch &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;foundationdb-rs&#x2F;foundationdb-rs&#x2F;issues&#x2F;90&quot;&gt;new behaviors in the Rust compiler early&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;While these testing practices provide significant coverage, the most powerful tool we utilize comes from FoundationDB’s maintainers: the &lt;code&gt;BindingTester&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;the-bindingtester&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-bindingtester&quot; aria-label=&quot;Anchor link for: the-bindingtester&quot;&gt;🔗&lt;&#x2F;a&gt;The BindingTester&lt;&#x2F;h3&gt;
&lt;p&gt;FoundationDB is renowned for its &lt;a href=&quot;https:&#x2F;&#x2F;apple.github.io&#x2F;foundationdb&#x2F;testing.html&quot;&gt;simulation and testing&lt;&#x2F;a&gt; frameworks. Bindings are no exception. They developed the BindingTester, a cross-language validation suite ensuring that all bindings behave correctly and consistently across different languages.&lt;&#x2F;p&gt;
&lt;p&gt;The BindingTester uses &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;main&#x2F;bindings&#x2F;bindingtester&#x2F;spec&#x2F;bindingApiTester.md&quot;&gt;a stack-based machine&lt;&#x2F;a&gt; to queue operations for FoundationDB. A program then reads the stack and performs the operations. These operations are run twice: once in the target environment and once against a reference implementation. Any differences are reported by the BindingTester.&lt;&#x2F;p&gt;
&lt;p&gt;It looks like this:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;bash&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-bash &quot;&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;.&#x2F;bindings&#x2F;bindingtester&#x2F;bindingtester.py --num-ops&lt;&#x2F;span&gt;&lt;span&gt; 1000&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt; --api-version&lt;&#x2F;span&gt;&lt;span&gt; 730&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt; --test-name&lt;&#x2F;span&gt;&lt;span&gt; api&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt; --compare&lt;&#x2F;span&gt;&lt;span&gt; python rust
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Creating&lt;&#x2F;span&gt;&lt;span&gt; test at API version 730
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Generating&lt;&#x2F;span&gt;&lt;span&gt; api test at seed 3208032894 with 1000 op(s) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;and&lt;&#x2F;span&gt;&lt;span&gt; 1 concurrent tester(s)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;...
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# Inserting Rust tests
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Inserting&lt;&#x2F;span&gt;&lt;span&gt; test into database...
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Running&lt;&#x2F;span&gt;&lt;span&gt; tester &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;&#x2F;home&#x2F;runner&#x2F;work&#x2F;foundationdb-rs&#x2F;foundationdb-rs&#x2F;target&#x2F;debug&#x2F;bindingtester test_spec 730&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;...
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Reading&lt;&#x2F;span&gt;&lt;span&gt; results from &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;tester_output&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;workspace&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;...
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Reading&lt;&#x2F;span&gt;&lt;span&gt; results from &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;tester_output&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;stack&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;...
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# Inserting Python tests
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Inserting&lt;&#x2F;span&gt;&lt;span&gt; test into database...
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Running&lt;&#x2F;span&gt;&lt;span&gt; tester &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;python &#x2F;home&#x2F;runner&#x2F;work&#x2F;foundationdb-rs&#x2F;foundationdb-rs&#x2F;target&#x2F;foundationdb_build&#x2F;foundationdb&#x2F;bindings&#x2F;bindingtester&#x2F;..&#x2F;python&#x2F;tests&#x2F;tester.py test_spec 730&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;...
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Reading&lt;&#x2F;span&gt;&lt;span&gt; results from &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;tester_output&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;workspace&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;...
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Reading&lt;&#x2F;span&gt;&lt;span&gt; results from &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;tester_output&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;stack&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;...
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# Comparing the results
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Comparing&lt;&#x2F;span&gt;&lt;span&gt; results from &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;tester_output&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;workspace&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;...
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Comparing&lt;&#x2F;span&gt;&lt;span&gt; results from &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;tester_output&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;stack&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;...
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Test&lt;&#x2F;span&gt;&lt;span&gt; with seed 3208032894 and concurrency 1 had 0 incorrect result(s) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;and&lt;&#x2F;span&gt;&lt;span&gt; 0 error(s) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;at&lt;&#x2F;span&gt;&lt;span&gt; API version 730
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Completed&lt;&#x2F;span&gt;&lt;span&gt; api test with random seed 3208032894 and 1000 operations
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The great advantage of this method is that the tests are seeded, meaning the operations are:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;randomly selected to cover all binding usages,&lt;&#x2F;li&gt;
&lt;li&gt;deterministic, so a failing seed can be replayed locally.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Combined with code coverage, this gives us a good idea of what has been tested (though code coverage may vary).&lt;&#x2F;p&gt;
&lt;p&gt;We run the &lt;code&gt;BindingTester&lt;&#x2F;code&gt; &lt;strong&gt;every hour&lt;&#x2F;strong&gt; on our GitHub actions, amounting to &lt;strong&gt;around 219 days of continuous testing each month&lt;&#x2F;strong&gt; (316,335 minutes of correctness last month according to Github).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;user-safety&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#user-safety&quot; aria-label=&quot;Anchor link for: user-safety&quot;&gt;🔗&lt;&#x2F;a&gt;User Safety&lt;&#x2F;h2&gt;
&lt;p&gt;Thanks to &lt;code&gt;libfdb&lt;&#x2F;code&gt; and the &lt;code&gt;BindingTester&lt;&#x2F;code&gt;, we can ensure that the library is quite safe. But what about the user&#x27;s code? How can we help users ensure their code can handle all of FoundationDB&#x27;s caveats, such as &lt;a href=&quot;&#x2F;posts&#x2F;automatic-txn-fdb-730&#x2F;#transactions-with-unknown-results&quot;&gt;commit_unknown_result&lt;&#x2F;a&gt;? We added a great feature: the ability to include Rust code &lt;strong&gt;within FDB&#x27;s simulation framework&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;We can implement an Rust workload with the following Trait:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;pub trait &lt;&#x2F;span&gt;&lt;span&gt;RustWorkload {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;description&lt;&#x2F;span&gt;&lt;span&gt;(&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt; String;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;setup&lt;&#x2F;span&gt;&lt;span&gt;(&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;&amp;#39;static mut &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;db&lt;&#x2F;span&gt;&lt;span&gt;: SimDatabase, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;done&lt;&#x2F;span&gt;&lt;span&gt;: Promise);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;start&lt;&#x2F;span&gt;&lt;span&gt;(&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;&amp;#39;static mut &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;db&lt;&#x2F;span&gt;&lt;span&gt;: SimDatabase, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;done&lt;&#x2F;span&gt;&lt;span&gt;: Promise);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;check&lt;&#x2F;span&gt;&lt;span&gt;(&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;&amp;#39;static mut &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;db&lt;&#x2F;span&gt;&lt;span&gt;: SimDatabase, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;done&lt;&#x2F;span&gt;&lt;span&gt;: Promise);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;get_metrics&lt;&#x2F;span&gt;&lt;span&gt;(&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt; Vec&amp;lt;Metric&amp;gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;get_check_timeout&lt;&#x2F;span&gt;&lt;span&gt;(&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;f64&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Which can be runned inside the simulation while injecting some faults:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;bash&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-bash &quot;&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;fdbserver -r&lt;&#x2F;span&gt;&lt;span&gt; simulation&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt; -f&lt;&#x2F;span&gt;&lt;span&gt; &#x2F;root&#x2F;atomic.toml&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt; -b&lt;&#x2F;span&gt;&lt;span&gt; on&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt; --trace-format&lt;&#x2F;span&gt;&lt;span&gt; json
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# Choosing a random seed
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Random&lt;&#x2F;span&gt;&lt;span&gt; seed is 394378360...
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# Then, everything is derived from the seed, including:
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# * cluster topology,
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# * cluster configuration,
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# * timing to inject faults,
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# * operations to run
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# * ...
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Datacenter&lt;&#x2F;span&gt;&lt;span&gt; 0: 3&#x2F;12 machines, 1&#x2F;1 coordinators
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Datacenter&lt;&#x2F;span&gt;&lt;span&gt; 1: 3&#x2F;12 machines, 0&#x2F;1 coordinators
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Datacenter&lt;&#x2F;span&gt;&lt;span&gt; 2: 3&#x2F;12 machines, 0&#x2F;1 coordinators
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Datacenter&lt;&#x2F;span&gt;&lt;span&gt; 3: 3&#x2F;12 machines, 0&#x2F;1 coordinators
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# Starting the Atomic workload
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Run&lt;&#x2F;span&gt;&lt;span&gt; test:AtomicWorkload start
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;AtomicWorkload&lt;&#x2F;span&gt;&lt;span&gt; complete
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;checking&lt;&#x2F;span&gt;&lt;span&gt; test (AtomicWorkload)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;...
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt; test clients passed; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt; test clients failed
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Run&lt;&#x2F;span&gt;&lt;span&gt; test:AtomicWorkload Done.
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt; tests passed; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt; tests failed.
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Unseed:&lt;&#x2F;span&gt;&lt;span&gt; 66324
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Elapsed:&lt;&#x2F;span&gt;&lt;span&gt; 405.055622 simsec, 30.342000 real seconds
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This has been a &lt;strong&gt;major keypoint&lt;&#x2F;strong&gt; for us to develop and operate &lt;a href=&quot;https:&#x2F;&#x2F;www.clever-cloud.com&#x2F;materia&#x2F;&quot;&gt;Materia, Clever Cloud&#x27;s serverless database offer&lt;&#x2F;a&gt;, as we can enjoy the same Simulation framework used by FDB&#x27;s core engineers for layer engineering 🤯&lt;&#x2F;p&gt;
&lt;h2 id=&quot;closing-words&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#closing-words&quot; aria-label=&quot;Anchor link for: closing-words&quot;&gt;🔗&lt;&#x2F;a&gt;Closing words&lt;&#x2F;h2&gt;
&lt;p&gt;As with any open-source project, there is always more to accomplish, but I am quite satisfied with the current level of safety provided by the crate. I would like to express my gratitude to the FoundationDB community for developing the BindingTester, and former contributors to the crate.&lt;&#x2F;p&gt;
&lt;p&gt;I also would like to encourage everyone to explore the simulation framework. Integrating Rust code within this framework has allowed us to harness the full potential of simulation without the need to build our own, and it has forever changed my perspective on testing and software engineering.&lt;&#x2F;p&gt;
&lt;p&gt;There is a strong likelihood that future blog posts will focus on simulation, so feel free to explore the &lt;a href=&quot;&#x2F;tags&#x2F;simulation&#x2F;&quot;&gt;simulation tags&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">foundationdb</category>
          <category domain="tag">rust</category>
          <category domain="tag">testing</category>
          <category domain="tag">database</category>
          <category domain="tag">distributed</category>
      </item>
      <item>
          <title>Back in engineering!</title>
          <pubDate>Wed, 15 Jan 2025 00:37:27 +0100</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/back-engineering/</link>
          <guid>https://pierrezemb.fr/posts/back-engineering/</guid>
          <description xml:base="https://pierrezemb.fr/posts/back-engineering/">&lt;p&gt;Time flies—it’s already 2025! Looking back, 2024 was an incredibly fast-paced year for me professionally as an Engineering Manager. This year, I’ve decided to take a new direction and return to a more engineering-focused role.&lt;&#x2F;p&gt;
&lt;p&gt;I moved in a management position early 2023. It was a time where my company was growing fast (from 20-ish to 60-ish), and we needed coordination to ship things out in parallel, but also to &lt;a href=&quot;https:&#x2F;&#x2F;www.clever-cloud.com&#x2F;blog&#x2F;company&#x2F;2023&#x2F;12&#x2F;21&#x2F;our-journey-to-a-better-clever-cloud&#x2F;&quot;&gt;migrate customers (and ourselves!) to new datacenters&lt;&#x2F;a&gt;. With my on-call experience, I helped our CTO, Steven, bootstrap two data-oriented teams and later led the Materia team, shaping its technology.&lt;&#x2F;p&gt;
&lt;p&gt;During this time, we successfully launched &lt;a href=&quot;https:&#x2F;&#x2F;www.clever-cloud.com&#x2F;blog&#x2F;features&#x2F;2024&#x2F;06&#x2F;11&#x2F;materia-kv-our-easy-to-use-serverless-key-value-database-is-available-to-all&#x2F;&quot;&gt;Materia KV in its Alpha version&lt;&#x2F;a&gt; and built strong internal trust in FoundationDB. Our largest cluster effortlessly handles hundreds of thousands of writes per second, and &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;foundationdb-rs&#x2F;foundationdb-rs&#x2F;tree&#x2F;main&#x2F;foundationdb-simulation&quot;&gt;the ability to simulate Rust code within FDB’s simulation framework&lt;&#x2F;a&gt; has significantly boosted our developers&#x27; confidence.&lt;&#x2F;p&gt;
&lt;p&gt;Management is a rewarding challenge, like unlocking a new skill tree to walk through. Resources like &lt;a href=&quot;https:&#x2F;&#x2F;www.oreilly.com&#x2F;library&#x2F;view&#x2F;the-managers-path&#x2F;9781491973882&#x2F;&quot;&gt;The Manager&#x27;s Path&lt;&#x2F;a&gt; or &lt;a href=&quot;https:&#x2F;&#x2F;www.engmanagement.dev&quot;&gt;Engineering Management for the Rest of Us&lt;&#x2F;a&gt; can give you a head-start and should be read by anyone. Blogs like &lt;a href=&quot;https:&#x2F;&#x2F;charity.wtf&#x2F;tag&#x2F;management&#x2F;page&#x2F;2&#x2F;&quot;&gt;Charity Majors&lt;&#x2F;a&gt; are also useful to read, especially &lt;a href=&quot;https:&#x2F;&#x2F;charity.wtf&#x2F;2017&#x2F;05&#x2F;11&#x2F;the-engineer-manager-pendulum&#x2F;&quot;&gt;The Engineer&#x2F;Manager Pendulum&lt;&#x2F;a&gt; and its &lt;a href=&quot;https:&#x2F;&#x2F;charity.wtf&#x2F;2019&#x2F;01&#x2F;04&#x2F;engineering-management-the-pendulum-or-the-ladder&#x2F;&quot;&gt;follow-up&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;After nearly two years, I feel it&#x27;s the right time to return to core engineering. I just miss it.  From the start, my manager and I had an understanding that I could transition back whenever I wanted to—and that time is now. I&#x27;ve found an excellent manager to lead the team, giving me the space to focus on the technical side of Materia. I also should have more time to work around open-source, and I&#x27;m looking forward to it.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">personal</category>
      </item>
      <item>
          <title>Redwood’s memory tuning in FoundationDB</title>
          <pubDate>Mon, 22 Apr 2024 00:37:27 +0100</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/redwood-memory-tuning/</link>
          <guid>https://pierrezemb.fr/posts/redwood-memory-tuning/</guid>
          <description xml:base="https://pierrezemb.fr/posts/redwood-memory-tuning/">&lt;p&gt;While FoundationDB allows you to obtain sub-milliseconds transactions’s latency without any knob-tuning, we had to bump a bit memory usage for Redwood under certain usage and workload. The following configuration has been tested on clusters from 7.1 to 7.3.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;btree-page-cache&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#btree-page-cache&quot; aria-label=&quot;Anchor link for: btree-page-cache&quot;&gt;🔗&lt;&#x2F;a&gt;BTree page cache&lt;&#x2F;h2&gt;
&lt;p&gt;We discovered the issue when we saw a performance decrease on our cluster storing time-series data. Our cluster was reporting some high disk-business, causing outages:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;10.0.3.23:4501 ( 65% cpu; 61% machine; 0.010 Gbps; 93% disk IO; 7.5 GB &#x2F; 7.4 GB RAM  )
&lt;&#x2F;span&gt;&lt;span&gt;10.0.3.24:4501 ( 61% cpu; 61% machine; 0.010 Gbps; 87% disk IO; 9.7 GB &#x2F; 7.4 GB RAM  )
&lt;&#x2F;span&gt;&lt;span&gt;10.0.3.25:4501 ( 69% cpu; 61% machine; 0.010 Gbps; 93% disk IO; 5.4 GB &#x2F; 7.4 GB RAM  )
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This was our first «we need to dig into this» moment with FDB. We couldn’t find the root-cause and we asked the community. Turns out we had a classic page-cache issue which was spotted by &lt;a href=&quot;https:&#x2F;&#x2F;forums.foundationdb.org&#x2F;u&#x2F;markus.pilman&#x2F;summary&quot;&gt;Markus Pilman&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;forums.foundationdb.org&#x2F;u&#x2F;wmd&#x2F;summary&quot;&gt;William Dowling&lt;&#x2F;a&gt;. While the trace files are pretty verbose, they are containing a lot of information like this one:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;&amp;quot;PagerCacheHit&amp;quot;: &amp;quot;39852&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;PagerCacheMiss&amp;quot;: &amp;quot;25903&amp;quot;,
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Yep, that’s a 40% cache-miss ratio over 5s 😱 This is why the disk was so busy, spending his time moving pages back and forth. We need to bump the memory, but how much? The general recommandation that worked for us is to target around 1-2% of the &lt;code&gt;kvstore_used_bytes&lt;&#x2F;code&gt; metrics. As we have around 1TiB of data per StorageServer, we can add the following config key:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;cache_memory = 10GiB
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Which fixed our cache-miss issue 🎉&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;&amp;quot;PagerCacheHit&amp;quot;: &amp;quot;51968&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;PagerCacheMiss&amp;quot;: &amp;quot;432&amp;quot;,
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt; &lt;&#x2F;p&gt;
&lt;h2 id=&quot;byte-sample-memory-usage&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#byte-sample-memory-usage&quot; aria-label=&quot;Anchor link for: byte-sample-memory-usage&quot;&gt;🔗&lt;&#x2F;a&gt;Byte Sample memory usage&lt;&#x2F;h2&gt;
&lt;p&gt;But our problems are still unresolved, as we are still seeing some OOM 😭 Because this cluster is storing time-series data, each StorageServers is holding around 1TiB of data. As we were holding more and more data, we saw more and more OOM errors on our &lt;code&gt;fdbmonitor&lt;&#x2F;code&gt; logs. Something was growing linearly with our usage and needed tuning. This time, we had help from &lt;a href=&quot;https:&#x2F;&#x2F;forums.foundationdb.org&#x2F;u&#x2F;SteavedHams&#x2F;summary&quot;&gt;Steve Atherton&lt;&#x2F;a&gt; which pointed us towards the direction of the &lt;a href=&quot;https:&#x2F;&#x2F;forums.foundationdb.org&#x2F;t&#x2F;foundationdb-7-1-24-the-memory-usage-after-clean-startup-of-fdbserver-process-is-too-high&#x2F;3863&#x2F;8?u=pierrez&quot;&gt;Byte Sample&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;There is a data structure that storage servers have called the Byte Sample which stores a deterministic random sample of keys. This data is persisted on disk in the storage engine and is loaded immediately upon storage server startup. Unfortunately, its size is not tracked or reported, but grows linearly with KV size and I suspect yours is somewhere around 4GB-6GB based on the memory usage I’ve seen for smaller storage KV sizes.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;So, we need to add around 4GB more in the memory, but there is no config for that parameter. It needs to be embedded in the global &lt;code&gt;memory&lt;&#x2F;code&gt; parameter. Let’s compute the right value!&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-global-memory-formula&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-global-memory-formula&quot; aria-label=&quot;Anchor link for: the-global-memory-formula&quot;&gt;🔗&lt;&#x2F;a&gt;The global memory formula&lt;&#x2F;h2&gt;
&lt;p&gt;By testing things on our clusters, we ended up with this formula:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;# Default is 2
&lt;&#x2F;span&gt;&lt;span&gt;cache_memory = (1-2% of kvstore_used_bytes)GiB
&lt;&#x2F;span&gt;&lt;span&gt;# Default is 8
&lt;&#x2F;span&gt;&lt;span&gt;memory = (8 + cache_memory + 4-6GB per TB of kvstore_used_bytes)GiB
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Which fixed all our memory issues with FoundationDB 🎉 And to be fair, this is the only things we needed to tune on our clusters, which is quite impressive 👀&lt;&#x2F;p&gt;
&lt;h2 id=&quot;special-thanks&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#special-thanks&quot; aria-label=&quot;Anchor link for: special-thanks&quot;&gt;🔗&lt;&#x2F;a&gt;Special thanks&lt;&#x2F;h2&gt;
&lt;p&gt;I would like to thank Markus, William and Steve from the FoundationDB community for their help 🤝&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;strong&gt;Thank you&lt;&#x2F;strong&gt; for reading my post! Feel free to react to this article, I am also available on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt; if needed.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">foundationdb</category>
          <category domain="tag">performance</category>
          <category domain="tag">storage</category>
          <category domain="tag">database</category>
          <category domain="tag">tuning</category>
      </item>
      <item>
          <title>True idempotent transactions with FoundationDB 7.3</title>
          <pubDate>Tue, 12 Mar 2024 00:37:27 +0100</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/automatic-txn-fdb-730/</link>
          <guid>https://pierrezemb.fr/posts/automatic-txn-fdb-730/</guid>
          <description xml:base="https://pierrezemb.fr/posts/automatic-txn-fdb-730/">&lt;p&gt;I have been working around &lt;a href=&quot;https:&#x2F;&#x2F;foundationdb.org&quot;&gt;FoundationDB&lt;&#x2F;a&gt; for several years now, and the new upcoming version is fixing one of the most evil and painful caveats you can deal with when writing layers: &lt;code&gt;commit_unknown_result&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;transactions-with-unknown-results&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#transactions-with-unknown-results&quot; aria-label=&quot;Anchor link for: transactions-with-unknown-results&quot;&gt;🔗&lt;&#x2F;a&gt;Transactions with unknown results&lt;&#x2F;h2&gt;
&lt;p&gt;When you start writing code with FDB, you may be under the assertions that given the database’s robustness, you will not experience some strange behavior under certain failure scenarios. Turns out, there is one scenario that is possible to reach, and quickly explained in the official &lt;a href=&quot;https:&#x2F;&#x2F;apple.github.io&#x2F;foundationdb&#x2F;developer-guide.html#transactions-with-unknown-results&quot;&gt;documentation&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;As with other client&#x2F;server databases, in some failure scenarios a client may be unable to determine whether a transaction succeeded. In these cases, commit() will raise a &lt;a href=&quot;https:&#x2F;&#x2F;apple.github.io&#x2F;foundationdb&#x2F;api-error-codes.html#developer-guide-error-codes&quot;&gt;&lt;code&gt;commit_unknown_result&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; exception. The on_error() function treats this exception as retriable, so retry loops that don’t check for &lt;a href=&quot;https:&#x2F;&#x2F;apple.github.io&#x2F;foundationdb&#x2F;api-error-codes.html#developer-guide-error-codes&quot;&gt;&lt;code&gt;commit_unknown_result&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; could execute the transaction twice. In these cases, you must consider the idempotency of the transaction.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;While having idempotent retry loops is possible, sometimes it is not possible, for example when using atomic operations to keep track of statistics.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Is this problem worth fixing? Seems a really edgy case 🤔&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;It truly depends whether it is acceptable for your transaction to be committed twice. For most of the case, it is not, but sometimes developers are not aware of this behavior, leading to errors. This is one of the reasons why we worked and open-sourced a way to embed rust-code within FoundationDB’s simulation framework. Using the simulation crate, your layer can be tested like FDB, and I can assure you: you &lt;strong&gt;will see&lt;&#x2F;strong&gt; those transactions in simulation 🙈.&lt;&#x2F;p&gt;
&lt;p&gt;This behavior has given headache to my colleagues, as we tried to bypass correctness and validation code in simulation when transactions state are unknown, and who could blame us? Validate the correctness of your code is hard when certains transactions (for example, one that could clean everything) are “maybe committed”. Fortunately, the community has released a workaround for this: &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;release-7.3&#x2F;documentation&#x2F;sphinx&#x2F;source&#x2F;automatic-idempotency.rst&quot;&gt;&lt;code&gt;automatic idempotency&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;automatic-idempotency&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#automatic-idempotency&quot; aria-label=&quot;Anchor link for: automatic-idempotency&quot;&gt;🔗&lt;&#x2F;a&gt;Automatic idempotency&lt;&#x2F;h2&gt;
&lt;p&gt;The documentation is fairly explicit:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Use the automatic_idempotency transaction option to prevent commits from failing with &lt;code&gt;commit_unknown_result&lt;&#x2F;code&gt; at a small performance cost.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;The option appeared in FoundationDB 7.3, and could solve our issue. I decided to give it a try and modify the foundationdb-simulation crate example. The example is trying to use a atomic increment under simulation. Before 7.1, during validation, we had to write &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;foundationdb-rs&#x2F;foundationdb-rs&#x2F;blob&#x2F;98136cbea1c9b8d40ea9a599438ce0fa8d0297c0&#x2F;foundationdb-simulation&#x2F;examples&#x2F;atomic&#x2F;workload.rs#L99C1-L99C94&quot;&gt;some code&lt;&#x2F;a&gt; that looks like this:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; We don&amp;#39;t know how much maybe_committed transactions has succeeded,
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; so we are checking the possible range
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;.success_count &amp;lt;= count
&lt;&#x2F;span&gt;&lt;span&gt;   &amp;amp;&amp;amp; count &amp;lt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;.expected_count + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;.maybe_committed_count {
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; ...
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;As I was adding 7.3 support in the crate, I decided to update the example and try the new option:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Enable idempotent txn
&lt;&#x2F;span&gt;&lt;span&gt; trx.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;set_option&lt;&#x2F;span&gt;&lt;span&gt;(TransactionOption::AutomaticIdempotency)?;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If the behavior is correct, I can simplify my consistency checks:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;.success_count == count {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;.context.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;trace&lt;&#x2F;span&gt;&lt;span&gt;(
&lt;&#x2F;span&gt;&lt;span&gt;        Severity::Info,
&lt;&#x2F;span&gt;&lt;span&gt;        &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;Atomic count match&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;        details![],
&lt;&#x2F;span&gt;&lt;span&gt;     );
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; ...
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;I’ve been running hundreds of seeds on my machine and everything works great: no more maybe-committed transactions! Now that 7.3 support is merged in the rust bindings, we will be able to expands our testing thanks to our simulation farm. I&#x27;m also looking to see the performance impact of the feature, even if I&#x27;m pretty sure that it will outperform any layer-work.&lt;&#x2F;p&gt;
&lt;p&gt;This is truly a very useful feature and I hope this option will be turned on by default on the next major release. The initial PR can be found &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;pull&#x2F;8398&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;strong&gt;Thank you&lt;&#x2F;strong&gt; for reading my post! Feel free to react to this article, I am also available on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt; if needed.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">database</category>
          <category domain="tag">distributed</category>
          <category domain="tag">transactions</category>
          <category domain="tag">foundationdb</category>
          <category domain="tag">storage</category>
      </item>
      <item>
          <title>The unseen treasures of Infrastructure Engineering: Academic Papers</title>
          <pubDate>Mon, 22 Jan 2024 15:37:27 +0100</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/academic-conferences/</link>
          <guid>https://pierrezemb.fr/posts/academic-conferences/</guid>
          <description xml:base="https://pierrezemb.fr/posts/academic-conferences/">&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;papers.png&quot; alt=&quot;Academic paper created with AI&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;I really like using RSS feeds. My Feedly account has more than 190 feeds, all neatly organized by categories. They help me keep up with new ideas and interesting blog posts about engineering. But there&#x27;s another source of information I&#x27;ve been using for a long time that not many people know about: &lt;strong&gt;academic papers&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;You can discover details about infrastructure that you might not find in regular blog posts. Academic papers, unlike typical blog content, often &lt;strong&gt;dive deeper&lt;&#x2F;strong&gt; into specific aspects of infrastructure. They provide more in-depth information, uncovering details that are not commonly discussed. So, if you&#x27;re interested in gaining a more comprehensive understanding of infrastructure-related topics, exploring academic papers can be really worthwile.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Sounds a bit too academic, doesn&#x27;t it? 🤔&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;I don&#x27;t think so!  It&#x27;s true that academic research can sometimes seem distant from everyday industry needs, but following both academic and industry tracks is beneficial. R&amp;amp;D from academia often lead to new ideas and technologies that eventually find their way into practical use.&lt;&#x2F;p&gt;
&lt;p&gt;Moreover, numerous academic conferences feature a &lt;strong&gt;&quot;industry track&quot;&lt;&#x2F;strong&gt; that is essential to monitor.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Aren&#x27;t they too complex to read? 🤔&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;If you don&#x27;t get everything right away, that&#x27;s okay. Reading these smart papers might be a bit hard, but it&#x27;s a skill that gets better with practice. And who knows, maybe you&#x27;ll be inspired to write your own paper someday! 😉&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;I&#x27;m intrigued! Where should I start?&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Here&#x27;s a short list of my go-to academic papers and conferences that you can follow for infrastructure engineering. Please note that many conferences exists on other subjects, like security and so.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-usenix-community&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-usenix-community&quot; aria-label=&quot;Anchor link for: the-usenix-community&quot;&gt;🔗&lt;&#x2F;a&gt;The USENIX community&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;osdi&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#osdi&quot; aria-label=&quot;Anchor link for: osdi&quot;&gt;🔗&lt;&#x2F;a&gt;OSDI&lt;&#x2F;h3&gt;
&lt;p&gt;As part of the USENIX Association, the &lt;a href=&quot;https:&#x2F;&#x2F;www.usenix.org&#x2F;conferences&quot;&gt;Operating Systems Design and Implementation&lt;&#x2F;a&gt; is an annual computer science conference that you shouldn&#x27;t miss. You can catch most of the sessions online along with some useful slides.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;nsdi&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#nsdi&quot; aria-label=&quot;Anchor link for: nsdi&quot;&gt;🔗&lt;&#x2F;a&gt;NSDI&lt;&#x2F;h3&gt;
&lt;p&gt;In a similar fashion, the &lt;a href=&quot;https:&#x2F;&#x2F;www.usenix.org&#x2F;conferences&quot;&gt;Networked Systems Design and Implementation&lt;&#x2F;a&gt; focuses on the design principles, implementation, and practical evaluation of networked and distributed systems.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;usenix-atc&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#usenix-atc&quot; aria-label=&quot;Anchor link for: usenix-atc&quot;&gt;🔗&lt;&#x2F;a&gt;Usenix ATC&lt;&#x2F;h3&gt;
&lt;p&gt;The Usenix &lt;a href=&quot;https:&#x2F;&#x2F;www.usenix.org&#x2F;conferences&quot;&gt;Annual Technical Conference&lt;&#x2F;a&gt; is another classic to follow.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-acm-family&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-acm-family&quot; aria-label=&quot;Anchor link for: the-acm-family&quot;&gt;🔗&lt;&#x2F;a&gt;The ACM family&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;sigmod&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#sigmod&quot; aria-label=&quot;Anchor link for: sigmod&quot;&gt;🔗&lt;&#x2F;a&gt;SIGMOD&lt;&#x2F;h3&gt;
&lt;p&gt;SIGMOD, or the &lt;a href=&quot;https:&#x2F;&#x2F;sigmod.org&#x2F;&quot;&gt;Special Interest Group on Management of Data&lt;&#x2F;a&gt;, is an essential conference under the ACM umbrella, focusing on the management and organization of data.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;damon&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#damon&quot; aria-label=&quot;Anchor link for: damon&quot;&gt;🔗&lt;&#x2F;a&gt;DaMoN&lt;&#x2F;h3&gt;
&lt;p&gt;Held with ACM SIGMOD&#x2F;PODS, you can also find the &lt;a href=&quot;https:&#x2F;&#x2F;damon-db.org&#x2F;&quot;&gt;Data Management on New Hardware &lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;socc&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#socc&quot; aria-label=&quot;Anchor link for: socc&quot;&gt;🔗&lt;&#x2F;a&gt;SoCC&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;a href=&quot;https:&#x2F;&#x2F;acmsocc.org&#x2F;2023&#x2F;&quot;&gt;Symposium on Cloud Computing&lt;&#x2F;a&gt; or SoCC for short belongs to ACM. It has a bit less content, as videos are not published, but you should keep it in your watchlist.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;sosp&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#sosp&quot; aria-label=&quot;Anchor link for: sosp&quot;&gt;🔗&lt;&#x2F;a&gt;SOSP&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;a href=&quot;http:&#x2F;&#x2F;sosp.org&#x2F;&quot;&gt;Symposium on Operating Systems Principles&lt;&#x2F;a&gt; is another noteworthy conference in the ACM family. It&#x27;s a top-tier venue for discussing operating systems research. Stay tuned for updates on the latest breakthroughs and innovative ideas.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;others&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#others&quot; aria-label=&quot;Anchor link for: others&quot;&gt;🔗&lt;&#x2F;a&gt;Others&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;vldb&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#vldb&quot; aria-label=&quot;Anchor link for: vldb&quot;&gt;🔗&lt;&#x2F;a&gt;VLDB&lt;&#x2F;h3&gt;
&lt;p&gt;Not belonging to USENIX or ACM, the &lt;a href=&quot;https:&#x2F;&#x2F;vldb.org&#x2F;&quot;&gt;Very Large Data Bases&lt;&#x2F;a&gt; (VLDB) conference is a key event in the database community. It provides a platform for researchers and professionals to exchange ideas on managing and analyzing large-scale datasets.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;cidr&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#cidr&quot; aria-label=&quot;Anchor link for: cidr&quot;&gt;🔗&lt;&#x2F;a&gt;CIDR&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;a href=&quot;https:&#x2F;&#x2F;www.cidrdb.org&quot;&gt;Conference on Innovative Data Systems Research&lt;&#x2F;a&gt; (CIDR) is a systems-oriented conference, complementary in its mission to the mainstream database conferences like SIGMOD and VLDB, emphasizing the systems architecture perspective.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;cool-papers-examples&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#cool-papers-examples&quot; aria-label=&quot;Anchor link for: cool-papers-examples&quot;&gt;🔗&lt;&#x2F;a&gt;Cool papers examples&lt;&#x2F;h2&gt;
&lt;blockquote&gt;
&lt;p&gt;This is a nice list, but how about some paper examples that &lt;strong&gt;you&lt;&#x2F;strong&gt; like?🤔&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Sure! Here&#x27;s a quick list with some infrastructure-related informations:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.usenix.org&#x2F;conference&#x2F;atc23&#x2F;presentation&#x2F;brooker&quot;&gt;On-demand Container Loading in AWS Lambda&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.cidrdb.org&#x2F;cidr2024&#x2F;papers&#x2F;p63-helland.pdf&quot;&gt;Scalable OLTP in the Cloud: What&#x27;s the BIG DEAL?&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.confluent.io&#x2F;blog&#x2F;cloud-native-kafka-kora-vldb-award&#x2F;&quot;&gt;Kora: A Cloud-Native Event Streaming Platform For Kafka&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=YdxvOPenjWI&quot;&gt;Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.foundationdb.org&#x2F;blog&#x2F;fdb-paper&#x2F;&quot;&gt;FoundationDB: A Distributed, Unbundled, Transactional Key Value Store&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.usenix.org&#x2F;conference&#x2F;osdi20&#x2F;presentation&#x2F;balakrishnan&quot;&gt;Virtual consensus with Delos&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;I&#x27;m also trying to organize them into my &lt;a href=&quot;https:&#x2F;&#x2F;www.zotero.org&#x2F;pierre.zemb&#x2F;library&quot;&gt;Zotero&#x27;s library&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;strong&gt;Thank you&lt;&#x2F;strong&gt; for reading my post! Feel free to react to this article, I am also available on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt; if needed.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">research</category>
          <category domain="tag">learning</category>
          <category domain="tag">engineering</category>
          <category domain="tag">papers</category>
      </item>
      <item>
          <title>Best resources to learn about data and distributed systems</title>
          <pubDate>Mon, 17 Jan 2022 01:37:27 +0100</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/distsys-resources/</link>
          <guid>https://pierrezemb.fr/posts/distsys-resources/</guid>
          <description xml:base="https://pierrezemb.fr/posts/distsys-resources/">&lt;p&gt;Learning distributed systems is tough. You need to go through a lot of academic papers, concepts, code review, before being able to have a global pictures. Thankfully, there is a lot of resources out there that can help you to get started.  Here&#x27;s a list of resources I used to learn distributed systems. I will keep this blogpost up-to-date with books, conferences, and so on.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;distsys-resources&#x2F;books.jpeg&quot; alt=&quot;&#x2F;posts&#x2F;distsys-resources&#x2F;books.jpeg&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;A distributed system is one in which the failure of a computer you didn&#x27;t even know existed can render your own computer unusable.&lt;&#x2F;p&gt;
&lt;p&gt;-Lamport, 1987&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;reading-books&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#reading-books&quot; aria-label=&quot;Anchor link for: reading-books&quot;&gt;🔗&lt;&#x2F;a&gt;Reading 📚&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;designing-data-intensive-applications&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#designing-data-intensive-applications&quot; aria-label=&quot;Anchor link for: designing-data-intensive-applications&quot;&gt;🔗&lt;&#x2F;a&gt;Designing Data-Intensive Applications&lt;&#x2F;h3&gt;
&lt;p&gt;Let&#x27;s start by one of my favorite book, &lt;a href=&quot;https:&#x2F;&#x2F;dataintensive.net&#x2F;&quot;&gt;Designing Data-Intensive Applications&lt;&#x2F;a&gt;, written by &lt;a href=&quot;https:&#x2F;&#x2F;martin.kleppmann.com&#x2F;&quot;&gt;Martin Kleppmann&lt;&#x2F;a&gt;. This is by far the most practical book you will ever find about distributed systems. It covers:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Data models, query languages and encoding,&lt;&#x2F;li&gt;
&lt;li&gt;Replication, partitioning, the associated troubles, consistency, consensus,&lt;&#x2F;li&gt;
&lt;li&gt;batch and stream processing.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;blockquote&gt;
&lt;p&gt;NoSQL… Big Data… Scalability… CAP Theorem… Eventual Consistency… Sharding…&lt;&#x2F;p&gt;
&lt;p&gt;Nice buzzwords, but how does the stuff actually work?&lt;&#x2F;p&gt;
&lt;p&gt;As software engineers, we need to build applications that are reliable, scalable and maintainable in the long run. We need to understand the range of available tools and their trade-offs. For that, we have to dig deeper than buzzwords.&lt;&#x2F;p&gt;
&lt;p&gt;This book will help you navigate the diverse and fast-changing landscape of technologies for storing and processing data. We compare a broad variety of tools and approaches, so that you can see the strengths and weaknesses of each, and decide what’s best for your application.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;database-internals&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#database-internals&quot; aria-label=&quot;Anchor link for: database-internals&quot;&gt;🔗&lt;&#x2F;a&gt;Database Internals&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.databass.dev&#x2F;&quot;&gt;Database Internals&lt;&#x2F;a&gt;, written by &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;ifesdjeen&quot;&gt;Alex Petrov&lt;&#x2F;a&gt;, is a fantastic book for anyone wondering how a database works. I recommend reading it after &lt;code&gt;Designing Data-Intensive Applications&lt;&#x2F;code&gt;, as the author dives in more details compared to Martin&#x27;s book.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Have you ever wanted to learn more about Databases but did not know where to start? This is a book just for you.&lt;&#x2F;p&gt;
&lt;p&gt;We can treat databases and other infrastructure components as black boxes, but it doesn’t have to be that way. Sometimes we have to take a closer look at what’s going on because of performance issues. Sometimes databases misbehave, and we need to find out what exactly is going on. Some of us want to work in infrastructure and develop databases. This book’s main intention is to introduce you to the cornerstone concepts and help you understand how databases work.&lt;&#x2F;p&gt;
&lt;p&gt;The book consists of two parts: Storage Engines and Distributed Systems since that’s where most of the differences between the vast majority of databases is coming from.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;distributed-systems&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#distributed-systems&quot; aria-label=&quot;Anchor link for: distributed-systems&quot;&gt;🔗&lt;&#x2F;a&gt;Distributed Systems&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.distributed-systems.net&#x2F;index.php&#x2F;me&#x2F;&quot;&gt;Maarten van Steen&lt;&#x2F;a&gt; wrote a book called &lt;a href=&quot;https:&#x2F;&#x2F;www.distributed-systems.net&#x2F;&quot;&gt;Distributed Systems 3rd edition&lt;&#x2F;a&gt;. It is a nice book which you can get a digital copy of this book for free.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Distributed systems are like 3D brain teasers: easy to disassemble; hard to put together.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;understanding-distributed-systems&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#understanding-distributed-systems&quot; aria-label=&quot;Anchor link for: understanding-distributed-systems&quot;&gt;🔗&lt;&#x2F;a&gt;Understanding Distributed Systems&lt;&#x2F;h3&gt;
&lt;p&gt;If you are not a backend engineer but still curious about distributed systems, I highly recommend &lt;a href=&quot;https:&#x2F;&#x2F;understandingdistributed.systems&#x2F;&quot;&gt;Understanding Distributed Systems&lt;&#x2F;a&gt;. &lt;a href=&quot;https:&#x2F;&#x2F;robertovitillo.com&#x2F;&quot;&gt;Roberto Vitillo&lt;&#x2F;a&gt; is doing an insane job to vulgarize the subject.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Want to learn how to build scalable and fault-tolerant cloud applications?&lt;&#x2F;p&gt;
&lt;p&gt;This book will teach you the core principles of distributed systems so that you don’t have to spend countless hours trying to understand how everything fits together.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;the-internals-of-postgresql&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-internals-of-postgresql&quot; aria-label=&quot;Anchor link for: the-internals-of-postgresql&quot;&gt;🔗&lt;&#x2F;a&gt;The Internals of PostgreSQL&lt;&#x2F;h3&gt;
&lt;p&gt;PostgreSQL is getting a lot of love and traction these years, and &lt;a href=&quot;https:&#x2F;&#x2F;www.interdb.jp&#x2F;&quot;&gt;Hironobu Suzuki&lt;&#x2F;a&gt; wrote a terrific book the about the &lt;a href=&quot;https:&#x2F;&#x2F;www.interdb.jp&#x2F;pg&#x2F;index.html&quot;&gt;The Internals of PostgreSQL&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;PostgreSQL is a well-designed open-source multi-purpose relational database system which is widely used throughout the world. It is one huge system with the integrated subsystems, each of which has a particular complex feature and works with each other cooperatively. Although understanding of the internal mechanism is crucial for both administration and integration using PostgreSQL, its hugeness and complexity prevent it. The main purposes of this document are to explain how each subsystem works, and to provide the whole picture of PostgreSQL.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;jepsen-blog&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#jepsen-blog&quot; aria-label=&quot;Anchor link for: jepsen-blog&quot;&gt;🔗&lt;&#x2F;a&gt;Jepsen blog&lt;&#x2F;h3&gt;
&lt;p&gt;We are often using databases as a source of truth, but they are also pieces of software with bugs in it. Kyle Kingsbury is the most famous database-breaker with &lt;a href=&quot;http:&#x2F;&#x2F;jepsen.io&#x2F;&quot;&gt;Jepsen&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Jepsen is an effort to improve the safety of distributed databases, queues, consensus systems, etc. We maintain an open source software library for systems testing, as well as blog posts and conference talks exploring particular systems’ failure modes. In each analysis we explore whether the system lives up to its documentation’s claims, file new bugs, and suggest recommendations for operators.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;You will find analysis on many databases, such as CockroachDB, etcd, Kafka, MongoDB, and so on.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;aphyr-distsys-class-notes&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#aphyr-distsys-class-notes&quot; aria-label=&quot;Anchor link for: aphyr-distsys-class-notes&quot;&gt;🔗&lt;&#x2F;a&gt;Aphyr distsys class notes&lt;&#x2F;h3&gt;
&lt;p&gt;Following Jepsen, here&#x27;s a great bonus: Kyle is also teaching distributed systems, and his notes are &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;aphyr&#x2F;distsys-class#an-introduction-to-distributed-systems&quot;&gt;available&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;distributed-systems-for-fun-and-profit&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#distributed-systems-for-fun-and-profit&quot; aria-label=&quot;Anchor link for: distributed-systems-for-fun-and-profit&quot;&gt;🔗&lt;&#x2F;a&gt;Distributed systems for fun and profit&lt;&#x2F;h3&gt;
&lt;p&gt;Despite being free, &lt;a href=&quot;http:&#x2F;&#x2F;book.mixu.net&#x2F;distsys&#x2F;&quot;&gt;Distributed systems for fun and profit&lt;&#x2F;a&gt; is an awesome book. The author, &lt;a href=&quot;http:&#x2F;&#x2F;mixu.net&#x2F;&quot;&gt;Mikito Takada&lt;&#x2F;a&gt; has done a terrific work to vulgarize distributed systems.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;I wanted a text that would bring together the ideas behind many of the more recent distributed systems - systems such as Amazon&#x27;s Dynamo, Google&#x27;s BigTable and MapReduce, Apache&#x27;s Hadoop and so on.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;In this text I&#x27;ve tried to provide a more accessible introduction to distributed systems. To me, that means two things: introducing the key concepts that you will need in order to have a good time reading more serious texts, and providing a narrative that covers things in enough detail that you get a gist of what&#x27;s going on without getting stuck on details.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;translucent-databases&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#translucent-databases&quot; aria-label=&quot;Anchor link for: translucent-databases&quot;&gt;🔗&lt;&#x2F;a&gt;Translucent Databases&lt;&#x2F;h3&gt;
&lt;p&gt;I really like the pitch of the book:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Do you have personal information in your database?&lt;&#x2F;p&gt;
&lt;p&gt;Do you keep files on your customers, your employees, or anyone else?&lt;&#x2F;p&gt;
&lt;p&gt;Do you need to worry about European laws restricting the information you keep?&lt;&#x2F;p&gt;
&lt;p&gt;Do you keep copies of credit card numbers, social security numbers, or other information that might be useful to identity thieves or insurance fraudsters?&lt;&#x2F;p&gt;
&lt;p&gt;Do you deal with medical records or personal secrets?&lt;&#x2F;p&gt;
&lt;p&gt;Most database administrators have some of these worries. Some have all of them. That&#x27;s why database security is so important.&lt;&#x2F;p&gt;
&lt;p&gt;This new book, Translucent Databases, describes a different attitude toward protecting the information.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;&lt;a href=&quot;http:&#x2F;&#x2F;wayner.org&#x2F;node&#x2F;46&quot;&gt;Translucent Databases&lt;&#x2F;a&gt; is a short book, focus on how to store sensitive data. You will find several dozen examples of interesting case studies on how to efficiently and privately store sensitive data. A must-have.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;the-art-of-postgresql&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-art-of-postgresql&quot; aria-label=&quot;Anchor link for: the-art-of-postgresql&quot;&gt;🔗&lt;&#x2F;a&gt;The Art of PostgreSQL&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;theartofpostgresql.com&#x2F;&quot;&gt;The Art of PostgreSQL&lt;&#x2F;a&gt; is all about showing the power of both SQL and PostgreSQL. It explains the how&#x27;s and why&#x27;s of using Postgres&#x27;s many feature, and how you, as a developers, can take advantages of it. A brilliant book that should be read by every developer.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;This book is for developers, covering advanced SQL techniques for data processing. Learn how to get exactly the result set you need in your application’s code!&lt;&#x2F;p&gt;
&lt;p&gt;Learn advanced SQL with practical examples and datasets that help you get the most of the book! Every query solves a practical use case and is given in context.&lt;&#x2F;p&gt;
&lt;p&gt;The book covers (de-)normalisation with simple practical examples to dive into this seemingly complex topic, including Caching and Indexing Strategy.&lt;&#x2F;p&gt;
&lt;p&gt;Writing efficient SQL is easier than it looks, and begins with database modeling and writing clear code. The book teaches you how to write fast queries!&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;readings-in-database-systems&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#readings-in-database-systems&quot; aria-label=&quot;Anchor link for: readings-in-database-systems&quot;&gt;🔗&lt;&#x2F;a&gt;Readings in Database Systems&lt;&#x2F;h3&gt;
&lt;p&gt;Another free book, &lt;a href=&quot;http:&#x2F;&#x2F;www.redbook.io&#x2F;&quot;&gt;Readings in Database Systems&lt;&#x2F;a&gt; is a great read if you are looking for an opinionated and short review on subject like architecture, engines, analytics and so on.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Readings in Database Systems (commonly known as the &quot;Red Book&quot;) has offered readers an opinionated take on both classic and cutting-edge research in the field of data management since 1988. Here, we present the Fifth Edition of the Red Book — the first in over ten years.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;watching-tv&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#watching-tv&quot; aria-label=&quot;Anchor link for: watching-tv&quot;&gt;🔗&lt;&#x2F;a&gt;Watching 📺&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;cmu-database-group&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#cmu-database-group&quot; aria-label=&quot;Anchor link for: cmu-database-group&quot;&gt;🔗&lt;&#x2F;a&gt;CMU Database Group&lt;&#x2F;h3&gt;
&lt;p&gt;The Database Group at Carnegie Mellon University have been publishing a lot of contents, including:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;playlist?list=PLSE8ODhjZXjZaHA6QcxDfJ0SIWBzQFKEG&quot;&gt;Intro to Database Systems lecture&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;playlist?list=PLSE8ODhjZXjasmrEd2_Yi1deeE360zv5O&quot;&gt;Advanced Database Systems lecture&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;which are the best lectures about database in my opinion.&lt;&#x2F;p&gt;
&lt;p&gt;I also recommend their Quarantine database talks playlists:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;the &quot;Quarantine Database Tech Talks&quot; is a on-line seminar series at Carnegie Mellon University with leading developers and researchers of database systems. Each speaker will present the implementation details of their respective systems and examples of the technical challenges that they faced when working with real-world customers.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;playlist?list=PLSE8ODhjZXjbeqnfuvp30VrI7VXiFuOXS&quot;&gt;Vaccination Database Tech Talks First Dose&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;playlist?list=PLSE8ODhjZXjbDOFN4U4-Uv95-N8sgzs5D&quot;&gt;Vaccination Database Tech Talks Second Dose&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;distributed-systems-lecture-series&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#distributed-systems-lecture-series&quot; aria-label=&quot;Anchor link for: distributed-systems-lecture-series&quot;&gt;🔗&lt;&#x2F;a&gt;Distributed Systems lecture series&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;martin.kleppmann.com&#x2F;&quot;&gt;Martin Kleppmann&lt;&#x2F;a&gt;(&lt;code&gt;Designing Data Intensive applications&lt;&#x2F;code&gt;&#x27;s author) published an &lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;playlist?list=PLeKd45zvjcDFUEv_ohr_HdUFe97RItdiB&quot;&gt;8-lecture series on distributed systems&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;This video is part of an 8-lecture series on distributed systems, given as part of the undergraduate computer science course at the University of Cambridge.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;academic-conferences&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#academic-conferences&quot; aria-label=&quot;Anchor link for: academic-conferences&quot;&gt;🔗&lt;&#x2F;a&gt;Academic conferences&lt;&#x2F;h3&gt;
&lt;p&gt;Keeping track of the academic world is not easy, but thankfully, we can keep track of several academic conferences which are data-related, including:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http:&#x2F;&#x2F;cidrdb.org&quot;&gt;CIDR&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;sigmod.org&#x2F;&quot;&gt;SIGMOD&#x2F;PODS&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;vldb.org&quot;&gt;VLDB&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;papoc-workshop.github.io&#x2F;2022&#x2F;&quot;&gt;PaPoC&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;industrial-conference&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#industrial-conference&quot; aria-label=&quot;Anchor link for: industrial-conference&quot;&gt;🔗&lt;&#x2F;a&gt;Industrial conference&lt;&#x2F;h3&gt;
&lt;p&gt;There is not much database-focused conferences, but you will be interested to see talks from:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;hydraconf.com&#x2F;&quot;&gt;HydraConf&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.hytradboi.com&#x2F;&quot;&gt;HYTRADBOI&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;distsys-reading-group-sessions&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#distsys-reading-group-sessions&quot; aria-label=&quot;Anchor link for: distsys-reading-group-sessions&quot;&gt;🔗&lt;&#x2F;a&gt;DistSys Reading Group sessions&lt;&#x2F;h3&gt;
&lt;p&gt;If you are looking for explanations about a distributed systems paper, you may be interested in the &lt;a href=&quot;http:&#x2F;&#x2F;charap.co&#x2F;category&#x2F;reading-group&#x2F;&quot;&gt;DistSys Reading Group&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Every week we present and discuss one distributed systems paper. We try to focus on relatively new papers, although we occasionally break this rule for some important older publications. The main objective of this group is to share knowledge through the discussion. Our participants come from academia and industry and often carry a unique perspective and expertise on the subject matter.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Every session can be found on their &lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;channel&#x2F;UCMKIroHVXvMQRIBhENE6RhQ&quot;&gt;YouTube channel&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;coding-adult-computer&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#coding-adult-computer&quot; aria-label=&quot;Anchor link for: coding-adult-computer&quot;&gt;🔗&lt;&#x2F;a&gt;Coding 🧑‍💻&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;maelstrom&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#maelstrom&quot; aria-label=&quot;Anchor link for: maelstrom&quot;&gt;🔗&lt;&#x2F;a&gt;Maelstrom&lt;&#x2F;h3&gt;
&lt;p&gt;Ever wonder to develop your own toy distributed systems? Fear no more, you can use &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;jepsen-io&#x2F;maelstrom&quot;&gt;Maelstrom&lt;&#x2F;a&gt; for that!&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Maelstrom is a workbench for learning distributed systems by writing your own. It uses the Jepsen testing library to test toy implementations of distributed systems. Maelstrom provides standardized tests for things like &quot;a commutative set&quot; or &quot;a transactional key-value store&quot;, and lets you learn by writing implementations which those test suites can exercise. It&#x27;s used as a part of a distributed systems workshop by Jepsen.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;Maelstrom provides a range of tests for different kinds of distributed systems, built on top of a simple JSON protocol via STDIN and STDOUT. Users write servers in any language. Maelstrom runs those servers, sends them requests, routes messages via a simulated network, and checks that clients observe expected behavior. You want to write Plumtree in Bash? Byzantine Paxos in Intercal? Maelstrom is for you.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;pingcap-s-talent-plan&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#pingcap-s-talent-plan&quot; aria-label=&quot;Anchor link for: pingcap-s-talent-plan&quot;&gt;🔗&lt;&#x2F;a&gt;PingCAP&#x27;s Talent Plan&lt;&#x2F;h3&gt;
&lt;p&gt;PingCAP is the company behind the tidb&#x2F;tikv stack, a new distributed systems. They developed their own &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;pingcap&#x2F;talent-plan&quot;&gt;open source training program&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Talent Plan is an open source training program initiated by PingCAP. It aims to create or combine some open source learning materials for people interested in open source, distributed systems, Rust, Golang, and other infrastructure knowledge. As such, it provides a series of courses focused on open source collaboration, rust programming, distributed database and systems.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;I went through the Raft project in Rust and I learned a lot!&lt;&#x2F;p&gt;
&lt;h3 id=&quot;patterns-of-distributed-systems&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#patterns-of-distributed-systems&quot; aria-label=&quot;Anchor link for: patterns-of-distributed-systems&quot;&gt;🔗&lt;&#x2F;a&gt;Patterns of Distributed Systems&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;unmeshjoshi&quot;&gt;Unmesh Joshi&lt;&#x2F;a&gt; is writing an on-going serie called &lt;a href=&quot;https:&#x2F;&#x2F;martinfowler.com&#x2F;articles&#x2F;patterns-of-distributed-systems&#x2F;&quot;&gt;Patterns of Distributed Systems&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Distributed systems provide a particular challenge to program. They often require us to have multiple copies of data, which need to keep synchronized. Yet we cannot rely on processing nodes working reliably, and network delays can easily lead to inconsistencies. Despite this, many organizations rely on a range of core distributed software handling data storage, messaging, system management, and compute capability. These systems face common problems which they solve with similar solutions. This article recognizes and develops these solutions as patterns, with which we can build up an understanding of how to better understand, communicate and teach distributed system design.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;reading-lists-eyes&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#reading-lists-eyes&quot; aria-label=&quot;Anchor link for: reading-lists-eyes&quot;&gt;🔗&lt;&#x2F;a&gt;Reading lists 👀&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;dan-creswell-s-reading-list&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#dan-creswell-s-reading-list&quot; aria-label=&quot;Anchor link for: dan-creswell-s-reading-list&quot;&gt;🔗&lt;&#x2F;a&gt;Dan Creswell&#x27;s reading List&lt;&#x2F;h3&gt;
&lt;p&gt;If you want more contents, &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;dancres&quot;&gt;Dan Creswell&lt;&#x2F;a&gt; has a nice &lt;a href=&quot;https:&#x2F;&#x2F;dancres.github.io&#x2F;Pages&#x2F;&quot;&gt;Distributed Systems Reading List&lt;&#x2F;a&gt; 🚀&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;strong&gt;Thank you&lt;&#x2F;strong&gt; for reading my post! Feel free to react to this article, you can find me on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">learning</category>
          <category domain="tag">distributed</category>
          <category domain="tag">education</category>
          <category domain="tag">database</category>
      </item>
      <item>
          <title>Crafting row keys in FoundationDB</title>
          <pubDate>Sun, 21 Feb 2021 00:24:27 +0100</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/crafting-keys-in-fdb/</link>
          <guid>https://pierrezemb.fr/posts/crafting-keys-in-fdb/</guid>
          <description xml:base="https://pierrezemb.fr/posts/crafting-keys-in-fdb/">&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;notes-about-foundationdb&#x2F;fdb-white.jpg&quot; alt=&quot;fdb image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;As I&#x27;m working &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Clikengo&#x2F;foundationdb-rs&#x2F;issues&#x2F;27&quot;&gt;on my latest contribution around FoundationDB and Rust&lt;&#x2F;a&gt;, I had the chance to dig a bit into how FoundationDB&#x27;s bindings are offering helpers to generate keys. Their approach is interesting enough to deserve a blogpost 😎&lt;&#x2F;p&gt;
&lt;h2 id=&quot;row-key&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#row-key&quot; aria-label=&quot;Anchor link for: row-key&quot;&gt;🔗&lt;&#x2F;a&gt;Row key?&lt;&#x2F;h2&gt;
&lt;p&gt;When you are using a key&#x2F;value store, the design of the &lt;code&gt;row key&lt;&#x2F;code&gt; is extremely important, as this will define how well:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;your scans will be optimized,&lt;&#x2F;li&gt;
&lt;li&gt;your puts will be spread,&lt;&#x2F;li&gt;
&lt;li&gt;you will avoid &lt;code&gt;hot-spotting&lt;&#x2F;code&gt; a shard&#x2F;region.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;If you need more information on &lt;code&gt;row keys&lt;&#x2F;code&gt;, I recommend going through these links before moving on:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;cloud.google.com&#x2F;bigtable&#x2F;docs&#x2F;schema-design&quot;&gt;&quot;Designing your schema&quot; BigTable documentation&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;hbase.apache.org&#x2F;book.html#rowkey.design&quot;&gt;&quot;Rowkey Design&quot; HBase documentation&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;hand-crafting-row-keys&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#hand-crafting-row-keys&quot; aria-label=&quot;Anchor link for: hand-crafting-row-keys&quot;&gt;🔗&lt;&#x2F;a&gt;Hand-crafting row keys&lt;&#x2F;h2&gt;
&lt;p&gt;Most of the time, you will need to craft the &lt;code&gt;row key&lt;&#x2F;code&gt; &quot;by hand&quot;, like this for &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;senx&#x2F;warp10-platform&#x2F;blob&#x2F;879734d7f63791b487f3e535cd79ac4c23e99377&#x2F;warp10&#x2F;src&#x2F;main&#x2F;java&#x2F;io&#x2F;warp10&#x2F;continuum&#x2F;store&#x2F;Store.java#L1215-L1222&quot;&gt;an HBase&#x27;s app&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;java&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-java &quot;&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Prefix + classId + labelsId + timestamp
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; 128 bits
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;byte[]&lt;&#x2F;span&gt;&lt;span&gt; rowkey = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;new byte&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Constants&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;HBASE_RAW_DATA_KEY_PREFIX&lt;&#x2F;span&gt;&lt;span&gt;.length + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;8 &lt;&#x2F;span&gt;&lt;span&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;8 &lt;&#x2F;span&gt;&lt;span&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;];
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;System&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;arraycopy&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Constants&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;HBASE_RAW_DATA_KEY_PREFIX&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, rowkey, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Constants&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;HBASE_RAW_DATA_KEY_PREFIX&lt;&#x2F;span&gt;&lt;span&gt;.length);
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Copy classId&#x2F;labelsId
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;System&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;arraycopy&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Longs&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;toByteArray&lt;&#x2F;span&gt;&lt;span&gt;(msg.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;getClassId&lt;&#x2F;span&gt;&lt;span&gt;()), &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, rowkey, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Constants&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;HBASE_RAW_DATA_KEY_PREFIX&lt;&#x2F;span&gt;&lt;span&gt;.length, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;System&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;arraycopy&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Longs&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;toByteArray&lt;&#x2F;span&gt;&lt;span&gt;(msg.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;getLabelsId&lt;&#x2F;span&gt;&lt;span&gt;()), &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, rowkey, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Constants&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;HBASE_RAW_DATA_KEY_PREFIX&lt;&#x2F;span&gt;&lt;span&gt;.length + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Or maybe you will wrap things in a function &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;pingcap&#x2F;tidb&#x2F;blob&#x2F;ef57bdbbb04f60a8be744060a99207e08a37514a&#x2F;tablecodec&#x2F;tablecodec.go#L80-L86&quot;&gt;like this in Go&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;go&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-go &quot;&gt;&lt;code class=&quot;language-go&quot; data-lang=&quot;go&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; EncodeRowKey encodes the table id and record handle into a kv.Key
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;func &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;EncodeRowKey&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;tableID &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;int64&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;encodedHandle &lt;&#x2F;span&gt;&lt;span&gt;[]&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;byte&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;kv&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;Key &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;buf &lt;&#x2F;span&gt;&lt;span&gt;:= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;make&lt;&#x2F;span&gt;&lt;span&gt;([]&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;byte&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;prefixLen&lt;&#x2F;span&gt;&lt;span&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;len&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;encodedHandle&lt;&#x2F;span&gt;&lt;span&gt;))
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;buf &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;appendTableRecordPrefix&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;buf&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;tableID&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;buf &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;append&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;buf&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;encodedHandle&lt;&#x2F;span&gt;&lt;span&gt;...)
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;buf
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Each time, you need to wrap the complexity of converting your objects to a row-key, by creating a buffer and write stuff in it.&lt;&#x2F;p&gt;
&lt;p&gt;In our Java example, there is an interesting comment:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;java&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-java &quot;&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Prefix + classId + labelsId + timestamp
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If we are replacing some characters, we are not really far from:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;java&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-java &quot;&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; (Prefix, classId, labelsId, timestamp)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Which looks like a &lt;code&gt;Tuple&lt;&#x2F;code&gt;(a collection of values of different types) and this is what FoundationDB is using as an abstraction to create keys 😍&lt;&#x2F;p&gt;
&lt;h2 id=&quot;fdb-s-abstractions-and-helpers&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#fdb-s-abstractions-and-helpers&quot; aria-label=&quot;Anchor link for: fdb-s-abstractions-and-helpers&quot;&gt;🔗&lt;&#x2F;a&gt;FDB&#x27;s abstractions and helpers&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;tuple&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#tuple&quot; aria-label=&quot;Anchor link for: tuple&quot;&gt;🔗&lt;&#x2F;a&gt;Tuple&lt;&#x2F;h3&gt;
&lt;p&gt;Instead of crafting bytes by hand, we are &lt;code&gt;packing&lt;&#x2F;code&gt; a Tuple:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; create a Tuple&amp;lt;String, i64&amp;gt; with (&amp;quot;tenant-42&amp;quot;, 1)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; tuple = (String::from(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;tenant-42&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;), &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; and compute a row-key from the Tuple
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; row_key = foundationdb::tuple::pack::&amp;lt;(String, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;i64&lt;&#x2F;span&gt;&lt;span&gt;)&amp;gt;(&amp;amp;tuple);
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The generated row-key will be readable from any bindings, as it&#x27;s construction is standardized. Let&#x27;s print it:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; and print-it in hexa
&lt;&#x2F;span&gt;&lt;span&gt;println!(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;{:#04X?}&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, row_key);
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;pre data-lang=&quot;txt&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-txt &quot;&gt;&lt;code class=&quot;language-txt&quot; data-lang=&quot;txt&quot;&gt;&lt;span&gt;&#x2F;&#x2F; can be verified with https:&#x2F;&#x2F;www.utf8-chartable.de&#x2F;unicode-utf8-table.pl
&lt;&#x2F;span&gt;&lt;span&gt;[
&lt;&#x2F;span&gt;&lt;span&gt;    0x02,
&lt;&#x2F;span&gt;&lt;span&gt;    0x74, &#x2F;&#x2F; t
&lt;&#x2F;span&gt;&lt;span&gt;    0x65, &#x2F;&#x2F; e 
&lt;&#x2F;span&gt;&lt;span&gt;    0x6E, &#x2F;&#x2F; n
&lt;&#x2F;span&gt;&lt;span&gt;    0x61, &#x2F;&#x2F; a
&lt;&#x2F;span&gt;&lt;span&gt;    0x6E, &#x2F;&#x2F; n
&lt;&#x2F;span&gt;&lt;span&gt;    0x74, &#x2F;&#x2F; t
&lt;&#x2F;span&gt;&lt;span&gt;    0x2D, &#x2F;&#x2F; -
&lt;&#x2F;span&gt;&lt;span&gt;    0x31, &#x2F;&#x2F; 1
&lt;&#x2F;span&gt;&lt;span&gt;    0x00, 
&lt;&#x2F;span&gt;&lt;span&gt;    0x15,
&lt;&#x2F;span&gt;&lt;span&gt;    0x2A, &#x2F;&#x2F; 42
&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;As you can see, &lt;code&gt;pack&lt;&#x2F;code&gt; added some extra-characters. There are used to recognized the next type, a bit like when you are encoding&#x2F;decoding some wire protocols. You can find the relevant documentation &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;master&#x2F;design&#x2F;tuple.md&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Having this kind of standard means that we can easily decompose&#x2F;&lt;code&gt;unpack&lt;&#x2F;code&gt; it:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; retrieve the user and the magic number In a Tuple (String, i64)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; from_row_key = foundationdb::tuple::unpack::&amp;lt;(String, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;i64&lt;&#x2F;span&gt;&lt;span&gt;)&amp;gt;(&amp;amp;row_key)?;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;println!(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;user=&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;{}&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;&amp;#39;, magic_number=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;{}&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, from_row_key.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, from_row_key.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; user=&amp;#39;tenant-42&amp;#39;, magic_number=42
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Now that we saw &lt;code&gt;Tuples&lt;&#x2F;code&gt;, let&#x27;s dig in the next abstraction: &lt;code&gt;subspaces&lt;&#x2F;code&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;subspace&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#subspace&quot; aria-label=&quot;Anchor link for: subspace&quot;&gt;🔗&lt;&#x2F;a&gt;Subspace&lt;&#x2F;h3&gt;
&lt;p&gt;When you are working with key-values store, we are often playing with what we call &lt;code&gt;keyspaces&lt;&#x2F;code&gt;, by dedicating a portion of the key to an usage, like this for example:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;txt&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-txt &quot;&gt;&lt;code class=&quot;language-txt&quot; data-lang=&quot;txt&quot;&gt;&lt;span&gt;&#x2F;users&#x2F;tenant-1&#x2F;...
&lt;&#x2F;span&gt;&lt;span&gt;&#x2F;users&#x2F;tenant-2&#x2F;...
&lt;&#x2F;span&gt;&lt;span&gt;&#x2F;users&#x2F;tenant-3&#x2F;...
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Here, &lt;code&gt;&#x2F;users&#x2F;tenant-1&#x2F;&lt;&#x2F;code&gt; can be view like a prefix where we will put all the relevant keys. Instead of passing a simple prefix, FoundationDB is offering a dedicated structure called a &lt;code&gt;Subspace&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;A Subspace represents a well-defined region of keyspace in a FoundationDB database&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;It provides a convenient way to use FoundationDB tuples to define namespaces for different categories of data. The namespace is specified by a prefix tuple which is prepended to all tuples packed by the subspace. When unpacking a key with the subspace, the prefix tuple will be removed from the result.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;As you can see, the &lt;code&gt;Subspace&lt;&#x2F;code&gt; is heavily relying on FoundationDB&#x27;s tuples, as we can &lt;code&gt;pack&lt;&#x2F;code&gt; and &lt;code&gt;unpack&lt;&#x2F;code&gt; it.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;As a best practice, API clients should use at least one subspace for application data.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Well, as we have now the tools to handle keyspaces easily, it is now futile to craft keys by hand 🙃 Let&#x27;s create a subspace!&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; create a subspace from the Tuple (&amp;quot;tenant-1&amp;quot;, 42)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; subspace = Subspace::from((String::from(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;tenant-1&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;), &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;42&lt;&#x2F;span&gt;&lt;span&gt;));
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; let&amp;#39;s print the range
&lt;&#x2F;span&gt;&lt;span&gt;println!(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;start: {:#04X?}&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt; end: {:#04X?}&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, subspace.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;range&lt;&#x2F;span&gt;&lt;span&gt;().&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, subspace.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;range&lt;&#x2F;span&gt;&lt;span&gt;().&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We can see observe this:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;txt&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-txt &quot;&gt;&lt;code class=&quot;language-txt&quot; data-lang=&quot;txt&quot;&gt;&lt;span&gt;&#x2F;&#x2F; can be verified with https:&#x2F;&#x2F;www.utf8-chartable.de&#x2F;unicode-utf8-table.pl
&lt;&#x2F;span&gt;&lt;span&gt;start: [
&lt;&#x2F;span&gt;&lt;span&gt;    0x02,
&lt;&#x2F;span&gt;&lt;span&gt;    0x74, &#x2F;&#x2F; t
&lt;&#x2F;span&gt;&lt;span&gt;    0x65, &#x2F;&#x2F; e 
&lt;&#x2F;span&gt;&lt;span&gt;    0x6E, &#x2F;&#x2F; n
&lt;&#x2F;span&gt;&lt;span&gt;    0x61, &#x2F;&#x2F; a
&lt;&#x2F;span&gt;&lt;span&gt;    0x6E, &#x2F;&#x2F; n
&lt;&#x2F;span&gt;&lt;span&gt;    0x74, &#x2F;&#x2F; t
&lt;&#x2F;span&gt;&lt;span&gt;    0x2D, &#x2F;&#x2F; -
&lt;&#x2F;span&gt;&lt;span&gt;    0x31, &#x2F;&#x2F; 1
&lt;&#x2F;span&gt;&lt;span&gt;    0x00, 
&lt;&#x2F;span&gt;&lt;span&gt;    0x15,
&lt;&#x2F;span&gt;&lt;span&gt;    0x2A, &#x2F;&#x2F; 42
&lt;&#x2F;span&gt;&lt;span&gt;    0x00,
&lt;&#x2F;span&gt;&lt;span&gt;    0x00, &#x2F;&#x2F; smallest possible byte
&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;span&gt;end: [
&lt;&#x2F;span&gt;&lt;span&gt;    0x02,
&lt;&#x2F;span&gt;&lt;span&gt;    0x74, &#x2F;&#x2F; t
&lt;&#x2F;span&gt;&lt;span&gt;    0x65, &#x2F;&#x2F; e 
&lt;&#x2F;span&gt;&lt;span&gt;    0x6E, &#x2F;&#x2F; n
&lt;&#x2F;span&gt;&lt;span&gt;    0x61, &#x2F;&#x2F; a
&lt;&#x2F;span&gt;&lt;span&gt;    0x6E, &#x2F;&#x2F; n
&lt;&#x2F;span&gt;&lt;span&gt;    0x74, &#x2F;&#x2F; t
&lt;&#x2F;span&gt;&lt;span&gt;    0x2D, &#x2F;&#x2F; -
&lt;&#x2F;span&gt;&lt;span&gt;    0x31, &#x2F;&#x2F; 1
&lt;&#x2F;span&gt;&lt;span&gt;    0x00, 
&lt;&#x2F;span&gt;&lt;span&gt;    0x15,
&lt;&#x2F;span&gt;&lt;span&gt;    0x2A, &#x2F;&#x2F; 42
&lt;&#x2F;span&gt;&lt;span&gt;    0x00,
&lt;&#x2F;span&gt;&lt;span&gt;    0xFF, &#x2F;&#x2F; biggest possible byte
&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Which make sens, if we take &lt;code&gt;(&quot;tenant-1&quot;, 42)&lt;&#x2F;code&gt; as a prefix, then the range for this subspace will be between &lt;code&gt;(&quot;tenant-1&quot;, 42, 0x00)&lt;&#x2F;code&gt; and &lt;code&gt;(&quot;tenant-1&quot;, 42, 0xFF)&lt;&#x2F;code&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;directory&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#directory&quot; aria-label=&quot;Anchor link for: directory&quot;&gt;🔗&lt;&#x2F;a&gt;Directory&lt;&#x2F;h3&gt;
&lt;p&gt;Now that we know our way around &lt;code&gt;Tuples&lt;&#x2F;code&gt; and &lt;code&gt;Subspaces&lt;&#x2F;code&gt;, we can now talk about what I&#x27;m working on, which is the &lt;code&gt;Directory&lt;&#x2F;code&gt;. Let&#x27;s have a look at the relevant &lt;a href=&quot;https:&#x2F;&#x2F;apple.github.io&#x2F;foundationdb&#x2F;developer-guide.html#directories&quot;&gt;documentation&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;FoundationDB provides directories (available in each language binding) as a tool for managing related subspaces.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;Directories are a recommended approach for administering applications. Each application should create or open at least one directory to manage its subspaces.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Okay, let&#x27;s see the API(in Go, as I&#x27;m working on the Rust API):&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;go&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-go &quot;&gt;&lt;code class=&quot;language-go&quot; data-lang=&quot;go&quot;&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;subspace&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;err &lt;&#x2F;span&gt;&lt;span&gt;:= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;directory&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;CreateOrOpen&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;db&lt;&#x2F;span&gt;&lt;span&gt;, []&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;string&lt;&#x2F;span&gt;&lt;span&gt;{&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;application&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;my-app&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;tenant&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;tenant-42&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;}, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;nil&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;err &lt;&#x2F;span&gt;&lt;span&gt;!= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;nil &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;log&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Fatal&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;err&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;fmt&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Printf&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;%+v&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;subspace&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Bytes&lt;&#x2F;span&gt;&lt;span&gt;())
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; [21 18]
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We can see that we have a shorter subspace! The &lt;code&gt;directory&lt;&#x2F;code&gt; allows you to generate some integer that will be bind to a path, like here &lt;code&gt;&quot;application&quot;, &quot;my-app&quot;, &quot;tenant&quot;, &quot;tenant-42&quot;&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;There are two advantages to this:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;shorter keys,&lt;&#x2F;li&gt;
&lt;li&gt;cheap metadata operations like &lt;code&gt;List&lt;&#x2F;code&gt; or &lt;code&gt;Move&lt;&#x2F;code&gt;:&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;pre data-lang=&quot;go&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-go &quot;&gt;&lt;code class=&quot;language-go&quot; data-lang=&quot;go&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; list all tenant in &amp;quot;application&amp;quot;, &amp;quot;my-app&amp;quot;:
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;tenants&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;err &lt;&#x2F;span&gt;&lt;span&gt;:= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;directory&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;List&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;db&lt;&#x2F;span&gt;&lt;span&gt;, []&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;string&lt;&#x2F;span&gt;&lt;span&gt;{&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;application&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;my-app&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;tenant&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;})
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;err &lt;&#x2F;span&gt;&lt;span&gt;!= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;nil &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;log&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Fatal&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;err&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;fmt&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Printf&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;%+v&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;tenants&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; [tenant-42]
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; renaming &amp;#39;tenant-42&amp;#39; in &amp;#39;tenant-142&amp;#39;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; This will NOT move the data, only the metadata is modified
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;directorySubspace&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;err &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;directory&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Move&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;db&lt;&#x2F;span&gt;&lt;span&gt;, 
&lt;&#x2F;span&gt;&lt;span&gt; []&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;string&lt;&#x2F;span&gt;&lt;span&gt;{&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;application&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;my-app&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;tenant&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;tenant-42&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;},  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; old path
&lt;&#x2F;span&gt;&lt;span&gt; []&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;string&lt;&#x2F;span&gt;&lt;span&gt;{&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;application&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;my-app&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;tenant&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;tenant-142&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;}) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; new path
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;err &lt;&#x2F;span&gt;&lt;span&gt;!= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;nil &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;log&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Fatal&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;err&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;fmt&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Printf&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;%+v&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;directorySubspace&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Bytes&lt;&#x2F;span&gt;&lt;span&gt;())
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; still [21 18]
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The returned object is actually a &lt;code&gt;DirectorySubspace&lt;&#x2F;code&gt;, which implements both &lt;code&gt;Directory&lt;&#x2F;code&gt; and &lt;code&gt;Subspace&lt;&#x2F;code&gt;, which means that you can use it to recreate many directories and subspaces at will 👌&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you are wondering about how this integer is generated, I recommend going through this awesome blogpost on &lt;a href=&quot;https:&#x2F;&#x2F;activesphere.com&#x2F;blog&#x2F;2018&#x2F;08&#x2F;05&#x2F;high-contention-allocator&quot;&gt;how high contention allocator works in FoundationDB.&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;strong&gt;Thank you&lt;&#x2F;strong&gt; for reading my post! Feel free to react to this article, I am also available on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt; if needed.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">database</category>
          <category domain="tag">storage</category>
          <category domain="tag">foundationdb</category>
          <category domain="tag">distributed</category>
      </item>
      <item>
          <title>Notes about ETCD</title>
          <pubDate>Mon, 11 Jan 2021 00:24:27 +0100</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/notes-about-etcd/</link>
          <guid>https://pierrezemb.fr/posts/notes-about-etcd/</guid>
          <description xml:base="https://pierrezemb.fr/posts/notes-about-etcd/">&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;notes-about-etcd&#x2F;images&#x2F;etcd.png&quot; alt=&quot;etcd image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a href=&quot;&#x2F;tags&#x2F;notes&#x2F;&quot;&gt;Notes About&lt;&#x2F;a&gt; is a blogpost serie  you will find a lot of &lt;strong&gt;links, videos, quotes, podcasts to click on&lt;&#x2F;strong&gt; about a specific topic. Today we will discover ETCD.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;overview-of-etcd&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#overview-of-etcd&quot; aria-label=&quot;Anchor link for: overview-of-etcd&quot;&gt;🔗&lt;&#x2F;a&gt;Overview of ETCD&lt;&#x2F;h2&gt;
&lt;p&gt;As stated in the &lt;a href=&quot;https:&#x2F;&#x2F;etcd.io&#x2F;&quot;&gt;official documentation&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;etcd is a strongly consistent, distributed key-value store that provides a reliable way to store data that needs to be accessed by a distributed system or cluster of machines. It gracefully handles leader elections during network partitions and can tolerate machine failure, even in the leader node.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;history&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#history&quot; aria-label=&quot;Anchor link for: history&quot;&gt;🔗&lt;&#x2F;a&gt;History&lt;&#x2F;h2&gt;
&lt;p&gt;ETCD was initially developed by CoreOS:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;CoreOS built etcd to solve the problem of shared configuration and service discovery.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;July 23, 2013 - announcement&lt;&#x2F;li&gt;
&lt;li&gt;December 27, 2013 - etcd 0.2.0 - new API, new modules and tons of improvements&lt;&#x2F;li&gt;
&lt;li&gt;February 07, 2014 - etcd 0.3.0 - Improved Cluster Discovery, API Enhancements and Windows Support&lt;&#x2F;li&gt;
&lt;li&gt;January 28, 2015 - etcd 2.0 - First Major Stable Release&lt;&#x2F;li&gt;
&lt;li&gt;June 30, 2016 - etcd3 - A New Version of etcd from CoreOS&lt;&#x2F;li&gt;
&lt;li&gt;June 09, 2017 - etcd 3.2 - etcd 3.2 now with massive watch scaling and easy locks&lt;&#x2F;li&gt;
&lt;li&gt;February 01, 2018 - etcd 3.3 - Announcing etcd 3.3, with improvements to stability, performance, and more&lt;&#x2F;li&gt;
&lt;li&gt;August 30, 2019 - etcd 3.4 - Better Storage Backend, concurrent Read, Improved Raft Voting Process, Raft Learner Member&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;overall-architecture&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#overall-architecture&quot; aria-label=&quot;Anchor link for: overall-architecture&quot;&gt;🔗&lt;&#x2F;a&gt;Overall architecture&lt;&#x2F;h2&gt;
&lt;blockquote&gt;
&lt;p&gt;The etcd key-value store is a distributed system intended for use as a coordination primitive. Like Zookeeper and Consul, etcd stores a small volume of infrequently-updated state (by default, up to 8 GB) in a key-value map, and offers strict-serializable reads, writes and micro-transactions across the entire datastore, plus coordination primitives like locks, watches, and leader election. Many distributed systems, such as Kubernetes and OpenStack, use etcd to store cluster metadata, to coordinate consistent views over data, to choose leaders, and so on.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;ETCD is:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;using &lt;a href=&quot;&#x2F;posts&#x2F;notes-about-raft&#x2F;&quot;&gt;the raft consensus algorithm&lt;&#x2F;a&gt;,&lt;&#x2F;li&gt;
&lt;li&gt;a single group raft,&lt;&#x2F;li&gt;
&lt;li&gt;using &lt;a href=&quot;https:&#x2F;&#x2F;grpc.io&#x2F;&quot;&gt;gRPC&lt;&#x2F;a&gt; for communication,&lt;&#x2F;li&gt;
&lt;li&gt;using a self-made WAL implementation,&lt;&#x2F;li&gt;
&lt;li&gt;storing key-values into bbolt,&lt;&#x2F;li&gt;
&lt;li&gt;optimized for consistency over latency in normal situations and consistency over availability in the case of a partition (&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;PACELC_theorem&quot;&gt;in terms of the PACELC theorem&lt;&#x2F;a&gt;).&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;consensus-raft&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#consensus-raft&quot; aria-label=&quot;Anchor link for: consensus-raft&quot;&gt;🔗&lt;&#x2F;a&gt;Consensus? Raft?&lt;&#x2F;h3&gt;
&lt;ul&gt;
&lt;li&gt;Raft is a consensus algorithm for managing a replicated log.&lt;&#x2F;li&gt;
&lt;li&gt;consensus involves multiple servers agreeing on values.&lt;&#x2F;li&gt;
&lt;li&gt;two common consensus algorithm are Paxos and Raft&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;blockquote&gt;
&lt;p&gt;Paxos is quite difficult to understand, inspite of numerous attempts to make it more approachable. Furthermore, its architecture requires complex changes to support practical systems. As a result, both system builders and students struggle with Paxos.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;A common alternative to Paxos&#x2F;Raft is a non-consensus (aka peer-to-peer) replication protocol.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;blockquote&gt;
&lt;p&gt;Raft separates the key elements of consensus, such as leader election, log replication, and safety&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;ETCD contains several raft optimizations:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Read Index,&lt;&#x2F;li&gt;
&lt;li&gt;Follower reads,&lt;&#x2F;li&gt;
&lt;li&gt;Transfer leader,&lt;&#x2F;li&gt;
&lt;li&gt;Learner role,&lt;&#x2F;li&gt;
&lt;li&gt;Client-side load-balancing.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;exposed-api&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#exposed-api&quot; aria-label=&quot;Anchor link for: exposed-api&quot;&gt;🔗&lt;&#x2F;a&gt;Exposed API&lt;&#x2F;h3&gt;
&lt;p&gt;ETCD is exposing several APIs through different gRPC services:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Put(key, value),&lt;&#x2F;li&gt;
&lt;li&gt;Delete(key, Optional(keyRangeEnd)),&lt;&#x2F;li&gt;
&lt;li&gt;Get(key, Optional(keyRangeEnd)),&lt;&#x2F;li&gt;
&lt;li&gt;Watch(key, Optional(keyRangeEnd)),&lt;&#x2F;li&gt;
&lt;li&gt;Transaction(if&#x2F;then&#x2F;else ops),&lt;&#x2F;li&gt;
&lt;li&gt;Compact(revision),&lt;&#x2F;li&gt;
&lt;li&gt;Lease:
&lt;ul&gt;
&lt;li&gt;Grant,&lt;&#x2F;li&gt;
&lt;li&gt;Revoke,&lt;&#x2F;li&gt;
&lt;li&gt;KeepAlive&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Key and values are bytes-oriented but ordered.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;transactions&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#transactions&quot; aria-label=&quot;Anchor link for: transactions&quot;&gt;🔗&lt;&#x2F;a&gt;Transactions&lt;&#x2F;h3&gt;
&lt;pre data-lang=&quot;proto&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-proto &quot;&gt;&lt;code class=&quot;language-proto&quot; data-lang=&quot;proto&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; From google paxosdb paper:
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Our implementation hinges around a powerful primitive which we call MultiOp. All other database
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; operations except for iteration are implemented as a single call to MultiOp. A MultiOp is applied atomically
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; and consists of three components:
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; 1. A list of tests called guard. Each test in guard checks a single entry in the database. It may check
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; for the absence or presence of a value, or compare with a given value. Two different tests in the guard
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; may apply to the same or different entries in the database. All tests in the guard are applied and
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; MultiOp returns the results. If all tests are true, MultiOp executes t op (see item 2 below), otherwise
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; it executes f op (see item 3 below).
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; 2. A list of database operations called t op. Each operation in the list is either an insert, delete, or
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; lookup operation, and applies to a single database entry. Two different operations in the list may apply
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; to the same or different entries in the database. These operations are executed
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; if guard evaluates to
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; true.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; 3. A list of database operations called f op. Like t op, but executed if guard evaluates to false.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;message &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;TxnRequest &lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; compare is a list of predicates representing a conjunction of terms.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; If the comparisons succeed, then the success requests will be processed in order,
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; and the response will contain their respective responses in order.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; If the comparisons fail, then the failure requests will be processed in order,
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; and the response will contain their respective responses in order.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;repeated &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Compare compare &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; success is a list of requests which will be applied when compare evaluates to true.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;repeated &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;RequestOp success &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; failure is a list of requests which will be applied when compare evaluates to false.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;repeated &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;RequestOp failure &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h3 id=&quot;versioned-data&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#versioned-data&quot; aria-label=&quot;Anchor link for: versioned-data&quot;&gt;🔗&lt;&#x2F;a&gt;Versioned data&lt;&#x2F;h3&gt;
&lt;p&gt;Each Key&#x2F;Value has a revision. When creating a new key, revision starts at 1, and then will be incremented each time the key is updated.&lt;&#x2F;p&gt;
&lt;p&gt;In order to avoid having a growing keySpace, one can issue the &lt;code&gt;Compact&lt;&#x2F;code&gt; gRPC service:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Compacting the keyspace history drops all information about keys superseded prior to a given keyspace revision&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;lease&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#lease&quot; aria-label=&quot;Anchor link for: lease&quot;&gt;🔗&lt;&#x2F;a&gt;Lease&lt;&#x2F;h3&gt;
&lt;pre data-lang=&quot;proto&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-proto &quot;&gt;&lt;code class=&quot;language-proto&quot; data-lang=&quot;proto&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; this message represent a Lease
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;message &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Lease &lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; TTL is the advisory time-to-live in seconds. Expired lease will return -1.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  int64 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;TTL &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; ID is the requested ID for the lease. If ID is set to 0, the lessor chooses an ID.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  int64 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ID &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  int64 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;insert_timestamp &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h3 id=&quot;watches&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#watches&quot; aria-label=&quot;Anchor link for: watches&quot;&gt;🔗&lt;&#x2F;a&gt;Watches&lt;&#x2F;h3&gt;
&lt;pre data-lang=&quot;proto&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-proto &quot;&gt;&lt;code class=&quot;language-proto&quot; data-lang=&quot;proto&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;message &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Watch &lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; key is the key to register for watching.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  bytes &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;key &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; range_end is the end of the range [key, range_end) to watch. If range_end is not given,
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; only the key argument is watched. If range_end is equal to &amp;#39;\0&amp;#39;, all keys greater than
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; or equal to the key argument are watched.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; If the range_end is one bit larger than the given key,
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; then all keys with the prefix (the given key) will be watched.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  bytes &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;range_end &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; If watch_id is provided and non-zero, it will be assigned to this watcher.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Since creating a watcher in etcd is not a synchronous operation,
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; this can be used ensure that ordering is correct when creating multiple
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; watchers on the same stream. Creating a watcher with an ID already in
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; use on the stream will cause an error to be returned.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;  int64 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;watch_id &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#eff1f5;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h3 id=&quot;linearizable-reads&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#linearizable-reads&quot; aria-label=&quot;Anchor link for: linearizable-reads&quot;&gt;🔗&lt;&#x2F;a&gt;Linearizable reads&lt;&#x2F;h3&gt;
&lt;p&gt;Section 8 of the raft paper explains the issue:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Read-only operations can be handled without writing anything into the log. However, with no additional measures, this would run the risk of returning stale data, since the leader responding to the request might have been superseded by a newer leader of which it is unaware. Linearizable reads must not return stale data, and Raft needs two extra precautions to guarantee this without using the log. First, a leader must have the latest information on which entries are committed. The Leader Completeness Property guarantees that a leader has all committed entries, but at the start of its term, it may not know which those are. To find out, it needs to commit an entry from its term. Raft handles this by having each leader commit a blank no-op entry into the log at the start of its term. Second,a leader must check whether it has been deposed before processing a read-only request (its information may be stale if a more recent leader has been elected). Raft handles this by having the leader exchange heartbeat messages with a majority of the cluster before responding to read-only requests.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;ETCD implements &lt;code&gt;ReadIndex&lt;&#x2F;code&gt; read(more info on &lt;a href=&quot;&#x2F;posts&#x2F;diving-into-etcd-linearizable&#x2F;&quot;&gt;Diving into ETCD’s linearizable reads&lt;&#x2F;a&gt;).&lt;&#x2F;p&gt;
&lt;h3 id=&quot;how-etcd-is-using-bbolt&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#how-etcd-is-using-bbolt&quot; aria-label=&quot;Anchor link for: how-etcd-is-using-bbolt&quot;&gt;🔗&lt;&#x2F;a&gt;How ETCD is using bbolt&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;etcd-io&#x2F;bbolt&quot;&gt;bbolt&lt;&#x2F;a&gt; is the underlying kv used in etcd. &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;etcd-io&#x2F;etcd&#x2F;blob&#x2F;v3.4.14&#x2F;mvcc&#x2F;kvstore_txn.go#L214&quot;&gt;A bucket called &lt;code&gt;key&lt;&#x2F;code&gt; is used to store data, and the key is the revision&lt;&#x2F;a&gt;. Then, to find keys, &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;etcd-io&#x2F;etcd&#x2F;blob&#x2F;v3.4.14&#x2F;mvcc&#x2F;index.go#L68&quot;&gt;a B-Tree is used&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Bolt allows only one read-write transaction at a time but allows as many read-only transactions as you want at a time.&lt;&#x2F;li&gt;
&lt;li&gt;Each transaction has a consistent view of the data as it existed when the transaction started.&lt;&#x2F;li&gt;
&lt;li&gt;Bolt uses a B+tree internally and only a single file. Both approaches have trade-offs.&lt;&#x2F;li&gt;
&lt;li&gt;If you require a high random write throughput (&amp;gt;10,000 w&#x2F;sec) or you need to use spinning disks then LevelDB could be a good choice. If your application is read-heavy or does a lot of range scans then Bolt could be a good choice.&lt;&#x2F;li&gt;
&lt;li&gt;Try to avoid long running read transactions. Bolt uses copy-on-write so old pages cannot be reclaimed while an old transaction is using them.&lt;&#x2F;li&gt;
&lt;li&gt;Bolt uses a memory-mapped file so the underlying operating system handles the caching of the data. Typically, the OS will cache as much of the file as it can in memory and will release memory as needed to other processes. This means that Bolt can show very high memory usage when working with large databases.&lt;&#x2F;li&gt;
&lt;li&gt;Etcd implements multi-version-concurrency-control (MVCC) on top of Boltdb&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;etcd-io&#x2F;etcd&#x2F;issues&#x2F;12169#issuecomment-673292122&quot;&gt;From an Github issue&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Note that the underlying &lt;code&gt;bbolt&lt;&#x2F;code&gt; mmap its file in memory. For better performance, usually it is a good idea to ensure the physical memory available to etcd is larger than its data size.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;etcd-in-k8s&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#etcd-in-k8s&quot; aria-label=&quot;Anchor link for: etcd-in-k8s&quot;&gt;🔗&lt;&#x2F;a&gt;ETCD in K8S&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;kubernetes&#x2F;kubernetes&#x2F;blob&#x2F;master&#x2F;staging&#x2F;src&#x2F;k8s.io&#x2F;apiserver&#x2F;pkg&#x2F;storage&#x2F;interfaces.go#L159&quot;&gt;The interface can be found here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Create use TTL and Txn&lt;&#x2F;li&gt;
&lt;li&gt;Get use KV.Get&lt;&#x2F;li&gt;
&lt;li&gt;Delete use Get and then for with a Txn&lt;&#x2F;li&gt;
&lt;li&gt;GuaranteedUpdate uses Txn&lt;&#x2F;li&gt;
&lt;li&gt;List uses Get&lt;&#x2F;li&gt;
&lt;li&gt;Watch uses Watch with a channel&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;jepsen&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#jepsen&quot; aria-label=&quot;Anchor link for: jepsen&quot;&gt;🔗&lt;&#x2F;a&gt;Jepsen&lt;&#x2F;h2&gt;
&lt;p&gt;The Jepsen team tested &lt;a href=&quot;https:&#x2F;&#x2F;jepsen.io&#x2F;analyses&#x2F;etcd-3.4.3&quot;&gt;etcd-3.4.3&lt;&#x2F;a&gt;, here&#x27;s some quotes:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;In our tests, etcd 3.4.3 lived up to its claims for key-value operations: we observed nothing but strict-serializable consistency for reads, writes, and even multi-key transactions, during process pauses, crashes, clock skew, network partitions, and membership changes.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;Watches appear correct, at least over single keys. So long as compaction does not destroy historical data while a watch isn’t running, watches appear to deliver every update to a key in order.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;However, etcd locks (like all distributed locks) do not provide mutual exclusion. Multiple processes can hold an etcd lock concurrently, even in healthy clusters with perfectly synchronized clocks.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;If you use etcd locks, consider whether those locks are used to ensure safety, or simply to improve performance by probabilistically limiting concurrency. It’s fine to use etcd locks for performance, but using them for safety might be risky.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;operation-notes&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#operation-notes&quot; aria-label=&quot;Anchor link for: operation-notes&quot;&gt;🔗&lt;&#x2F;a&gt;Operation notes&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;deployements-tips&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#deployements-tips&quot; aria-label=&quot;Anchor link for: deployements-tips&quot;&gt;🔗&lt;&#x2F;a&gt;Deployements tips&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;etcd.io&#x2F;docs&#x2F;v3.4.0&#x2F;faq&#x2F;&quot;&gt;From the official documentation&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Since etcd writes data to disk, SSD is highly recommended.
To prevent performance degradation or unintentionally overloading the key-value store, etcd enforces a configurable storage size quota set to 2GB by default.
To avoid swapping or running out of memory, the machine should have at least as much RAM to cover the quota.
8GB is a suggested maximum size for normal environments and etcd warns at startup if the configured value exceeds it.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;defrag&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#defrag&quot; aria-label=&quot;Anchor link for: defrag&quot;&gt;🔗&lt;&#x2F;a&gt;Defrag&lt;&#x2F;h3&gt;
&lt;blockquote&gt;
&lt;p&gt;After compacting the keyspace, the backend database may exhibit internal fragmentation.
Defragmentation is issued on a per-member so that cluster-wide latency spikes may be avoided.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Defrag is basically &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;etcd-io&#x2F;etcd&#x2F;blob&#x2F;2b79442d8e9fc54b1ac27e7e230ac0e4c132a054&#x2F;mvcc&#x2F;backend&#x2F;backend.go#L349&quot;&gt;dumping the bbolt tree on disk and reopening it&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;snapshot&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#snapshot&quot; aria-label=&quot;Anchor link for: snapshot&quot;&gt;🔗&lt;&#x2F;a&gt;Snapshot&lt;&#x2F;h3&gt;
&lt;p&gt;An ETCD snapshot is related to Raft&#x27;s snapshot:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Snapshotting is the simplest approach to compaction. In snapshotting, the entire current system state is written to a snapshot on stable storage, then the entire log up to that point is discarded&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Snapshot can be saved using &lt;code&gt;etcdctl&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;bash&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-bash &quot;&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;etcdctl&lt;&#x2F;span&gt;&lt;span&gt; snapshot save backup.db
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h3 id=&quot;lease-1&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#lease-1&quot; aria-label=&quot;Anchor link for: lease-1&quot;&gt;🔗&lt;&#x2F;a&gt;Lease&lt;&#x2F;h3&gt;
&lt;p&gt;Be careful on Leader&#x27;s change and lease, this can &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;kubernetes&#x2F;kubernetes&#x2F;issues&#x2F;65497&quot;&gt;create some issues&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;The new leader extends timeouts automatically for all leases. This mechanism ensures no lease expires due to server side unavailability.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;war-stories&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#war-stories&quot; aria-label=&quot;Anchor link for: war-stories&quot;&gt;🔗&lt;&#x2F;a&gt;War stories&lt;&#x2F;h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;blog.cloudflare.com&#x2F;a-byzantine-failure-in-the-real-world&#x2F;&quot;&gt;An analysis of the Cloudflare API availability incident on 2020-11-02&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;grafana.com&#x2F;blog&#x2F;2020&#x2F;04&#x2F;07&#x2F;how-a-production-outage-in-grafana-clouds-hosted-prometheus-service-was-caused-by-a-bad-etcd-client-setup&#x2F;&quot;&gt;How a production outage in Grafana Cloud&#x27;s Hosted Prometheus service was caused by a bad etcd client setup&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;etcd-io&#x2F;etcd&#x2F;issues&#x2F;11884&quot;&gt;Random performance issue on etcd 3.4&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2004.00372.pdf&quot;&gt;Impact of etcd deployment on Kubernetes, Istio, and application performance&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
</description>
          <category domain="tag">distributed</category>
          <category domain="tag">etcd</category>
          <category domain="tag">storage</category>
          <category domain="tag">consensus</category>
          <category domain="tag">notes</category>
      </item>
      <item>
          <title>10 years of programming and counting 🚀</title>
          <pubDate>Wed, 30 Sep 2020 00:24:27 +0100</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/ten-years-programming/</link>
          <guid>https://pierrezemb.fr/posts/ten-years-programming/</guid>
          <description xml:base="https://pierrezemb.fr/posts/ten-years-programming/">&lt;p&gt;I’ve just realized that I’ve spent the last decade programming 🤯 While 2020 feels like a strange year, I thought it would be nice to write down a retrospective of the last 10 years 🗓&lt;&#x2F;p&gt;
&lt;h2 id=&quot;learning-to-program-man-computer&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#learning-to-program-man-computer&quot; aria-label=&quot;Anchor link for: learning-to-program-man-computer&quot;&gt;🔗&lt;&#x2F;a&gt;Learning to program 👨🏻‍💻&lt;&#x2F;h2&gt;
&lt;p&gt;I wrote my first &lt;em&gt;Hello, world&lt;&#x2F;em&gt; program somewhere around September 2010, when I started my engineering school to do some electronics, but that C language got me. I spent 6 months struggling to understand pointers and memory. I remember spending nights trying to find a memory leak with valgrind. Of course there were multiples mistakes, but it felt good to dig that far.&lt;&#x2F;p&gt;
&lt;p&gt;I also discovered Linux around that time, and spent many nights playing with Linux commands. I started my journey to Linux with Centos and then Ubuntu 11.04. I think this started the loop I’m (still!) stuck in:
&lt;code&gt;for {tryNewDistro()}&lt;&#x2F;code&gt;
I’m pretty sure that if I wanted to go away from distributed systems, I would try to land a job around operating systems. So many things to learn 🤩&lt;&#x2F;p&gt;
&lt;p&gt;After learning C, we started to learn web-based technologies like HTML&#x2F;CSS&#x2F;JS&#x2F;PHP. I remember struggling to generate a calendar with PHP 🐘 I learned about APIs the week after the project 😅 I remember digging into cookies, and network calls from popular websites to see how they were using it.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;java-and-hadoop-elephant&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#java-and-hadoop-elephant&quot; aria-label=&quot;Anchor link for: java-and-hadoop-elephant&quot;&gt;🔗&lt;&#x2F;a&gt;Java and Hadoop 🐘&lt;&#x2F;h2&gt;
&lt;p&gt;I had the chance to land a part-time internship during the third year (out of five) of my engineering school. I joined the Systems team @ Arkea, a french bank.
I remember spending a lot of time with my coworkers, learning things from them, from Hadoop to mainframes and Linux. It was my first time grasping the work around “system programming”.&lt;&#x2F;p&gt;
&lt;p&gt;My first task was around writing an installer for a java app on windows, but my tutor tried to push me further. He saw my interest around some specific layers of their perimeter, such as Hadoop and Kafka. He gave to me a chance to work directly on those. A small API that was could load old monitoring data stored in HDFS and expose them back into the “real-time” visualization tool. I also used Kafka and even deployed a small HBase cluster for testing.&lt;&#x2F;p&gt;
&lt;p&gt;I can&#x27;t thank my tutor enough for giving me this chance, and for allowing me to discover what will become my focus: distributed systems.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;let-s-meet-other-people-wave&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#let-s-meet-other-people-wave&quot; aria-label=&quot;Anchor link for: let-s-meet-other-people-wave&quot;&gt;🔗&lt;&#x2F;a&gt;Let’s meet other people 👋&lt;&#x2F;h2&gt;
&lt;p&gt;Around the same time, I discovered tech meetups and conferences. At that time, Google I&#x2F;O was a major event with people jumping from a plane and streaming it through Google Glass. I found out there was a group of people watching the live together. And this is how I discovered my local GDG&#x2F;JUG 🥳 I learned so many things by watching local talks, even if it was difficult to grasp everything at first. I remember taking 📝 about what I didn’t understand, to learn about it later.&lt;&#x2F;p&gt;
&lt;p&gt;I also met amazing persons, that are now friends and&#x2F;or mentors. I remember feeling humble to be able to learn from them.&lt;&#x2F;p&gt;
&lt;p&gt;I also discovered more global tech conferences. I asked as a birthday 🎁 to go to Devoxx France and DotScale, in 2014. It was awesome 😎&lt;&#x2F;p&gt;
&lt;p&gt;By dint of watching talks, I wanted to give some. I started small, giving talks at my engineering school, then moved to the JUG itself. I learned &lt;strong&gt;a lot&lt;&#x2F;strong&gt; by making a lot of mistakes, but I’m pretty happy how things turned out, as I’m now speaking at tech conferences as part of my current work.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;ten-years-programming&#x2F;first-talk.jpg&quot; alt=&quot;etcd image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;I also started to be involved in events and organizations such as:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;The JUG&#x2F;GDG&lt;&#x2F;li&gt;
&lt;li&gt;A coworking place&lt;&#x2F;li&gt;
&lt;li&gt;Startup Weekend&lt;&#x2F;li&gt;
&lt;li&gt;Devoxx4kids&lt;&#x2F;li&gt;
&lt;li&gt;DevFest du bout du monde&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;learning-big-data-floppy-disk&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#learning-big-data-floppy-disk&quot; aria-label=&quot;Anchor link for: learning-big-data-floppy-disk&quot;&gt;🔗&lt;&#x2F;a&gt;Learning big data 💾&lt;&#x2F;h2&gt;
&lt;p&gt;After my graduation and a(nother) part-time internship at OVH, I started working on something called Metrics Data Platform. It is the platform massively used internally to store, query and alert on timeseries data. We avoid the Borgmon approach (deploying Prometheus’s like database for every team), instead we created a unique platform to ingest all OVHcloud’s datapoints using a big-data approach. Here’s the key point of Metrics:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;multi-tenant&lt;&#x2F;strong&gt;: as we said before, a single metrics cluster is handling all telemetry, from servers to applications and smart data centers from OVHcloud.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;scalable&lt;&#x2F;strong&gt;: today we are receiving around 1.8 million datapoints per second&#x2F;s 🙈 for about 450 million timeseries 🙉. During European daytime, we are reading around 4.5 millions datapoints per seconds thank to Grafana’s auto-refresh mode 🙊&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;multi-protocol support&lt;&#x2F;strong&gt;: we didn&#x27;t want to reflect our infrastructure choice to our users, so we wrote some proxies that can translate known protocols to our query language, so users can query and push data using OpenTSDB, Prometheus, InfluxDB and so on.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;based on open source&lt;&#x2F;strong&gt; we are using Warp10 as the core of our infrastructure with Kafka and HBase. Alerting was built with Apache flink. We open sourced many software, from agent to our proxies. We also gave many talks about what we learnt.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;I had the chance to built Metrics from the ground. I started working on the management layer and proxies. Then I wanted to learn operations, so I learned it by deploying Hadoop clusters 🤯 it took me a while to be able to start doing on-calls. I cannot count how many nights I was up, trying to fix some buggy softwares, or yelling at HBase for an inconsistent &lt;code&gt;hbck&lt;&#x2F;code&gt;, or trying to find a way to handle a side effect of a loosing multiple racks.&lt;&#x2F;p&gt;
&lt;p&gt;Our work was highly technical, and I loved it:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;We optimized a lot of things, from HBase to our Go’s based proxies. &lt;code&gt;optimize HBase&#x27;s data balancer&lt;&#x2F;code&gt; or &lt;code&gt;fix issues with Go’s gc&lt;&#x2F;code&gt;  was almost a normal task to do&lt;&#x2F;li&gt;
&lt;li&gt;We saw Metrics’s growth, from hundred to millions of datapoints 😎 we saw systems breaking at scale, causing us to rewrite software or change architecture. Production became the final test.&lt;&#x2F;li&gt;
&lt;li&gt;Every software we developed had a &lt;code&gt;keep it simple, yet scalable&lt;&#x2F;code&gt; policy, and doing on-calls was a good way to ensure software quality. We all learned it the hard way I guess 🤣&lt;&#x2F;li&gt;
&lt;li&gt;We were only 4 to 6 to handle ~800 servers, 3 Hadoop clusters, and thousands of lines of Java&#x2F;Go&#x2F;Rust&#x2F;Ansible codes.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;As always, things were not always magical, and i struggled more time than I can count. I learned that personal struggle is more difficult than technical, as you can always drill-down your tech problems by reading the code. The team was amazing 🚀, and we were helping each other a lot 🤝&lt;&#x2F;p&gt;
&lt;h2 id=&quot;searching-for-planets-telescope-ringed-planet&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#searching-for-planets-telescope-ringed-planet&quot; aria-label=&quot;Anchor link for: searching-for-planets-telescope-ringed-planet&quot;&gt;🔗&lt;&#x2F;a&gt;Searching for planets 🔭 🪐&lt;&#x2F;h2&gt;
&lt;p&gt;When I started working on Metrics, we did a lot of internal on boarding. At his core, metrics is usine Warp10, which is coming with his own language to analyze timeseries. This provides heavy query-capabilities, but as it is stack-based, getting started was difficult. I needed a project to dive into timeseries analysis.&lt;&#x2F;p&gt;
&lt;p&gt;I love astronomy 🔭, but there’s too much ☁️ (not the servers) in my city. I decided to look for astronomical timeseries. Turns out there is a lot, but one use case triggered my interest: exoplanet’s search. Almost everything from NASA is Opendata, so we decided to create &lt;a href=&quot;https:&#x2F;&#x2F;helloexo.world&#x2F;&quot;&gt;HelloExoWorld&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;We imported the &lt;strong&gt;25TB dataset into a Warp10 instance&lt;&#x2F;strong&gt; and start writing some WarpScript to search for transits. We wrote a &lt;a href=&quot;https:&#x2F;&#x2F;helloexoworld.github.io&#x2F;hew-hands-on&#x2F;&quot;&gt;hands-on about it&lt;&#x2F;a&gt;. We also did several labs in french conferences like Devoxx and many others.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;io-timeout-construction&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#io-timeout-construction&quot; aria-label=&quot;Anchor link for: io-timeout-construction&quot;&gt;🔗&lt;&#x2F;a&gt;IO timeout 🚧&lt;&#x2F;h2&gt;
&lt;p&gt;Around 2018, OVHcloud started Managed Kubernetes, a free K8S control-plane. With this product we saw more developers coming to OVHcloud. We started thinking about how we could help them. Running stateful systems is &lt;strong&gt;hard&lt;&#x2F;strong&gt;, so maybe we could offer them some databases or queues in a As-a-Service fashion. We started to design such products from our Metrics experience. We started the IO Vision to offer &lt;code&gt;popular Storage APIs in front of a scalable storage&lt;&#x2F;code&gt;. Does it sound familiar? 😇 I had a lot of fun working on that vision as a Technical Leader.&lt;&#x2F;p&gt;
&lt;p&gt;We started with queuing with ioStream. We wanted something that was:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Multi-tenant&lt;&#x2F;li&gt;
&lt;li&gt;Multi-protocol&lt;&#x2F;li&gt;
&lt;li&gt;Geo-replicated natively&lt;&#x2F;li&gt;
&lt;li&gt;Less operation burden at scale than Kafka&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;We built ioStream around Apache Pulsar, and opened the beta around September 2019. As the same time we were working on Kafka’s support as a proxy in Rust. Writing such a software capable of translating Kafka’s TCP frames to Pulsar with a state-machine was a &lt;strong&gt;fun and challenging work&lt;&#x2F;strong&gt;. Rust is really a nice language to write such software.&lt;&#x2F;p&gt;
&lt;p&gt;Then we worked with Apache Pulsar’s PMC to introduce a Kafka protocol handler on Pulsar brokers. I had the chance to work closely to two PMCs, it was an amazing experience for me 🚀 You can read about our collaboration &lt;a href=&quot;https:&#x2F;&#x2F;www.ovh.com&#x2F;blog&#x2F;announcing-kafka-on-pulsar-bring-native-kafka-protocol-support-to-apache-pulsar&#x2F;&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Unfortunately as stated by the official communication, the project has been shut down:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;However, the limited success of the beta service and other strategic focuses,
&lt;&#x2F;span&gt;&lt;span&gt;have resulted in us taking the very difficult decision to close it.
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;I learned a lot of things, both technically and on the product-side, especially considering the fact that it was shutdown.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;today&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#today&quot; aria-label=&quot;Anchor link for: today&quot;&gt;🔗&lt;&#x2F;a&gt;Today&lt;&#x2F;h2&gt;
&lt;p&gt;After ioStream’s shutdown, most of the team moved to create a new LBaaS. I helped them wrote an operator to schedule HAProxy’s containers on a Kubernetes cluster. It was a nice introduction to operators.&lt;&#x2F;p&gt;
&lt;p&gt;Then I decided to join the Managed Kubernetes ☸️ team. This is my current team now, where I’m having a lot of fun working around ETCD.&lt;&#x2F;p&gt;
&lt;p&gt;I really hope the next 10 years will be as fun as the last 10 years 😇&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;strong&gt;Thank you&lt;&#x2F;strong&gt; for reading my post! Feel free to react to this article, I am also available on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt; if needed.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">personal</category>
      </item>
      <item>
          <title>Announcing Record-Store, a new (experimental) place for your data</title>
          <pubDate>Wed, 23 Sep 2020 10:24:27 +0100</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/announcing-record-store/</link>
          <guid>https://pierrezemb.fr/posts/announcing-record-store/</guid>
          <description xml:base="https://pierrezemb.fr/posts/announcing-record-store/">&lt;p&gt;TL;DR: I&#x27;m really happy to announce my latest open-source project called Record-Store 🚀 Please check it out on &lt;a href=&quot;https:&#x2F;&#x2F;pierrez.github.io&#x2F;record-store&quot;&gt;https:&#x2F;&#x2F;pierrez.github.io&#x2F;record-store&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;what&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what&quot; aria-label=&quot;Anchor link for: what&quot;&gt;🔗&lt;&#x2F;a&gt;What?&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;code&gt;Record-Store&lt;&#x2F;code&gt; is a &lt;a href=&quot;https:&#x2F;&#x2F;apple.github.io&#x2F;foundationdb&#x2F;layer-concept.html&quot;&gt;layer&lt;&#x2F;a&gt; running on top of &lt;a href=&quot;https:&#x2F;&#x2F;foundationdb.org&quot;&gt;FoundationDB&lt;&#x2F;a&gt;. It provides abstractions to create, load and deletes customer-defined data called &lt;code&gt;records&lt;&#x2F;code&gt;, which are hold into a &lt;code&gt;RecordSpace&lt;&#x2F;code&gt;. We would like to have this kind of flow for developers:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Opening RecordSpace, for example &lt;code&gt;prod&#x2F;users&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Create a protobuf definition which will be used as schema&lt;&#x2F;li&gt;
&lt;li&gt;Upsert schema&lt;&#x2F;li&gt;
&lt;li&gt;Push records&lt;&#x2F;li&gt;
&lt;li&gt;Query records&lt;&#x2F;li&gt;
&lt;li&gt;delete records&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;You need another &lt;code&gt;KeySpace&lt;&#x2F;code&gt; to store another type of data, or maybe a &lt;code&gt;KeySpace&lt;&#x2F;code&gt; dedicated to production env? Juste create it and you are good to go!&lt;&#x2F;p&gt;
&lt;h2 id=&quot;features&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#features&quot; aria-label=&quot;Anchor link for: features&quot;&gt;🔗&lt;&#x2F;a&gt;Features&lt;&#x2F;h2&gt;
&lt;p&gt;It is currently an experiment, but it already has some strong features:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Multi-tenant&lt;&#x2F;strong&gt; A &lt;code&gt;tenant&lt;&#x2F;code&gt; can create as many &lt;code&gt;RecordSpace&lt;&#x2F;code&gt; as we want, and we can have many &lt;code&gt;tenants&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Standard API&lt;&#x2F;strong&gt; We are exposing the record-store with standard technologies:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;grpc.io&quot;&gt;gRPC&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;em&gt;very experimental&lt;&#x2F;em&gt; &lt;a href=&quot;https:&#x2F;&#x2F;graphql.org&quot;&gt;GraphQL&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Scalable&lt;&#x2F;strong&gt; We are based on the same tech behind &lt;a href=&quot;https:&#x2F;&#x2F;www.foundationdb.org&#x2F;files&#x2F;record-layer-paper.pdf&quot;&gt;CloudKit&lt;&#x2F;a&gt; called the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;foundationdb&#x2F;fdb-record-layer&#x2F;&quot;&gt;Record Layer&lt;&#x2F;a&gt;,&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Transactional&lt;&#x2F;strong&gt; We are running on top of &lt;a href=&quot;https:&#x2F;&#x2F;www.foundationdb.org&#x2F;&quot;&gt;FoundationDB&lt;&#x2F;a&gt;. FoundationDB gives you the power of ACID transactions in a distributed database.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Encrypted&lt;&#x2F;strong&gt; Data are encrypted by default.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Multi-model&lt;&#x2F;strong&gt; For each &lt;code&gt;RecordSpace&lt;&#x2F;code&gt;, you can define a &lt;code&gt;schema&lt;&#x2F;code&gt;, which is in-fact only a &lt;code&gt;Protobuf&lt;&#x2F;code&gt; definition. You need to store some &lt;code&gt;users&lt;&#x2F;code&gt;, or a more complicated structure? If you can represent it as &lt;a href=&quot;https:&#x2F;&#x2F;developers.google.com&#x2F;protocol-buffers&quot;&gt;Protobuf&lt;&#x2F;a&gt;, you are good to go!&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Index-defined queries&lt;&#x2F;strong&gt; Your queries&#x27;s capabilities are defined by the indexes you put on your schema.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Secured&lt;&#x2F;strong&gt; We are using &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;CleverCloud&#x2F;biscuit&quot;&gt;Biscuit&lt;&#x2F;a&gt;, a mix of &lt;code&gt;JWT&lt;&#x2F;code&gt; and &lt;code&gt;Macaroons&lt;&#x2F;code&gt; to ensure auth{entication, orization}.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;why&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#why&quot; aria-label=&quot;Anchor link for: why&quot;&gt;🔗&lt;&#x2F;a&gt;Why?&lt;&#x2F;h2&gt;
&lt;p&gt;Lately, I have been playing a lot with my &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;PierreZ&#x2F;fdb-etcd&quot;&gt;ETCD-Layer&lt;&#x2F;a&gt; that is using the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;foundationdb&#x2F;fdb-record-layer&#x2F;&quot;&gt;Record-Layer&lt;&#x2F;a&gt;. Thanks to it, I was able to bootstrap my ETCD-layer very quickly, but I was not using a tenth of the capacities of this library. So I decided to go deeper. &lt;strong&gt;What would a gRPC abstraction of the Record-Layer look like?&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The name of this project itself is a tribute to the Record Layer as we are exposing the layer within a gRPC interface.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;try-it-out&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#try-it-out&quot; aria-label=&quot;Anchor link for: try-it-out&quot;&gt;🔗&lt;&#x2F;a&gt;Try it out&lt;&#x2F;h2&gt;
&lt;p&gt;Record-Store is open sourced under Apache License V2 in &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;PierreZ&#x2F;record-store&quot;&gt;https:&#x2F;&#x2F;github.com&#x2F;PierreZ&#x2F;record-store&lt;&#x2F;a&gt; and the documentation can be found &lt;a href=&quot;https:&#x2F;&#x2F;pierrez.github.io&#x2F;record-store&quot;&gt;https:&#x2F;&#x2F;pierrez.github.io&#x2F;record-store&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;strong&gt;Thank you&lt;&#x2F;strong&gt; for reading my post! Feel free to react to this article, I am also available on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt; if needed.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">database</category>
          <category domain="tag">storage</category>
          <category domain="tag">distributed</category>
          <category domain="tag">opensource</category>
          <category domain="tag">foundationdb</category>
      </item>
      <item>
          <title>Diving into ETCD&#x27;s linearizable reads</title>
          <pubDate>Fri, 18 Sep 2020 05:24:27 +0100</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/diving-into-etcd-linearizable/</link>
          <guid>https://pierrezemb.fr/posts/diving-into-etcd-linearizable/</guid>
          <description xml:base="https://pierrezemb.fr/posts/diving-into-etcd-linearizable/">&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;diving-into-etcd-linearizable&#x2F;etcd.png&quot; alt=&quot;etcd image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a href=&quot;&#x2F;tags&#x2F;diving-into&#x2F;&quot;&gt;Diving Into&lt;&#x2F;a&gt; is a blogpost serie where we are digging a specific part of the project&#x27;s basecode. In this episode, we will digg into the implementation behind ETCD&#x27;s Linearizable reads.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;what-is-etcd&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what-is-etcd&quot; aria-label=&quot;Anchor link for: what-is-etcd&quot;&gt;🔗&lt;&#x2F;a&gt;What is ETCD?&lt;&#x2F;h2&gt;
&lt;p&gt;From &lt;a href=&quot;https:&#x2F;&#x2F;etcd.io&#x2F;&quot;&gt;the official website&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;etcd is a strongly consistent, distributed key-value store that provides a reliable way to store data that needs to be accessed by a distributed system or cluster of machines. It gracefully handles leader elections during network partitions and can tolerate machine failure, even in the leader node.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;ETCD is well-known to be Kubernetes&#x27;s datastore, and a CNCF incubating project.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;linea-what&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#linea-what&quot; aria-label=&quot;Anchor link for: linea-what&quot;&gt;🔗&lt;&#x2F;a&gt;Linea-what?&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;jepsen.io&#x2F;consistency&#x2F;models&#x2F;linearizable&quot;&gt;Let&#x27;s quote Kyle Kingsbury, a.k.a &quot;Aphyr&quot;&lt;&#x2F;a&gt;, for this one:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Linearizability is one of the strongest single-object consistency models, and implies that every operation appears to take place atomically, in some order, consistent with the real-time ordering of those operations: e.g., if operation A completes before operation B begins, then B should logically take effect after A.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;why&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#why&quot; aria-label=&quot;Anchor link for: why&quot;&gt;🔗&lt;&#x2F;a&gt;Why?&lt;&#x2F;h2&gt;
&lt;p&gt;ETCD is using &lt;a href=&quot;https:&#x2F;&#x2F;raft.github.io&#x2F;&quot;&gt;Raft&lt;&#x2F;a&gt;, a consensus algorithm at his core. As always, the devil is hidden in the details, or when things are going wrong. Here&#x27;s an example:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;node1&lt;&#x2F;code&gt; is &lt;code&gt;leader&lt;&#x2F;code&gt; and heartbeating properly to &lt;code&gt;node2&lt;&#x2F;code&gt; and &lt;code&gt;node3&lt;&#x2F;code&gt;,&lt;&#x2F;li&gt;
&lt;li&gt;network partition is happening, and &lt;code&gt;node1&lt;&#x2F;code&gt; is isolated from the others.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;At this moment, all the actions are depending on timeouts and settings. In a (close) future, all nodes will go into &lt;strong&gt;election mode&lt;&#x2F;strong&gt; and node 2 and 3 will be able to create a quorum. This can lead to this situation:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;node1&lt;&#x2F;code&gt; thinks he is a leader as heartbeat timeouts and retry are not yet reached, so he can serve reads 😱&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;node2&lt;&#x2F;code&gt; and &lt;code&gt;node3&lt;&#x2F;code&gt; have elected a new leader and are working again, accepting writes.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;This situation is violating Linearizable reads, as reads going through &lt;code&gt;node1&lt;&#x2F;code&gt; will not see the last updates from the current leader.&lt;&#x2F;p&gt;
&lt;p&gt;How can we solve this? One way is to use &lt;code&gt;ReadIndex&lt;&#x2F;code&gt;!&lt;&#x2F;p&gt;
&lt;h2 id=&quot;readindex&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#readindex&quot; aria-label=&quot;Anchor link for: readindex&quot;&gt;🔗&lt;&#x2F;a&gt;ReadIndex&lt;&#x2F;h2&gt;
&lt;p&gt;The basic idea behind this is to confirm that the &lt;strong&gt;leader is true leader or not&lt;&#x2F;strong&gt; by sending a message to the followers. If a majority of responses are healthy, then the leader can safely serve the reads. Let&#x27;s dive into the implementation!&lt;&#x2F;p&gt;
&lt;p&gt;All codes are from the current latest release &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;etcd-io&#x2F;etcd&#x2F;releases&#x2F;tag&#x2F;v3.4.13&quot;&gt;v3.4.13&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;etcd-io&#x2F;etcd&#x2F;blob&#x2F;v3.4.13&#x2F;etcdserver&#x2F;v3_server.go#L114-L120&quot;&gt;Let&#x27;s take a Range operation&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;go&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-go &quot;&gt;&lt;code class=&quot;language-go&quot; data-lang=&quot;go&quot;&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;!&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Serializable &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;err &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;s&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;linearizableReadNotify&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ctx&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;trace&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Step&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;agreement among raft nodes before linearized reading&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;)
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;err &lt;&#x2F;span&gt;&lt;span&gt;!= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;nil &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;nil&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;err
&lt;&#x2F;span&gt;&lt;span&gt;  }
&lt;&#x2F;span&gt;&lt;span&gt; }
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;pre data-lang=&quot;go&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-go &quot;&gt;&lt;code class=&quot;language-go&quot; data-lang=&quot;go&quot;&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;func &lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;s &lt;&#x2F;span&gt;&lt;span&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;EtcdServer&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;linearizableReadNotify&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ctx context&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;Context&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;error &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;s&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;readMu&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;RLock&lt;&#x2F;span&gt;&lt;span&gt;()
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;nc &lt;&#x2F;span&gt;&lt;span&gt;:= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;s&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;readNotifier
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;s&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;readMu&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;RUnlock&lt;&#x2F;span&gt;&lt;span&gt;()
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; signal linearizable loop for current notify if it hasn&amp;#39;t been already
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;select &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;case &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;s&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;readwaitc &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;- &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;struct&lt;&#x2F;span&gt;&lt;span&gt;{}{}:
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;default&lt;&#x2F;span&gt;&lt;span&gt;:
&lt;&#x2F;span&gt;&lt;span&gt; }
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; wait for read state notification
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;select &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;case &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;nc&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;c&lt;&#x2F;span&gt;&lt;span&gt;:
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;nc&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;err
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;case &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ctx&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Done&lt;&#x2F;span&gt;&lt;span&gt;():
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ctx&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Err&lt;&#x2F;span&gt;&lt;span&gt;()
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;case &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;s&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;done&lt;&#x2F;span&gt;&lt;span&gt;:
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ErrStopped
&lt;&#x2F;span&gt;&lt;span&gt; }
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;So in &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;etcd-io&#x2F;etcd&#x2F;blob&#x2F;v3.4.13&#x2F;etcdserver&#x2F;v3_server.go#L773-L793&quot;&gt;linearizableReadNotify&lt;&#x2F;a&gt;, we are waiting for a signal. &lt;code&gt;readwaitc&lt;&#x2F;code&gt; is used in another goroutine called &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;etcd-io&#x2F;etcd&#x2F;blob&#x2F;v3.4.13&#x2F;etcdserver&#x2F;v3_server.go#L672-L771&quot;&gt;linearizableReadLoop&lt;&#x2F;a&gt;. This goroutines will call this:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;go&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-go &quot;&gt;&lt;code class=&quot;language-go&quot; data-lang=&quot;go&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;func &lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;n &lt;&#x2F;span&gt;&lt;span&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;node&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;ReadIndex&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ctx context&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;Context&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;rctx &lt;&#x2F;span&gt;&lt;span&gt;[]&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;byte&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;error &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;n&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;step&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ctx&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;pb&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Message&lt;&#x2F;span&gt;&lt;span&gt;{&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Type&lt;&#x2F;span&gt;&lt;span&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;pb&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;MsgReadIndex&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Entries&lt;&#x2F;span&gt;&lt;span&gt;: []&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;pb&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Entry&lt;&#x2F;span&gt;&lt;span&gt;{{&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Data&lt;&#x2F;span&gt;&lt;span&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;rctx&lt;&#x2F;span&gt;&lt;span&gt;}}})
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;that will create a &lt;code&gt;MsgReadIndex&lt;&#x2F;code&gt; message that will be handled in &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;etcd-io&#x2F;etcd&#x2F;blob&#x2F;v3.4.13&#x2F;raft&#x2F;raft.go#L994&quot;&gt;stepLeader&lt;&#x2F;a&gt;, who will send the message to the followers, like this:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;go&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-go &quot;&gt;&lt;code class=&quot;language-go&quot; data-lang=&quot;go&quot;&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;case &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;pb&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;MsgReadIndex&lt;&#x2F;span&gt;&lt;span&gt;:
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; If more than the local vote is needed, go through a full broadcast,
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; otherwise optimize.
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;!&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;prs&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;IsSingleton&lt;&#x2F;span&gt;&lt;span&gt;() {
&lt;&#x2F;span&gt;&lt;span&gt;      &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; PZ: omitting some code here
&lt;&#x2F;span&gt;&lt;span&gt;   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;switch &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;readOnly&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;option &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;case &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ReadOnlySafe&lt;&#x2F;span&gt;&lt;span&gt;:
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;readOnly&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;addRequest&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;raftLog&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;committed&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;m&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; The local node automatically acks the request.
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;readOnly&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;recvAck&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;id&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;m&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Entries&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;].&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Data&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;bcastHeartbeatWithCtx&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;m&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Entries&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;].&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Data&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;case &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ReadOnlyLeaseBased&lt;&#x2F;span&gt;&lt;span&gt;:
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ri &lt;&#x2F;span&gt;&lt;span&gt;:= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;raftLog&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;committed
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;m&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;From &lt;&#x2F;span&gt;&lt;span&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;None &lt;&#x2F;span&gt;&lt;span&gt;|| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;m&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;From &lt;&#x2F;span&gt;&lt;span&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;id &lt;&#x2F;span&gt;&lt;span&gt;{ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; from local member
&lt;&#x2F;span&gt;&lt;span&gt;     &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;readStates &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;append&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;readStates&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ReadState&lt;&#x2F;span&gt;&lt;span&gt;{&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Index&lt;&#x2F;span&gt;&lt;span&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ri&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;RequestCtx&lt;&#x2F;span&gt;&lt;span&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;m&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Entries&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;].&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Data&lt;&#x2F;span&gt;&lt;span&gt;})
&lt;&#x2F;span&gt;&lt;span&gt;    } &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;     &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;send&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;pb&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Message&lt;&#x2F;span&gt;&lt;span&gt;{&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;To&lt;&#x2F;span&gt;&lt;span&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;m&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;From&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Type&lt;&#x2F;span&gt;&lt;span&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;pb&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;MsgReadIndexResp&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Index&lt;&#x2F;span&gt;&lt;span&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ri&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Entries&lt;&#x2F;span&gt;&lt;span&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;m&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Entries&lt;&#x2F;span&gt;&lt;span&gt;})
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;   }
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;So, the &lt;code&gt;leader&lt;&#x2F;code&gt; is sending a heartbeat in &lt;code&gt;ReadOnlySafe&lt;&#x2F;code&gt; mode. Turns out there is two modes:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;go&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-go &quot;&gt;&lt;code class=&quot;language-go&quot; data-lang=&quot;go&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;const &lt;&#x2F;span&gt;&lt;span&gt;(
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; ReadOnlySafe guarantees the linearizability of the read only request by
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; communicating with the quorum. It is the default and suggested option.
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ReadOnlySafe &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;ReadOnlyOption &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;iota
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; ReadOnlyLeaseBased ensures linearizability of the read only request by
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; relying on the leader lease. It can be affected by clock drift.
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; If the clock drift is unbounded, leader might keep the lease longer than it
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; should (clock can move backward&#x2F;pause without any bound). ReadIndex is not safe
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; in that case.
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ReadOnlyLeaseBased
&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Responses from the followers will be handled here:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;go&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-go &quot;&gt;&lt;code class=&quot;language-go&quot; data-lang=&quot;go&quot;&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;case &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;pb&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;MsgHeartbeatResp&lt;&#x2F;span&gt;&lt;span&gt;:
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; PZ: omitting some code here
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;rss &lt;&#x2F;span&gt;&lt;span&gt;:= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;readOnly&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;advance&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;m&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;_&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;rs &lt;&#x2F;span&gt;&lt;span&gt;:= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;range &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;rss &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;req &lt;&#x2F;span&gt;&lt;span&gt;:= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;rs&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;req
&lt;&#x2F;span&gt;&lt;span&gt;   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;req&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;From &lt;&#x2F;span&gt;&lt;span&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;None &lt;&#x2F;span&gt;&lt;span&gt;|| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;req&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;From &lt;&#x2F;span&gt;&lt;span&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;id &lt;&#x2F;span&gt;&lt;span&gt;{ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; from local member
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;readStates &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;append&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;readStates&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ReadState&lt;&#x2F;span&gt;&lt;span&gt;{&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Index&lt;&#x2F;span&gt;&lt;span&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;rs&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;index&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;RequestCtx&lt;&#x2F;span&gt;&lt;span&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;req&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Entries&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;].&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Data&lt;&#x2F;span&gt;&lt;span&gt;})
&lt;&#x2F;span&gt;&lt;span&gt;   } &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;send&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;pb&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Message&lt;&#x2F;span&gt;&lt;span&gt;{&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;To&lt;&#x2F;span&gt;&lt;span&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;req&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;From&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Type&lt;&#x2F;span&gt;&lt;span&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;pb&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;MsgReadIndexResp&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Index&lt;&#x2F;span&gt;&lt;span&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;rs&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;index&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Entries&lt;&#x2F;span&gt;&lt;span&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;req&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Entries&lt;&#x2F;span&gt;&lt;span&gt;})
&lt;&#x2F;span&gt;&lt;span&gt;   }
&lt;&#x2F;span&gt;&lt;span&gt;  }
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We are storing things into a &lt;code&gt;ReadState&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;go&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-go &quot;&gt;&lt;code class=&quot;language-go&quot; data-lang=&quot;go&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; ReadState provides state for read only query.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; It&amp;#39;s caller&amp;#39;s responsibility to call ReadIndex first before getting
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; this state from ready, it&amp;#39;s also caller&amp;#39;s duty to differentiate if this
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; state is what it requests through RequestCtx, eg. given a unique id as
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; RequestCtx
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;type &lt;&#x2F;span&gt;&lt;span&gt;ReadState &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;struct &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Index      &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;uint64
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;RequestCtx &lt;&#x2F;span&gt;&lt;span&gt;[]&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;byte
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Now that the state has been updated, we need to unblock our &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;etcd-io&#x2F;etcd&#x2F;blob&#x2F;v3.4.13&#x2F;etcdserver&#x2F;v3_server.go#L672-L771&quot;&gt;linearizableReadLoop&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;go&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-go &quot;&gt;&lt;code class=&quot;language-go&quot; data-lang=&quot;go&quot;&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;!&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;timeout &lt;&#x2F;span&gt;&lt;span&gt;&amp;amp;&amp;amp; !&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;done &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;select &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;case &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;rs &lt;&#x2F;span&gt;&lt;span&gt;= &amp;lt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;s&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;readStateC&lt;&#x2F;span&gt;&lt;span&gt;:
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Cool, another channel! Turns out, &lt;code&gt;readStateC&lt;&#x2F;code&gt; is updated in &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;etcd-io&#x2F;etcd&#x2F;blob&#x2F;v3.4.13&#x2F;etcdserver&#x2F;raft.go#L162&quot;&gt;one of the main goroutine&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;go&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-go &quot;&gt;&lt;code class=&quot;language-go&quot; data-lang=&quot;go&quot;&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; start prepares and starts raftNode in a new goroutine. It is no longer safe
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; to modify the fields after it has been started.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;func &lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r &lt;&#x2F;span&gt;&lt;span&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;raftNode&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8fa1b3;&quot;&gt;start&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;rh &lt;&#x2F;span&gt;&lt;span&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;raftReadyHandler&lt;&#x2F;span&gt;&lt;span&gt;) {
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;internalTimeout &lt;&#x2F;span&gt;&lt;span&gt;:= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;time&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Second
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;go func&lt;&#x2F;span&gt;&lt;span&gt;() {
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;defer &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;onStop&lt;&#x2F;span&gt;&lt;span&gt;()
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;islead &lt;&#x2F;span&gt;&lt;span&gt;:= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;false
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;select &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;case &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ticker&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;C&lt;&#x2F;span&gt;&lt;span&gt;:
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;tick&lt;&#x2F;span&gt;&lt;span&gt;()
&lt;&#x2F;span&gt;&lt;span&gt;   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;case &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;rd &lt;&#x2F;span&gt;&lt;span&gt;:= &amp;lt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Ready&lt;&#x2F;span&gt;&lt;span&gt;():
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; PZ: omitting some code here
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;len&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;rd&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ReadStates&lt;&#x2F;span&gt;&lt;span&gt;) != &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;     &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;select &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;     &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;case &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;readStateC &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;- &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;rd&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ReadStates&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#96b5b4;&quot;&gt;len&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;rd&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ReadStates&lt;&#x2F;span&gt;&lt;span&gt;)-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;]:
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Perfect, now &lt;code&gt;readStateC&lt;&#x2F;code&gt; is notified, and we can continue on &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;etcd-io&#x2F;etcd&#x2F;blob&#x2F;v3.4.13&#x2F;etcdserver&#x2F;v3_server.go#L672-L771&quot;&gt;linearizableReadLoop&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;go&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-go &quot;&gt;&lt;code class=&quot;language-go&quot; data-lang=&quot;go&quot;&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ai &lt;&#x2F;span&gt;&lt;span&gt;:= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;s&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;getAppliedIndex&lt;&#x2F;span&gt;&lt;span&gt;(); &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ai &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;rs&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Index &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;select &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;case &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;s&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;applyWait&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Wait&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;rs&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Index&lt;&#x2F;span&gt;&lt;span&gt;):
&lt;&#x2F;span&gt;&lt;span&gt;   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;case &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;s&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;stopping&lt;&#x2F;span&gt;&lt;span&gt;:
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;return
&lt;&#x2F;span&gt;&lt;span&gt;   }
&lt;&#x2F;span&gt;&lt;span&gt;  }
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; unblock all l-reads requested at indices before rs.Index
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;nr&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;notify&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;nil&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The first part is a safety measure to makes sure the applied index is lower that the index stored in &lt;code&gt;ReadState&lt;&#x2F;code&gt;. And then finally we are unlocking all pending reads 🤩&lt;&#x2F;p&gt;
&lt;h2 id=&quot;one-more-thing-follower-read&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#one-more-thing-follower-read&quot; aria-label=&quot;Anchor link for: one-more-thing-follower-read&quot;&gt;🔗&lt;&#x2F;a&gt;One more thing: Follower read&lt;&#x2F;h2&gt;
&lt;p&gt;We went through &lt;code&gt;stepLeader&lt;&#x2F;code&gt; a lot, be there is something interesting in &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;etcd-io&#x2F;etcd&#x2F;blob&#x2F;v4.3.13&#x2F;raft&#x2F;raft.go#L1320&quot;&gt;&lt;code&gt;stepFollower&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;go&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-go &quot;&gt;&lt;code class=&quot;language-go&quot; data-lang=&quot;go&quot;&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;case &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;pb&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;MsgReadIndex&lt;&#x2F;span&gt;&lt;span&gt;:
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;lead &lt;&#x2F;span&gt;&lt;span&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;None &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;logger&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Infof&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;%x&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt; no leader at term &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;%d&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;; dropping index reading msg&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;id&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Term&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;nil
&lt;&#x2F;span&gt;&lt;span&gt;  }
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;m&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;To &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;lead
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;r&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;send&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;m&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This means that a follower can send a &lt;code&gt;MsgReadIndex&lt;&#x2F;code&gt; message to perform the same kind of checks than a leader. This small features is in fact enabling &lt;strong&gt;follower-reads&lt;&#x2F;strong&gt; on ETCD 🤩 That is why you can see &lt;code&gt;Range&lt;&#x2F;code&gt; requests from a &lt;code&gt;follower&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;operational-tips&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#operational-tips&quot; aria-label=&quot;Anchor link for: operational-tips&quot;&gt;🔗&lt;&#x2F;a&gt;operational tips&lt;&#x2F;h2&gt;
&lt;ul&gt;
&lt;li&gt;If you are running etcd &amp;lt;= 3.4, make sure &lt;strong&gt;logger=zap&lt;&#x2F;strong&gt; is set. Like this, you will be able to see some tracing logs, and I trully hope you will not witness this one:&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;pre data-lang=&quot;json&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-json &quot;&gt;&lt;code class=&quot;language-json&quot; data-lang=&quot;json&quot;&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;level&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;info&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;ts&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;2020-08-12T08:24:56.181Z&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;caller&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;traceutil&#x2F;trace.go:145&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;msg&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;trace[677217921] range&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;detail&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;{range_begin:&#x2F;...redacted...; range_end:; response_count:1; response_revision:2725080604; }&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;duration&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;1.553047811s&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;start&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;2020-08-12T08:24:54.628Z&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;end&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;2020-08-12T08:24:56.181Z&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;steps&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: [
&lt;&#x2F;span&gt;&lt;span&gt;    &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;trace[677217921] &amp;#39;agreement among raft nodes before linearized reading&amp;#39;  (duration: 1.534322015s)&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot; 
&lt;&#x2F;span&gt;&lt;span&gt;  ]
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;ul&gt;
&lt;li&gt;there is &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;etcd-io&#x2F;etcd&#x2F;issues&#x2F;11884&quot;&gt;a random performance issue on etcd 3.4&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;there is some metrics than you can watch for ReadIndex issues:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;etcd_server_read_indexes_failed_total&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;etcd_server_slow_read_indexes_total&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;strong&gt;Thank you&lt;&#x2F;strong&gt; for reading my post! feel free to react to this article, I&#x27;m also available on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt; if needed.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">distributed</category>
          <category domain="tag">etcd</category>
          <category domain="tag">raft</category>
          <category domain="tag">consensus</category>
          <category domain="tag">storage</category>
          <category domain="tag">diving-into</category>
      </item>
      <item>
          <title>Notes about Raft&#x27;s paper</title>
          <pubDate>Thu, 30 Jul 2020 07:24:27 +0100</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/notes-about-raft/</link>
          <guid>https://pierrezemb.fr/posts/notes-about-raft/</guid>
          <description xml:base="https://pierrezemb.fr/posts/notes-about-raft/">&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;notes-about-raft&#x2F;raft.png&quot; alt=&quot;raft_image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a href=&quot;&#x2F;tags&#x2F;notes&#x2F;&quot;&gt;Notes About&lt;&#x2F;a&gt; is a blogpost serie  you will find a lot of &lt;strong&gt;links, videos, quotes, podcasts to click on&lt;&#x2F;strong&gt; about a specific topic. Today we will discover Raft&#x27;s paper called &#x27;In Search of an Understandable Consensus Algorithm&#x27;.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;As I&#x27;m digging into ETCD, I needed to refresh my memory about Raft. I started by reading the paper located &lt;a href=&quot;https:&#x2F;&#x2F;raft.github.io&#x2F;raft.pdf&quot;&gt;here&lt;&#x2F;a&gt; and I&#x27;m also playing with the amazing &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;pingcap&#x2F;talent-plan&#x2F;tree&#x2F;master&#x2F;courses&#x2F;dss&#x2F;raft&quot;&gt;Raft labs made by PingCAP&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;These labs are derived from the &lt;a href=&quot;http:&#x2F;&#x2F;nil.csail.mit.edu&#x2F;6.824&#x2F;2018&#x2F;labs&#x2F;lab-raft.html&quot;&gt;lab2:raft&lt;&#x2F;a&gt; and &lt;a href=&quot;http:&#x2F;&#x2F;nil.csail.mit.edu&#x2F;6.824&#x2F;2018&#x2F;labs&#x2F;lab-kvraft.html&quot;&gt;lab3:kvraft&lt;&#x2F;a&gt; from the famous &lt;a href=&quot;http:&#x2F;&#x2F;nil.csail.mit.edu&#x2F;6.824&#x2F;2018&#x2F;index.html&quot;&gt;MIT 6.824&lt;&#x2F;a&gt; course but rewritten in Rust.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;abstract&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#abstract&quot; aria-label=&quot;Anchor link for: abstract&quot;&gt;🔗&lt;&#x2F;a&gt;Abstract&lt;&#x2F;h2&gt;
&lt;blockquote&gt;
&lt;p&gt;Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to (multi-)Paxos, andit is as efficient as Paxos, but its structure is differentfrom Paxos; this makes Raft more understandable thanPaxos and also provides a better foundation for build-ing practical systems.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;Raft separates the key elements of consensus, such asleader election, log replication, and safety, and it enforcesa stronger degree of coherency to reduce the number ofstates that must be considered.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;introduction&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#introduction&quot; aria-label=&quot;Anchor link for: introduction&quot;&gt;🔗&lt;&#x2F;a&gt;Introduction&lt;&#x2F;h2&gt;
&lt;blockquote&gt;
&lt;p&gt;Consensus algorithms allow a collection of machines to work as a coherent group that can survive the failures of some of its members.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;Paxos has dominated the discussion of consensus algorithms over the last decade.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;Unfortunately, Paxos is quite difficult to understand, inspite of numerous attempts to make it more approachable.Furthermore, its architecture requires complex changes to support practical systems. As a result, both systembuilders and students struggle with Paxos.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;Our approach was unusual in that our primary goal was &lt;strong&gt;understandability&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;We believe that Raft is superior to Paxos and other consensus algorithms, both for educational purposes and as a foundation for implementation.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;replicated-state-machines&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#replicated-state-machines&quot; aria-label=&quot;Anchor link for: replicated-state-machines&quot;&gt;🔗&lt;&#x2F;a&gt;Replicated state machines&lt;&#x2F;h2&gt;
&lt;p&gt;The main idea is to compute identical copies of the same state (i.e &lt;code&gt;x:3, y:9&lt;&#x2F;code&gt;) in case of machines&#x27;s failure. Most of the time, an ordered &lt;code&gt;wal&lt;&#x2F;code&gt; (write-ahead log) is used in the implementation, to hold the mutation (&lt;code&gt;x:4&lt;&#x2F;code&gt;). Keeping the replicated log consistent is the job of the consensus algorithm, here Raft.&lt;&#x2F;p&gt;
&lt;p&gt;Raft creates a true split between:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;the consensus module,&lt;&#x2F;li&gt;
&lt;li&gt;the wal,&lt;&#x2F;li&gt;
&lt;li&gt;the state machine.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;img src=&quot;&#x2F;images&#x2F;notes-about-raft&#x2F;fig_1.png&quot; alt=&quot;fig1&quot; class=&quot;center&quot;&gt;
&lt;h2 id=&quot;what-s-wrong-with-paxos&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what-s-wrong-with-paxos&quot; aria-label=&quot;Anchor link for: what-s-wrong-with-paxos&quot;&gt;🔗&lt;&#x2F;a&gt;What’s wrong with Paxos?&lt;&#x2F;h2&gt;
&lt;p&gt;The paper is listing the drawbacks of Paxos:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;difficult to understand, and &lt;a href=&quot;https:&#x2F;&#x2F;www.microsoft.com&#x2F;en-us&#x2F;research&#x2F;uploads&#x2F;prod&#x2F;2016&#x2F;12&#x2F;The-Part-Time-Parliament.pdf&quot;&gt;I can&#x27;t blame them&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;many details are missing from the paper to implement &lt;code&gt;Multi-Paxos&lt;&#x2F;code&gt; as the paper is mainly describing &lt;code&gt;single-decree Paxos&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;blockquote&gt;
&lt;p&gt;It is simpler and more efficient to design a system around a log, where new entries are appended sequentially in a constrained order.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;As a result, practical systems bear little resemblance to Paxos. Each implementation begins with Paxos, discovers the difficulties in implementing it, and then develops a significantly different architecture. This is time-consuming and error-prone, and the difficulties of understanding Paxos exacerbate the problem. The following com-ment from the &lt;a href=&quot;https:&#x2F;&#x2F;static.googleusercontent.com&#x2F;media&#x2F;research.google.com&#x2F;en&#x2F;&#x2F;archive&#x2F;chubby-osdi06.pdf&quot;&gt;Chubby&lt;&#x2F;a&gt; implementers is typical:&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system
the final system will be based on an un-proven protocol [4].&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;designing-for-understandability&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#designing-for-understandability&quot; aria-label=&quot;Anchor link for: designing-for-understandability&quot;&gt;🔗&lt;&#x2F;a&gt;Designing for understandability&lt;&#x2F;h2&gt;
&lt;p&gt;Beside all the others goals of Raft:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;a complete and practical foundation for system building,&lt;&#x2F;li&gt;
&lt;li&gt;must be safe under all conditions and available under typical operating conditions,&lt;&#x2F;li&gt;
&lt;li&gt;must be efficient for common operations,&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;&lt;strong&gt;understandability&lt;&#x2F;strong&gt; was the most difficult challenge:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;It must be possible for a large audience to understand the algorithm comfortably. In addition, it must be possible to develop intuitions about the algorithm, so that system builders can make the extensions that are inevitable in real-world implementations.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;we divided problems into separate pieces that could be solved, explained, and understood relatively independently. For example, in Raft we separated leader election, log replication, safety, and membership changes.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;Our second approach was to simplify the state spaceby reducing the number of states to consider, making thesystem more coherent and eliminating nondeterminism where possible.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;the-raft-consensus-algorithm&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-raft-consensus-algorithm&quot; aria-label=&quot;Anchor link for: the-raft-consensus-algorithm&quot;&gt;🔗&lt;&#x2F;a&gt;The Raft consensus algorithm&lt;&#x2F;h2&gt;
&lt;p&gt;Raft is heavily relying on the &lt;code&gt;leader&lt;&#x2F;code&gt; pattern:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Raft implements consensus by first electing a distinguished leader, then giving the leader complete responsibility for managing the replicated log.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;The leader accepts log entries from clients, replicates them on other servers, and tells servers when it is safe to apply log entries to their state machines.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Thanks to this pattern, Raft is splitting the consensus problem into 3:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Leader election&lt;&#x2F;li&gt;
&lt;li&gt;Log replication&lt;&#x2F;li&gt;
&lt;li&gt;Safety&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;raft-basics&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#raft-basics&quot; aria-label=&quot;Anchor link for: raft-basics&quot;&gt;🔗&lt;&#x2F;a&gt;Raft basics&lt;&#x2F;h3&gt;
&lt;p&gt;Each server can be in one of the three states:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Leader&lt;&#x2F;strong&gt; handle all requests,&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Follower&lt;&#x2F;strong&gt; passive member, they issue no requests on their own but simply respond to requests from leaders and candidates,&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Candidate&lt;&#x2F;strong&gt; is used to elect a new leader.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Leader is elected through &lt;code&gt;election&lt;&#x2F;code&gt;: Each term (interval of time of arbitrary length packed with an number) begins with an election, in which one or more candidates attempt to become leader. If a candidate wins the election, then it serves as leader for the rest of the term. In the case of a split vote, the term will end with no leader; a new term (with a new election) will begin.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Terms act as a logical clock [14] in Raft.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;Each server stores a current term number, which increases monotonically over time. Current terms are exchanged whenever servers communicate; if one server’s current term is smaller than the other’s, then it updates its current term to the larger value. If a candidate or leader discovers that its term is out of date, it immediately reverts to fol-lower state. If a server receives a request with a stale term number, it rejects the request.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;&lt;code&gt;RPC&lt;&#x2F;code&gt; is used for communications:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;RequestVote RPCs&lt;&#x2F;strong&gt; are initiated by candidates during elections,&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Append-Entries RPCs&lt;&#x2F;strong&gt; are initiated by leaders to replicate log en-tries and to provide a form of heartbeat.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;leader-election&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#leader-election&quot; aria-label=&quot;Anchor link for: leader-election&quot;&gt;🔗&lt;&#x2F;a&gt;Leader election&lt;&#x2F;h3&gt;
&lt;p&gt;A good vizualization is available &lt;a href=&quot;http:&#x2F;&#x2F;thesecretlivesofdata.com&#x2F;raft&#x2F;#election&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The key-point of the election are the fact that:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;nodes vote for themselves,&lt;&#x2F;li&gt;
&lt;li&gt;the term number is used to recover from failure,&lt;&#x2F;li&gt;
&lt;li&gt;election timeouts are randomized.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;blockquote&gt;
&lt;p&gt;To begin an election, a follower increments its current term and transitions to candidate state. It then votes for itself and issues RequestVote RPCs in parallel to each of the other servers in the cluster. A candidate continues in this state until one of three things happens:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;(a) it wins the election,&lt;&#x2F;li&gt;
&lt;li&gt;(b) another server establishes itself as leader,&lt;&#x2F;li&gt;
&lt;li&gt;(c) a period of time goes by with no winner.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;Raft uses randomized election timeouts to ensure that split votes are rare and that they are resolved quickly. To prevent split votes in the first place, election timeouts are chosen randomly from a fixed interval (e.g., 150–300ms).&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;log-replication&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#log-replication&quot; aria-label=&quot;Anchor link for: log-replication&quot;&gt;🔗&lt;&#x2F;a&gt;Log replication&lt;&#x2F;h3&gt;
&lt;p&gt;A good vizualization is available &lt;a href=&quot;http:&#x2F;&#x2F;thesecretlivesofdata.com&#x2F;raft&#x2F;#replication&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Once a leader has been elected, it begins servicing client requests. Each client request contains a command to be executed by the replicated state machines. The leader appends the command to its log as a new entry, then issues AppendEntries RPCs in parallel to each of the other servers to replicate the entry. When the entry has been safely replicated (as described below), the leader applies the entry to its state machine and returns the result of that execution to the client.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;The term numbers in log entries are used to detect inconsistencies between logs&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;The leader decides when it is safe to apply a log entry to the state machines; such an entry is called committed. Raft guarantees that committed entries are durable and will eventually be executed by all of the available state machines.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Raft is implementing a lot of safety inside the log:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;When sending an AppendEntries RPC, the leader includes the index and term of the entry in its log that immediately precedes the new entries. If the follower does not find an entry in its log with the same index and term, then it refuses the new entries&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;This is really interesting to be leader-failure proof. And for follower&#x27;s failure:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;To bring a follower’s log into consistency with its own,the leader must find the latest log entry where the two logs agree, delete any entries in the follower’s log after that point, and send the follower all of the leader’s entries after that point.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;safety&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#safety&quot; aria-label=&quot;Anchor link for: safety&quot;&gt;🔗&lt;&#x2F;a&gt;Safety&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;leader-election-1&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#leader-election-1&quot; aria-label=&quot;Anchor link for: leader-election-1&quot;&gt;🔗&lt;&#x2F;a&gt;Leader election&lt;&#x2F;h3&gt;
&lt;p&gt;As Raft guarantees that all the committed entries are available on all followers, log entries only flow in one di-rection, from leaders to followers, and leaders never over-write existing entries in their logs.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Raft uses the voting process to prevent a candidate from winning an election unless its log contains all committed entries. A candidate must contact a majority of the cluster in order to be elected, which means that every committed entry must be present in at least one of those servers.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;Raft determines which of two logs is more up-to-date by comparing the index and term of the last entries in the logs. If the logs have last entries with different terms, then the log with the later term is more up-to-date. If the log send with the same term, then whichever log is longer is more up-to-date.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;committing-entries-from-previous-terms&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#committing-entries-from-previous-terms&quot; aria-label=&quot;Anchor link for: committing-entries-from-previous-terms&quot;&gt;🔗&lt;&#x2F;a&gt;Committing entries from previous terms&lt;&#x2F;h3&gt;
&lt;blockquote&gt;
&lt;p&gt;Raft never commits log entries from previous terms by counting replicas. Only log entries from the leader’s current term are committed by counting replicas.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;This behavior avoids future leaders to attempt to finish replicating an entry where the leader crashes before committing an entry.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;follower-and-candidate-crashes&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#follower-and-candidate-crashes&quot; aria-label=&quot;Anchor link for: follower-and-candidate-crashes&quot;&gt;🔗&lt;&#x2F;a&gt;Follower and candidate crashes&lt;&#x2F;h3&gt;
&lt;blockquote&gt;
&lt;p&gt;If a follower or candidate crashes, then future RequestVote and AppendEntries RPCs sent to it will fail. Raft handles these failures by retrying indefinitely.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;cluster-membership-changes&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#cluster-membership-changes&quot; aria-label=&quot;Anchor link for: cluster-membership-changes&quot;&gt;🔗&lt;&#x2F;a&gt;Cluster membership changes&lt;&#x2F;h2&gt;
&lt;p&gt;This section presents how to do cluster configuration(the set of servers participating in the consensus algorithm). Raft implements a two-phase approach:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;In Raft the cluster first switches to a transitional configuration we call joint consensus; once the joint consensus has been committed,the system then transitions to the new configuration. The joint consensus combines both the old and new configurations:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Log entries are replicated to all servers in both con-figurations,&lt;&#x2F;li&gt;
&lt;li&gt;Any server from either configuration may serve asleader,&lt;&#x2F;li&gt;
&lt;li&gt;Agreement (for elections and entry commitment) requires separate majorities from both the old and new configurations.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;log-compaction&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#log-compaction&quot; aria-label=&quot;Anchor link for: log-compaction&quot;&gt;🔗&lt;&#x2F;a&gt;Log compaction&lt;&#x2F;h2&gt;
&lt;p&gt;As the WAL holds the commands, we need to compact it. Raft is using snapshots as describe here:&lt;&#x2F;p&gt;
&lt;img src=&quot;&#x2F;images&#x2F;notes-about-raft&#x2F;fig_3.png&quot; alt=&quot;fig3&quot; class=&quot;center&quot;&gt;
&lt;blockquote&gt;
&lt;p&gt;the leader must occasionally send snapshots to followers that lag behind.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;This is useful for slow follower or a new server joining the cluster.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;The leader uses a new RPC called InstallSnapshot to send snapshots to followers that are too far behind.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;client-interaction&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#client-interaction&quot; aria-label=&quot;Anchor link for: client-interaction&quot;&gt;🔗&lt;&#x2F;a&gt;Client interaction&lt;&#x2F;h2&gt;
&lt;blockquote&gt;
&lt;p&gt;Clients of Raft send all of their requests to the leader. When a client first starts up, it connects to a randomly-chosen server. If the client’s first choice is not the leader,that server will reject the client’s request and supply information about the most recent leader it has heard from.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;strong&gt;Thank you&lt;&#x2F;strong&gt; for reading my post! Feel free to react to this article, I am also available on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt; if needed.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">distributed</category>
          <category domain="tag">consensus</category>
          <category domain="tag">raft</category>
          <category domain="tag">algorithms</category>
          <category domain="tag">notes</category>
      </item>
      <item>
          <title>Announcing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pulsar</title>
          <pubDate>Tue, 24 Mar 2020 10:24:27 +0100</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/announcing-kop/</link>
          <guid>https://pierrezemb.fr/posts/announcing-kop/</guid>
          <description xml:base="https://pierrezemb.fr/posts/announcing-kop/">&lt;blockquote&gt;
&lt;p&gt;This is a repost from &lt;a href=&quot;https:&#x2F;&#x2F;www.ovh.com&#x2F;blog&#x2F;announcing-kafka-on-pulsar-bring-native-kafka-protocol-support-to-apache-pulsar&#x2F;&quot; title=&quot;Permalink to announcing KoP&quot;&gt;OVHcloud&#x27;s official blogpost.&lt;&#x2F;a&gt;, please read it there to support my company. Thanks &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;LostInBrittany&#x2F;&quot;&gt;Horacio Gonzalez&lt;&#x2F;a&gt; for the awesome drawings!&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;This post has been published on both the StreamNative and OVHcloud blogs and was co-authored by &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;sijieg&quot;&gt;Sijie Guo&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;Jia_Zhai&quot;&gt;Jia Zhai&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Pierre Zemb&lt;&#x2F;a&gt;. Thanks &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;LostInBrittany&quot;&gt;Horacio Gonzalez&lt;&#x2F;a&gt; for the illustrations!&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;announcing-kop&#x2F;kop-1.png&quot; alt=&quot;hbase image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;We are excited to announce that StreamNative and OVHcloud are open-sourcing &quot;Kafka on Pulsar&quot; (KoP). KoP brings the native Apache Kafka protocol support to Apache Pulsar by introducing a Kafka protocol handler on Pulsar brokers. By adding the KoP protocol handler to your existing Pulsar cluster, you can now migrate your existing Kafka applications and services to Pulsar without modifying the code. This enables Kafka applications to leverage Pulsar&#x27;s powerful features, such as:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Streamlined operations with enterprise-grade multi-tenancy&lt;&#x2F;li&gt;
&lt;li&gt;Simplified operations with a rebalance-free architecture&lt;&#x2F;li&gt;
&lt;li&gt;Infinite event stream retention with Apache BookKeeper and tiered storage&lt;&#x2F;li&gt;
&lt;li&gt;Serverless event processing with Pulsar Functions&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;what-is-apache-pulsar&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what-is-apache-pulsar&quot; aria-label=&quot;Anchor link for: what-is-apache-pulsar&quot;&gt;🔗&lt;&#x2F;a&gt;What is Apache Pulsar?&lt;&#x2F;h2&gt;
&lt;p&gt;Apache Pulsar is an event streaming platform designed from the ground up to be cloud-native- deploying a multi-layer and segment-centric architecture. The architecture separates serving and storage into different layers, making the system container-friendly. The cloud-native architecture provides scalability, availability and resiliency and enables companies to expand their offerings with real-time data-enabled solutions. Pulsar has gained wide adoption since it was open-sourced in 2016 and was designated an Apache Top-Level project in 2018.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-need-behind-kop&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-need-behind-kop&quot; aria-label=&quot;Anchor link for: the-need-behind-kop&quot;&gt;🔗&lt;&#x2F;a&gt;The need behind KoP&lt;&#x2F;h2&gt;
&lt;p&gt;Pulsar provides a unified messaging model for both queueing and streaming workloads. Pulsar implemented its own protobuf-based binary protocol to provide high performance and low latency. This choice of protobuf makes it convenient to implement Pulsar &lt;a href=&quot;https:&#x2F;&#x2F;pulsar.apache.org&#x2F;docs&#x2F;en&#x2F;client-libraries&#x2F;&quot;&gt;clients&lt;&#x2F;a&gt; and the project already supports Java, Go, Python and C++ languages alongside &lt;a href=&quot;https:&#x2F;&#x2F;pulsar.apache.org&#x2F;docs&#x2F;en&#x2F;client-libraries&#x2F;#thirdparty-clients&quot;&gt;thirdparty clients&lt;&#x2F;a&gt; provided by the community. However, existing applications written using other messaging protocols had to be rewritten to adopt Pulsar&#x27;s new unified messaging protocol.&lt;&#x2F;p&gt;
&lt;p&gt;To address this, the Pulsar community developed applications to facilitate the migration to Pulsar from other messaging systems. For example, Pulsar provides a &lt;a href=&quot;http:&#x2F;&#x2F;(https:&#x2F;&#x2F;pulsar.apache.org&#x2F;docs&#x2F;en&#x2F;adaptors-kafka&quot;&gt;Kafka wrapper&lt;&#x2F;a&gt; on Kafka Java API, which allows existing applications that already use Kafka Java client switching from Kafka to Pulsar &lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=Cy9ev9nAZpI&quot;&gt;without code change&lt;&#x2F;a&gt;. Pulsar also has a rich connector ecosystem, connecting Pulsar with other data systems. Yet, there was still a strong demand from those looking to switch from other Kafka applications to Pulsar.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;streamnative-and-ovhcloud-s-collaboration&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#streamnative-and-ovhcloud-s-collaboration&quot; aria-label=&quot;Anchor link for: streamnative-and-ovhcloud-s-collaboration&quot;&gt;🔗&lt;&#x2F;a&gt;StreamNative and OVHcloud&#x27;s collaboration&lt;&#x2F;h2&gt;
&lt;p&gt;StreamNative was receiving a lot of inbound requests for help migrating from other messaging systems to Pulsar and recognized the need to support other messaging protocols (such as AMQP and Kafka) natively on Pulsar. StreamNative began working on introducing a general protocol handler framework in Pulsar that would allow developers using other messaging protocols to use Pulsar.&lt;&#x2F;p&gt;
&lt;p&gt;Internally, OVHcloud had been running Apache Kafka for years, but despite their experience operating multiple clusters with millions of messages per second on Kafka, there were painful operational challenges. For example, putting thousands of topics from thousands of users into a single cluster was difficult without multi-tenancy.&lt;&#x2F;p&gt;
&lt;p&gt;As a result, OVHcloud decided to shift and build the foundation of their topic-as-a-service product, called ioStream, on Pulsar instead of Kafka. Pulsar&#x27;s multi-tenancy and the overall architecture with Apache Bookkeeper simplified operations compared to Kafka.&lt;&#x2F;p&gt;
&lt;p&gt;After spawning the first region, OVHcloud decided to implement it as a proof-of-concept proxy capable of transforming the Kafka protocol to Pulsar on the fly. During this process, OVHcloud discovered that StreamNative was working on bringing the Kafka protocol natively to Pulsar, and they joined forces to develop KoP.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;announcing-kop&#x2F;kop-2.png&quot; alt=&quot;kop image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;KoP was developed to provide a streamlined and comprehensive solution leveraging Pulsar and BookKeeper&#x27;s event stream storage infrastructure and Pulsar&#x27;s pluggable protocol handler framework. KoP is implemented as a protocol handler plugin with protocol name &quot;kafka&quot;. It can be installed and configured to run as part of Pulsar brokers.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-distributed-log&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-distributed-log&quot; aria-label=&quot;Anchor link for: the-distributed-log&quot;&gt;🔗&lt;&#x2F;a&gt;The distributed log&lt;&#x2F;h2&gt;
&lt;p&gt;Both Pulsar and Kafka share a very similar data model around &lt;strong&gt;log&lt;&#x2F;strong&gt; for both pub&#x2F;sub messaging and event streaming. For example, both are built on top of a distributed log. Kafka implements the distributed log in a partition-basis architecture, where a distributed log (a partition in Kafka) is designated to store in a set of brokers, while Pulsar deploys a &lt;strong&gt;segment&lt;&#x2F;strong&gt;-based architecture to implement its distributed log by leveraging Apache BookKeeper as its scale-out segment storage layer. Pulsar&#x27;s &lt;em&gt;segment&lt;&#x2F;em&gt; based architecture provides benefits such as rebalance-free, instant scalability, and infinite event stream storage. You can learn more about the key differences between Pulsar and Kafka in &lt;a href=&quot;https:&#x2F;&#x2F;www.splunk.com&#x2F;en_us&#x2F;blog&#x2F;it&#x2F;comparing-pulsar-and-kafka-how-a-segment-based-architecture-delivers-better-performance-scalability-and-resilience.html&quot;&gt;this Splunk blog&lt;&#x2F;a&gt; and in &lt;a href=&quot;http:&#x2F;&#x2F;bookkeeper.apache.org&#x2F;distributedlog&#x2F;technical-review&#x2F;2016&#x2F;09&#x2F;19&#x2F;kafka-vs-distributedlog.html&quot;&gt;this blog from the Bookkeeper project&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Since both of the systems are built on a similar data model, a distributed log, it is very simple to implement a Kafka-compatible protocol handler by leveraging Pulsar&#x27;s distributed log storage and its pluggable protocol handler framework (introduced in the 2.5.0 release).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;implementations&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#implementations&quot; aria-label=&quot;Anchor link for: implementations&quot;&gt;🔗&lt;&#x2F;a&gt;Implementations&lt;&#x2F;h2&gt;
&lt;p&gt;The implementation is done by comparing the protocols between Pulsar and Kafka. We found that there are a lot of similarities between these two protocols. Both protocols are comprised of the following operations:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Topic Lookup&lt;&#x2F;strong&gt;: All the clients connect to any broker to lookup the metadata (i.e. the owner broker) of the topics. After fetching the metadata, the clients establish persistent TCP connections to the owner brokers.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Produce&lt;&#x2F;strong&gt;: The clients talk to the &lt;strong&gt;owner&lt;&#x2F;strong&gt; broker of a topic partition to append the messages to a distributed log.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Consume&lt;&#x2F;strong&gt;: The clients talk to the &lt;strong&gt;owner&lt;&#x2F;strong&gt; broker of a topic partition to read the messages from a distributed log.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Offset&lt;&#x2F;strong&gt;: The messages produced to a topic partition are assigned with an offset. The offset in Pulsar is called MessageId. Consumers can use &lt;strong&gt;offsets&lt;&#x2F;strong&gt; to seek to a given position within the log to read messages.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Consumption State&lt;&#x2F;strong&gt;: Both systems maintain the consumption state for consumers within a subscription (or a consumer group in Kafka). The consumption state is stored in __offsets topic in Kafka, while the consumption state is stored as cursors in Pulsar.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;As you can see, these are all the primitive operations provided by a scale-out distributed log storage such as Apache BookKeeper. The core capabilities of Pulsar are implemented on top of Apache BookKeeper. Thus it is pretty easy and straightforward to implement the Kafka concepts by using the existing components that Pulsar has developed on BookKeeper.&lt;br&gt;
The following figure illustrates how we add the Kafka protocol support within Pulsar. We are introducing a new &lt;strong&gt;Protocol Handler&lt;&#x2F;strong&gt;which implements the Kafka wire protocol by leveraging the existing components (such as topic discovery, the distributed log library – ManagedLedger, cursors and etc) that Pulsar already has.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;announcing-kop&#x2F;kop-3.png&quot; alt=&quot;hbase image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;topics&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#topics&quot; aria-label=&quot;Anchor link for: topics&quot;&gt;🔗&lt;&#x2F;a&gt;Topics&lt;&#x2F;h3&gt;
&lt;p&gt;In Kafka, all the topics are stored in one flat namespace. But in Pulsar, topics are organized in hierarchical multi-tenant namespaces. We introduce a setting &lt;em&gt;kafkaNamespace&lt;&#x2F;em&gt; in broker configuration to allow the administrator configuring to map Kafka topics to Pulsar topics.&lt;&#x2F;p&gt;
&lt;p&gt;In order to let Kafka users leverage the multi-tenancy feature of Apache Pulsar, a Kafka user can specify a Pulsar tenant and namespace as its SASL username when it uses SASL authentication mechanism to authenticate a Kafka client.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;message-id-and-offset&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#message-id-and-offset&quot; aria-label=&quot;Anchor link for: message-id-and-offset&quot;&gt;🔗&lt;&#x2F;a&gt;Message ID and offset&lt;&#x2F;h3&gt;
&lt;p&gt;In Kafka, each message is assigned with an offset once it is successfully produced to a topic partition. In Pulsar, each message is assigned with a &lt;code&gt;MessageID&lt;&#x2F;code&gt;. The message id consists of 3 components, &lt;em&gt;ledger-id&lt;&#x2F;em&gt;, &lt;em&gt;entry-id&lt;&#x2F;em&gt;, and &lt;em&gt;batch-index&lt;&#x2F;em&gt;. We are using the same approach in Pulsar-Kafka wrapper to convert a Pulsar MessageID to an offset and vice versa.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;messages&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#messages&quot; aria-label=&quot;Anchor link for: messages&quot;&gt;🔗&lt;&#x2F;a&gt;Messages&lt;&#x2F;h3&gt;
&lt;p&gt;Both a Kafka message and a Pulsar message have key, value, timestamp, and headers (note: this is called &#x27;properties&#x27; in Pulsar). We convert these fields automatically between Kafka messages and Pulsar messages.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;topic-lookup&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#topic-lookup&quot; aria-label=&quot;Anchor link for: topic-lookup&quot;&gt;🔗&lt;&#x2F;a&gt;Topic lookup&lt;&#x2F;h3&gt;
&lt;p&gt;We use the same topic lookup approach for the Kafka request handler as the Pulsar request handler. The request handler does topic discovery to lookup all the ownerships for the requested topic partitions and responds with the ownership information as part of Kafka TopicMetadata back to Kafka clients.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;produce-messages&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#produce-messages&quot; aria-label=&quot;Anchor link for: produce-messages&quot;&gt;🔗&lt;&#x2F;a&gt;Produce Messages&lt;&#x2F;h3&gt;
&lt;p&gt;When the Kafka request handler receives produced messages from a Kafka client, it converts Kafka messages to Pulsar messages by mapping the fields (i.e. key, value, timestamp and headers) one by one, and uses the ManagedLedger append API to append those converted Pulsar messages to BookKeeper. Converting Kafka messages to Pulsar messages allows existing Pulsar applications to consume messages produced by Kafka clients.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;consume-messages&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#consume-messages&quot; aria-label=&quot;Anchor link for: consume-messages&quot;&gt;🔗&lt;&#x2F;a&gt;Consume Messages&lt;&#x2F;h3&gt;
&lt;p&gt;When the Kafka request handler receives a consumer request from a Kafka client, it opens a non-durable cursor to read the entries starting from the requested offset. The Kafka request handler converts the Pulsar messages back to Kafka messages to allow existing Kafka applications to consume the messages produced by Pulsar clients.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;group-coordinator-offsets-management&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#group-coordinator-offsets-management&quot; aria-label=&quot;Anchor link for: group-coordinator-offsets-management&quot;&gt;🔗&lt;&#x2F;a&gt;Group coordinator &amp;amp; offsets management&lt;&#x2F;h3&gt;
&lt;p&gt;The most challenging part is to implement the group coordinator and offsets management. Because Pulsar doesn&#x27;t have a centralized group coordinator for assigning partitions to consumers of a consumer group and managing offsets for each consumer group. In Pulsar, the partition assignment is managed by broker on a per-partition basis, and the offset management is done by storing the acknowledgements in cursors by the owner broker of that partition.&lt;&#x2F;p&gt;
&lt;p&gt;It is difficult to align the Pulsar model with the Kafka model. Hence, for the sake of providing full compatibility with Kafka clients, we implemented the Kafka group coordinator by storing the coordinator group changes and offsets in a system topic called *public&#x2F;kafka&#x2F;*&lt;em&gt;offsets&lt;&#x2F;em&gt; in Pulsar.&lt;&#x2F;p&gt;
&lt;p&gt;This allows us to bridge the gap between Pulsar and Kafka and allows people to use existing Pulsar tools and policies to manage subscriptions and monitor Kafka consumers. We add a background thread in the implemented group coordinator to periodically sync offset updates from the system topic to Pulsar cursors. Hence a Kafka consumer group is effectively treated as a Pulsar subscription. All the existing Pulsar toolings can be used for managing Kafka consumer groups as well.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;bridge-two-popular-messaging-ecosystems&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#bridge-two-popular-messaging-ecosystems&quot; aria-label=&quot;Anchor link for: bridge-two-popular-messaging-ecosystems&quot;&gt;🔗&lt;&#x2F;a&gt;Bridge two popular messaging ecosystems&lt;&#x2F;h2&gt;
&lt;p&gt;At both companies, we value customer success. We believe that providing a native Kafka protocol on Apache Pulsar will reduce the barriers for people adopting Pulsar to achieve their business success. By integrating two popular event streaming ecosystems, KoP unlocks new use cases. Customers can leverage advantages from each ecosystem and build a truly unified event streaming platform with Apache Pulsar to accelerate the development of real-time applications and services.&lt;&#x2F;p&gt;
&lt;p&gt;With KoP, a log collector can continue collecting log data from its sources and producing messages to Apache Pulsar using existing Kafka integrations. The downstream applications can use Pulsar Functions to process the events arriving in the system to do serverless event streaming.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;try-it-out&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#try-it-out&quot; aria-label=&quot;Anchor link for: try-it-out&quot;&gt;🔗&lt;&#x2F;a&gt;Try it out&lt;&#x2F;h2&gt;
&lt;p&gt;KoP is open sourced under Apache License V2 in &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;streamnative&#x2F;kop&quot;&gt;https:&#x2F;&#x2F;github.com&#x2F;streamnative&#x2F;kop&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;We are looking forward to your issues, and PRs. You can also &lt;a href=&quot;https:&#x2F;&#x2F;apache-pulsar.herokuapp.com&#x2F;&quot;&gt;join #kop channel in Pulsar Slack&lt;&#x2F;a&gt; to discuss all things about Kafka-on-Pulsar.&lt;&#x2F;p&gt;
&lt;p&gt;StreamNative and OVHcloud are also hosting a webinar about KoP on March 31. If you are interested in learning more details about KoP,&lt;a href=&quot;https:&#x2F;&#x2F;zoom.us&#x2F;webinar&#x2F;register&#x2F;6515842602644&#x2F;WN_l_i-3ekDSg6PwPFn7tqRvA&quot;&gt;please sign up&lt;&#x2F;a&gt;. Looking forward to meeting you online.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;announcing-kop&#x2F;kop-4.png&quot; alt=&quot;hbase image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;thanks&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#thanks&quot; aria-label=&quot;Anchor link for: thanks&quot;&gt;🔗&lt;&#x2F;a&gt;Thanks&lt;&#x2F;h2&gt;
&lt;p&gt;The KoP project was originally initiated by StreamNative. The OVHcloud team joined the project to collaborate on the development of the KoP project. Many thanks to Pierre Zemb and Steven Le Roux from OVHcloud for their contributions to this project!&lt;&#x2F;p&gt;
</description>
          <category domain="tag">messaging</category>
          <category domain="tag">distributed</category>
          <category domain="tag">kafka</category>
          <category domain="tag">pulsar</category>
          <category domain="tag">opensource</category>
      </item>
      <item>
          <title>Contributing to Apache HBase: custom data balancing</title>
          <pubDate>Fri, 14 Feb 2020 10:24:27 +0100</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/hbase-custom-data-balancing/</link>
          <guid>https://pierrezemb.fr/posts/hbase-custom-data-balancing/</guid>
          <description xml:base="https://pierrezemb.fr/posts/hbase-custom-data-balancing/">&lt;blockquote&gt;
&lt;p&gt;This is a repost from &lt;a href=&quot;https:&#x2F;&#x2F;www.ovh.com&#x2F;blog&#x2F;contributing-to-apache-hbase-custom-data-balancing&#x2F;&quot; title=&quot;Permalink to Contributing to Apache HBase: custom data balancing&quot;&gt;OVHcloud&#x27;s official blogpost.&lt;&#x2F;a&gt;, please read it there to support my company. Thanks &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;LostInBrittany&#x2F;&quot;&gt;Horacio Gonzalez&lt;&#x2F;a&gt; for the awesome drawings!&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;In today&#x27;s blogpost, we&#x27;re going to take a look at our upstream
contribution to Apache HBase&#x27;s stochastic load balancer, based on our
experience of running HBase clusters to support OVHcloud&#x27;s monitoring.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;hbase-custom-data-balancing&#x2F;hbase-ovh-1.jpeg&quot; alt=&quot;hbase image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-context&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-context&quot; aria-label=&quot;Anchor link for: the-context&quot;&gt;🔗&lt;&#x2F;a&gt;The context&lt;&#x2F;h2&gt;
&lt;p&gt;Have you ever wondered how:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;we generate the graphs for your OVHcloud server or web hosting package?&lt;&#x2F;li&gt;
&lt;li&gt;our internal teams monitor their own servers and applications?&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;&lt;strong&gt;All internal teams are constantly gathering telemetry and monitoring data&lt;&#x2F;strong&gt; and sending them to a &lt;strong&gt;dedicated team,&lt;&#x2F;strong&gt; who are responsible for &lt;strong&gt;handling all the metrics and logs generated by OVHcloud&#x27;s infrastructure&lt;&#x2F;strong&gt;: the Observability team.&lt;&#x2F;p&gt;
&lt;p&gt;We tried a lot of different &lt;strong&gt;Time Series databases&lt;&#x2F;strong&gt;, and eventually chose &lt;a href=&quot;https:&#x2F;&#x2F;warp10.io&#x2F;&quot;&gt;Warp10&lt;&#x2F;a&gt; to handle our workloads. &lt;strong&gt;Warp10&lt;&#x2F;strong&gt; can be integrated with the various &lt;strong&gt;big-data solutions&lt;&#x2F;strong&gt; provided by the &lt;a href=&quot;https:&#x2F;&#x2F;www.apache.org&#x2F;&quot;&gt;Apache Foundation.&lt;&#x2F;a&gt; In our case, we use &lt;a href=&quot;http:&#x2F;&#x2F;hbase.apache.org&#x2F;&quot;&gt;Apache HBase&lt;&#x2F;a&gt; as the long-term storage datastore for our metrics.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a href=&quot;http:&#x2F;&#x2F;hbase.apache.org&#x2F;&quot;&gt;Apache HBase&lt;&#x2F;a&gt;, a datastore built on top of &lt;a href=&quot;http:&#x2F;&#x2F;hadoop.apache.org&#x2F;&quot;&gt;Apache Hadoop&lt;&#x2F;a&gt;, provides &lt;strong&gt;an elastic, distributed, key-ordered map.&lt;&#x2F;strong&gt; As such, one of the key features of Apache HBase for us is the ability to &lt;strong&gt;scan&lt;&#x2F;strong&gt;, i.e. retrieve a range of keys. Thanks to this feature, we can fetch &lt;strong&gt;thousands of datapoints in an optimised way&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;We have our own dedicated clusters, the biggest of which has more than 270 nodes to spread our workloads:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;between 1.6 and 2 million writes per second, 24&#x2F;7&lt;&#x2F;li&gt;
&lt;li&gt;between 4 and 6 million reads per second&lt;&#x2F;li&gt;
&lt;li&gt;around 300TB of telemetry, stored within Apache HBase&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;As you can probably imagine, storing 300TB of data in 270 nodes comes with some challenges regarding repartition, as &lt;strong&gt;every&lt;&#x2F;strong&gt; &lt;strong&gt;bit is hot data, and should be accessible at any time&lt;&#x2F;strong&gt;. Let&#x27;s dive in!&lt;&#x2F;p&gt;
&lt;h2 id=&quot;how-does-balancing-work-in-apache-hbase&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#how-does-balancing-work-in-apache-hbase&quot; aria-label=&quot;Anchor link for: how-does-balancing-work-in-apache-hbase&quot;&gt;🔗&lt;&#x2F;a&gt;How does balancing work in Apache HBase?&lt;&#x2F;h2&gt;
&lt;p&gt;Before diving into the balancer, let&#x27;s take a look at how it works. In Apache HBase, data is split into shards called &lt;code&gt;Regions&lt;&#x2F;code&gt;, and distributed through &lt;code&gt;RegionServers&lt;&#x2F;code&gt;. The number of regions will increase as the data is coming in, and regions will be split as a result. This is where the &lt;code&gt;Balancer&lt;&#x2F;code&gt; comes in. It will &lt;strong&gt;move regions&lt;&#x2F;strong&gt; to avoid hotspotting a single &lt;code&gt;RegionServer&lt;&#x2F;code&gt; and effectively distribute the load.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;hbase-custom-data-balancing&#x2F;hbase-ovh-2.jpeg&quot; alt=&quot;hbase image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The actual implementation, called &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;hbase&#x2F;blob&#x2F;master&#x2F;hbase-server&#x2F;src&#x2F;main&#x2F;java&#x2F;org&#x2F;apache&#x2F;hadoop&#x2F;hbase&#x2F;master&#x2F;balancer&#x2F;StochasticLoadBalancer.java&quot;&gt;StochasticBalancer&lt;&#x2F;a&gt;, uses &lt;strong&gt;a cost-based approach:&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;It first computes the &lt;strong&gt;overall cost&lt;&#x2F;strong&gt; of the cluster, by looping through &lt;code&gt;cost functions&lt;&#x2F;code&gt;. Every cost function &lt;strong&gt;returns a number between 0 and 1 inclusive&lt;&#x2F;strong&gt;, where 0 is the lowest cost-best solution, and 1 is the highest possible cost and worst solution. Apache Hbase is coming with several cost functions, which are measuring things like region load, table load, data locality, number of regions per RegionServers... The computed costs are &lt;strong&gt;scaled by their respective coefficients, defined in the configuration&lt;&#x2F;strong&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;Now that the initial cost is computed, we can try to &lt;code&gt;Mutate&lt;&#x2F;code&gt; our cluster. For this, the Balancer creates a random &lt;code&gt;nextAction&lt;&#x2F;code&gt;, which could be something like &lt;strong&gt;swapping two regions&lt;&#x2F;strong&gt;, or &lt;strong&gt;moving one region to another RegionServer&lt;&#x2F;strong&gt;. The action is &lt;strong&gt;applied&lt;&#x2F;strong&gt; &lt;strong&gt;virtually&lt;&#x2F;strong&gt; , and then the &lt;strong&gt;new cost is calculated&lt;&#x2F;strong&gt;. If the new cost is lower than our previous one, the action is stored. If not, it is skipped. This operation is repeated &lt;code&gt;thousands of times&lt;&#x2F;code&gt;, hence the &lt;code&gt;Stochastic&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;At the end, &lt;strong&gt;the list of valid actions is applied to the actual cluster.&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h2 id=&quot;what-was-not-working-for-us&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what-was-not-working-for-us&quot; aria-label=&quot;Anchor link for: what-was-not-working-for-us&quot;&gt;🔗&lt;&#x2F;a&gt;What was not working for us?&lt;&#x2F;h2&gt;
&lt;p&gt;We found out that &lt;strong&gt;for our specific use case&lt;&#x2F;strong&gt;, which involved:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Single table&lt;&#x2F;li&gt;
&lt;li&gt;Dedicated Apache HBase and Apache Hadoop, &lt;strong&gt;tailored for our requirements&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Good key distribution&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;&lt;strong&gt;the number of regions per RegionServer was the real limit for us&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Even if the balancing strategy seems simple, &lt;strong&gt;we do think that being able to run an Apache HBase cluster on heterogeneous hardware is vital&lt;&#x2F;strong&gt;, especially in cloud environments, because you &lt;strong&gt;may not be able to buy the same server specs again in the future.&lt;&#x2F;strong&gt;
In our earlier example, our cluster grew from 80 to ~250 machines in
four years. Throughout that time, we bought new dedicated server
references, and even tested some special internal references.&lt;&#x2F;p&gt;
&lt;p&gt;We ended-up with differents groups of hardware: &lt;strong&gt;some servers can handle only 180 regions, whereas the biggest can handle more than 900&lt;&#x2F;strong&gt;. Because of this disparity, we had to disable the Load Balancer to avoid the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;hbase&#x2F;blob&#x2F;master&#x2F;hbase-server&#x2F;src&#x2F;main&#x2F;java&#x2F;org&#x2F;apache&#x2F;hadoop&#x2F;hbase&#x2F;master&#x2F;balancer&#x2F;StochasticLoadBalancer.java#L1194&quot;&gt;RegionCountSkewCostFunction&lt;&#x2F;a&gt;, which would try to bring all RegionServers to the same number of regions.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;hbase-custom-data-balancing&#x2F;hbase-ovh-3.jpeg&quot; alt=&quot;hbase image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Two years ago we developed some internal tools, which are responsible
for load balancing regions across RegionServers. The tooling worked
really good for our use case, simplifying the day-to-day operation of
our cluster.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Open source is at the DNA of OVHcloud&lt;&#x2F;strong&gt;, and that means that we build our tools on open source software, but also that we &lt;strong&gt;contribute&lt;&#x2F;strong&gt;
and give it back to the community. When we talked around, we saw that
we weren&#x27;t the only one concerned by the heterogenous cluster problem.
We decided to rewrite our tooling to make it more general, and to &lt;strong&gt;contribute&lt;&#x2F;strong&gt; it &lt;strong&gt;directly upstream&lt;&#x2F;strong&gt; to the HBase project &lt;strong&gt;.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;our-contributions&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#our-contributions&quot; aria-label=&quot;Anchor link for: our-contributions&quot;&gt;🔗&lt;&#x2F;a&gt;Our contributions&lt;&#x2F;h2&gt;
&lt;p&gt;The first contribution was pretty simple, the cost function list was a &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;hbase&#x2F;blob&#x2F;8cb531f207b9f9f51ab1509655ae59701b66ac37&#x2F;hbase-server&#x2F;src&#x2F;main&#x2F;java&#x2F;org&#x2F;apache&#x2F;hadoop&#x2F;hbase&#x2F;master&#x2F;balancer&#x2F;StochasticLoadBalancer.java#L199-L213&quot;&gt;constant&lt;&#x2F;a&gt;. We &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;hbase&#x2F;commit&#x2F;836f26976e1ad8b35d778c563067ed0614c026e9&quot;&gt;added the possibility to load custom cost functions&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The second contribution was about &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;hbase&#x2F;commit&#x2F;42d535a57a75b58f585b48df9af9c966e6c7e46a&quot;&gt;adding an optional costFunction to balance regions according to a capacity rule&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;how-does-it-works&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#how-does-it-works&quot; aria-label=&quot;Anchor link for: how-does-it-works&quot;&gt;🔗&lt;&#x2F;a&gt;How does it works?&lt;&#x2F;h2&gt;
&lt;p&gt;The balancer will load a file containing lines of rules. &lt;strong&gt;A rule is composed of a regexp for hostname, and a limit.&lt;&#x2F;strong&gt; For example, we could have:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;rs[0-9] 200
&lt;&#x2F;span&gt;&lt;span&gt;rs1[0-9] 50
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;RegionServers with &lt;strong&gt;hostnames matching the first rules will have a limit of 200&lt;&#x2F;strong&gt;, and &lt;strong&gt;the others 50&lt;&#x2F;strong&gt;. If there&#x27;s no match, a default is set.&lt;&#x2F;p&gt;
&lt;p&gt;Thanks to these rule, we have two key pieces of information:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;the &lt;strong&gt;max number of regions for this cluster&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;li&gt;the *&lt;em&gt;rules for each servers&lt;&#x2F;em&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The &lt;code&gt;HeterogeneousRegionCountCostFunction&lt;&#x2F;code&gt; will try to &lt;strong&gt;balance regions, according to their capacity.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s take an example... Imagine that we have 20 RS:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;10 RS, named &lt;code&gt;rs0&lt;&#x2F;code&gt; to &lt;code&gt;rs9&lt;&#x2F;code&gt;, loaded with 60 regions each, which can each handle 200 regions.&lt;&#x2F;li&gt;
&lt;li&gt;10 RS, named &lt;code&gt;rs10&lt;&#x2F;code&gt; to &lt;code&gt;rs19&lt;&#x2F;code&gt;, loaded with 60 regions each, which can each handle 50 regions.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;So, based on the following rules:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;rs[0-9] 200
&lt;&#x2F;span&gt;&lt;span&gt;rs1[0-9] 50
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;... we can see that the &lt;strong&gt;second group is overloaded&lt;&#x2F;strong&gt;, whereas the first group has plenty of space.&lt;&#x2F;p&gt;
&lt;p&gt;We know that we can handle a maximum of &lt;strong&gt;2,500 regions&lt;&#x2F;strong&gt; (200×10 + 50×10), and we have currently &lt;strong&gt;1,200 regions&lt;&#x2F;strong&gt; (60×20). As such, the &lt;code&gt;HeterogeneousRegionCountCostFunction&lt;&#x2F;code&gt; will understand that the cluster is &lt;strong&gt;full at 48.0%&lt;&#x2F;strong&gt; (1200&#x2F;2500). Based on this information, we will then &lt;strong&gt;try to put all the RegionServers at ~48% of the load, according to the rules.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;hbase-custom-data-balancing&#x2F;hbase-ovh-4.jpeg&quot; alt=&quot;hbase image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;where-to-next&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#where-to-next&quot; aria-label=&quot;Anchor link for: where-to-next&quot;&gt;🔗&lt;&#x2F;a&gt;Where to next?&lt;&#x2F;h2&gt;
&lt;p&gt;Thanks to Apache HBase&#x27;s contributors, our patches are now &lt;strong&gt;merged&lt;&#x2F;strong&gt; into the master branch. As soon as Apache HBase maintainers publish a new release, we will deploy and use it at scale. This &lt;strong&gt;will allow more automation on our side, and ease operations for the Observability Team.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Contributing was an awesome journey. What I love most about open
source is the opportunity ability to contribute back, and build stronger
software. We &lt;strong&gt;had an opinion&lt;&#x2F;strong&gt; about how a particular issue should addressed, but &lt;strong&gt;the discussions with the community helped us to refine it&lt;&#x2F;strong&gt;. We spoke with e &lt;strong&gt;ngineers from other companies, who were struggling with Apache HBase&#x27;s cloud deployments, just as we were&lt;&#x2F;strong&gt;, and thanks to those exchanges, &lt;strong&gt;our contribution became more and more relevant.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
</description>
          <category domain="tag">database</category>
          <category domain="tag">distributed</category>
          <category domain="tag">hbase</category>
          <category domain="tag">performance</category>
          <category domain="tag">opensource</category>
      </item>
      <item>
          <title>Notes about FoundationDB</title>
          <pubDate>Thu, 30 Jan 2020 10:24:27 +0100</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/notes-about-foundationdb/</link>
          <guid>https://pierrezemb.fr/posts/notes-about-foundationdb/</guid>
          <description xml:base="https://pierrezemb.fr/posts/notes-about-foundationdb/">&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;notes-about-foundationdb&#x2F;fdb-white.jpg&quot; alt=&quot;fdb image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a href=&quot;&#x2F;tags&#x2F;notes&#x2F;&quot;&gt;Notes About&lt;&#x2F;a&gt; is a blogpost serie  you will find a lot of &lt;strong&gt;links, videos, quotes, podcasts to click on&lt;&#x2F;strong&gt; about a specific topic. Today we will discover FoundationDB.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;overview-of-foundationdb&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#overview-of-foundationdb&quot; aria-label=&quot;Anchor link for: overview-of-foundationdb&quot;&gt;🔗&lt;&#x2F;a&gt;Overview of FoundationDB&lt;&#x2F;h2&gt;
&lt;p&gt;As stated in the &lt;a href=&quot;https:&#x2F;&#x2F;apple.github.io&#x2F;foundationdb&#x2F;index.html&quot;&gt;official documentation&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;FoundationDB is a distributed database designed to handle large volumes of structured data across clusters of commodity servers. It organizes data as an ordered key-value store and employs ACID transactions for all operations. It is especially well-suited for read&#x2F;write workloads but also has excellent performance for write-intensive workloads.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;It has strong key points:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Multi-model data store&lt;&#x2F;li&gt;
&lt;li&gt;Easily scalable and fault tolerant&lt;&#x2F;li&gt;
&lt;li&gt;Industry-leading performance&lt;&#x2F;li&gt;
&lt;li&gt;Open source.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;From a database dialect, it provides:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;jepsen.io&#x2F;consistency&#x2F;models&#x2F;strict-serializable&quot;&gt;strict serializability&lt;&#x2F;a&gt;(operations appear to have occurred in some order),&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;cloud.google.com&#x2F;spanner&#x2F;docs&#x2F;true-time-external-consistency&quot;&gt;external consistency&lt;&#x2F;a&gt;(For any two transactions, T1 and T2, if T2 starts to commit after T1 finishes committing, then the timestamp for T2 is greater than the timestamp for T1).&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;the-story&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-story&quot; aria-label=&quot;Anchor link for: the-story&quot;&gt;🔗&lt;&#x2F;a&gt;The story&lt;&#x2F;h2&gt;
&lt;p&gt;FoundationDB started as a company in 2009, and then &lt;a href=&quot;https:&#x2F;&#x2F;techcrunch.com&#x2F;2015&#x2F;03&#x2F;24&#x2F;apple-acquires-durable-database-company-foundationdb&#x2F;&quot;&gt;has been acquired in 2015 by Apple&lt;&#x2F;a&gt;. It &lt;a href=&quot;https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=9259986&quot;&gt;was a bad public publicity for the database as the download were removed.&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;p&gt;On April 19, 2018, Apple &lt;a href=&quot;https:&#x2F;&#x2F;www.foundationdb.org&#x2F;blog&#x2F;foundationdb-is-open-source&#x2F;&quot;&gt;open sourced the software, releasing it under the Apache 2.0 license&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;tooling-before-coding&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#tooling-before-coding&quot; aria-label=&quot;Anchor link for: tooling-before-coding&quot;&gt;🔗&lt;&#x2F;a&gt;Tooling before coding&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;flow&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#flow&quot; aria-label=&quot;Anchor link for: flow&quot;&gt;🔗&lt;&#x2F;a&gt;Flow&lt;&#x2F;h3&gt;
&lt;p&gt;From the &lt;a href=&quot;https:&#x2F;&#x2F;apple.github.io&#x2F;foundationdb&#x2F;engineering.html&quot;&gt;Engineering page&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;FoundationDB began with ambitious goals for both high performance per node and scalability. We knew that to achieve these goals we would face serious engineering challenges that would require tool breakthroughs. We’d need efficient asynchronous communicating processes like in Erlang or the Async in .NET, but we’d also need the raw speed, I&#x2F;O efficiency, and control of C++. To meet these challenges, we developed several new tools, the most important of which is &lt;strong&gt;Flow&lt;&#x2F;strong&gt;, a new programming language that brings actor-based concurrency to C++11.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Flow is more of a &lt;strong&gt;stateful distributed system framework&lt;&#x2F;strong&gt; than an asynchronous library. It takes a number of highly opinionated stances on how the overall distributed system should be written, and isn’t trying to be a widely reusable building block.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Flow adds about 10 keywords to C++11 and is technically a trans-compiler: the Flow compiler reads Flow code and compiles it down to raw C++11, which is then compiled to a native binary with a traditional toolchain.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Flow was developed before FDB, as stated in this &lt;a href=&quot;https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=5319163&quot;&gt;2013&#x27;s post&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;FoundationDB founder here. Flow sounds crazy. What hubris to think that you need a new programming language for your project? Three years later: Best decision we ever made.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;We knew this was going to be a long project so we invested heavily in tools at the beginning. The first two weeks of FoundationDB were building this new programming language to give us the speed of C++ with high level tools for actor-model concurrency. But, the real magic is how Flow enables us to use our real code to do deterministic simulations of a cluster in a single thread. We have a white paper upcoming on this.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;We&#x27;ve had quite a bit of interest in Flow over the years and I&#x27;ve given several talks on it at meetups&#x2F;conferences. We&#x27;ve always thought about open-sourcing it... It&#x27;s not as elegant as some other actor-model languages like Scala or Erlang (see: C++) but it&#x27;s nice and fast at run-time and really helps productivity vs. writing callbacks, etc.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;(Fun fact: We&#x27;ve only ever found two bugs in Flow. After the first, we decided that we never wanted a bug again in our programming language. So, we built a program in Python that generates random Flow code and independently-executes it to validate Flow&#x27;s behavior. This fuzz tester found one more bug, and we&#x27;ve never found another.)&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;A very good overview of Flow is available &lt;a href=&quot;https:&#x2F;&#x2F;apple.github.io&#x2F;foundationdb&#x2F;flow.html&quot;&gt;here&lt;&#x2F;a&gt; and some details &lt;a href=&quot;https:&#x2F;&#x2F;forums.foundationdb.org&#x2F;t&#x2F;why-was-flow-developed&#x2F;1711&#x2F;3&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;simulation-driven-development&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#simulation-driven-development&quot; aria-label=&quot;Anchor link for: simulation-driven-development&quot;&gt;🔗&lt;&#x2F;a&gt;Simulation-Driven development&lt;&#x2F;h3&gt;
&lt;p&gt;One of Flow’s most important job is enabling &lt;strong&gt;Simulation&lt;&#x2F;strong&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;We wanted FoundationDB to survive failures of machines, networks, disks, clocks, racks, data centers, file systems, etc., so we created a simulation framework closely tied to Flow. By replacing physical interfaces with shims, replacing the main epoll-based run loop with a time-based simulation, and running multiple logical processes as concurrent Flow Actors, Simulation is able to conduct a deterministic simulation of an entire FoundationDB cluster within a single-thread! Even better, we are able to execute this simulation in a deterministic way, enabling us to reproduce problems and add instrumentation ex post facto. This incredible capability enabled us to build FoundationDB exclusively in simulation for the first 18 months and ensure exceptional fault tolerance long before it sent its first real network packet. For a database with as strong a contract as the FoundationDB, testing is crucial, and over the years we have run the equivalent of a trillion CPU-hours of simulated stress testing.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;A good overview of the simulation can be found &lt;a href=&quot;https:&#x2F;&#x2F;apple.github.io&#x2F;foundationdb&#x2F;testing.html&quot;&gt;here&lt;&#x2F;a&gt;. You can also have a look at this awesome talk!&lt;&#x2F;p&gt;
&lt;div &gt;&lt;&#x2F;div&gt;
    &lt;iframe
        src=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;embed&#x2F;4fFDFbi3toc&quot;
        webkitallowfullscreen
        mozallowfullscreen
        allowfullscreen&gt;
    &lt;&#x2F;iframe&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;Simulation has been made possible by combining:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Single-threaded pseudo-concurrency,&lt;&#x2F;li&gt;
&lt;li&gt;Simulated implementation of all external communication,&lt;&#x2F;li&gt;
&lt;li&gt;determinism.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Here&#x27;s an example of a &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;master&#x2F;tests&#x2F;slow&#x2F;SwizzledCycleTest.txt&quot;&gt;testfile&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;testTitle=SwizzledCycleTest
&lt;&#x2F;span&gt;&lt;span&gt;    testName=Cycle
&lt;&#x2F;span&gt;&lt;span&gt;    transactionsPerSecond=5000.0
&lt;&#x2F;span&gt;&lt;span&gt;    testDuration=30.0
&lt;&#x2F;span&gt;&lt;span&gt;    expectedRate=0.01
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    testName=RandomClogging
&lt;&#x2F;span&gt;&lt;span&gt;    testDuration=30.0
&lt;&#x2F;span&gt;&lt;span&gt;    swizzle = 1
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    testName=Attrition
&lt;&#x2F;span&gt;&lt;span&gt;    machinesToKill=10
&lt;&#x2F;span&gt;&lt;span&gt;    machinesToLeave=3
&lt;&#x2F;span&gt;&lt;span&gt;    reboot=true
&lt;&#x2F;span&gt;&lt;span&gt;    testDuration=30.0
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    testName=Attrition
&lt;&#x2F;span&gt;&lt;span&gt;    machinesToKill=10
&lt;&#x2F;span&gt;&lt;span&gt;    machinesToLeave=3
&lt;&#x2F;span&gt;&lt;span&gt;    reboot=true
&lt;&#x2F;span&gt;&lt;span&gt;    testDuration=30.0
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    testName=ChangeConfig
&lt;&#x2F;span&gt;&lt;span&gt;    maxDelayBeforeChange=30.0
&lt;&#x2F;span&gt;&lt;span&gt;    coordinators=auto
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The test is splitted into two parts:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The goal&lt;&#x2F;strong&gt;, for example doing transaction pointing to another with thousands of transactions per sec and there should be only 0.01% of success.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What will be done to try to prevent the test to succeed&lt;&#x2F;strong&gt;. In this example it will &lt;strong&gt;at the same time&lt;&#x2F;strong&gt;:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;do random clogging. Which means that &lt;strong&gt;network connections will be stopped&lt;&#x2F;strong&gt; (preventing actors to send and receive packets). Swizzle flag means that a subset of network connections will be stopped and bring back in reverse order, 😳&lt;&#x2F;li&gt;
&lt;li&gt;will &lt;strong&gt;poweroff&#x2F;reboot machines&lt;&#x2F;strong&gt; (attritions) pseudo-randomly while keeping a minimal of three machines, 🤯&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;change configuration&lt;&#x2F;strong&gt;, which means a coordination changes through multi-paxos for the whole cluster. 😱&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Keep in mind that all these failures will appears &lt;strong&gt;at the same time!&lt;&#x2F;strong&gt; Do you think that your current &lt;strong&gt;datastore has gone through the same test on a daily basis?&lt;&#x2F;strong&gt; &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;etcd-io&#x2F;etcd&#x2F;pull&#x2F;11308&quot;&gt;I think not&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Applications written using the FoundationDB simulator have hierarchy: &lt;code&gt;DataCenter -&amp;gt; Machine -&amp;gt; Process -&amp;gt; Interface&lt;&#x2F;code&gt;. &lt;strong&gt;Each of these can be killed&#x2F;freezed&#x2F;nuked&lt;&#x2F;strong&gt;. Even faulty admin commands fired by some DevOps are tested!&lt;&#x2F;p&gt;
&lt;h3 id=&quot;known-limitations&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#known-limitations&quot; aria-label=&quot;Anchor link for: known-limitations&quot;&gt;🔗&lt;&#x2F;a&gt;Known limitations&lt;&#x2F;h3&gt;
&lt;p&gt;Limitations are well described in the &lt;a href=&quot;https:&#x2F;&#x2F;apple.github.io&#x2F;foundationdb&#x2F;known-limitations.html&quot;&gt;official documentation&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;recap&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#recap&quot; aria-label=&quot;Anchor link for: recap&quot;&gt;🔗&lt;&#x2F;a&gt;Recap&lt;&#x2F;h3&gt;
&lt;p&gt;An awesome recap is available on the &lt;a href=&quot;https:&#x2F;&#x2F;softwareengineeringdaily.com&#x2F;2019&#x2F;07&#x2F;01&#x2F;foundationdb-with-ryan-worl&#x2F;&quot;&gt;Software Engineering Daily podcast&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;FoundationDB is tested in a very rigorous way using what&#x27;s called &lt;strong&gt;a deterministic simulation&lt;&#x2F;strong&gt;. The reason they needed a new programming language to do this, is that to get a deterministic simulation, you have to make something that is deterministic. It&#x27;s kind of obvious, but it&#x27;s hard to do.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;For example, if your process interacts with the network, or disks, or clocks, it&#x27;s not deterministic. If you have multiple threads, not deterministic. So, they needed a way to write a concurrent program that could talk with networks and disks and that type of thing. They needed a way to write a concurrent program that does all of those things that you would think are non-deterministic in a deterministic way.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;So, all FoundationDB processes, and FoundationDB, it&#x27;s basically all written in Flow except a very small amount of it from the SQLite B-tree. The reason why that was useful is that when you use Flow, you get all of these higher level abstraction that let what you do what feels to you like asynchronous stuff, but under the hood, it&#x27;s all implemented using callbacks in C++, which you can make deterministic by running it in a single thread. So, there&#x27;s a scheduler that just calls these callbacks one after another and it&#x27;s very crazy looking C++ code, like you wouldn&#x27;t want to read it, but it&#x27;s because of Flow they were able to implement that deterministic simulation.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;the-architecture&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-architecture&quot; aria-label=&quot;Anchor link for: the-architecture&quot;&gt;🔗&lt;&#x2F;a&gt;The Architecture&lt;&#x2F;h2&gt;
&lt;p&gt;According to the &lt;a href=&quot;https:&#x2F;&#x2F;apple.github.io&#x2F;foundationdb&#x2F;administration.html#fdbmonitor-and-fdbserver&quot;&gt;fdbmonitor and fdbserver&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;The core FoundationDB server process is &lt;code&gt;fdbserver&lt;&#x2F;code&gt;. Each &lt;code&gt;fdbserver&lt;&#x2F;code&gt; process uses up to one full CPU core, so a production FoundationDB cluster will usually run N such processes on an N-core system.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;To make configuring, starting, stopping, and restarting fdbserver processes easy, FoundationDB also comes with a singleton daemon process, &lt;code&gt;fdbmonitor&lt;&#x2F;code&gt;, which is started automatically on boot. &lt;code&gt;fdbmonitor&lt;&#x2F;code&gt; reads the &lt;code&gt;foundationdb.conf&lt;&#x2F;code&gt; file and starts the configured set of fdbserver processes. It is also responsible for starting backup-agent.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;The whole architecture is designed to automatically:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;load-balanced data and traffic,&lt;&#x2F;li&gt;
&lt;li&gt;self-healing.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;microservices&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#microservices&quot; aria-label=&quot;Anchor link for: microservices&quot;&gt;🔗&lt;&#x2F;a&gt;Microservices&lt;&#x2F;h3&gt;
&lt;p&gt;A typical FDB cluster is composed of different actors which are describe &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;blob&#x2F;master&#x2F;documentation&#x2F;sphinx&#x2F;source&#x2F;kv-architecture.rst&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The most important role in FDB is the &lt;code&gt;Coordinator&lt;&#x2F;code&gt;, it uses &lt;code&gt;Paxos&lt;&#x2F;code&gt; to manage membership on a quorum to do writes. The &lt;code&gt;Coordinator&lt;&#x2F;code&gt; is mostly only used to elect some peers and during recovery. You can view it as a Zookeeper-like stack.&lt;&#x2F;p&gt;
&lt;p&gt;The Coordinator starts by electing a &lt;code&gt;Cluster Controller&lt;&#x2F;code&gt;. It provides administratives informations about the cluster(I have 4 storage processes). Every process needs to register to the &lt;code&gt;Cluster Controller&lt;&#x2F;code&gt; and then it will assign roles to them. It is the one that will heart-beat all the processes.&lt;&#x2F;p&gt;
&lt;p&gt;Then a &lt;code&gt;Master&lt;&#x2F;code&gt; is elected. The &lt;code&gt;Master&lt;&#x2F;code&gt; process is reponsible for the &lt;code&gt;data distribution&lt;&#x2F;code&gt; algorithms. Fun fact, the mapping between keys and storage servers is stored within FDB, which is you can actually move data by running transactions like any other application. He is also the one providing &lt;code&gt;read versions&lt;&#x2F;code&gt; and &lt;code&gt;version number&lt;&#x2F;code&gt; internally. He is also acting as the &lt;code&gt;RateKeeper&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;The Proxies&lt;&#x2F;code&gt; are responsible for providing read versions, committing transactions, and tracking the storage servers responsible for each range of keys.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;The Transaction Resolvers&lt;&#x2F;code&gt; are responsible determining conflicts between transactions. A transaction conflicts if it reads a key that has been written between the transaction’s read version and commit version. The resolver does this by holding the last 5 seconds of committed writes in memory, and comparing a new transaction’s reads against this set of commits.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;notes-about-foundationdb&#x2F;architecture.png&quot; alt=&quot;fdb image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;read-and-write-path&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#read-and-write-path&quot; aria-label=&quot;Anchor link for: read-and-write-path&quot;&gt;🔗&lt;&#x2F;a&gt;Read and Write Path&lt;&#x2F;h3&gt;
&lt;div &gt;&lt;&#x2F;div&gt;
    &lt;iframe
        src=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;embed&#x2F;EMwhsGsxfPU&quot;
        webkitallowfullscreen
        mozallowfullscreen
        allowfullscreen&gt;
    &lt;&#x2F;iframe&gt;
&lt;&#x2F;div&gt;&lt;h4 id=&quot;read-path&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#read-path&quot; aria-label=&quot;Anchor link for: read-path&quot;&gt;🔗&lt;&#x2F;a&gt;Read Path&lt;&#x2F;h4&gt;
&lt;ol&gt;
&lt;li&gt;Retrieve a consistend read version for the transaction&lt;&#x2F;li&gt;
&lt;li&gt;Do reads from a consistent MVCC snapshot at that read version on the storage node&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h4 id=&quot;write-path&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#write-path&quot; aria-label=&quot;Anchor link for: write-path&quot;&gt;🔗&lt;&#x2F;a&gt;Write Path&lt;&#x2F;h4&gt;
&lt;ol&gt;
&lt;li&gt;client is sending a bundle to the &lt;code&gt;proxy&lt;&#x2F;code&gt; containing:
&lt;ul&gt;
&lt;li&gt;read version for the transaction&lt;&#x2F;li&gt;
&lt;li&gt;every readen key&lt;&#x2F;li&gt;
&lt;li&gt;every mutation that you want to do&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;The proxy will assign a &lt;code&gt;Commit version&lt;&#x2F;code&gt; to a batch of transactions. &lt;code&gt;Commit version&lt;&#x2F;code&gt; is generated by the &lt;code&gt;Master&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Proxy is sending to the resolver. This will check if the data that you want to mutate has been changed between your &lt;code&gt;read Version&lt;&#x2F;code&gt; and your &lt;code&gt;Commit version&lt;&#x2F;code&gt;. They are sharded by key-range.&lt;&#x2F;li&gt;
&lt;li&gt;Transaction is made durable within the &lt;code&gt;Transaction Logs&lt;&#x2F;code&gt; by &lt;code&gt;fsync&lt;&#x2F;code&gt;ing the data. Before the data is even written to disk it is forwarded to the &lt;code&gt;storage servers&lt;&#x2F;code&gt; responsible for that mutation. Internally, &lt;code&gt;Transactions Logs&lt;&#x2F;code&gt; are creating &lt;strong&gt;a stream per &lt;code&gt;Storage Server&lt;&#x2F;code&gt;&lt;&#x2F;strong&gt;. Once the &lt;code&gt;storage servers&lt;&#x2F;code&gt; have made the mutation durable, they pop it from the log. This generally happens roughly 6 seconds after the mutation was originally committed to the log.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;Storage servers&lt;&#x2F;code&gt; are lazily updating data on disk from the &lt;code&gt;Transaction logs&lt;&#x2F;code&gt;. They are keeping new write in-memory.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;Transaction Logs&lt;&#x2F;code&gt; is responding OK to the Proxy and then the proxy is replying OK to the client.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;You can find more diagrams about transactions &lt;a href=&quot;https:&#x2F;&#x2F;forums.foundationdb.org&#x2F;t&#x2F;technical-overview-of-the-database&#x2F;135&#x2F;3&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;recovery&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#recovery&quot; aria-label=&quot;Anchor link for: recovery&quot;&gt;🔗&lt;&#x2F;a&gt;Recovery&lt;&#x2F;h3&gt;
&lt;p&gt;Recovery processes are detailled at around 25min.&lt;&#x2F;p&gt;
&lt;p&gt;During failure of a process (Except storage servers), the systems will try to create a new &lt;code&gt;generation&lt;&#x2F;code&gt;, so new &lt;code&gt;Master&lt;&#x2F;code&gt;, &lt;code&gt;proxies&lt;&#x2F;code&gt;, &lt;code&gt;resolvers&lt;&#x2F;code&gt; and &lt;code&gt;transactions logs&lt;&#x2F;code&gt;. New master will get a read version from transactions logs, and commit with &lt;code&gt;Paxos&lt;&#x2F;code&gt; the fact that starting from &lt;code&gt;Read version&lt;&#x2F;code&gt;, the new generation is the one in charge.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;Storage servers&lt;&#x2F;code&gt; are replicating data on failures.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;the-5-second-transaction-limit&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-5-second-transaction-limit&quot; aria-label=&quot;Anchor link for: the-5-second-transaction-limit&quot;&gt;🔗&lt;&#x2F;a&gt;The 5-second transaction limit&lt;&#x2F;h3&gt;
&lt;p&gt;FoundationDB currently does not support transactions running for over five seconds. More details around 16min but the &lt;code&gt;tl;dr&lt;&#x2F;code&gt; is:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Storage servers are caching latest read in-memory,&lt;&#x2F;li&gt;
&lt;li&gt;Resolvers are caching the last 5 seconds transactions.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;ratekeeper&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#ratekeeper&quot; aria-label=&quot;Anchor link for: ratekeeper&quot;&gt;🔗&lt;&#x2F;a&gt;Ratekeeper&lt;&#x2F;h3&gt;
&lt;p&gt;More details around 31min but the &lt;code&gt;tl;dr&lt;&#x2F;code&gt; is that when system is saturated, retrieving the &lt;code&gt;Read version&lt;&#x2F;code&gt; is slowed down.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;storage&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#storage&quot; aria-label=&quot;Anchor link for: storage&quot;&gt;🔗&lt;&#x2F;a&gt;Storage&lt;&#x2F;h3&gt;
&lt;p&gt;A lot of information are available in this talk:&lt;&#x2F;p&gt;
&lt;div &gt;&lt;&#x2F;div&gt;
    &lt;iframe
        src=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;embed&#x2F;nlus1Z7TVTI&quot;
        webkitallowfullscreen
        mozallowfullscreen
        allowfullscreen&gt;
    &lt;&#x2F;iframe&gt;
&lt;&#x2F;div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;memory&lt;&#x2F;code&gt; is optimized for small databases. Data is stored in memory and logged to disk. In this storage engine, all data must be resident in memory at all times, and all reads are satisfied from memory.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;SSD&lt;&#x2F;code&gt; Storage Engine is based on SQLite B-Tree&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;Redwood&lt;&#x2F;code&gt; will be a new storage engine based on Versioned B+Tree&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;developer-experience&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#developer-experience&quot; aria-label=&quot;Anchor link for: developer-experience&quot;&gt;🔗&lt;&#x2F;a&gt;Developer experience&lt;&#x2F;h2&gt;
&lt;p&gt;FoundationDB’s keys are ordered, making &lt;code&gt;tuples&lt;&#x2F;code&gt; a particularly useful tool for data modeling. FoundationDB provides a &lt;strong&gt;tuple layer&lt;&#x2F;strong&gt; (available in each language binding) that encodes tuples into keys. This layer lets you store data using a tuple like &lt;code&gt;(state, county)&lt;&#x2F;code&gt; as a key. Later, you can perform reads using a prefix like &lt;code&gt;(state,)&lt;&#x2F;code&gt;. The layer works by preserving the natural ordering of the tuples.&lt;&#x2F;p&gt;
&lt;p&gt;Everything is wrapped into a transaction in FDB.&lt;&#x2F;p&gt;
&lt;p&gt;You can have a nice overview by reading the README of &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;richardartoul&#x2F;tsdb-layer&#x2F;blob&#x2F;master&#x2F;README.md&quot;&gt;tsdb-layer&lt;&#x2F;a&gt;, an experiment combining Time Series and FoundationDB: Millions of writes&#x2F;s and 10x compression in under 2,000 lines of Go.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;fdb-one-more-things-layers&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#fdb-one-more-things-layers&quot; aria-label=&quot;Anchor link for: fdb-one-more-things-layers&quot;&gt;🔗&lt;&#x2F;a&gt;FDB One more things: Layers&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;concept-of-layers&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#concept-of-layers&quot; aria-label=&quot;Anchor link for: concept-of-layers&quot;&gt;🔗&lt;&#x2F;a&gt;Concept of layers&lt;&#x2F;h3&gt;
&lt;div &gt;&lt;&#x2F;div&gt;
    &lt;iframe
        src=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;embed&#x2F;HLE8chgw6LI&quot;
        webkitallowfullscreen
        mozallowfullscreen
        allowfullscreen&gt;
    &lt;&#x2F;iframe&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;FDB is resolving many distributed problems, but you still need things like &lt;strong&gt;security, multi-tenancy, query optimizations, schema, indexing&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;notes-about-foundationdb&#x2F;extract-layer-1.png&quot; alt=&quot;fdb image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Layers are designed to develop features &lt;strong&gt;above FDB.&lt;&#x2F;strong&gt; The record-layer provided by Apple is a good starting point to build things above it, as it provides &lt;strong&gt;structured schema, indexes, and (async) query planner.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;notes-about-foundationdb&#x2F;extract-layer-2.png&quot; alt=&quot;fdb image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;The record-layer provided by Apple is a good starting point to build things above it, as it provides &lt;strong&gt;structured schema, indexes, and (async) query planner.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;notes-about-foundationdb&#x2F;extract-layer-3.png&quot; alt=&quot;fdb image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;apple-s-record-layer&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#apple-s-record-layer&quot; aria-label=&quot;Anchor link for: apple-s-record-layer&quot;&gt;🔗&lt;&#x2F;a&gt;Apple&#x27;s Record Layer&lt;&#x2F;h3&gt;
&lt;p&gt;The paper is located &lt;a href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;1901.04452.pdf&quot;&gt;FoundationDB Record Layer:A Multi-Tenant Structured Datastore&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;div &gt;&lt;&#x2F;div&gt;
    &lt;iframe
        src=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;embed&#x2F;SvoUHHM9IKU&quot;
        webkitallowfullscreen
        mozallowfullscreen
        allowfullscreen&gt;
    &lt;&#x2F;iframe&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;Record Layer was designed to solve CloudKit problem.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;notes-about-foundationdb&#x2F;record-extract-1.png&quot; alt=&quot;fdb image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Record allow multi-tenancy with schema above FDB&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;notes-about-foundationdb&#x2F;record-extract-2.png&quot; alt=&quot;fdb image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;notes-about-foundationdb&#x2F;record-extract-3.png&quot; alt=&quot;fdb image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Record Layers is providing stateless compute&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;notes-about-foundationdb&#x2F;record-extract-4.png&quot; alt=&quot;fdb image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;And streaming queries!&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;notes-about-foundationdb&#x2F;record-extract-5.png&quot; alt=&quot;fdb image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;kubernetes-operators&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#kubernetes-operators&quot; aria-label=&quot;Anchor link for: kubernetes-operators&quot;&gt;🔗&lt;&#x2F;a&gt;Kubernetes Operators&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;overview-of-the-operator&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#overview-of-the-operator&quot; aria-label=&quot;Anchor link for: overview-of-the-operator&quot;&gt;🔗&lt;&#x2F;a&gt;Overview of the operator&lt;&#x2F;h3&gt;
&lt;div &gt;&lt;&#x2F;div&gt;
    &lt;iframe
        src=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;embed&#x2F;A3U8M8pt3Ks&quot;
        webkitallowfullscreen
        mozallowfullscreen
        allowfullscreen&gt;
    &lt;&#x2F;iframe&gt;
&lt;&#x2F;div&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;notes-about-foundationdb&#x2F;operator-extract-1.png&quot; alt=&quot;fdb image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;notes-about-foundationdb&#x2F;operator-extract-2.png&quot; alt=&quot;fdb image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Upgrade is done by &lt;strong&gt;bumping all processes at once&lt;&#x2F;strong&gt; 😱&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;notes-about-foundationdb&#x2F;operator-extract-3.png&quot; alt=&quot;fdb image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;notes-about-foundationdb&#x2F;operator-extract-4.png&quot; alt=&quot;fdb image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;combining-chaos-mesh-and-the-operator&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#combining-chaos-mesh-and-the-operator&quot; aria-label=&quot;Anchor link for: combining-chaos-mesh-and-the-operator&quot;&gt;🔗&lt;&#x2F;a&gt;Combining chaos-mesh and the operator&lt;&#x2F;h3&gt;
&lt;p&gt;I played a bit with the operator by combining:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;FoundationDB&#x2F;fdb-kubernetes-operator&quot;&gt;FoundationDB&#x2F;fdb-kubernetes-operator&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;pingcap&#x2F;go-ycsb&quot;&gt;pingcap&#x2F;go-ycsb&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;pingcap&#x2F;chaos-mesh&quot;&gt;pingcap&#x2F;chaos-mesh&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;PierreZ&#x2F;fdb-prometheus-exporter&#x2F;&quot;&gt;PierreZ&#x2F;fdb-prometheus-exporter&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The experiment is available &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;PierreZ&#x2F;fdb-k8s-chaos&#x2F;&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;roadmap&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#roadmap&quot; aria-label=&quot;Anchor link for: roadmap&quot;&gt;🔗&lt;&#x2F;a&gt;Roadmap&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;foundationdb&#x2F;wiki&#x2F;FoundationDB-Release-7.0-Planning&quot;&gt;FoundationDB Release 7.0 Planning&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;strong&gt;Thank you&lt;&#x2F;strong&gt; for reading my post! Feel free to react to this article, I am also available on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt; if needed.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">distributed</category>
          <category domain="tag">foundationdb</category>
          <category domain="tag">storage</category>
          <category domain="tag">database</category>
          <category domain="tag">notes</category>
      </item>
      <item>
          <title>Diving into Kafka&#x27;s Protocol</title>
          <pubDate>Sun, 08 Dec 2019 15:00:00 +0100</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/diving-into-kafka-protocol/</link>
          <guid>https://pierrezemb.fr/posts/diving-into-kafka-protocol/</guid>
          <description xml:base="https://pierrezemb.fr/posts/diving-into-kafka-protocol/">&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;diving-into-kafka-protocol&#x2F;apache-kafka.png&quot; alt=&quot;kafka image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a href=&quot;&#x2F;tags&#x2F;diving-into&#x2F;&quot;&gt;Diving Into&lt;&#x2F;a&gt; is a blogpost serie where we are digging a specific part of of the project&#x27;s basecode. In this episode, we will digg into Kafka&#x27;s protocol.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-protocol-reference&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-protocol-reference&quot; aria-label=&quot;Anchor link for: the-protocol-reference&quot;&gt;🔗&lt;&#x2F;a&gt;The protocol reference&lt;&#x2F;h2&gt;
&lt;p&gt;For the last few months, I worked a lot around Kafka&#x27;s protocols, first by creating a fully async Kafka to Pulsar Proxy in Rust, and now by contributing directly to &lt;a href=&quot;https:&#x2F;&#x2F;www.slideshare.net&#x2F;streamnative&#x2F;2-kafkaonpulsarjia&quot;&gt;KoP (Kafka On Pulsar)&lt;&#x2F;a&gt;. The full Kafka Protocol documentation is available &lt;a href=&quot;https:&#x2F;&#x2F;kafka.apache.org&#x2F;protocol.html&quot;&gt;here&lt;&#x2F;a&gt;, but it does not offer a global view of what is happening for a classic Producer and Consumer exchange. Let&#x27;s dive in!&lt;&#x2F;p&gt;
&lt;h3 id=&quot;common-handshake&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#common-handshake&quot; aria-label=&quot;Anchor link for: common-handshake&quot;&gt;🔗&lt;&#x2F;a&gt;Common handshake&lt;&#x2F;h3&gt;
&lt;p&gt;After a client established the TCP connection, there is a few common requests and responses that are almost always here.&lt;&#x2F;p&gt;
&lt;p&gt;The common handhake can be divided in three parts:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Being able to understand each other. For this, we are using &lt;strong&gt;&lt;a href=&quot;https:&#x2F;&#x2F;kafka.apache.org&#x2F;protocol.html#The_Messages_ApiVersions&quot;&gt;API_VERSIONS&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt; to know which versions of which TCP frames can be uses,&lt;&#x2F;li&gt;
&lt;li&gt;Establish Auth using &lt;strong&gt;SASL&lt;&#x2F;strong&gt; if needed, thanks to &lt;strong&gt;&lt;a href=&quot;https:&#x2F;&#x2F;kafka.apache.org&#x2F;protocol.html#The_Messages_SaslHandshake&quot;&gt;SASL_HANDSHAKE&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt; and &lt;strong&gt;&lt;a href=&quot;https:&#x2F;&#x2F;kafka.apache.org&#x2F;protocol.html#The_Messages_SaslAuthenticate&quot;&gt;SASL_AUTHENTICATE&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt;,&lt;&#x2F;li&gt;
&lt;li&gt;Retrieve the topology of the cluster using &lt;strong&gt;&lt;a href=&quot;https:&#x2F;&#x2F;kafka.apache.org&#x2F;protocol.html#The_Messages_Metadata&quot;&gt;METADATA&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;blockquote&gt;
&lt;p&gt;All exchange are based between a Kafka 2.0 cluster and client.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;All the following diagrams are generated with &lt;a href=&quot;https:&#x2F;&#x2F;mermaidjs.github.io&#x2F;#&#x2F;&quot;&gt;MermaidJS&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
    &lt;pre class=&quot;mermaid&quot;&gt;
        sequenceDiagram

    Note left of KafkaClient: I&amp;#x27;m speaking Kafka &amp;lt;br&amp;#x2F;&amp;gt; 2.3,but can the &amp;lt;br&amp;#x2F;&amp;gt; broker understand &amp;lt;br&amp;#x2F;&amp;gt; me?

    KafkaClient -&amp;gt;&amp;gt;+ Broker0: API_VERSIONS request

    Note right of Broker0: I can handle theses &amp;lt;br&amp;#x2F;&amp;gt; structures in theses &amp;lt;br&amp;#x2F;&amp;gt;versions: ...
    Broker0 -&amp;gt;&amp;gt;- KafkaClient: 

    Note left of KafkaClient: Thanks!&amp;lt;br&amp;#x2F;&amp;gt; I see you can handle &amp;lt;br&amp;#x2F;&amp;gt; SASL, let&amp;#x27;s auth! &amp;lt;br&amp;#x2F;&amp;gt; can you handle &amp;lt;br&amp;#x2F;&amp;gt; SASL_PLAIN?
    KafkaClient -&amp;gt;&amp;gt;+ Broker0: SASL_HANDSHAKE request

    Note right of Broker0: Yes I can handle &amp;lt;br&amp;#x2F;&amp;gt; SASL_PLAIN &amp;lt;br&amp;#x2F;&amp;gt; among others
    Broker0 -&amp;gt;&amp;gt;- KafkaClient: 

    Note left of KafkaClient: Awesome, here&amp;#x27;s &amp;lt;br&amp;#x2F;&amp;gt; my credentials!
    KafkaClient -&amp;gt;&amp;gt;+ Broker0: SASL_AUTHENTICATE request

    Note right of Broker0: Checking...
    Note right of Broker0: You are &amp;lt;br&amp;#x2F;&amp;gt;authenticated!
    Broker0 -&amp;gt;&amp;gt;- KafkaClient: 

    Note left of KafkaClient: Cool! &amp;lt;br&amp;#x2F;&amp;gt; Can you give &amp;lt;br&amp;#x2F;&amp;gt; the cluster topology?&amp;lt;br&amp;#x2F;&amp;gt; I want to &amp;lt;br&amp;#x2F;&amp;gt; use &amp;#x27;my-topic&amp;#x27;
    KafkaClient -&amp;gt;&amp;gt;+ Broker0: METADATA request

    Note right of Broker0: There is one topic &amp;lt;br&amp;#x2F;&amp;gt; with one partition&amp;lt;br&amp;#x2F;&amp;gt; called &amp;#x27;my-topic&amp;#x27;&amp;lt;br&amp;#x2F;&amp;gt;The partition&amp;#x27;s leader &amp;lt;br&amp;#x2F;&amp;gt; is Broker0
    Broker0 -&amp;gt;&amp;gt;- KafkaClient: 

Note left of KafkaClient: That is you, I don&amp;#x27;t &amp;lt;br&amp;#x2F;&amp;gt; need to handshake &amp;lt;br&amp;#x2F;&amp;gt; again with &amp;lt;br&amp;#x2F;&amp;gt; another broker
    &lt;&#x2F;pre&gt;
&lt;&#x2F;div&gt;&lt;h3 id=&quot;producing&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#producing&quot; aria-label=&quot;Anchor link for: producing&quot;&gt;🔗&lt;&#x2F;a&gt;Producing&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;strong&gt;&lt;a href=&quot;https:&#x2F;&#x2F;kafka.apache.org&#x2F;protocol.html#The_Messages_Produce&quot;&gt;PRODUCE&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt; API is used to send message sets to the server. For efficiency it allows sending message sets intended for many topic partitions in a single request.&lt;&#x2F;p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
    &lt;pre class=&quot;mermaid&quot;&gt;
        sequenceDiagram

    Note over KafkaClient,Broker0: ...handshaking, see above...

    loop pull msg
        Note left of KafkaClient: I have a batch &amp;lt;br&amp;#x2F;&amp;gt; containing one &amp;lt;br&amp;#x2F;&amp;gt; message for the &amp;lt;br&amp;#x2F;&amp;gt; partition-0 &amp;lt;br&amp;#x2F;&amp;gt; of &amp;#x27;my-topic&amp;#x27;
        KafkaClient -&amp;gt;&amp;gt;+ Broker0: PRODUCE request

        Note right of Broker0: Processing...&amp;lt;br&amp;#x2F;&amp;gt;
        Note right of Broker0: Done!
        Broker0 -&amp;gt;&amp;gt;- KafkaClient: 
        
        Note left of KafkaClient: Thanks
    end
    &lt;&#x2F;pre&gt;
&lt;&#x2F;div&gt;&lt;h3 id=&quot;consuming&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#consuming&quot; aria-label=&quot;Anchor link for: consuming&quot;&gt;🔗&lt;&#x2F;a&gt;Consuming&lt;&#x2F;h3&gt;
&lt;p&gt;Consuming is more complicated than producing. You can learn more in &lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=maJulQ4ABNY&quot;&gt;The Magical Group Coordination Protocol of Apache Kafka&lt;&#x2F;a&gt; By Gwen Shapira, Principal Data Architect @ Confluent and also in the &lt;a href=&quot;https:&#x2F;&#x2F;cwiki.apache.org&#x2F;confluence&#x2F;display&#x2F;KAFKA&#x2F;Kafka+Client-side+Assignment+Proposal&quot;&gt;Kafka Client-side Assignment Proposal&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Consuming can be divided in three parts:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;coordinating the consumers to assign them partitions, using:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https:&#x2F;&#x2F;kafka.apache.org&#x2F;protocol.html#The_Messages_FindCoordinator&quot;&gt;FIND_COORDINATOR&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt;,&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https:&#x2F;&#x2F;kafka.apache.org&#x2F;protocol.html#The_Messages_JoinGroup&quot;&gt;JOIN_GROUP&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt;,&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https:&#x2F;&#x2F;kafka.apache.org&#x2F;protocol.html#The_Messages_SyncGroup&quot;&gt;SYNC_GROUP&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt;,&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;then fetch messages using:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https:&#x2F;&#x2F;kafka.apache.org&#x2F;protocol.html#The_Messages_OffsetFetch&quot;&gt;OFFSET_FETCH&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt;,&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https:&#x2F;&#x2F;kafka.apache.org&#x2F;protocol.html#The_Messages_ListOffsets&quot;&gt;LIST_OFFSETS&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt;,&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https:&#x2F;&#x2F;kafka.apache.org&#x2F;protocol.html#The_Messages_Fetch&quot;&gt;FETCH&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt;,&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https:&#x2F;&#x2F;kafka.apache.org&#x2F;protocol.html#The_Messages_OffsetCommit&quot;&gt;OFFSET_COMMIT&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt;,&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;Send lifeproof to the coordinator using &lt;strong&gt;&lt;a href=&quot;https:&#x2F;&#x2F;kafka.apache.org&#x2F;protocol.html#The_Messages_Heartbeat&quot;&gt;HEARTBEAT&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;For the sake of the explanation, we have now another Broker1 which is holding the coordinator for topic &#x27;my-topic&#x27;. In real-life, it would be the same.&lt;&#x2F;p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
    &lt;pre class=&quot;mermaid&quot;&gt;
        sequenceDiagram

    Note over KafkaClient,Broker0: ...handshaking, see above...

    Note left of KafkaClient: Who is the &amp;lt;br&amp;#x2F;&amp;gt; coordinator for&amp;lt;br&amp;#x2F;&amp;gt; &amp;#x27;my-topic&amp;#x27;?
    KafkaClient -&amp;gt;&amp;gt;+ Broker0: FIND_COORDINATOR request

    Note right of Broker0: It is Broker1!
    Broker0 -&amp;gt;&amp;gt;- KafkaClient: 

    Note left of KafkaClient: OK, let&amp;#x27;s connect&amp;lt;br&amp;#x2F;&amp;gt; to Broker1
    Note over KafkaClient,Broker1: ...handshaking, see above...

    Note left of KafkaClient: Hi, I want to join a &amp;lt;br&amp;#x2F;&amp;gt; consumption group &amp;lt;br&amp;#x2F;&amp;gt;for &amp;#x27;my-topic&amp;#x27;
    KafkaClient -&amp;gt;&amp;gt;+ Broker1: JOIN_GROUP request

    Note right of Broker1: Welcome! I will be &amp;lt;br&amp;#x2F;&amp;gt; waiting a bit for any &amp;lt;br&amp;#x2F;&amp;gt;of your friends.
    Note right of Broker1: You are now leader. &amp;lt;br&amp;#x2F;&amp;gt;Your group contains &amp;lt;br&amp;#x2F;&amp;gt; only one member.&amp;lt;br&amp;#x2F;&amp;gt; You now  need to &amp;lt;br&amp;#x2F;&amp;gt; assign partitions to &amp;lt;br&amp;#x2F;&amp;gt; them. 
    Broker1 -&amp;gt;&amp;gt;- KafkaClient: 

    Note left of KafkaClient: Computing &amp;lt;br&amp;#x2F;&amp;gt;the assigment...
    Note left of KafkaClient: Done! I will be &amp;lt;br&amp;#x2F;&amp;gt; in charge of handling &amp;lt;br&amp;#x2F;&amp;gt; partition-0 of &amp;lt;br&amp;#x2F;&amp;gt;&amp;#x27;my-topic&amp;#x27;
    KafkaClient -&amp;gt;&amp;gt;+ Broker1: SYNC_GROUP request

    Note right of Broker1: Thanks, I will &amp;lt;br&amp;#x2F;&amp;gt;broadcast the &amp;lt;br&amp;#x2F;&amp;gt;assigmnents to &amp;lt;br&amp;#x2F;&amp;gt;everyone
    Broker1 -&amp;gt;&amp;gt;- KafkaClient: 

    Note left of KafkaClient: Can I get the &amp;lt;br&amp;#x2F;&amp;gt; committed offsets &amp;lt;br&amp;#x2F;&amp;gt; for partition-0&amp;lt;br&amp;#x2F;&amp;gt;for my consumer&amp;lt;br&amp;#x2F;&amp;gt;group?
    KafkaClient -&amp;gt;&amp;gt;+ Broker1: OFFSET_FETCH request

    Note right of Broker1: Found no &amp;lt;br&amp;#x2F;&amp;gt;committed offset&amp;lt;br&amp;#x2F;&amp;gt; for partition-0
    Broker1 -&amp;gt;&amp;gt;- KafkaClient: 

    Note left of KafkaClient: Thanks, I will now &amp;lt;br&amp;#x2F;&amp;gt;connect to Broker0

    Note over KafkaClient,Broker0: ...handshaking again...

    opt if new consumer-group
        Note left of KafkaClient: Can you give me&amp;lt;br&amp;#x2F;&amp;gt; the earliest position&amp;lt;br&amp;#x2F;&amp;gt; for partition-0?
        KafkaClient -&amp;gt;&amp;gt;+ Broker0: LIST_OFFSETS request
        
        Note right of Broker0: Here&amp;#x27;s the earliest &amp;lt;br&amp;#x2F;&amp;gt; position: ...
        Broker0 -&amp;gt;&amp;gt;- KafkaClient: 
    end 
    loop pull msg

        opt Consume
            Note left of KafkaClient: Can you give me&amp;lt;br&amp;#x2F;&amp;gt; some messages &amp;lt;br&amp;#x2F;&amp;gt; starting  at offset X?
            KafkaClient -&amp;gt;&amp;gt;+ Broker0: FETCH request

            Note right of Broker0: Here some records...
            Broker0 -&amp;gt;&amp;gt;- KafkaClient: 

            Note left of KafkaClient: Processing...
            Note left of KafkaClient: Can you commit &amp;lt;br&amp;#x2F;&amp;gt;offset X?
            KafkaClient -&amp;gt;&amp;gt;+ Broker1: OFFSET_COMMIT request

            Note right of Broker1: Committing...
            Note right of Broker1: Done!
            Broker1 -&amp;gt;&amp;gt;- KafkaClient: 
        end

        Note left of KafkaClient: I need to send &amp;lt;br&amp;#x2F;&amp;gt; some lifeness proof &amp;lt;br&amp;#x2F;&amp;gt; to the coordinator           
        opt Healthcheck
            Note left of KafkaClient: I am still alive!  
            KafkaClient -&amp;gt;&amp;gt;+ Broker1: HEARTBEAT request
            Note right of Broker1: I hear you
            Broker1 -&amp;gt;&amp;gt;- KafkaClient: 
        end
    end
    &lt;&#x2F;pre&gt;
&lt;&#x2F;div&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;strong&gt;Thank you&lt;&#x2F;strong&gt; for reading my post! Feel free to react to this article, I am also available on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt; if needed.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">messaging</category>
          <category domain="tag">distributed</category>
          <category domain="tag">kafka</category>
          <category domain="tag">networking</category>
          <category domain="tag">diving-into</category>
      </item>
      <item>
          <title>Diving into Hbase&#x27;s MemStore</title>
          <pubDate>Sun, 17 Nov 2019 10:24:27 +0100</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/diving-into-hbase-memstore/</link>
          <guid>https://pierrezemb.fr/posts/diving-into-hbase-memstore/</guid>
          <description xml:base="https://pierrezemb.fr/posts/diving-into-hbase-memstore/">&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;hbase-data-model&#x2F;hbase.jpg&quot; alt=&quot;hbase image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a href=&quot;&#x2F;tags&#x2F;diving-into&#x2F;&quot;&gt;Diving Into&lt;&#x2F;a&gt; is a blogpost serie where we are digging a specific part of of the project&#x27;s basecode. In this episode, we will digg into the implementation behind Hbase&#x27;s MemStore.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;code&gt;tl;dr:&lt;&#x2F;code&gt; Hbase is using the &lt;a href=&quot;https:&#x2F;&#x2F;docs.oracle.com&#x2F;javase&#x2F;8&#x2F;docs&#x2F;api&#x2F;java&#x2F;util&#x2F;concurrent&#x2F;ConcurrentSkipListMap.html&quot;&gt;ConcurrentSkipListMap&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;what-is-the-memstore&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what-is-the-memstore&quot; aria-label=&quot;Anchor link for: what-is-the-memstore&quot;&gt;🔗&lt;&#x2F;a&gt;What is the MemStore?&lt;&#x2F;h2&gt;
&lt;blockquote&gt;
&lt;p&gt;The &lt;code&gt;memtable&lt;&#x2F;code&gt; from the official &lt;a href=&quot;https:&#x2F;&#x2F;research.google.com&#x2F;archive&#x2F;bigtable-osdi06.pdf&quot;&gt;BigTable paper&lt;&#x2F;a&gt; is the equivalent of the &lt;code&gt;MemStore&lt;&#x2F;code&gt; in Hbase.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;As rows are &lt;strong&gt;sorted lexicographically&lt;&#x2F;strong&gt; in Hbase, when data comes in, you need to have some kind of a &lt;strong&gt;in-memory buffer&lt;&#x2F;strong&gt; to order those keys. This is where the &lt;code&gt;MemStore&lt;&#x2F;code&gt; comes in. It absorbs the recent write (or put in Hbase semantics) operations. All the rest are immutable files called &lt;code&gt;HFile&lt;&#x2F;code&gt; stored in HDFS. There is one &lt;code&gt;MemStore&lt;&#x2F;code&gt; per &lt;code&gt;column family&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s dig into how the MemStore internally works in Hbase 1.X.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;hbase-1&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#hbase-1&quot; aria-label=&quot;Anchor link for: hbase-1&quot;&gt;🔗&lt;&#x2F;a&gt;Hbase 1&lt;&#x2F;h2&gt;
&lt;p&gt;All extract of code for this section are taken from &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;hbase&#x2F;tree&#x2F;rel&#x2F;1.4.9&quot;&gt;rel&#x2F;1.4.9&lt;&#x2F;a&gt; tag.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;in-memory-storage&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#in-memory-storage&quot; aria-label=&quot;Anchor link for: in-memory-storage&quot;&gt;🔗&lt;&#x2F;a&gt;in-memory storage&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;hbase&#x2F;blob&#x2F;rel&#x2F;1.4.9&#x2F;hbase-server&#x2F;src&#x2F;main&#x2F;java&#x2F;org&#x2F;apache&#x2F;hadoop&#x2F;hbase&#x2F;regionserver&#x2F;MemStore.java#L35&quot;&gt;MemStore interface&lt;&#x2F;a&gt; is giving us insight on how it is working internally.&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;java&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-java &quot;&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;**
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;   * Write an update
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;   * &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;@param &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;cell
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;   * &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;@return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt; approximate size of the passed cell.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;   *&#x2F;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;long add(final Cell cell);
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;-- &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;hbase&#x2F;blob&#x2F;rel&#x2F;1.4.9&#x2F;hbase-server&#x2F;src&#x2F;main&#x2F;java&#x2F;org&#x2F;apache&#x2F;hadoop&#x2F;hbase&#x2F;regionserver&#x2F;MemStore.java#L68-L73&quot;&gt;add function on the MemStore&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The implementation is hold by &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;hbase&#x2F;blob&#x2F;rel&#x2F;1.4.9&#x2F;hbase-server&#x2F;src&#x2F;main&#x2F;java&#x2F;org&#x2F;apache&#x2F;hadoop&#x2F;hbase&#x2F;regionserver&#x2F;DefaultMemStore.java&quot;&gt;DefaultMemStore&lt;&#x2F;a&gt;. &lt;code&gt;add&lt;&#x2F;code&gt; is wrapped by several functions, but in the end, we are arriving here:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;java&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-java &quot;&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;private boolean &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;addToCellSet&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;Cell&lt;&#x2F;span&gt;&lt;span&gt; e) {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;boolean&lt;&#x2F;span&gt;&lt;span&gt; b = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;this&lt;&#x2F;span&gt;&lt;span&gt;.activeSection.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;getCellSkipListSet&lt;&#x2F;span&gt;&lt;span&gt;().&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;add&lt;&#x2F;span&gt;&lt;span&gt;(e);
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;-- &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;hbase&#x2F;blob&#x2F;rel&#x2F;1.4.9&#x2F;hbase-server&#x2F;src&#x2F;main&#x2F;java&#x2F;org&#x2F;apache&#x2F;hadoop&#x2F;hbase&#x2F;regionserver&#x2F;DefaultMemStore.java#L202-L213&quot;&gt;addToCellSet on the DefaultMemStore&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;hbase&#x2F;blob&#x2F;rel&#x2F;1.4.9&#x2F;hbase-server&#x2F;src&#x2F;main&#x2F;java&#x2F;org&#x2F;apache&#x2F;hadoop&#x2F;hbase&#x2F;regionserver&#x2F;CellSkipListSet.java#L33-L48&quot;&gt;CellSkipListSet class&lt;&#x2F;a&gt; is built on top of &lt;a href=&quot;https:&#x2F;&#x2F;docs.oracle.com&#x2F;javase&#x2F;8&#x2F;docs&#x2F;api&#x2F;java&#x2F;util&#x2F;concurrent&#x2F;ConcurrentSkipListMap.html&quot;&gt;ConcurrentSkipListMap&lt;&#x2F;a&gt;, which provide nice features:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;concurrency&lt;&#x2F;li&gt;
&lt;li&gt;sorted elements&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;flush-on-hdfs&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#flush-on-hdfs&quot; aria-label=&quot;Anchor link for: flush-on-hdfs&quot;&gt;🔗&lt;&#x2F;a&gt;Flush on HDFS&lt;&#x2F;h3&gt;
&lt;p&gt;As we seen above, the &lt;code&gt;MemStore&lt;&#x2F;code&gt; is supporting all the puts. When asked to flush, the current memstore is &lt;strong&gt;moved to snapshot and is cleared&lt;&#x2F;strong&gt;. Flushed file are called (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;hbase&#x2F;blob&#x2F;rel&#x2F;2.1.2&#x2F;hbase-server&#x2F;src&#x2F;main&#x2F;java&#x2F;org&#x2F;apache&#x2F;hadoop&#x2F;hbase&#x2F;io&#x2F;hfile&#x2F;HFile.java&quot;&gt;HFiles&lt;&#x2F;a&gt;) and they are similar to &lt;code&gt;SSTables&lt;&#x2F;code&gt; introduced by the official &lt;a href=&quot;https:&#x2F;&#x2F;research.google.com&#x2F;archive&#x2F;bigtable-osdi06.pdf&quot;&gt;BigTable paper&lt;&#x2F;a&gt;. HFiles are flushed on the Hadoop Distributed File System called &lt;code&gt;HDFS&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you want deeper insight about SSTables, I recommend reading &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;facebook&#x2F;rocksdb&#x2F;wiki&#x2F;Rocksdb-BlockBasedTable-Format&quot;&gt;Table Format from the awesome RocksDB wiki&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;compaction&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#compaction&quot; aria-label=&quot;Anchor link for: compaction&quot;&gt;🔗&lt;&#x2F;a&gt;Compaction&lt;&#x2F;h3&gt;
&lt;p&gt;Compaction are only run on HFiles. It means that &lt;strong&gt;if hot data is continuously updated, we are overusing memory due to duplicate entries per row per MemStore&lt;&#x2F;strong&gt;. Accordion tends to solve this problem through &lt;em&gt;in-memory compactions&lt;&#x2F;em&gt;. Let&#x27;s have a look to Hbase 2.X!&lt;&#x2F;p&gt;
&lt;h2 id=&quot;hbase-2&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#hbase-2&quot; aria-label=&quot;Anchor link for: hbase-2&quot;&gt;🔗&lt;&#x2F;a&gt;Hbase 2&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;storing-data&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#storing-data&quot; aria-label=&quot;Anchor link for: storing-data&quot;&gt;🔗&lt;&#x2F;a&gt;storing data&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;strong&gt;All extract of code starting from here are taken from &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;hbase&#x2F;tree&#x2F;rel&#x2F;2.1.2&quot;&gt;rel&#x2F;2.1.2&lt;&#x2F;a&gt; tag.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Does &lt;code&gt;MemStore&lt;&#x2F;code&gt; interface changed?&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;java&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-java &quot;&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;**
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;   * Write an update
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;   * &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;@param &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;cell
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;   * &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;@param &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;memstoreSizing&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt; The delta in memstore size will be passed back via this.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;   *        This will include both data size and heap overhead delta.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;   *&#x2F;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;  void add(final Cell cell, MemStoreSizing memstoreSizing);
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;-- &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;hbase&#x2F;blob&#x2F;rel&#x2F;2.1.2&#x2F;hbase-server&#x2F;src&#x2F;main&#x2F;java&#x2F;org&#x2F;apache&#x2F;hadoop&#x2F;hbase&#x2F;regionserver&#x2F;MemStore.java#L67-L73&quot;&gt;add function in MemStore interface&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The signature changed a bit, to include passing a object instead of returning a long. Moving on.&lt;&#x2F;p&gt;
&lt;p&gt;The new structure implementing MemStore is called &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;hbase&#x2F;blob&#x2F;rel&#x2F;2.1.2&#x2F;hbase-server&#x2F;src&#x2F;main&#x2F;java&#x2F;org&#x2F;apache&#x2F;hadoop&#x2F;hbase&#x2F;regionserver&#x2F;AbstractMemStore.java#L42&quot;&gt;AbstractMemStore&lt;&#x2F;a&gt;. Again, we have some layers, where AbstractMemStore is writing to a &lt;code&gt;MutableSegment&lt;&#x2F;code&gt;, which itsef is wrapping &lt;code&gt;Segment&lt;&#x2F;code&gt;. If you dig far enough, you will find that data are stored into the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;hbase&#x2F;blob&#x2F;rel&#x2F;2.1.2&#x2F;hbase-server&#x2F;src&#x2F;main&#x2F;java&#x2F;org&#x2F;apache&#x2F;hadoop&#x2F;hbase&#x2F;regionserver&#x2F;CellSet.java#L35-L51&quot;&gt;CellSet class&lt;&#x2F;a&gt; which is also things built on top of &lt;strong&gt;ConcurrentSkipListMap&lt;&#x2F;strong&gt;!&lt;&#x2F;p&gt;
&lt;h3 id=&quot;in-memory-compactions&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#in-memory-compactions&quot; aria-label=&quot;Anchor link for: in-memory-compactions&quot;&gt;🔗&lt;&#x2F;a&gt;in-memory Compactions&lt;&#x2F;h3&gt;
&lt;p&gt;Hbase 2.0 introduces a big change to the original memstore called Accordion which is a codename for in-memory compactions. An awesome blogpost is available here: &lt;a href=&quot;https:&#x2F;&#x2F;blogs.apache.org&#x2F;hbase&#x2F;entry&#x2F;accordion-hbase-breathes-with-in&quot;&gt;Accordion: HBase Breathes with In-Memory Compaction&lt;&#x2F;a&gt; and the &lt;a href=&quot;https:&#x2F;&#x2F;issues.apache.org&#x2F;jira&#x2F;secure&#x2F;attachment&#x2F;12709471&#x2F;HBaseIn-MemoryMemstoreCompactionDesignDocument.pdf&quot;&gt;document design&lt;&#x2F;a&gt; is also available.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;strong&gt;Thank you&lt;&#x2F;strong&gt; for reading my post! feel free to react to this article, I&#x27;m also available on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt; if needed.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">database</category>
          <category domain="tag">storage</category>
          <category domain="tag">distributed</category>
          <category domain="tag">hbase</category>
          <category domain="tag">performance</category>
          <category domain="tag">diving-into</category>
      </item>
      <item>
          <title>What can be gleaned about GFS successor codenamed Colossus?</title>
          <pubDate>Sun, 04 Aug 2019 15:07:11 +0200</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/colossus-google/</link>
          <guid>https://pierrezemb.fr/posts/colossus-google/</guid>
          <description xml:base="https://pierrezemb.fr/posts/colossus-google/">&lt;p&gt;In the last few months, there has been numerous blogposts about the end of the Hadoop-era. It is true that:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.theregister.co.uk&#x2F;2019&#x2F;06&#x2F;06&#x2F;cloudera_ceo_quits_customers_delay_purchase_orders_due_to_roadmap_uncertainty_after_hortonworks_merger&#x2F;&quot;&gt;Health of Hadoop-based companies are publicly bad&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Hadoop has a bad publicity with headlines like &lt;a href=&quot;https:&#x2F;&#x2F;techwireasia.com&#x2F;2019&#x2F;07&#x2F;what-does-the-death-of-hadoop-mean-for-big-data&#x2F;&quot;&gt;&#x27;What does the death of Hadoop mean for big data?&#x27;&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Hadoop, as a distributed-system, &lt;strong&gt;is hard to operate, but can be essential for some type of workload&lt;&#x2F;strong&gt;. As Hadoop is based on GFS, we can wonder how GFS evolved inside Google.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;hadoop-s-story&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#hadoop-s-story&quot; aria-label=&quot;Anchor link for: hadoop-s-story&quot;&gt;🔗&lt;&#x2F;a&gt;Hadoop&#x27;s story&lt;&#x2F;h2&gt;
&lt;p&gt;Hadoop is based on a Google&#x27;s paper called &lt;a href=&quot;https:&#x2F;&#x2F;static.googleusercontent.com&#x2F;media&#x2F;research.google.com&#x2F;en&#x2F;&#x2F;archive&#x2F;gfs-sosp2003.pdf&quot;&gt;The Google File System&lt;&#x2F;a&gt; published in 2003. There are some key-elements on this paper:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;It was designed to be deployed with &lt;a href=&quot;https:&#x2F;&#x2F;ai.google&#x2F;research&#x2F;pubs&#x2F;pub43438&quot;&gt;Borg&lt;&#x2F;a&gt;,&lt;&#x2F;li&gt;
&lt;li&gt;to &quot;&lt;a href=&quot;https:&#x2F;&#x2F;queue.acm.org&#x2F;detail.cfm?id=1594206&quot;&gt;simplify the overall design problem&lt;&#x2F;a&gt;&quot;, they:
&lt;ul&gt;
&lt;li&gt;implemented a single master architecture&lt;&#x2F;li&gt;
&lt;li&gt;dropped the idea of a full POSIX-compliant file system&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;Metadatas are stored in RAM in the master,&lt;&#x2F;li&gt;
&lt;li&gt;Datas are stored within chunkservers,&lt;&#x2F;li&gt;
&lt;li&gt;There is no YARN or Map&#x2F;Reduce or any kind of compute capabilities.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;is-hadoop-still-revelant&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#is-hadoop-still-revelant&quot; aria-label=&quot;Anchor link for: is-hadoop-still-revelant&quot;&gt;🔗&lt;&#x2F;a&gt;Is Hadoop still revelant?&lt;&#x2F;h2&gt;
&lt;p&gt;Google with GFS and the rest of the world with Hadoop hit some issues:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;One (Metadata) machine is not large enough for large FS,&lt;&#x2F;li&gt;
&lt;li&gt;Single bottleneck for metadata operations,&lt;&#x2F;li&gt;
&lt;li&gt;Not appropriate for latency sensitive applications,&lt;&#x2F;li&gt;
&lt;li&gt;Fault tolerant not HA,&lt;&#x2F;li&gt;
&lt;li&gt;Unpredictable performance,&lt;&#x2F;li&gt;
&lt;li&gt;Replication&#x27;s cost,&lt;&#x2F;li&gt;
&lt;li&gt;HDFS Write-path pipelining,&lt;&#x2F;li&gt;
&lt;li&gt;fixed-size of blocks,&lt;&#x2F;li&gt;
&lt;li&gt;cost of operations,&lt;&#x2F;li&gt;
&lt;li&gt;...&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Despite all the issues, Hadoop is still relevant for some usecases, such as Map&#x2F;Reduce, or if you need Hbase as a main datastore. There is stories available online about the scalability of Hadoop:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;blog.twitter.com&#x2F;engineering&#x2F;en_us&#x2F;topics&#x2F;infrastructure&#x2F;2017&#x2F;the-infrastructure-behind-twitter-scale.html&quot;&gt;Twitter has multiple clusters storing over 500 PB (2017)&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;whereas Google prefered to &lt;a href=&quot;https:&#x2F;&#x2F;cloud.google.com&#x2F;files&#x2F;storage_architecture_and_challenges.pdf&quot;&gt;&quot;Scaled to approximately 50M files, 10P&quot; to avoid &quot;added management overhead&quot; brought by the scaling.&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Nowadays, Hadoop is mostly used for Business Intelligence or to create a datalake, but at first, GFS was designed to provide a distributed file-system on top of commodity servers.&lt;&#x2F;p&gt;
&lt;p&gt;Google&#x27;s developers were&#x2F;are deploying applications into &quot;containers&quot;, meaning that &lt;strong&gt;any process could be spawned somewhere into the cloud&lt;&#x2F;strong&gt;. Developers are used to work with the file-system abstraction, which provide a layer of durability and security. To mimic that process, they developed GFS, so that &lt;strong&gt;processes don&#x27;t need to worry about replication&lt;&#x2F;strong&gt; (like Bigtable&#x2F;HBase).&lt;&#x2F;p&gt;
&lt;p&gt;This is a promise that, I think, was forgotten. In a world where Kubernetes &lt;em&gt;seems&lt;&#x2F;em&gt; to be the standard, &lt;strong&gt;the need of a global distributed file-system is now higher than before&lt;&#x2F;strong&gt;. By providing a &quot;file-system&quot; abstraction for applications deployed in Kubernetes, we may be solving many problems Kubernetes-adopters are hitting, such as:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;How can I retrieve that particular file for my applications deployed on the other side of the Kubernetes cluster?&lt;&#x2F;li&gt;
&lt;li&gt;Should I be moving that persistent volume over my slow network?&lt;&#x2F;li&gt;
&lt;li&gt;What is happening when &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;dgraph-io&#x2F;dgraph&#x2F;issues&#x2F;2698&quot;&gt;Kubernetes killed an alpha pod in the middle of retrieving snapshot&lt;&#x2F;a&gt;?&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;well-let-s-put-hadoop-in-kubernetes&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#well-let-s-put-hadoop-in-kubernetes&quot; aria-label=&quot;Anchor link for: well-let-s-put-hadoop-in-kubernetes&quot;&gt;🔗&lt;&#x2F;a&gt;Well, let&#x27;s put Hadoop in Kubernetes&lt;&#x2F;h2&gt;
&lt;p&gt;Putting a distributed systems inside Kubernetes is currently a unpleasant experience because of the current tooling:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Helm is not helping me expressing my needs as a distributed-system operator. Even worse, the official &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;helm&#x2F;charts&#x2F;tree&#x2F;master&#x2F;stable&#x2F;hadoop&quot;&gt;Helm chart for Hadoop is limited to YARN and Map&#x2F;Reduce and &quot;Data should be read from cloud based datastores such as Google Cloud Storage, S3 or Swift.&quot;&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Kubernetes Operators has no access to key-metrics, so they cannot watch over your applications correctly. It is only providing a &quot;day-zero to day-two&quot; good experience,&lt;&#x2F;li&gt;
&lt;li&gt;Google seems to &lt;a href=&quot;https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=16971959&quot;&gt;not be using the Operators design internally&lt;&#x2F;a&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.ibm.com&#x2F;cloud&#x2F;blog&#x2F;new-builders&#x2F;database-deep-dives-couchdb&quot;&gt;CouchDB developers&lt;&#x2F;a&gt; are saying that:
&lt;ul&gt;
&lt;li&gt;&quot;For certain workloads, the technology isn’t quite there yet&quot;&lt;&#x2F;li&gt;
&lt;li&gt;&quot;In certain scenarios that are getting smaller and smaller, both Kubernetes and Docker get in the way of that. At that point, CouchDB gets slow, or you get timeout errors, that you can’t explain.&quot;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;how-gfs-evolved-within-google&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#how-gfs-evolved-within-google&quot; aria-label=&quot;Anchor link for: how-gfs-evolved-within-google&quot;&gt;🔗&lt;&#x2F;a&gt;How GFS evolved within Google&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;technical-overview&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#technical-overview&quot; aria-label=&quot;Anchor link for: technical-overview&quot;&gt;🔗&lt;&#x2F;a&gt;Technical overview&lt;&#x2F;h3&gt;
&lt;p&gt;As GFS&#x27;s paper was published in 2003, we can ask ourselves if GFS has evolved. And it did! The sad part is that there is only a few informations about this project codenamed &lt;code&gt;Colossus&lt;&#x2F;code&gt;. There is no papers, and not a lot informations available, here&#x27;s what can be found online:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;From &lt;a href=&quot;https:&#x2F;&#x2F;cloud.google.com&#x2F;files&#x2F;storage_architecture_and_challenges.pdf&quot;&gt;Storage Architecture and Challenges(2010)&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;They moved from full-replication to &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Reed%E2%80%93Solomon_error_correction&quot;&gt;Reed-Salomon&lt;&#x2F;a&gt;. This feature is acually in &lt;a href=&quot;https:&#x2F;&#x2F;hadoop.apache.org&#x2F;docs&#x2F;r3.0.0&#x2F;hadoop-project-dist&#x2F;hadoop-hdfs&#x2F;HDFSErasureCoding.html&quot;&gt;Hadoop 3&lt;&#x2F;a&gt;,&lt;&#x2F;li&gt;
&lt;li&gt;replication is handled by the client, instead of the pipelining,&lt;&#x2F;li&gt;
&lt;li&gt;the metadata layer is automatically sharded. We can find more informations about that in the next ressource!&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;From &lt;a href=&quot;http:&#x2F;&#x2F;www.pdsw.org&#x2F;pdsw-discs17&#x2F;slides&#x2F;PDSW-DISCS-Google-Keynote.pdf&quot;&gt;Cluster-Level Storage @ Google(2017)&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;GFS master replaced by Colossus&lt;&#x2F;li&gt;
&lt;li&gt;GFS chunkserver replaced by D&lt;&#x2F;li&gt;
&lt;li&gt;Colossus rebalances old, cold data&lt;&#x2F;li&gt;
&lt;li&gt;distributes newly written data evenly across disks&lt;&#x2F;li&gt;
&lt;li&gt;Metadatas are stored into BigTable. each Bigtable row corresponds to a single file.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The &quot;all in RAM&quot; GFS master design was a severe single-point-of-failure, so getting rid of it was a priority. They didn&#x27;t had a lof of options for a scalable and rock-solid datastore &lt;strong&gt;beside BigTable&lt;&#x2F;strong&gt;. When you think about it, a key&#x2F;value datastore is a great replacement for a distributed file-system master:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;automatic sharding of regions,&lt;&#x2F;li&gt;
&lt;li&gt;scan capabilities for files in the same &quot;directory&quot;,&lt;&#x2F;li&gt;
&lt;li&gt;lexical ordering,&lt;&#x2F;li&gt;
&lt;li&gt;...&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The funny part is that they now need a Colossus for Colossus. The only things saving them is that storing the metametametadata (the metadata of the metadata of the metadata) can be hold in Chubby.&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;From &lt;a href=&quot;https:&#x2F;&#x2F;queue.acm.org&#x2F;detail.cfm?id=1594206&quot;&gt;GFS: Evolution on Fast-forward(2009)&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;they moved to chunks of 1MB of files, as the limitations of the master disappeared. This is also allowing Colossus to support latency sensitive applications,&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;From &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;cockroachdb&#x2F;cockroach&#x2F;issues&#x2F;243#issuecomment-91575792&quot;&gt;a Github comment on Colossus&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;File reconstruction from Reed-Salomnon was performed on both client-side and server-side&lt;&#x2F;li&gt;
&lt;li&gt;on-the-fly recovery of data is greatly enhanced by this data layout(Reed Salomon)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;From a &lt;a href=&quot;https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=20135927&quot;&gt;Hacker News comment&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Colossus and D are two separate things.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;From &lt;a href=&quot;https:&#x2F;&#x2F;cloud.google.com&#x2F;blog&#x2F;products&#x2F;storage-data-transfer&#x2F;a-peek-behind-colossus-googles-file-system&quot;&gt;Colossus under the hood: a peek into Google’s scalable storage system&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Colossus&#x27;s Control Plane is a scalable metadata service, which consists of many Curators. Clients talk directly to curators for control operations, such as file creation, and can scale horizontally.&lt;&#x2F;li&gt;
&lt;li&gt;background storage managers called Custodians, there are handling tasks like disk space balancing and RAID reconstruction.&lt;&#x2F;li&gt;
&lt;li&gt;Applications needs to specifies I&#x2F;O, availability, and durability requirements&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;What is that &quot;D&quot;?&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;From &lt;a href=&quot;https:&#x2F;&#x2F;landing.google.com&#x2F;sre&#x2F;sre-book&#x2F;chapters&#x2F;production-environment&#x2F;&quot;&gt;The Production Environment at Google, from the Viewpoint of an SRE&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;D stands for &lt;em&gt;Disk&lt;&#x2F;em&gt;,&lt;&#x2F;li&gt;
&lt;li&gt;D is a fileserver running on almost all machines in a cluster.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;From &lt;a href=&quot;https:&#x2F;&#x2F;medium.com&#x2F;@jerub&#x2F;the-production-environment-at-google-8a1aaece3767&quot;&gt;The Production Environment at Google&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;D is more of a block server than a file server&lt;&#x2F;li&gt;
&lt;li&gt;It provides nothing apart from checksums.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;deployments&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#deployments&quot; aria-label=&quot;Anchor link for: deployments&quot;&gt;🔗&lt;&#x2F;a&gt;Deployments&lt;&#x2F;h3&gt;
&lt;!-- I think the team that&#x27;s pushing the forefront of something k8s-like for persistency&#x2F;durability is... the Colossus&#x2F;D team at Google, who have been running storage servers managed by Borg for almost a decade now :) Problem is, it&#x27;s not k8s. But could tell us what that roadmap is.  --&gt;
&lt;p&gt;How everything is deployed? Using Borg!&lt;&#x2F;p&gt;
&lt;h3 id=&quot;migration&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#migration&quot; aria-label=&quot;Anchor link for: migration&quot;&gt;🔗&lt;&#x2F;a&gt;Migration&lt;&#x2F;h3&gt;
&lt;p&gt;The migration process is described in the now free &lt;a href=&quot;https:&#x2F;&#x2F;static.googleusercontent.com&#x2F;media&#x2F;sre.google&#x2F;en&#x2F;&#x2F;static&#x2F;pdf&#x2F;case-studies-infrastructure-change-management.pdf&quot;&gt;Case Studies in Infrastructure Change Management&lt;&#x2F;a&gt; book.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;is-there-an-open-source-effort-to-create-a-colossus-like-dfs&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#is-there-an-open-source-effort-to-create-a-colossus-like-dfs&quot; aria-label=&quot;Anchor link for: is-there-an-open-source-effort-to-create-a-colossus-like-dfs&quot;&gt;🔗&lt;&#x2F;a&gt;Is there an open-source effort to create a Colossus-like DFS?&lt;&#x2F;h2&gt;
&lt;p&gt;I did not found any point towards a open-source version of Colossus, beside some work made for &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;baidu&#x2F;bfs&quot;&gt;The Baidu File System&lt;&#x2F;a&gt; in which the Nameserver is implemented as a raft group.&lt;&#x2F;p&gt;
&lt;p&gt;There is &lt;a href=&quot;https:&#x2F;&#x2F;www.slideshare.net&#x2F;HadoopSummit&#x2F;scaling-hdfs-to-manage-billions-of-files-with-distributed-storage-schemes&quot;&gt;some work to add colossus&#x27;s features in Hadoop&lt;&#x2F;a&gt; but based on the bad publicity Hadoop has now, I don&#x27;t think there will be a lot of money to power those efforts.&lt;&#x2F;p&gt;
&lt;p&gt;I do think that rewriting an distributed file-system based on Colossus would be a huge benefit for the community:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Reimplement D may be easy, my current question is &lt;strong&gt;how far can we use modern FS such as OpenZFS&lt;&#x2F;strong&gt; to facilitate the work? FS capabilities such as &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;zfsonlinux&#x2F;zfs&#x2F;wiki&#x2F;Checksums&quot;&gt;OpenZFS checksums&lt;&#x2F;a&gt; seems pretty interesting.&lt;&#x2F;li&gt;
&lt;li&gt;To resolve the distributed master issue, we could use &lt;a href=&quot;https:&#x2F;&#x2F;tikv.org&#x2F;&quot;&gt;Tikv&lt;&#x2F;a&gt; as a building block to provide an &quot;BigTable experience&quot; without the need of a distributed file-system underneath.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;But remember:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Like crypto, Do not roll your own DFS!&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;strong&gt;Thank you&lt;&#x2F;strong&gt; for reading my post! Feel free to react to this article, I am also available on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt; if needed.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">storage</category>
          <category domain="tag">distributed</category>
          <category domain="tag">google</category>
          <category domain="tag">filesystem</category>
          <category domain="tag">hadoop</category>
      </item>
      <item>
          <title>Playing with TTL in HBase</title>
          <pubDate>Mon, 27 May 2019 22:07:11 +0200</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/ttl-hbase/</link>
          <guid>https://pierrezemb.fr/posts/ttl-hbase/</guid>
          <description xml:base="https://pierrezemb.fr/posts/ttl-hbase/">&lt;header class=&quot;row text-center header&quot;&gt;
   &lt;img src=&quot;&#x2F;images&#x2F;hbase-data-model&#x2F;hbase.jpg&quot; alt=&quot;HBase Image&quot; class=&quot;text-center&quot;&gt;
&lt;&#x2F;header&gt;
&lt;p&gt;Among all features provided by HBase, there is one that is pretty handy to deal with your data&#x27;s lifecyle: the fact that every cell version can have &lt;strong&gt;Time to Live&lt;&#x2F;strong&gt; or TTL. Let&#x27;s dive into the feature!&lt;&#x2F;p&gt;
&lt;h1 id=&quot;time-to-live-ttl&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#time-to-live-ttl&quot; aria-label=&quot;Anchor link for: time-to-live-ttl&quot;&gt;🔗&lt;&#x2F;a&gt;Time To Live (TTL)&lt;&#x2F;h1&gt;
&lt;p&gt;Let&#x27;s read the doc first!&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;ColumnFamilies can set a TTL length in seconds, and &lt;strong&gt;HBase will automatically delete rows once the expiration time is reached&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;hbase.apache.org&#x2F;book.html#ttl&quot;&gt;HBase Book: Time To Live (TTL)&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s play with it! You can easily start an standalone HBase by following &lt;a href=&quot;https:&#x2F;&#x2F;hbase.apache.org&#x2F;book.html#quickstart&quot;&gt;the HBase Book&lt;&#x2F;a&gt;. Once your standalone cluster is started, we can get started:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;bash&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-bash &quot;&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;.&#x2F;bin&#x2F;hbase&lt;&#x2F;span&gt;&lt;span&gt; shell
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;hbase&lt;&#x2F;span&gt;&lt;span&gt;(main)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;:001:0&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt; create &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;test_table&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;, {&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;NAME&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39; =&amp;gt; &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;cf1&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;,&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;TTL&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39; =&amp;gt; 30} &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# 30 sec
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Now that our test_table is created, we can &lt;code&gt;put&lt;&#x2F;code&gt; some data on it:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;bash&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-bash &quot;&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;hbase&lt;&#x2F;span&gt;&lt;span&gt;(main)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;:002:0&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt; put &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;test_table&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;,&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;row123&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;,&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;cf1:desc&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;, &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;TTL Demo&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;And you can &lt;code&gt;get&lt;&#x2F;code&gt; it with:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;bash&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-bash &quot;&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;hbase&lt;&#x2F;span&gt;&lt;span&gt;(main)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;:003:0&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt; get &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;test_table&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;,&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;row123&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;,&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;cf1:desc&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;COLUMN&lt;&#x2F;span&gt;&lt;span&gt;                             CELL
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;cf1:desc&lt;&#x2F;span&gt;&lt;span&gt;                          timestamp=1558366581134, value=TTL Demo
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt; row(s) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;in&lt;&#x2F;span&gt;&lt;span&gt; 0.0080 seconds
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Here&#x27;s our row! But if you wait a bit, it will &lt;strong&gt;disappear&lt;&#x2F;strong&gt; thanks to the TTL:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;bash&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-bash &quot;&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;hbase&lt;&#x2F;span&gt;&lt;span&gt;(main)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;:004:0&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt; get &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;test_table&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;,&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;row123&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;,&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;cf1:desc&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;COLUMN&lt;&#x2F;span&gt;&lt;span&gt;                             CELL
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt; row(s) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;in&lt;&#x2F;span&gt;&lt;span&gt; 0.0220 seconds
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It has been filtered from the result, but the data is still here.  You can trigger a &lt;strong&gt;raw&lt;&#x2F;strong&gt; scan to check:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;bash&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-bash &quot;&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;hbase&lt;&#x2F;span&gt;&lt;span&gt;(main)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;:002:0&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt; scan &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;test_table&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;, {RAW =&amp;gt; true}
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ROW&lt;&#x2F;span&gt;&lt;span&gt;                                COLUMN+CELL
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;row123&lt;&#x2F;span&gt;&lt;span&gt;                            column=cf1:desc, timestamp=1558366581134, value=TTL Demo
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt; row(s) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;in&lt;&#x2F;span&gt;&lt;span&gt; 0.3280 seconds
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It will be removed only when a &lt;strong&gt;major-compaction&lt;&#x2F;strong&gt; will occur. As we are playing, we can:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;force the memstore to be &lt;strong&gt;flushed as HFiles&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;li&gt;force the &lt;strong&gt;compaction&lt;&#x2F;strong&gt;:&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;div class=&quot;bs-callout bs-callout-info&quot;&gt;
You may have heard about &lt;b&gt;&lt;a target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;blogs.apache.org&#x2F;hbase&#x2F;entry&#x2F;accordion-hbase-breathes-with-in&quot;&gt;Accordion&lt;&#x2F;a&gt;&lt;&#x2F;b&gt;, the new feature in HBase 2. If you are playing with HBase 2, you can enable it by following &lt;a target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;hbase.apache.org&#x2F;book.html#inmemory_compaction&quot;&gt;this link&lt;&#x2F;a&gt; and run &lt;b&gt;compactions directly in the MemStores.&lt;&#x2F;b&gt;
&lt;&#x2F;div&gt;
&lt;pre data-lang=&quot;bash&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-bash &quot;&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;hbase&lt;&#x2F;span&gt;&lt;span&gt;(main)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;:014:0&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt; flush &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;test_table&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Took&lt;&#x2F;span&gt;&lt;span&gt; 0.4456 seconds    
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;hbase&lt;&#x2F;span&gt;&lt;span&gt;(main)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;:015:0&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt; compact &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;test_table&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Took&lt;&#x2F;span&gt;&lt;span&gt; 0.0468 seconds
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# wait a bit
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;hbase&lt;&#x2F;span&gt;&lt;span&gt;(main)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;:016:0&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt; scan &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;test_table&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;, {RAW =&amp;gt; true}
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ROW&lt;&#x2F;span&gt;&lt;span&gt;                            COLUMN+CELL
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt; row(s)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Took&lt;&#x2F;span&gt;&lt;span&gt; 0.0060 seconds
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h1 id=&quot;how-does-it-works&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#how-does-it-works&quot; aria-label=&quot;Anchor link for: how-does-it-works&quot;&gt;🔗&lt;&#x2F;a&gt;How does it works?&lt;&#x2F;h1&gt;
&lt;p&gt;As always, the truth is held by the documentation:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;A {row, column, version} tuple exactly specifies a cell in HBase. It’s possible to have an unbounded number of cells where the row and column are the same but the cell address differs only in its version dimension.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;While rows and column keys are expressed as bytes, &lt;strong&gt;the version is specified using a long integer&lt;&#x2F;strong&gt;. Typically &lt;strong&gt;this long contains time instances&lt;&#x2F;strong&gt; such as those returned by java.util.Date.getTime() or &lt;strong&gt;System.currentTimeMillis()&lt;&#x2F;strong&gt;,&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;hbase.apache.org&#x2F;book.html#versions&quot;&gt;HBase Book: Versions&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;p&gt;You may have seen it during our scan earlier, there is a &lt;strong&gt;timestamp associated&lt;&#x2F;strong&gt; with the version of the cell:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;bash&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-bash &quot;&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;hbase&lt;&#x2F;span&gt;&lt;span&gt;(main)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;:003:0&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt; get &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;test_table&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;,&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;row123&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;,&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;cf1:desc&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;COLUMN&lt;&#x2F;span&gt;&lt;span&gt;                             CELL
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;cf1:desc&lt;&#x2F;span&gt;&lt;span&gt;                          timestamp=1558366581134, value=TTL Demo
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;#                           here  ^^^^^^^^^^^^^^^^^^^^^^^ 
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Hbase used the &lt;code&gt;System.currentTimeMillis()&lt;&#x2F;code&gt; at ingest time to add it. During scanner and compaction, as time went by, &lt;strong&gt;there was more than TTL seconds between the cell version and now, so the row was discarded&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Now the real question is: &lt;strong&gt;can you set it by yourself and be real Time-Lord&lt;&#x2F;strong&gt; (of HBase)?&lt;&#x2F;p&gt;
&lt;p&gt;The reponse is &lt;em&gt;yes!&lt;&#x2F;em&gt; There is also a bit of a warning a bit &lt;a href=&quot;https:&#x2F;&#x2F;hbase.apache.org&#x2F;book.html#_explicit_version_example&quot;&gt;below:&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Caution:&lt;&#x2F;em&gt; the version timestamp is used internally by HBase for things like &lt;strong&gt;time-to-live calculations&lt;&#x2F;strong&gt;. It’s usually best to avoid setting this timestamp yourself. Prefer using a separate timestamp attribute of the row, or have the timestamp as a part of the row key, or both.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Let&#x27;s try it:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;bash&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-bash &quot;&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;date&lt;&#x2F;span&gt;&lt;span&gt; +%&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;s -d &lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;+2 min&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;1558472441  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;# don&amp;#39;t forget to add 3 zeroes as the time need to be in millisecond!
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;.&#x2F;bin&#x2F;hbase&lt;&#x2F;span&gt;&lt;span&gt; shell
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;hbase&lt;&#x2F;span&gt;&lt;span&gt;(main)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;:001:0&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt; put &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;test_table&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;,&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;row1234&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;,&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;cf1:desc&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;, &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;timestamp Demo&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;, 1558472441000  
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;hbase&lt;&#x2F;span&gt;&lt;span&gt;(main)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;:044:0&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt; scan &amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;test_table&lt;&#x2F;span&gt;&lt;span&gt;&amp;#39;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;ROW&lt;&#x2F;span&gt;&lt;span&gt;                            COLUMN+CELL
&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;row1234&lt;&#x2F;span&gt;&lt;span&gt;                       column=cf1:desc, timestamp=1558473315, value=timestamp Demo
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt; row(s)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;Took&lt;&#x2F;span&gt;&lt;span&gt; 0.0031 seconds
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Notice that we are using a timestamp at the end of the &lt;code&gt;put&lt;&#x2F;code&gt; method? This will &lt;strong&gt;add the desired timestamp to the version&lt;&#x2F;strong&gt;. Which means that &lt;strong&gt;your application can control when your version will be removed, even with a TTL on your column-qualifier.&lt;&#x2F;strong&gt; You just need to compute a timestamp like this:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;ts = now - ttlCF + desiredTTL&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;strong&gt;Thank you&lt;&#x2F;strong&gt; for reading my post! Feel free to react to this article, I am also available on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt; if needed.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">database</category>
          <category domain="tag">hbase</category>
          <category domain="tag">storage</category>
          <category domain="tag">expiration</category>
      </item>
      <item>
          <title>Handling OVH&#x27;s alerts with Apache Flink</title>
          <pubDate>Sun, 03 Feb 2019 15:37:27 +0100</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/ovh-alerts-flink/</link>
          <guid>https://pierrezemb.fr/posts/ovh-alerts-flink/</guid>
          <description xml:base="https://pierrezemb.fr/posts/ovh-alerts-flink/">&lt;p&gt;This is a repost from &lt;a href=&quot;https:&#x2F;&#x2F;www.ovh.com&#x2F;fr&#x2F;blog&#x2F;handling-ovhs-alerts-with-apache-flink&#x2F;&quot; title=&quot;Permalink to Handling OVH&amp;#39;s alerts with Apache Flink&quot;&gt;OVH&#x27;s official blogpost.&lt;&#x2F;a&gt;. Thanks &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;LostInBrittany&#x2F;&quot;&gt;Horacio Gonzalez&lt;&#x2F;a&gt; for the awesome drawings!&lt;&#x2F;p&gt;
&lt;h1 id=&quot;handling-ovh-s-alerts-with-apache-flink&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#handling-ovh-s-alerts-with-apache-flink&quot; aria-label=&quot;Anchor link for: handling-ovh-s-alerts-with-apache-flink&quot;&gt;🔗&lt;&#x2F;a&gt;Handling OVH&#x27;s alerts with Apache Flink&lt;&#x2F;h1&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.ovh.com&#x2F;fr&#x2F;blog&#x2F;wp-content&#x2F;uploads&#x2F;2019&#x2F;01&#x2F;001-1.png?x70472&quot; alt=&quot;OVH &amp;amp; Apache Flink&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;OVH relies extensively on &lt;strong&gt;metrics&lt;&#x2F;strong&gt; to effectively monitor its entire stack. Whether they are &lt;strong&gt;low-level&lt;&#x2F;strong&gt; or &lt;strong&gt;business&lt;&#x2F;strong&gt; centric, they allow teams to gain &lt;strong&gt;insight&lt;&#x2F;strong&gt; into how our services are operating on a daily basis. The need to store &lt;strong&gt;millions of datapoints per second&lt;&#x2F;strong&gt; has produced the need to create a dedicated team to build a operate a product to handle that load: &lt;a href=&quot;https:&#x2F;&#x2F;www.ovh.com&#x2F;fr&#x2F;data-platforms&#x2F;metrics&#x2F;&quot;&gt;**Metrics Data Platform&lt;&#x2F;a&gt;.&lt;strong&gt;By relying on &lt;a href=&quot;https:&#x2F;&#x2F;hbase.apache.org&#x2F;&quot;&gt;**Apache Hbase&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;kafka.apache.org&#x2F;&quot;&gt;Apache Kafka&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;www.warp10.io&#x2F;&quot;&gt;&lt;strong&gt;Warp 10&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt;, we succeeded in creating a fully distributed platform that is handling all our metrics… and yours!&lt;&#x2F;p&gt;
&lt;p&gt;After building the platform to deal with all those metrics, our next challenge was to build one of the most needed feature for Metrics: the &lt;strong&gt;Alerting.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;meet-omni-our-alerting-layer&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#meet-omni-our-alerting-layer&quot; aria-label=&quot;Anchor link for: meet-omni-our-alerting-layer&quot;&gt;🔗&lt;&#x2F;a&gt;Meet OMNI, our alerting layer&lt;&#x2F;h2&gt;
&lt;p&gt;OMNI is our code name for a &lt;strong&gt;fully distributed&lt;&#x2F;strong&gt;, &lt;strong&gt;as-code&lt;&#x2F;strong&gt;, &lt;strong&gt;alerting&lt;&#x2F;strong&gt; system that we developed on top of Metrics. It is split into components:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The management part&lt;&#x2F;strong&gt;, taking your alerts definitions defined in a Git repository, and represent them as continuous queries,&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;The query executor&lt;&#x2F;strong&gt;, scheduling your queries in a distributed way.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The query executor is pushing the query results into Kafka, ready to be handled! We now need to perform all the tasks that an alerting system does:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Handling alerts &lt;strong&gt;deduplication&lt;&#x2F;strong&gt; and &lt;strong&gt;grouping&lt;&#x2F;strong&gt;, to avoid &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Alarm_fatigue&quot;&gt;alert fatigue.&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Handling &lt;strong&gt;escalation&lt;&#x2F;strong&gt; steps, &lt;strong&gt;acknowledgement&lt;&#x2F;strong&gt;or &lt;strong&gt;snooze&lt;&#x2F;strong&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Notify&lt;&#x2F;strong&gt; the end user, through differents &lt;strong&gt;channels&lt;&#x2F;strong&gt;: SMS, mail, Push notifications, …&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;To handle that, we looked at open-source projects, such as &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;prometheus&#x2F;alertmanager&quot;&gt;Prometheus AlertManager,&lt;&#x2F;a&gt; &lt;a href=&quot;https:&#x2F;&#x2F;engineering.linkedin.com&#x2F;blog&#x2F;2017&#x2F;06&#x2F;open-sourcing-iris-and-oncall&quot;&gt;LinkedIn Iris,&lt;&#x2F;a&gt; we discovered the &lt;em&gt;hidden&lt;&#x2F;em&gt; truth:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Handling alerts as streams of data,&lt;br &#x2F;&gt;
moving from operators to another.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;We embraced it, and decided to leverage &lt;a href=&quot;https:&#x2F;&#x2F;flink.apache.org&#x2F;&quot;&gt;Apache Flink&lt;&#x2F;a&gt; to create &lt;strong&gt;Beacon&lt;&#x2F;strong&gt;. In the next section we are going to describe the architecture of Beacon, and how we built and operate it.&lt;&#x2F;p&gt;
&lt;p&gt;If you want some more information on Apache Flink, we suggest to read the introduction article on the official website: &lt;a href=&quot;https:&#x2F;&#x2F;flink.apache.org&#x2F;flink-architecture.html&quot;&gt;What is Apache Flink?&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;beacon-architecture&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#beacon-architecture&quot; aria-label=&quot;Anchor link for: beacon-architecture&quot;&gt;🔗&lt;&#x2F;a&gt;&lt;strong&gt;Beacon architecture&lt;&#x2F;strong&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;At his core, Beacon is reading events from &lt;strong&gt;Kafka&lt;&#x2F;strong&gt;. Everything is represented as a &lt;strong&gt;message&lt;&#x2F;strong&gt;, from alerts to aggregations rules, snooze orders and so on. The pipeline is divided into two branches:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;One that is running the &lt;strong&gt;aggregations&lt;&#x2F;strong&gt;, and triggering notifications based on customer&#x27;s rules.&lt;&#x2F;li&gt;
&lt;li&gt;One that is handling the &lt;strong&gt;escalation steps&lt;&#x2F;strong&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Then everything is merged to &lt;strong&gt;generate&lt;&#x2F;strong&gt; &lt;strong&gt;a&lt;&#x2F;strong&gt; &lt;strong&gt;notification&lt;&#x2F;strong&gt;, that is going to be forward to the right person. A notification message is pushed into Kafka, that will be consumed by another component called &lt;strong&gt;beacon-notifier.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;handling-states&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#handling-states&quot; aria-label=&quot;Anchor link for: handling-states&quot;&gt;🔗&lt;&#x2F;a&gt;Handling States&lt;&#x2F;h2&gt;
&lt;p&gt;If you are new to streaming architecture, I recommend reading &lt;a href=&quot;https:&#x2F;&#x2F;ci.apache.org&#x2F;projects&#x2F;flink&#x2F;flink-docs-release-1.7&#x2F;concepts&#x2F;programming-model.html&quot;&gt;Dataflow Programming Model&lt;&#x2F;a&gt; from Flink official documentation.&lt;&#x2F;p&gt;
&lt;p&gt;Everything is merged into a dataStream, &lt;strong&gt;partitionned&lt;&#x2F;strong&gt; (&lt;a href=&quot;https:&#x2F;&#x2F;medium.com&#x2F;r&#x2F;?url=https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-release-1.7%2Fdev%2Fstream%2Fstate%2Fstate.html%23keyed-state&quot;&gt;keyed by&lt;&#x2F;a&gt;in Flink API) by users. Here&#x27;s an example:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;java&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-java &quot;&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;final &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;DataStream&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt; alertStream =
&lt;&#x2F;span&gt;&lt;span&gt;    
&lt;&#x2F;span&gt;&lt;span&gt;      &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Partitioning Stream per AlertIdentifier
&lt;&#x2F;span&gt;&lt;span&gt;      cleanedAlertsStream.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;keyBy&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;      &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Applying a Map Operation which is setting since when an alert is triggered
&lt;&#x2F;span&gt;&lt;span&gt;      .&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;map&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;new &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;SetSinceOnSelector&lt;&#x2F;span&gt;&lt;span&gt;())
&lt;&#x2F;span&gt;&lt;span&gt;      .&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;name&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;setting-since-on-selector&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;).&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;uid&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;setting-since-on-selector&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;)
&lt;&#x2F;span&gt;&lt;span&gt;    
&lt;&#x2F;span&gt;&lt;span&gt;      &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Partitioning again Stream per AlertIdentifier
&lt;&#x2F;span&gt;&lt;span&gt;      .&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;keyBy&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#d08770;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;      &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Applying another Map Operation which is setting State and Trend
&lt;&#x2F;span&gt;&lt;span&gt;      .&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;map&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;new &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ebcb8b;&quot;&gt;SetStateAndTrend&lt;&#x2F;span&gt;&lt;span&gt;())
&lt;&#x2F;span&gt;&lt;span&gt;      .&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;name&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;setting-state&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;).&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bf616a;&quot;&gt;uid&lt;&#x2F;span&gt;&lt;span&gt;(&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;setting-state&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;);
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In the example above, we are chaining two keyed operations:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SetSinceOnSelector&lt;&#x2F;strong&gt;, which is setting &lt;strong&gt;since&lt;&#x2F;strong&gt; when the alert is triggered&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;SetStateAndTrend&lt;&#x2F;strong&gt;, which is setting the &lt;strong&gt;state&lt;&#x2F;strong&gt;(ONGOING, RECOVERY or OK) and the &lt;strong&gt;trend&lt;&#x2F;strong&gt;(do we have more or less metrics in errors).&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Each of this class is under 120 lines of codes because Flink is &lt;strong&gt;handling all the difficulties&lt;&#x2F;strong&gt;. Most of the pipeline are &lt;strong&gt;only composed&lt;&#x2F;strong&gt; of &lt;strong&gt;classic transformations&lt;&#x2F;strong&gt; such as &lt;a href=&quot;https:&#x2F;&#x2F;ci.apache.org&#x2F;projects&#x2F;flink&#x2F;flink-docs-release-1.7&#x2F;dev&#x2F;stream&#x2F;operators&#x2F;&quot;&gt;Map, FlatMap, Reduce&lt;&#x2F;a&gt;, including their &lt;a href=&quot;https:&#x2F;&#x2F;ci.apache.org&#x2F;projects&#x2F;flink&#x2F;flink-docs-stable&#x2F;dev&#x2F;api_concepts.html#rich-functions&quot;&gt;Rich&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;ci.apache.org&#x2F;projects&#x2F;flink&#x2F;flink-docs-stable&#x2F;dev&#x2F;stream&#x2F;state&#x2F;state.html#using-managed-keyed-state&quot;&gt;Keyed&lt;&#x2F;a&gt; version. We have a few &lt;a href=&quot;https:&#x2F;&#x2F;ci.apache.org&#x2F;projects&#x2F;flink&#x2F;flink-docs-release-1.7&#x2F;dev&#x2F;stream&#x2F;operators&#x2F;process_function.html&quot;&gt;Process Functions&lt;&#x2F;a&gt;, which are &lt;strong&gt;very handy&lt;&#x2F;strong&gt; to develop, for example, the escalation timer.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;integration-tests&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#integration-tests&quot; aria-label=&quot;Anchor link for: integration-tests&quot;&gt;🔗&lt;&#x2F;a&gt;Integration tests&lt;&#x2F;h2&gt;
&lt;p&gt;As the number of classes was growing, we needed to test our pipeline. Because it is only wired to Kafka, we wrapped consumer and producer to create what we call **scenari:**a series of integration tests running different scenarios.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;queryable-state&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#queryable-state&quot; aria-label=&quot;Anchor link for: queryable-state&quot;&gt;🔗&lt;&#x2F;a&gt;Queryable state&lt;&#x2F;h2&gt;
&lt;p&gt;One killer feature of Apache Flink is the &lt;strong&gt;capabilities of &lt;a href=&quot;https:&#x2F;&#x2F;ci.apache.org&#x2F;projects&#x2F;flink&#x2F;flink-docs-release-1.7&#x2F;dev&#x2F;stream&#x2F;state&#x2F;queryable_state.html&quot;&gt;**&lt;strong&gt;querying the internal state&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt; of an operator**. Even if it is a beta feature, it allows us the get the current state of the different parts of the job:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;at which escalation steps are we on&lt;&#x2F;li&gt;
&lt;li&gt;is it snoozed or &lt;em&gt;ack&lt;&#x2F;em&gt;-ed&lt;&#x2F;li&gt;
&lt;li&gt;Which alert is ongoing&lt;&#x2F;li&gt;
&lt;li&gt;and so on.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.ovh.com&#x2F;fr&#x2F;blog&#x2F;wp-content&#x2F;uploads&#x2F;2019&#x2F;01&#x2F;004-1.png?x70472&quot; alt=&quot;Queryable state overview&quot; &#x2F;&gt;Queryable state overview&lt;&#x2F;p&gt;
&lt;p&gt;Thanks to this, we easily developed an &lt;strong&gt;API&lt;&#x2F;strong&gt; over the queryable state, that is powering our &lt;strong&gt;alerting view&lt;&#x2F;strong&gt; in &lt;a href=&quot;https:&#x2F;&#x2F;studio.metrics.ovh.net&#x2F;&quot;&gt;Metrics Studio,&lt;&#x2F;a&gt; our codename for the Web UI of the Metrics Data Platform.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;apache-flink-deployment&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#apache-flink-deployment&quot; aria-label=&quot;Anchor link for: apache-flink-deployment&quot;&gt;🔗&lt;&#x2F;a&gt;Apache Flink deployment&lt;&#x2F;h3&gt;
&lt;p&gt;We deployed the latest version of Flink (&lt;strong&gt;1.7.1&lt;&#x2F;strong&gt; at the time of writing) directly on bare metal servers with a dedicated Zookeeper&#x27;s cluster using Ansible. Operating Flink has been a really nice surprise for us, with &lt;strong&gt;clear documentation and configuration&lt;&#x2F;strong&gt;, and an &lt;strong&gt;impressive resilience&lt;&#x2F;strong&gt;. We are capable of &lt;strong&gt;rebooting&lt;&#x2F;strong&gt; the whole Flink cluster, and the job is &lt;strong&gt;restarting at his last saved state&lt;&#x2F;strong&gt;, like nothing happened.&lt;&#x2F;p&gt;
&lt;p&gt;We are using &lt;strong&gt;RockDB&lt;&#x2F;strong&gt; as a state backend, backed by OpenStack &lt;strong&gt;Swift storage&lt;&#x2F;strong&gt;provided by OVH Public Cloud.&lt;&#x2F;p&gt;
&lt;p&gt;For monitoring, we are relying on &lt;a href=&quot;https:&#x2F;&#x2F;ci.apache.org&#x2F;projects&#x2F;flink&#x2F;flink-docs-stable&#x2F;monitoring&#x2F;metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter&quot;&gt;Prometheus Exporter&lt;&#x2F;a&gt; with &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;ovh&#x2F;beamium&quot;&gt;Beamium&lt;&#x2F;a&gt; to gain &lt;strong&gt;observability&lt;&#x2F;strong&gt; over job&#x27;s health.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;in-short-we-love-apache-flink&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#in-short-we-love-apache-flink&quot; aria-label=&quot;Anchor link for: in-short-we-love-apache-flink&quot;&gt;🔗&lt;&#x2F;a&gt;In short, we love Apache Flink&lt;&#x2F;h3&gt;
&lt;p&gt;If you are used to work with stream related software, you may have realized that we did not used any rocket science or tricks. We may be relying on basics streaming features offered by Apache Flink, but they allowed us to tackle many business and scalability problems with ease.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.ovh.com&#x2F;fr&#x2F;blog&#x2F;wp-content&#x2F;uploads&#x2F;2019&#x2F;01&#x2F;0F28C7F7-9701-4C19-BAFB-E40439FA1C77.png?x70472&quot; alt=&quot;Apache Flink&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;As such, we highly recommend that any developers should have a look to Apache Flink. I encourage you to go through &lt;a href=&quot;https:&#x2F;&#x2F;medium.com&#x2F;r&#x2F;?url=https%3A%2F%2Ftraining.da-platform.com%2F&quot;&gt;Apache Flink Training&lt;&#x2F;a&gt;, written by Data Artisans. Furthermore, the community has put a lot of effort to easily deploy Apache Flink to Kubernetes, so you can easily try Flink using our Managed Kubernetes!&lt;&#x2F;p&gt;
</description>
          <category domain="tag">streaming</category>
          <category domain="tag">flink</category>
          <category domain="tag">monitoring</category>
          <category domain="tag">distributed</category>
      </item>
      <item>
          <title>What are ACID transactions?</title>
          <pubDate>Sun, 03 Feb 2019 00:00:00 +0000</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/acid-transactions/</link>
          <guid>https://pierrezemb.fr/posts/acid-transactions/</guid>
          <description xml:base="https://pierrezemb.fr/posts/acid-transactions/">&lt;h1 id=&quot;transaction&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#transaction&quot; aria-label=&quot;Anchor link for: transaction&quot;&gt;🔗&lt;&#x2F;a&gt;Transaction?&lt;&#x2F;h1&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;&amp;quot;Programming should be about transforming data&amp;quot;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;--- Programming Elixir 1.3 by Dave Thomas&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;As developers, we are interacting oftenly with data, whenever handling it from an API or a messaging consumer. To store it, we started to create softwares called &lt;strong&gt;relational database management system&lt;&#x2F;strong&gt; or &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Relational_database_management_system&quot;&gt;RDBMS&lt;&#x2F;a&gt;. Thanks to them, we, as developers, can develop applications pretty easily, &lt;strong&gt;without the need to implement our own storage solution&lt;&#x2F;strong&gt;. Interacting with &lt;a href=&quot;https:&#x2F;&#x2F;www.mysql.com&#x2F;&quot;&gt;mySQL&lt;&#x2F;a&gt; or &lt;a href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;&quot;&gt;PostgreSQL&lt;&#x2F;a&gt; have now become a &lt;strong&gt;commodity&lt;&#x2F;strong&gt;. Handling a database is not that easy though, because anything can happen, from failures to concurrency isssues:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;How can we interact with &lt;strong&gt;datastores that can fail?&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;li&gt;What is happening if two users are  &lt;strong&gt;updating a value at the same time?&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;As a database user, we are using &lt;code&gt;transactions&lt;&#x2F;code&gt; to answer these questions. As a developer, a transaction is a &lt;strong&gt;single unit of logic or work&lt;&#x2F;strong&gt;, sometimes made up of multiple operations. It is mainly an &lt;strong&gt;abstraction&lt;&#x2F;strong&gt; that we are using to &lt;strong&gt;hide underlying problems&lt;&#x2F;strong&gt;, such as concurrency or hardware faults.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;ACID&lt;&#x2F;code&gt; appears in a paper published in 1983 called &lt;a href=&quot;https:&#x2F;&#x2F;sites.fas.harvard.edu&#x2F;~cs265&#x2F;papers&#x2F;haerder-1983.pdf&quot;&gt;&quot;Principles of transaction-oriented database recovery&quot;&lt;&#x2F;a&gt; written by &lt;em&gt;Theo Haerder&lt;&#x2F;em&gt; and &lt;em&gt;Andreas Reuter&lt;&#x2F;em&gt;. This paper introduce a terminology of properties for a transaction:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A&lt;&#x2F;strong&gt;tomic, &lt;strong&gt;C&lt;&#x2F;strong&gt;onsistency, &lt;strong&gt;I&lt;&#x2F;strong&gt;solation, &lt;strong&gt;D&lt;&#x2F;strong&gt;urability&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;atomic&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#atomic&quot; aria-label=&quot;Anchor link for: atomic&quot;&gt;🔗&lt;&#x2F;a&gt;Atomic&lt;&#x2F;h2&gt;
&lt;p&gt;Atomic, as you may have guessed, &lt;code&gt;atomic&lt;&#x2F;code&gt; represents something that &lt;strong&gt;cannot be splitted&lt;&#x2F;strong&gt;. In the database transaction world, it means for example that if a transaction with several writes is &lt;strong&gt;started and failed&lt;&#x2F;strong&gt; at some point, &lt;strong&gt;none of the write will be committed&lt;&#x2F;strong&gt;. As stated by many, the word &lt;code&gt;atomic&lt;&#x2F;code&gt; could be reword as &lt;code&gt;abortability&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;consistency&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#consistency&quot; aria-label=&quot;Anchor link for: consistency&quot;&gt;🔗&lt;&#x2F;a&gt;Consistency&lt;&#x2F;h2&gt;
&lt;p&gt;You will hear about &lt;code&gt;consistency&lt;&#x2F;code&gt; a lot of this serie. Unfortunately, this word can be used in a lot of context. In the ACID definition, it refers to the fact that a transaction will &lt;strong&gt;bring the database from one valid state to another.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;isolation&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#isolation&quot; aria-label=&quot;Anchor link for: isolation&quot;&gt;🔗&lt;&#x2F;a&gt;Isolation&lt;&#x2F;h2&gt;
&lt;p&gt;Think back to your database. Were you the only user on it? I don&#x27;t think so. Maybe they were concurrent transactions at the same time, beside yours. &lt;strong&gt;Isolation while keeping good performance is the most difficult item on the list.&lt;&#x2F;strong&gt; There&#x27;s a lot of litterature and papers about it, and we will only scratch the surface. There is different transaction isolation levels, depending on the number of guarantees provided.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;isolation-by-the-theory&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#isolation-by-the-theory&quot; aria-label=&quot;Anchor link for: isolation-by-the-theory&quot;&gt;🔗&lt;&#x2F;a&gt;Isolation by the theory&lt;&#x2F;h3&gt;
&lt;p&gt;The SQL standard defines four isolation levels: &lt;code&gt;Serializable&lt;&#x2F;code&gt;, &lt;code&gt;Repeatable Read&lt;&#x2F;code&gt;, &lt;code&gt;Read Commited&lt;&#x2F;code&gt; and &lt;code&gt;Read Uncommited&lt;&#x2F;code&gt;. The strongest isolation is &lt;code&gt;Serializable&lt;&#x2F;code&gt; where transaction are &lt;strong&gt;not runned in parallel&lt;&#x2F;strong&gt;. As you may have guessed, it is also the slowest. &lt;strong&gt;Weaker isolation level are trading speed against anomalies&lt;&#x2F;strong&gt; that can be sum-up like this:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Isolation level&lt;&#x2F;th&gt;&lt;th&gt;&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Isolation_(database_systems)#Dirty_reads&quot;&gt;dirty reads&lt;&#x2F;a&gt;&lt;&#x2F;th&gt;&lt;th&gt;&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Isolation_%28database_systems%29#Non-repeatable_reads&quot;&gt;Non-repeatable reads&lt;&#x2F;a&gt;&lt;&#x2F;th&gt;&lt;th&gt;&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Isolation_(database_systems)#Phantom_reads&quot;&gt;Phantom reads&lt;&#x2F;a&gt;&lt;&#x2F;th&gt;&lt;th&gt;Performance&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Serializable&lt;&#x2F;td&gt;&lt;td&gt;😎&lt;&#x2F;td&gt;&lt;td&gt;😎&lt;&#x2F;td&gt;&lt;td&gt;😎&lt;&#x2F;td&gt;&lt;td&gt;👍&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Repeatable Read&lt;&#x2F;td&gt;&lt;td&gt;😎&lt;&#x2F;td&gt;&lt;td&gt;😎&lt;&#x2F;td&gt;&lt;td&gt;😱&lt;&#x2F;td&gt;&lt;td&gt;👍👍&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Read Commited&lt;&#x2F;td&gt;&lt;td&gt;😎&lt;&#x2F;td&gt;&lt;td&gt;😱&lt;&#x2F;td&gt;&lt;td&gt;😱&lt;&#x2F;td&gt;&lt;td&gt;👍👍👍&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Read uncommited&lt;&#x2F;td&gt;&lt;td&gt;😱&lt;&#x2F;td&gt;&lt;td&gt;😱&lt;&#x2F;td&gt;&lt;td&gt;😱&lt;&#x2F;td&gt;&lt;td&gt;👍👍👍👍&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;blockquote&gt;
&lt;p&gt;I encourage you to click on all the links within the table to &lt;strong&gt;see everything that could go wrong in a weak database!&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;isolation-in-real-databases&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#isolation-in-real-databases&quot; aria-label=&quot;Anchor link for: isolation-in-real-databases&quot;&gt;🔗&lt;&#x2F;a&gt;Isolation in Real Databases&lt;&#x2F;h3&gt;
&lt;p&gt;Now that we saw some theory, let&#x27;s have a look on a particular well-known database: PostgreSQL. What kind of isolation PostgreSQL is offering?&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;PostgreSQL provides a rich set of tools for developers to manage concurrent access to data. Internally, data consistency is maintained by using a multiversion model (&lt;strong&gt;Multiversion Concurrency Control, MVCC&lt;&#x2F;strong&gt;).&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;--- &lt;a href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;mvcc-intro.html&quot;&gt;Concurrency Control introduction&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Wait what? What is MVCC? Well, turns out that after the SQL standards came another type of Isolation: &lt;strong&gt;Snapshot Isolation&lt;&#x2F;strong&gt;. Instead of locking that row for reading when somebody starts working on it, it ensures that &lt;strong&gt;any transaction will see a version of the data that is corresponding to the start of the query&lt;&#x2F;strong&gt;. As it is providing a good balance between &lt;strong&gt;performance and consistency&lt;&#x2F;strong&gt;, it became &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;List_of_databases_using_MVCC&quot;&gt;a standard used by the industry&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;durability&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#durability&quot; aria-label=&quot;Anchor link for: durability&quot;&gt;🔗&lt;&#x2F;a&gt;Durability&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;code&gt;Durability&lt;&#x2F;code&gt; ensure that your database is a &lt;strong&gt;safe place&lt;&#x2F;strong&gt; where data can be stored without fear of losing it. If a transaction has commited successfully, any written data will not be forgotten.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;that-s-it&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#that-s-it&quot; aria-label=&quot;Anchor link for: that-s-it&quot;&gt;🔗&lt;&#x2F;a&gt;That&#x27;s it?&lt;&#x2F;h1&gt;
&lt;p&gt;&lt;strong&gt;All these properties may seems obvious to you&lt;&#x2F;strong&gt; but each of the item is involving a lot of engineering and researchs. And this is only valid for a single machine, &lt;strong&gt;the distributed transaction field&lt;&#x2F;strong&gt; is even more complicated, but we will get to it in another blogpost!&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;strong&gt;Thank you&lt;&#x2F;strong&gt; for reading my post! Feel free to react to this article, I am also available on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt; if needed.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">database</category>
          <category domain="tag">transactions</category>
          <category domain="tag">sql</category>
          <category domain="tag">storage</category>
      </item>
      <item>
          <title>Hbase Data Model</title>
          <pubDate>Sun, 27 Jan 2019 20:24:27 +0100</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/hbase-data-model/</link>
          <guid>https://pierrezemb.fr/posts/hbase-data-model/</guid>
          <description xml:base="https://pierrezemb.fr/posts/hbase-data-model/">&lt;h2 id=&quot;hbase&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#hbase&quot; aria-label=&quot;Anchor link for: hbase&quot;&gt;🔗&lt;&#x2F;a&gt;HBase?&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;hbase-data-model&#x2F;hbase.jpg&quot; alt=&quot;hbase image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;hbase.apache.org&#x2F;&quot;&gt;Apache HBase™&lt;&#x2F;a&gt; is a type of &quot;NoSQL&quot; database. &quot;NoSQL&quot; is a general term meaning that the database isn’t an RDBMS which supports SQL as its primary access language. Technically speaking, HBase is really more a &quot;Data Store&quot; than &quot;Data Base&quot; because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;-- &lt;a href=&quot;https:&#x2F;&#x2F;hbase.apache.org&#x2F;book.html#arch.overview.nosql&quot;&gt;Hbase architecture overview&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;hbase-data-model&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#hbase-data-model&quot; aria-label=&quot;Anchor link for: hbase-data-model&quot;&gt;🔗&lt;&#x2F;a&gt;Hbase data model&lt;&#x2F;h2&gt;
&lt;p&gt;The data model is simple: it&#x27;s like a multi-dimensional map:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Elements are stored as &lt;strong&gt;rows&lt;&#x2F;strong&gt; in a &lt;strong&gt;table&lt;&#x2F;strong&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;Each table has only &lt;strong&gt;one index, the row key&lt;&#x2F;strong&gt;. There are no secondary indices.&lt;&#x2F;li&gt;
&lt;li&gt;Rows are &lt;strong&gt;sorted lexicographically by row key&lt;&#x2F;strong&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;A range of rows is called a &lt;strong&gt;region&lt;&#x2F;strong&gt;. It is similar to a shard.&lt;&#x2F;li&gt;
&lt;li&gt;A row in HBase consists of a &lt;strong&gt;row key&lt;&#x2F;strong&gt; and &lt;strong&gt;one or more columns&lt;&#x2F;strong&gt;, which are holding the cells.&lt;&#x2F;li&gt;
&lt;li&gt;Values are stored into what we call a &lt;strong&gt;cell&lt;&#x2F;strong&gt; and are versioned with a timestamp.&lt;&#x2F;li&gt;
&lt;li&gt;A column is divided between a &lt;strong&gt;Column Family&lt;&#x2F;strong&gt; and a &lt;strong&gt;Column Qualifier&lt;&#x2F;strong&gt;. Long story short, a Column Family is kind of like a column in classic SQL, and a qualifier is a sub-structure inside a Colum family. A column Family is &lt;strong&gt;static&lt;&#x2F;strong&gt;, you need to create it during table creation, whereas Column Qualifiers can be created on the fly.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Not as easy as you thought? Here&#x27;s an example! Let&#x27;s say that we&#x27;re trying to &lt;strong&gt;save the whole internet&lt;&#x2F;strong&gt;. To do this, we need to store the content of each pages, and versioned it. We can use &lt;strong&gt;the page address as the row key&lt;&#x2F;strong&gt; and store the contents in a &lt;strong&gt;column called &quot;Contents&quot;&lt;&#x2F;strong&gt;. Nowadays, website &lt;strong&gt;contents can be anything&lt;&#x2F;strong&gt;, from a HTML file to a binary such as a PDF. To handle that, we can create as many &lt;strong&gt;qualifiers&lt;&#x2F;strong&gt; as we want, such as &quot;content:html&quot; or &quot;content:video&quot;.&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;json&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-json &quot;&gt;&lt;code class=&quot;language-json&quot; data-lang=&quot;json&quot;&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;  &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;fr.pierrezemb.www&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: {          &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Row key
&lt;&#x2F;span&gt;&lt;span&gt;    &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;contents&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: {                 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Column family
&lt;&#x2F;span&gt;&lt;span&gt;      &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;content:html&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: {           &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Column qualifier
&lt;&#x2F;span&gt;&lt;span&gt;        &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;2017-01-01&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;:             &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; A timestamp
&lt;&#x2F;span&gt;&lt;span&gt;          &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;&amp;lt;html&amp;gt;...&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;,            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; The actual value
&lt;&#x2F;span&gt;&lt;span&gt;        &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;2016-01-01&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;:             &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Another timestamp
&lt;&#x2F;span&gt;&lt;span&gt;          &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;&amp;lt;html&amp;gt;...&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;             &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Another cell
&lt;&#x2F;span&gt;&lt;span&gt;      },
&lt;&#x2F;span&gt;&lt;span&gt;      &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;content:pdf&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: {            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; Another Column qualifier
&lt;&#x2F;span&gt;&lt;span&gt;        &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;2015-01-01&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;: 
&lt;&#x2F;span&gt;&lt;span&gt;          &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#a3be8c;&quot;&gt;&amp;lt;pdf&amp;gt;...&lt;&#x2F;span&gt;&lt;span&gt;&amp;quot;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#65737e;&quot;&gt;&#x2F;&#x2F; my website may only contained a pdf in 2015
&lt;&#x2F;span&gt;&lt;span&gt;      }
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;  }
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;key-design&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#key-design&quot; aria-label=&quot;Anchor link for: key-design&quot;&gt;🔗&lt;&#x2F;a&gt;Key design&lt;&#x2F;h2&gt;
&lt;p&gt;Hbase is most efficient at queries when we&#x27;re getting a &lt;strong&gt;single row key&lt;&#x2F;strong&gt;, or during &lt;strong&gt;row range&lt;&#x2F;strong&gt;, ie. getting a block of contiguous data because keys are &lt;strong&gt;sorted lexicographically by row key&lt;&#x2F;strong&gt;. For example, my website &lt;code&gt;fr.pierrezemb.www&lt;&#x2F;code&gt; and &lt;code&gt;org.pierrezemb.www&lt;&#x2F;code&gt; would not be &quot;near&quot;.&lt;&#x2F;p&gt;
&lt;p&gt;As such, the &lt;strong&gt;key design&lt;&#x2F;strong&gt; is really important:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;If your data are too spread, you will have poor performance.&lt;&#x2F;li&gt;
&lt;li&gt;If your data are too much collocate, you will also have poor performance.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;As stated by the official &lt;a href=&quot;https:&#x2F;&#x2F;hbase.apache.org&#x2F;book.html#rowkey.design&quot;&gt;documentation&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Hotspotting occurs when a &lt;strong&gt;large amount of client traffic is directed at one node, or only a few nodes, of a cluster&lt;&#x2F;strong&gt;. This traffic may represent reads, writes, or other operations. The traffic overwhelms the single machine responsible for hosting that region, causing performance degradation and potentially leading to region unavailability.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;As you may have guessed, this is why we are using the &lt;strong&gt;reverse address name&lt;&#x2F;strong&gt; in my example, because &lt;code&gt;www&lt;&#x2F;code&gt; is too generic, we would have hotspot the poor region holding data for &lt;code&gt;www&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;If you are curious about Hbase schema, you should have a look on &lt;a href=&quot;https:&#x2F;&#x2F;cloud.google.com&#x2F;bigtable&#x2F;docs&#x2F;schema-design&quot;&gt;Designing Your BigTable Schema&lt;&#x2F;a&gt;, as BigTable is kind of the proprietary version of Hbase.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;be-warned&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#be-warned&quot; aria-label=&quot;Anchor link for: be-warned&quot;&gt;🔗&lt;&#x2F;a&gt;Be warned&lt;&#x2F;h2&gt;
&lt;p&gt;I have been working with Hbase for the past three years, &lt;strong&gt;including operation and on-call duty.&lt;&#x2F;strong&gt; It is a really nice data store, but it diverges from classical RDBMS. Here&#x27;s some warnings extracted from the well-written documentation:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;HBase is really more a &quot;Data Store&quot; than &quot;Data Base&quot; because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc. However, HBase has many features which supports both linear and modular scaling.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;-- &lt;a href=&quot;https:&#x2F;&#x2F;hbase.apache.org&#x2F;book.html#arch.overview.nosql&quot;&gt;NoSQL?&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand&#x2F;million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;-- &lt;a href=&quot;https:&#x2F;&#x2F;hbase.apache.org&#x2F;book.html#arch.overview.when&quot;&gt;When Should I Use HBase?&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;strong&gt;Thank you&lt;&#x2F;strong&gt; for reading my post! Feel free to react to this article, I am also available on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt; if needed.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">database</category>
          <category domain="tag">distributed</category>
          <category domain="tag">hbase</category>
          <category domain="tag">storage</category>
          <category domain="tag">design</category>
      </item>
      <item>
          <title>Introducing HelloExoWorld: The quest to discover exoplanets with Warp10 and Tensorflow</title>
          <pubDate>Wed, 11 Oct 2017 10:23:11 +0000</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/introducing-helloexoworld-the-quest-to-discover-exoplanets-with-warp10-and-tensorflow/</link>
          <guid>https://pierrezemb.fr/posts/introducing-helloexoworld-the-quest-to-discover-exoplanets-with-warp10-and-tensorflow/</guid>
          <description xml:base="https://pierrezemb.fr/posts/introducing-helloexoworld-the-quest-to-discover-exoplanets-with-warp10-and-tensorflow/">&lt;p&gt;&lt;strong&gt;update 2019:&lt;&#x2F;strong&gt; this is a repost on my own blog. original article can be read on &lt;a href=&quot;https:&#x2F;&#x2F;medium.com&#x2F;helloexoworld&#x2F;introducing-helloexoworld-the-quest-to-discover-exoplanets-with-warp10-and-tensorflow-e50f6e669915&quot;&gt;medium&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;introducing-helloexoworld-the-quest-to-discover-exoplanets-with-warp10-and-tensorflow&#x2F;1.jpeg&quot; alt=&quot;image&quot; &#x2F;&gt;
&lt;em&gt;Artist’s impression of the super-Earth exoplanet LHS 1140b By &lt;a href=&quot;https:&#x2F;&#x2F;www.eso.org&#x2F;public&#x2F;images&#x2F;eso1712a&#x2F;&quot;&gt;ESO&#x2F;spaceengine.org&lt;&#x2F;a&gt; — &lt;a href=&quot;http:&#x2F;&#x2F;creativecommons.org&#x2F;licenses&#x2F;by&#x2F;4.0&quot;&gt;CC BY 4.0&lt;&#x2F;a&gt;&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;p&gt;My passion for programming was kind of late, I typed my first line of code at my engineering school. It then became a &lt;strong&gt;passion&lt;&#x2F;strong&gt;, something I’m willing to do at work, on my free-time, at night or the week-end. But before discovering C and other languages, I had another passion: &lt;strong&gt;astronomy&lt;&#x2F;strong&gt;. Every summer, I was participating at the &lt;a href=&quot;https:&#x2F;&#x2F;www.afastronomie.fr&#x2F;les-nuits-des-etoiles&quot;&gt;&lt;strong&gt;Nuit des Etoiles&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt;, a &lt;strong&gt;global french event&lt;&#x2F;strong&gt; organized by numerous clubs of astronomers offering several hundreds (between 300 and 500 depending on the year) of free animation sites for the general public.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;introducing-helloexoworld-the-quest-to-discover-exoplanets-with-warp10-and-tensorflow&#x2F;2.png&quot; alt=&quot;image&quot; &#x2F;&gt;
&lt;em&gt;As you can see below, I was &lt;strong&gt;kind of young at the time&lt;&#x2F;strong&gt;!&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;p&gt;But the sad truth is that I didn’t do any astronomy during my studies. But now, &lt;strong&gt;I want to get back to it and look at the sky again&lt;&#x2F;strong&gt;. There were two obstacles:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;The price of equipments&lt;&#x2F;li&gt;
&lt;li&gt;The local weather&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;&lt;strong&gt;I was looking for something that would unit my two passions: computer and astronomy&lt;&#x2F;strong&gt;. So I started googling:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;introducing-helloexoworld-the-quest-to-discover-exoplanets-with-warp10-and-tensorflow&#x2F;3.png&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;I found a lot of amazing projects using Raspberry pis, but I didn’t find something that would &lt;strong&gt;motivate me&lt;&#x2F;strong&gt; over the time. So I started typing over keywords, more work-related, such as &lt;em&gt;&lt;strong&gt;time series&lt;&#x2F;strong&gt;&lt;&#x2F;em&gt; or &lt;em&gt;&lt;strong&gt;analytics&lt;&#x2F;strong&gt;&lt;&#x2F;em&gt;. I found many papers related to astrophysics, but there was two keywords that were coming back: &lt;strong&gt;exoplanet detection&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;what-is-an-exoplanet-and-how-to-detect-it&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what-is-an-exoplanet-and-how-to-detect-it&quot; aria-label=&quot;Anchor link for: what-is-an-exoplanet-and-how-to-detect-it&quot;&gt;🔗&lt;&#x2F;a&gt;What is an exoplanet and how to detect it?&lt;&#x2F;h3&gt;
&lt;p&gt;Let’s quote our good old friend &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Exoplanet&quot;&gt;&lt;strong&gt;Wikipedia&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;An exoplanet or extrasolar planet is a planet outside of our solar system that orbits a star.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;do you know how many exoplanets that have been discovered? &lt;a href=&quot;https:&#x2F;&#x2F;exoplanetarchive.ipac.caltech.edu&#x2F;&quot;&gt;&lt;strong&gt;3,529 confirmed planets&lt;&#x2F;strong&gt; as of 10&#x2F;09&#x2F;2017&lt;&#x2F;a&gt;. I was amazed by the number of them. I started digging into the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Methods_of_detecting_exoplanets&quot;&gt;&lt;strong&gt;detection methods&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt;. Turns out there is one method heavily used, called &lt;strong&gt;the transit method&lt;&#x2F;strong&gt;. It’s like a eclipse: when the exoplanet is passing in front of the star, the photometry is varying during the transit, as shown below:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;introducing-helloexoworld-the-quest-to-discover-exoplanets-with-warp10-and-tensorflow&#x2F;4.gif&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;animation illustrating how a dip in the observed brightness of a star may indicate the presence of an exoplanet. &lt;em&gt;&lt;strong&gt;Credits: NASA’s Goddard Space Flight Center&lt;&#x2F;strong&gt;&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;p&gt;To recap, exoplanet detection using the transit method are in reality a &lt;strong&gt;time series analysis problem&lt;&#x2F;strong&gt;. As I’m starting to be familiar with that type of analytics thanks to my current work at OVH in &lt;a href=&quot;https:&#x2F;&#x2F;www.ovh.com&#x2F;fr&#x2F;data-platforms&#x2F;metrics&#x2F;&quot;&gt;&lt;strong&gt;Metrics Data Platform&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt;, I wanted to give it a try.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;kepler-k2-mission&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#kepler-k2-mission&quot; aria-label=&quot;Anchor link for: kepler-k2-mission&quot;&gt;🔗&lt;&#x2F;a&gt;Kepler&#x2F;K2 mission&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;introducing-helloexoworld-the-quest-to-discover-exoplanets-with-warp10-and-tensorflow&#x2F;5.jpeg&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Image Credit: NASA Ames&#x2F;W. Stenzel&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Kepler is a &lt;strong&gt;space observatory&lt;&#x2F;strong&gt; launched by NASA in March 2009 to &lt;strong&gt;discover Earth-sized planets orbiting other stars&lt;&#x2F;strong&gt;. &lt;a href=&quot;https:&#x2F;&#x2F;www.nasa.gov&#x2F;feature&#x2F;ames&#x2F;nasas-k2-mission-the-kepler-space-telescopes-second-chance-to-shine&quot;&gt;The loss of a second of the four reaction wheels during May 2013&lt;&#x2F;a&gt; put an end to the original mission. Fortunately, scientists decided to create an &lt;strong&gt;entirely community-driven mission&lt;&#x2F;strong&gt; called K2, to &lt;strong&gt;reuse the Kepler spacecraft and its assets&lt;&#x2F;strong&gt;. But furthermore, the community is also encouraged to exploit the mission’s unique &lt;strong&gt;open&lt;&#x2F;strong&gt; data archive. Every image taken by the satellite can be &lt;strong&gt;downloaded and analyzed by anyone&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;More information about the telescope itself can be found &lt;a href=&quot;https:&#x2F;&#x2F;keplerscience.arc.nasa.gov&#x2F;the-kepler-space-telescope.html&quot;&gt;&lt;strong&gt;here&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;where-i-m-going&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#where-i-m-going&quot; aria-label=&quot;Anchor link for: where-i-m-going&quot;&gt;🔗&lt;&#x2F;a&gt;Where I’m going&lt;&#x2F;h3&gt;
&lt;p&gt;The goal of my project is to see if &lt;strong&gt;I can contribute to the exoplanets search&lt;&#x2F;strong&gt; using new tools such as &lt;a href=&quot;http:&#x2F;&#x2F;www.warp10.io&quot;&gt;&lt;strong&gt;Warp10&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;tensorflow.org&#x2F;&quot;&gt;&lt;strong&gt;TensorFlow&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt;. Using &lt;strong&gt;Deep Learning to search for anomalies could be much more effective&lt;&#x2F;strong&gt; than writing WarpScript, because it is the &lt;strong&gt;neural network&#x27;s job to learn&lt;&#x2F;strong&gt; by itself &lt;strong&gt;how&lt;&#x2F;strong&gt; to detect the exoplanets.&lt;&#x2F;p&gt;
&lt;p&gt;As I’m currently following &lt;a href=&quot;https:&#x2F;&#x2F;www.coursera.org&#x2F;learn&#x2F;neural-networks-deep-learning&quot;&gt;&lt;strong&gt;Andrew Ng courses about Deep Learning&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt;, it is also a great opportunity for me to play with &lt;strong&gt;Tensorflow&lt;&#x2F;strong&gt; in a personal project. The project can be divided into several steps:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Import&lt;&#x2F;strong&gt; the data&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Analyze&lt;&#x2F;strong&gt; the data using WarpScript&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Build&lt;&#x2F;strong&gt; a neural network to search for exoplanets&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Let&#x27;s see how the import was done!&lt;&#x2F;p&gt;
&lt;h3 id=&quot;importing-kepler-and-k2-dataset&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#importing-kepler-and-k2-dataset&quot; aria-label=&quot;Anchor link for: importing-kepler-and-k2-dataset&quot;&gt;🔗&lt;&#x2F;a&gt;Importing Kepler and K2 dataset&lt;&#x2F;h3&gt;
&lt;h4 id=&quot;step-0-find-the-data&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#step-0-find-the-data&quot; aria-label=&quot;Anchor link for: step-0-find-the-data&quot;&gt;🔗&lt;&#x2F;a&gt;Step 0: Find the data&lt;&#x2F;h4&gt;
&lt;p&gt;As mentioned previously, data are available from The Mikulski Archive for Space Telescopes or &lt;a href=&quot;https:&#x2F;&#x2F;archive.stsci.edu&#x2F;&quot;&gt;MAST&lt;&#x2F;a&gt;. It’s a &lt;strong&gt;NASA funded project&lt;&#x2F;strong&gt; to support and provide the astronomical community with a variety of astronomical data archives. Both Kepler and K2 dataset are &lt;strong&gt;available&lt;&#x2F;strong&gt; through &lt;strong&gt;campaigns&lt;&#x2F;strong&gt;. Each campaign has a collection of tar files, which are containing the FITS files associated. A &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;FITS&quot;&gt;&lt;strong&gt;FITS&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt; file is an &lt;strong&gt;open format&lt;&#x2F;strong&gt; for images which is also &lt;strong&gt;containing scientific data&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;introducing-helloexoworld-the-quest-to-discover-exoplanets-with-warp10-and-tensorflow&#x2F;6.png&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;FITS file representation.&lt;&#x2F;em&gt; &lt;a href=&quot;https:&#x2F;&#x2F;keplerscience.arc.nasa.gov&#x2F;k2-observing.html&quot;&gt;&lt;em&gt;Image Credit: KEPLER &amp;amp; K2 Science Center&lt;&#x2F;em&gt;&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;h4 id=&quot;step-1-etl-extract-transform-and-load-into-warp10&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#step-1-etl-extract-transform-and-load-into-warp10&quot; aria-label=&quot;Anchor link for: step-1-etl-extract-transform-and-load-into-warp10&quot;&gt;🔗&lt;&#x2F;a&gt;Step 1: ETL (Extract, Transform and Load) into Warp10&lt;&#x2F;h4&gt;
&lt;p&gt;To speed-up acquisition, I developed &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;PierreZ&#x2F;kepler-lens&quot;&gt;&lt;strong&gt;kepler-lens&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt; to &lt;strong&gt;automatically&lt;&#x2F;strong&gt; &lt;strong&gt;download Kepler&#x2F;K2 datasets and extract the needed time series&lt;&#x2F;strong&gt; into a CSV format. &lt;strong&gt;Kepler-lens&lt;&#x2F;strong&gt; is using two awesome libraries:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;KeplerGO&#x2F;PyKE&quot;&gt;&lt;strong&gt;pyKe&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt; to export the data from the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;FITS&quot;&gt;&lt;strong&gt;FITS&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt; files to CSV (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;KeplerGO&#x2F;PyKE&#x2F;pull&#x2F;69&quot;&gt;&lt;strong&gt;#PR69&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;KeplerGO&#x2F;PyKE&#x2F;pull&#x2F;76&quot;&gt;&lt;strong&gt;#PR76&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt;  have been merged).&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;dfm&#x2F;kplr&quot;&gt;&lt;strong&gt;kplr&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt; is used to &lt;strong&gt;tag&lt;&#x2F;strong&gt; the dataset. With it, I can easily &lt;strong&gt;find stars&lt;&#x2F;strong&gt; with &lt;strong&gt;confirmed&lt;&#x2F;strong&gt; exoplanets or &lt;strong&gt;candidates&lt;&#x2F;strong&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Then &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;PierreZ&#x2F;kepler2warp10&quot;&gt;&lt;strong&gt;Kepler2Warp10&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt; is used to &lt;strong&gt;push the CSV files generated by kepler-lens to Warp10&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;To ease importation, an &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;PierreZ&#x2F;kepler2warp10-ansible&quot;&gt;&lt;strong&gt;Ansible role&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt;  has been made, to spread the work across multiples small &lt;strong&gt;virtual machines&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;550k distincts stars&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;li&gt;around &lt;strong&gt;50k datapoints per star&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;That&#x27;s around &lt;strong&gt;27,5 billions of measures&lt;&#x2F;strong&gt; (300GB of LevelDB files), imported on a &lt;strong&gt;standalone&lt;&#x2F;strong&gt; instance. The Warp10 instance is &lt;strong&gt;self-hosted&lt;&#x2F;strong&gt; on a dedicated &lt;a href=&quot;https:&#x2F;&#x2F;www.kimsufi.com&#x2F;&quot;&gt;&lt;strong&gt;Kimsufi&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt; server at OVH. Here’s the full specifications for the curious ones:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;introducing-helloexoworld-the-quest-to-discover-exoplanets-with-warp10-and-tensorflow&#x2F;7.png&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Now that the data are &lt;strong&gt;available&lt;&#x2F;strong&gt;, we are ready to &lt;strong&gt;dive into the dataset&lt;&#x2F;strong&gt; and &lt;strong&gt;look for exoplanets&lt;&#x2F;strong&gt;! Let&#x27;s use WarpScript&lt;&#x2F;p&gt;
&lt;p&gt;!### Let&#x27;s see a transit using WarpScript&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;introducing-helloexoworld-the-quest-to-discover-exoplanets-with-warp10-and-tensorflow&#x2F;8.png&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;WarpScript logo&lt;&#x2F;p&gt;
&lt;p&gt;For those who don’t know WarpScript, I recommend reading my previous blogpost “&lt;a href=&quot;https:&#x2F;&#x2F;medium.com&#x2F;@PierreZ&#x2F;engage-maximum-warp-speed-in-time-series-analysis-with-warpscript-c97a9f4a0016&quot;&gt;&lt;strong&gt;Engage maximum warp speed in time series analysis with WarpScript&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt;”.&lt;&#x2F;p&gt;
&lt;p&gt;Let’s first plot the data! We are going to take a well-known star called &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Kepler-11&quot;&gt;&lt;strong&gt;Kepler-11&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt;. It has (at least) 6 confirmed exoplanets. Let&#x27;s write our first WarpScript:&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;a href=&quot;http:&#x2F;&#x2F;www.warp10.io&#x2F;reference&#x2F;functions&#x2F;function_FETCH&quot;&gt;FETCH&lt;&#x2F;a&gt; function retrieves &lt;strong&gt;raw datapoints&lt;&#x2F;strong&gt; from Warp10. Let’s plot the result of our script:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;introducing-helloexoworld-the-quest-to-discover-exoplanets-with-warp10-and-tensorflow&#x2F;9.png&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Mmmmh, the straight lines are representing &lt;strong&gt;empties period with no datapoints&lt;&#x2F;strong&gt;; they correspond to &lt;strong&gt;different observations&lt;&#x2F;strong&gt;. &lt;strong&gt;Let&#x27;s divide the data&lt;&#x2F;strong&gt; and generate &lt;strong&gt;one time series per observation&lt;&#x2F;strong&gt; using &lt;a href=&quot;http:&#x2F;&#x2F;www.warp10.io&#x2F;reference&#x2F;functions&#x2F;function_TIMESPLIT&#x2F;&quot;&gt;TIMESPLIT&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;p&gt;To ease the display, 0 GET is used to &lt;strong&gt;get only the first observation&lt;&#x2F;strong&gt;. Let&#x27;s see the result:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;introducing-helloexoworld-the-quest-to-discover-exoplanets-with-warp10-and-tensorflow&#x2F;10.png&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Much better. Do you see the dropouts? &lt;strong&gt;Those are transiting exoplanets!&lt;&#x2F;strong&gt; Now we’ll need to &lt;strong&gt;write a WarpScript to automatically detect transits.&lt;&#x2F;strong&gt; But that was enough for today, so we’ll cover this **in the next blogpost!**Thank you for reading! Feel free to &lt;strong&gt;comment&lt;&#x2F;strong&gt; and to &lt;strong&gt;subscribe&lt;&#x2F;strong&gt; to the &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;helloexoworld&quot;&gt;twitter account&lt;&#x2F;a&gt;!&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;introducing-helloexoworld-the-quest-to-discover-exoplanets-with-warp10-and-tensorflow&#x2F;11.jpeg&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Artist’s impression of the ultracool dwarf star TRAPPIST-1 from close to one of its planets&lt;&#x2F;strong&gt;. Image Credit: By &lt;a href=&quot;http:&#x2F;&#x2F;www.eso.org&#x2F;public&#x2F;images&#x2F;eso1615b&#x2F;&quot;&gt;ESO&#x2F;M. Kornmesser&lt;&#x2F;a&gt; — &lt;a href=&quot;https:&#x2F;&#x2F;creativecommons.org&#x2F;licenses&#x2F;by-sa&#x2F;4.0&quot;&gt;CC BY-SA 4.0&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
</description>
          <category domain="tag">space</category>
          <category domain="tag">timeseries</category>
          <category domain="tag">analytics</category>
          <category domain="tag">machinelearning</category>
      </item>
      <item>
          <title>Engage maximum warp speed in time series analysis with WarpScript</title>
          <pubDate>Sun, 08 Oct 2017 20:43:05 +0000</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/engage-maximum-warp-speed-in-time-series-analysis-with-warpscript/</link>
          <guid>https://pierrezemb.fr/posts/engage-maximum-warp-speed-in-time-series-analysis-with-warpscript/</guid>
          <description xml:base="https://pierrezemb.fr/posts/engage-maximum-warp-speed-in-time-series-analysis-with-warpscript/">&lt;p&gt;&lt;strong&gt;update 2019:&lt;&#x2F;strong&gt; this is a repost on my own blog. original article can be read on &lt;a href=&quot;https:&#x2F;&#x2F;medium.com&#x2F;@PierreZ&#x2F;engage-maximum-warp-speed-in-time-series-analysis-with-warpscript-c97a9f4a0016&quot;&gt;medium&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;engage-maximum-warp-speed-in-time-series-analysis-with-warpscript&#x2F;1.png&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;We, at &lt;a href=&quot;https:&#x2F;&#x2F;www.ovh.com&#x2F;fr&#x2F;data-platforms&#x2F;metrics&#x2F;&quot;&gt;Metrics Data Platform&lt;&#x2F;a&gt;, are working everyday with &lt;a href=&quot;http:&#x2F;&#x2F;www.warp10.io&#x2F;&quot;&gt;Warp10 Platform&lt;&#x2F;a&gt;, an open source Time Series database. You may not know it because it’s not as famous as &lt;a href=&quot;https:&#x2F;&#x2F;prometheus.io&#x2F;&quot;&gt;Prometheus&lt;&#x2F;a&gt; or &lt;a href=&quot;https:&#x2F;&#x2F;docs.influxdata.com&#x2F;influxdb&#x2F;&quot;&gt;InfluxDB&lt;&#x2F;a&gt; but Warp10 is the most &lt;strong&gt;powerful and generic solution&lt;&#x2F;strong&gt; to store and analyze sensor data. It’s the &lt;strong&gt;core&lt;&#x2F;strong&gt; of Metrics, and many internal teams from OVH are using &lt;a href=&quot;https:&#x2F;&#x2F;www.ovh.com&#x2F;fr&#x2F;data-platforms&#x2F;metrics&#x2F;&quot;&gt;Metrics Data Platform&lt;&#x2F;a&gt; to monitor their infrastructure. As a result, we are handling a pretty nice traffic 24&#x2F;7&#x2F;365, as you can see below:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;engage-maximum-warp-speed-in-time-series-analysis-with-warpscript&#x2F;6.png&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Not only Warp10 allows us to reach an unbelievable scalability but it also comes with his own language called &lt;strong&gt;WarpScript&lt;&#x2F;strong&gt;, to manipulate and perform heavy time series analysis. Before digging into the need of a new language, let’s talk a bit about the need of time series analysis.### What is a time serie ?&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;A time serie, or sensor data, is simply a sequence of measurements over time&lt;&#x2F;strong&gt;. The definition is quite generic, because many things can be represented as a time serie:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;the evolution of the stock exchange or a bank account&lt;&#x2F;li&gt;
&lt;li&gt;the number of calls on a webserver&lt;&#x2F;li&gt;
&lt;li&gt;the fuel consumption of a car&lt;&#x2F;li&gt;
&lt;li&gt;the time to insert a value into a database&lt;&#x2F;li&gt;
&lt;li&gt;the time a customer is taking to register on your website&lt;&#x2F;li&gt;
&lt;li&gt;the heart rate of a person measured through a smartwatch&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;From an historical point of view, time series appeared shortly after the creation of the Web, to &lt;strong&gt;help engineers monitor the networks&lt;&#x2F;strong&gt;. It quickly expands to also monitors servers. With the right monitoring system, you can have &lt;strong&gt;insights&lt;&#x2F;strong&gt; and &lt;strong&gt;KPIs&lt;&#x2F;strong&gt; about your service:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Analysis of long-term trend&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;How fast is my database growing?&lt;&#x2F;li&gt;
&lt;li&gt;At what speed my number of active user accounts grows?&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;&lt;strong&gt;The comparison over time&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;My queries run faster with the new version of my library? Is my site slower than last week?&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;&lt;strong&gt;Alerts&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Trigger alerts based on advanced queries&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;&lt;strong&gt;Displaying data through dashboards&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Dashboards help answer basic questions on the service, and in particular the 4 indispensable metrics: &lt;strong&gt;latency, traffic, errors and service saturation&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;&lt;strong&gt;The possibility of designing retrospective&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Our latency is doubling, what’s going on?### Time series are complicated to handle&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Storage, retrieval and analysis of time series cannot be done through standard relational databases. Generally, highly scalable databases are used to support volumetry. For example, the &lt;strong&gt;300,000 Airbus A380 sensors on board can generate an average of 16 TB of data per flight&lt;&#x2F;strong&gt;. On a smaller scale, &lt;strong&gt;a single sensor that measures every second generates 31.5 million values per year&lt;&#x2F;strong&gt;. Handling time series at scale is difficult, because you’re running into advanced distributed systems issues, such as:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ingestion scalability&lt;&#x2F;strong&gt;, i.e. how to absorb all the datapoints 24⁄7&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;query scalability&lt;&#x2F;strong&gt;, i.e. how to query in a raisonnable amount of time&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;delete capability&lt;&#x2F;strong&gt;, i.e. how to handle deletes without stopping ingestion and query&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Frustration with existing open source monitoring tools like &lt;strong&gt;Nagios&lt;&#x2F;strong&gt; and &lt;strong&gt;Ganglia&lt;&#x2F;strong&gt; is why the giants created their own tools — &lt;strong&gt;Google has Borgmon&lt;&#x2F;strong&gt; and &lt;strong&gt;Facebook has&lt;&#x2F;strong&gt; &lt;a href=&quot;http:&#x2F;&#x2F;www.vldb.org&#x2F;pvldb&#x2F;vol8&#x2F;p1816-teller.pdf&quot;&gt;&lt;strong&gt;Gorilla&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt;, just to name two. They are closed sources but the idea of treating time-series data as a data source for generating alerts is now accessible to everyone, thanks to the &lt;strong&gt;former Googlers who decided to rewrite Borgmon&lt;&#x2F;strong&gt; outside Google.### Why another time series database?&lt;&#x2F;p&gt;
&lt;p&gt;Now the time series ecosystem is bigger than ever, here’s a short list of what you can find to handle time series data:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;InfluxDB.&lt;&#x2F;li&gt;
&lt;li&gt;Prometheus.&lt;&#x2F;li&gt;
&lt;li&gt;Riak TS.&lt;&#x2F;li&gt;
&lt;li&gt;OpenTSDB.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Then there’s &lt;strong&gt;Warp10&lt;&#x2F;strong&gt;. The difference is quite simple, Warp10 is &lt;strong&gt;a platform&lt;&#x2F;strong&gt; whereas all the time series listed above are &lt;strong&gt;stores&lt;&#x2F;strong&gt;. This is game changing, for multiples reasons.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;security-first-design&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#security-first-design&quot; aria-label=&quot;Anchor link for: security-first-design&quot;&gt;🔗&lt;&#x2F;a&gt;Security-first design&lt;&#x2F;h4&gt;
&lt;p&gt;Security is mandatory for data access and sharing job’s results, but in most of the above databases, security access is not handled by default. With Warp10, security is handled with crypto tokens similar to &lt;a href=&quot;https:&#x2F;&#x2F;research.google.com&#x2F;pubs&#x2F;pub41892.html&quot;&gt;Macaroons&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;high-level-analysis-capabilities&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#high-level-analysis-capabilities&quot; aria-label=&quot;Anchor link for: high-level-analysis-capabilities&quot;&gt;🔗&lt;&#x2F;a&gt;High level analysis capabilities&lt;&#x2F;h4&gt;
&lt;p&gt;Using classical time series database, &lt;strong&gt;high level analysis must be done elsewhere&lt;&#x2F;strong&gt;, with R, Spark, Flink, Python, or whatever languages or frameworks that you want to use. Using Warp10, you can just &lt;strong&gt;submit your script&lt;&#x2F;strong&gt; and &lt;em&gt;voilà&lt;&#x2F;em&gt;!&lt;&#x2F;p&gt;
&lt;h4 id=&quot;server-side-calculation&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#server-side-calculation&quot; aria-label=&quot;Anchor link for: server-side-calculation&quot;&gt;🔗&lt;&#x2F;a&gt;Server-side calculation&lt;&#x2F;h4&gt;
&lt;p&gt;Algorithms are resource heavy. Whatever they’re using CPU, ram, disk and network, you’ll hit &lt;strong&gt;limitations&lt;&#x2F;strong&gt; on your personal computer. Can you really aggregate and analyze one year of data from thousands of sensors on your laptop? Maybe, but what if you’re submitting the job from a mobile? To be &lt;strong&gt;scalable&lt;&#x2F;strong&gt;, analysis must be done &lt;strong&gt;server-side&lt;&#x2F;strong&gt;.### Meet WarpScript&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;engage-maximum-warp-speed-in-time-series-analysis-with-warpscript&#x2F;2.png&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Warp10 folks created WarpScript, an &lt;strong&gt;extensible&lt;&#x2F;strong&gt; &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Stack-oriented_programming_language&quot;&gt;&lt;strong&gt;stack oriented programming language&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt; which offers more than &lt;strong&gt;800 functions&lt;&#x2F;strong&gt; and &lt;strong&gt;several high level frameworks&lt;&#x2F;strong&gt; to ease and speed your data analysis. Simply &lt;strong&gt;create scripts&lt;&#x2F;strong&gt; containing your data analysis code and &lt;strong&gt;submit them to the platform&lt;&#x2F;strong&gt;, they will &lt;strong&gt;execute close to where the data resides&lt;&#x2F;strong&gt; and you will get the result of that analysis as a &lt;strong&gt;JSON object&lt;&#x2F;strong&gt; that you can &lt;strong&gt;integrate into your application&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Yes, you’ll be able to run that &lt;strong&gt;awesome query that is fetching millions of datapoints&lt;&#x2F;strong&gt; and only get the result. You need all the data, or just the timestamp of a weird datapoint? &lt;strong&gt;The result of the script is simply what’s left on the stack&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;dataflow-language&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#dataflow-language&quot; aria-label=&quot;Anchor link for: dataflow-language&quot;&gt;🔗&lt;&#x2F;a&gt;Dataflow language&lt;&#x2F;h4&gt;
&lt;p&gt;WarpScript is really easy to code, &lt;strong&gt;because of the stack design&lt;&#x2F;strong&gt;. You’ll be &lt;strong&gt;pushing elements into the stack and consume them&lt;&#x2F;strong&gt;. Coding became logical. First you need to &lt;strong&gt;fetch&lt;&#x2F;strong&gt; your points, then &lt;strong&gt;applying some downsampling&lt;&#x2F;strong&gt; and then &lt;strong&gt;aggregate&lt;&#x2F;strong&gt;. These 3 steps are translated into &lt;strong&gt;3 lines of WarpScript&lt;&#x2F;strong&gt;:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;FETCH&lt;&#x2F;strong&gt; will push the needed Geo Time Series into the stack&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;BUCKETIZE&lt;&#x2F;strong&gt; will take the Geo Time Series from the stack, apply some downsampling, and push the result into the stack&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;REDUCE&lt;&#x2F;strong&gt; will take the Geo Time Series from the stack, aggregate them, and push them back into the stack&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Debugguing as never be that easy, just use the keyword &lt;strong&gt;STOP&lt;&#x2F;strong&gt; to see the stack at any moment.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;rich-programming-capabilities&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#rich-programming-capabilities&quot; aria-label=&quot;Anchor link for: rich-programming-capabilities&quot;&gt;🔗&lt;&#x2F;a&gt;Rich programming capabilities&lt;&#x2F;h4&gt;
&lt;p&gt;WarpScript is coming with more than &lt;strong&gt;800 functions&lt;&#x2F;strong&gt;, ready to use. Things like &lt;strong&gt;Patterns and outliers detections, rolling average, FFT, IDWT&lt;&#x2F;strong&gt; are built-in.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;geo-fencing-capabilities&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#geo-fencing-capabilities&quot; aria-label=&quot;Anchor link for: geo-fencing-capabilities&quot;&gt;🔗&lt;&#x2F;a&gt;Geo-Fencing capabilities&lt;&#x2F;h4&gt;
&lt;p&gt;Both &lt;strong&gt;space&lt;&#x2F;strong&gt; (location) and &lt;strong&gt;time&lt;&#x2F;strong&gt; are considered &lt;strong&gt;first class citizens&lt;&#x2F;strong&gt;. Complex searches like “&lt;strong&gt;find all the sensors active during last Monday in the perimeter delimited by this geo-fencing polygon&lt;&#x2F;strong&gt;” can be done without involving expensive joins between separate time series for the same source.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;unified-language&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#unified-language&quot; aria-label=&quot;Anchor link for: unified-language&quot;&gt;🔗&lt;&#x2F;a&gt;Unified Language&lt;&#x2F;h4&gt;
&lt;p&gt;WarpScript can be used in &lt;strong&gt;batch&lt;&#x2F;strong&gt; mode, or in &lt;strong&gt;real-time&lt;&#x2F;strong&gt;, because you need both of them in the real world.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;geez-give-me-an-example&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#geez-give-me-an-example&quot; aria-label=&quot;Anchor link for: geez-give-me-an-example&quot;&gt;🔗&lt;&#x2F;a&gt;Geez, give me an example&lt;&#x2F;h3&gt;
&lt;p&gt;Here’s an example of a simple but advanced query:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;&#x2F;&#x2F; Fetching all values  
&lt;&#x2F;span&gt;&lt;span&gt;[ $token ‘temperature’ {} NOW 1 h ] FETCH &#x2F;&#x2F; Get max value for each minute  
&lt;&#x2F;span&gt;&lt;span&gt;[ SWAP bucketizer.max 0 1 m 0 ] BUCKETIZE &#x2F;&#x2F; Round to nearest long  
&lt;&#x2F;span&gt;&lt;span&gt;[ SWAP mapper.round 0 0 0 ] MAP &#x2F;&#x2F; reduce the data by keeping the max, grouping by &amp;#39;buildingID&amp;#39;  
&lt;&#x2F;span&gt;&lt;span&gt;[ SWAP [ &amp;#39;buildingID&amp;#39; ] reducer.max ] REDUCE
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Have you guessed the goal? The result will &lt;strong&gt;display the temperature from now to 1 hour of the hottest room per buildingID&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;what-about-a-more-complex-example&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what-about-a-more-complex-example&quot; aria-label=&quot;Anchor link for: what-about-a-more-complex-example&quot;&gt;🔗&lt;&#x2F;a&gt;What about a more complex example?&lt;&#x2F;h3&gt;
&lt;p&gt;You’re still here? Good, let’s have a more complex example. Let’s say that I want to do some patterns recognition. Let’s take an example. Here’s a cosinus with an increasing amplitude:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;engage-maximum-warp-speed-in-time-series-analysis-with-warpscript&#x2F;3.png&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;I want to &lt;strong&gt;detect the green part&lt;&#x2F;strong&gt; of the time series, because I know that my service is crashing when I have that kind of load. With WarpScript, it’s only a &lt;strong&gt;2 functions calls&lt;&#x2F;strong&gt;:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;PATTERNS&lt;&#x2F;strong&gt; is generating a list of motifs.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;PATTERNDETECTION&lt;&#x2F;strong&gt; is running the list of motifs on all the time series you have.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Here’s the code&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#2b303b;color:#c0c5ce;&quot;&gt;&lt;code&gt;&lt;span&gt;&#x2F;&#x2F; defining some variables  
&lt;&#x2F;span&gt;&lt;span&gt;32 &amp;#39;windowSize&amp;#39; STORE  
&lt;&#x2F;span&gt;&lt;span&gt;8 &amp;#39;patternLength&amp;#39; STORE  
&lt;&#x2F;span&gt;&lt;span&gt;16 &amp;#39;quantizationScale&amp;#39; STORE  
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;&#x2F;&#x2F; Generate patterns   
&lt;&#x2F;span&gt;&lt;span&gt;$pattern.to.detect 0 GET   
&lt;&#x2F;span&gt;&lt;span&gt;$windowSize $patternLength $quantizationScale PATTERNS  
&lt;&#x2F;span&gt;&lt;span&gt;VALUES &amp;#39;patterns&amp;#39; STORE  
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;&#x2F;&#x2F; Running the patterns through a list of GTS (Geo Time Series)  
&lt;&#x2F;span&gt;&lt;span&gt;$list.of.gts $patterns   
&lt;&#x2F;span&gt;&lt;span&gt;$windowSize $patternLength $quantizationScale  PATTERNDETECTION
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Here’s the result:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;engage-maximum-warp-speed-in-time-series-analysis-with-warpscript&#x2F;4.png&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;As you can see, &lt;strong&gt;PATTERNDETECTION&lt;&#x2F;strong&gt; is working even with the increasing amplitude! You can discover this example by yourself by using &lt;a href=&quot;https:&#x2F;&#x2F;home.cityzendata.net&#x2F;quantum&#x2F;preview&#x2F;#&#x2F;plot&#x2F;TkVXR1RTICdjb3MnIFJFTkFNRQoxIDEwODAKPCUgRFVQICdpJyBTVE9SRSBEVVAgMiAqIFBJICogMzYwIC8gQ09TICRpICogTmFOIE5hTiBOYU4gNCBST0xMIEFERFZBTFVFICU+IEZPUgoKWyBTV0FQIGJ1Y2tldGl6ZXIubGFzdCAxMDgwIDEgMCBdIEJVQ0tFVElaRSAnY29zJyBTVE9SRQoKTkVXR1RTICdwYXR0ZXJuLnRvLmRldGVjdCcgUkVOQU1FCjIwMCAzNzAKPCUgIERVUCAnaScgU1RPUkUgRFVQIDIgKiBQSSAqIDM2MCAvIENPUyAkaSAqIE5hTiBOYU4gTmFOIDQgUk9MTCBBRERWQUxVRSAlPiBGT1IKClsgU1dBUCBidWNrZXRpemVyLmxhc3QgMjE2MCAxIDAgXSBCVUNLRVRJWkUgJ3BhdHRlcm4udG8uZGV0ZWN0JyBTVE9SRQoKLy8gQ3JlYXRlIFBhdHRlcm4KMzIgJ3dpbmRvd1NpemUnIFNUT1JFCjggJ3BhdHRlcm5MZW5ndGgnIFNUT1JFCjE2ICdxdWFudGl6YXRpb25TY2FsZScgU1RPUkUKCiRwYXR0ZXJuLnRvLmRldGVjdCAwIEdFVCAkd2luZG93U2l6ZSAkcGF0dGVybkxlbmd0aCAkcXVhbnRpemF0aW9uU2NhbGUgUEFUVEVSTlMgVkFMVUVTICdwYXR0ZXJucycgU1RPUkUKCiRjb3MgJHBhdHRlcm5zICR3aW5kb3dTaXplICRwYXR0ZXJuTGVuZ3RoICRxdWFudGl6YXRpb25TY2FsZSAgUEFUVEVSTkRFVEVDVElPTiAnY29zLmRldGVjdGlvbicgUkVOQU1FICdjb3MuZGV0ZWN0aW9uJyBTVE9SRQoKJGNvcy5kZXRlY3Rpb24KLy8gTGV0J3MgY3JlYXRlIGEgZ3RzIGZvciBlYWNoIHRyaXAKMTAgICAgICAgLy8gIFF1aWV0IHBlcmlvZAo1ICAgICAgICAgLy8gTWluIG51bWJlciBvZiB2YWx1ZXMKJ3N1YlBhdHRlcm4nICAvLyBMYWJlbApUSU1FU1BMSVQKCiRjb3M=&#x2F;eyJ1cmwiOiJodHRwczovL3dhcnAuY2l0eXplbmRhdGEubmV0L2FwaS92MCIsImhlYWRlck5hbWUiOiJYLUNpdHl6ZW5EYXRhIn0=&quot;&gt;Quantum&lt;&#x2F;a&gt;, the official web-based IDE for WarpScript. &lt;strong&gt;You need to switch X-axis scale to Timestamp in order to see the courbe&lt;&#x2F;strong&gt;.Thanks for reading, here’s a nice list of additionnals informations about the time series subject and Warp10:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.ovh.com&#x2F;fr&#x2F;data-platforms&#x2F;metrics&#x2F;&quot;&gt;Metrics Data Platform&lt;&#x2F;a&gt;, our product&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;http:&#x2F;&#x2F;warp10.io&#x2F;&quot;&gt;Warp10 official documentation&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;http:&#x2F;&#x2F;tour.warp10.io&#x2F;&quot;&gt;Warp10 tour&lt;&#x2F;a&gt;, similar to “The Go Tour”&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=mNkfBR9KofY&quot;&gt;Presentation of the Warp 10 Time Series Platform at the 42 US school in Fremont&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;groups.google.com&#x2F;forum&#x2F;#!forum&#x2F;warp10-users&quot;&gt;Warp10 Google Groups&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
</description>
          <category domain="tag">database</category>
          <category domain="tag">timeseries</category>
          <category domain="tag">analytics</category>
          <category domain="tag">performance</category>
      </item>
      <item>
          <title>Event-driven architecture 101</title>
          <pubDate>Fri, 13 May 2016 17:19:23 +0000</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/eventdriven-architecture-101/</link>
          <guid>https://pierrezemb.fr/posts/eventdriven-architecture-101/</guid>
          <description xml:base="https://pierrezemb.fr/posts/eventdriven-architecture-101/">&lt;p&gt;&lt;strong&gt;update 2019:&lt;&#x2F;strong&gt; this is a repost on my own blog. original article can be read on &lt;a href=&quot;https:&#x2F;&#x2F;medium.com&#x2F;@PierreZ&#x2F;event-driven-architecture-101-d8e13cc4c656&quot;&gt;medium&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;eventdriven-architecture-101&#x2F;1.png&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Do your own cover on &lt;a href=&quot;http:&#x2F;&#x2F;dev.to&#x2F;rly&quot;&gt;http:&#x2F;&#x2F;dev.to&#x2F;rly&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;I’m still a student, so my point of view could be far from reality, be gentle ;)&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;p&gt;**&lt;em&gt;tl;dr: Queue messaging are cool. Use them at the core of your architecture.&lt;&#x2F;em&gt;**I’m currently playing a lot around &lt;a href=&quot;https:&#x2F;&#x2F;kafka.apache.org&#x2F;&quot;&gt;Kafka&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;flink.apache.org&#x2F;&quot;&gt;Flink&lt;&#x2F;a&gt; at work. I also discovered &lt;a href=&quot;http:&#x2F;&#x2F;vertx.io&#x2F;&quot;&gt;Vert.x&lt;&#x2F;a&gt; at my local JUG. All three have a common word: &lt;strong&gt;events&lt;&#x2F;strong&gt;. Event-driven architecture is not something that I learned at school, and I think that’s a shame. It’s really powerful and useful, especially in a world where we speak more and more about “serverless” and “micro services” stuff. So here’s my attempt to make a big sum-up.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;the-unix-philosophy&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-unix-philosophy&quot; aria-label=&quot;Anchor link for: the-unix-philosophy&quot;&gt;🔗&lt;&#x2F;a&gt;the Unix philosophy&lt;&#x2F;h1&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;eventdriven-architecture-101&#x2F;2.gif&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;I’m a huge fan of GNU&#x2F;Linux. I just love my terminal. It’s been difficult at the beginning, but now, I consider myself fluent with it. My favorite feature ? &lt;strong&gt;Pipes or |&lt;&#x2F;strong&gt;. For those who don’t know, it’s the ability to pass the result of the command to another command. For example, to count how many files you have in a folder, you’ll find yourself doing something like this:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;list files&lt;&#x2F;strong&gt; in a folder&lt;&#x2F;li&gt;
&lt;li&gt;From this list, &lt;strong&gt;manipulate&#x2F;filter&lt;&#x2F;strong&gt; it. One line must correspond to one file, things like folder are omitted&lt;&#x2F;li&gt;
&lt;li&gt;And then &lt;strong&gt;count&lt;&#x2F;strong&gt; the line!&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;In the UNIX world, it should give you something like “&lt;strong&gt;&lt;em&gt;ls -l | grep ^- | wc -l”.&lt;&#x2F;em&gt;&lt;&#x2F;strong&gt; it might feels like chinese. For me, it’s just feels logical. &lt;strong&gt;3 operations mapped into 3 commands.&lt;&#x2F;strong&gt; You declare a set a commands that, in the end, give you the result. It’s simple and also very fast (in fact, you can find funny articles like this one: &lt;a href=&quot;http:&#x2F;&#x2F;aadrake.com&#x2F;command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html&quot;&gt;Command-line tools can be 235x faster than your Hadoop cluster&lt;&#x2F;a&gt;). This is only possible thanks to the &lt;strong&gt;UNIX philosophy&lt;&#x2F;strong&gt;, greatly describe by Doug McIlroy, Elliot Pinson and Berk Tague in 1978:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new “features”.&amp;gt; Expect the output of every program to become the input to another, as yet unknown, program.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Why should I care? It’s 2016, not 1978! Well…&lt;&#x2F;p&gt;
&lt;h1 id=&quot;back-in-2016&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#back-in-2016&quot; aria-label=&quot;Anchor link for: back-in-2016&quot;&gt;🔗&lt;&#x2F;a&gt;Back in 2016&lt;&#x2F;h1&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;eventdriven-architecture-101&#x2F;3.gif&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Cloud changed everything in terms of software engineering. &lt;strong&gt;We can now deploy applications without thinking about the underlying server&lt;&#x2F;strong&gt;. How cool is that? Let’s take some steps back. Now that you can easily deploy a huge application, what can be accomplished? Well, if I can deploy one app with ease, &lt;strong&gt;Why should I deploy only one huge app ?&lt;&#x2F;strong&gt; why can’t I deploy multiples applications instead of one? &lt;strong&gt;Let’s call theses applications micro services&lt;&#x2F;strong&gt; because we are in 2016.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;eventdriven-architecture-101&#x2F;4.png&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;OK, so now I’m applying the first rule of the UNIX Philosophy, because I have multiples programs that are doing one job each. But about the second rule? &lt;strong&gt;How can they communicate? How can we simulate UNIX pipes?&lt;&#x2F;strong&gt; Before answering, let’s answer to another question first: &lt;strong&gt;What do we really need to send through our network?&lt;&#x2F;strong&gt; Don’t forget the  &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Fallacies_of_distributed_computing&quot;&gt;&lt;strong&gt;Fallacies of distributed computing&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt;&lt;strong&gt;…&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Let’s take an example. We are a new startup, and we are building our plateform. We’ll certainly need to handle our customers. Let’s say that for each new customer, &lt;strong&gt;we need to make two actions&lt;&#x2F;strong&gt;: add it to our database, and then to our mailing-list. &lt;strong&gt;A simple and classical way would be to just call two functions&lt;&#x2F;strong&gt; (whether on the same applications or not), and then say to the customer: “You’re successfully registered”. Like this:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;eventdriven-architecture-101&#x2F;5.png&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Classic approach&lt;&#x2F;p&gt;
&lt;p&gt;Is there another approach? Let’s use an &lt;strong&gt;event-based architecture&lt;&#x2F;strong&gt;:&lt;&#x2F;p&gt;
&lt;h1 id=&quot;let-s-talk-events&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#let-s-talk-events&quot; aria-label=&quot;Anchor link for: let-s-talk-events&quot;&gt;🔗&lt;&#x2F;a&gt;&lt;strong&gt;Let’s talk events&lt;&#x2F;strong&gt;&lt;&#x2F;h1&gt;
&lt;p&gt;Let’s ask Google, what’s an event?&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;a thing that happens, especially one of importance.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Well, handling a new customer is a thing that happens (hopefully). For this, we’ll be using a &lt;strong&gt;Queue messaging system or Broker&lt;&#x2F;strong&gt;. It’s a &lt;strong&gt;middleware&lt;&#x2F;strong&gt; that will &lt;strong&gt;receive events, and making them available for another application or groups of applications.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;eventdriven-architecture-101&#x2F;6.gif&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Queue messaging architecture with 2 producers and 4 consumers&lt;&#x2F;p&gt;
&lt;p&gt;So let’s rethink our architecture. Pay attention to the words: our Register page will &lt;strong&gt;produce&lt;&#x2F;strong&gt; an event that will contains all the information about our client. This event will be &lt;strong&gt;queued&lt;&#x2F;strong&gt;, waiting to be &lt;strong&gt;consumed&lt;&#x2F;strong&gt; by the associated micro services.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;eventdriven-architecture-101&#x2F;7.png&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Simple event-driven architecture&lt;&#x2F;p&gt;
&lt;p&gt;We didn’t changed much, but we enable many things over here:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Simplicity&lt;&#x2F;strong&gt;. Remember, the first rule ! “Make each program do one thing well”. Like this, your &lt;strong&gt;code base for each app will be simple&lt;&#x2F;strong&gt; &lt;strong&gt;as hell&lt;&#x2F;strong&gt;, and you’ll be able to easily replace your software if needed.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Modularity&lt;&#x2F;strong&gt;. You need to add another action to the event, for example CreateProfile ? Easy, &lt;strong&gt;just plug another app on the same queue&lt;&#x2F;strong&gt;. You need to test a new version of your program? Easy, &lt;strong&gt;just plug it on the same queue&lt;&#x2F;strong&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Scalability&lt;&#x2F;strong&gt;. One of your micro services is taking too much time? &lt;strong&gt;Just start a new instance of it&lt;&#x2F;strong&gt;. Huge traffic? Add new instances. With this approach, you can start really small and become giant.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Big-data friendly.&lt;&#x2F;strong&gt; This type of architecture is often used to handle a lot of data. With plateform like &lt;a href=&quot;http:&#x2F;&#x2F;flink.apache.org&quot;&gt;Apache Flink&lt;&#x2F;a&gt;, you can do some &lt;strong&gt;stream processing directly&lt;&#x2F;strong&gt;. &lt;a href=&quot;https:&#x2F;&#x2F;ci.apache.org&#x2F;projects&#x2F;flink&#x2F;flink-docs-master&#x2F;apis&#x2F;streaming&#x2F;index.html#example-program&quot;&gt;Look how easy it is&lt;&#x2F;a&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Polyglotism.&lt;&#x2F;strong&gt; Most messaging system are offering libraries for many languages.&lt;strong&gt;Like this, you can use whatever language you want&lt;&#x2F;strong&gt; . But be aware, &lt;em&gt;With great power comes great responsibility&lt;&#x2F;em&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h1 id=&quot;what-about-serverless&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what-about-serverless&quot; aria-label=&quot;Anchor link for: what-about-serverless&quot;&gt;🔗&lt;&#x2F;a&gt;&lt;strong&gt;What about serverless?&lt;&#x2F;strong&gt;&lt;&#x2F;h1&gt;
&lt;p&gt;Serverless is the “new” buzz word. Ignited by Amazon with their product &lt;a href=&quot;https:&#x2F;&#x2F;aws.amazon.com&#x2F;lambda&#x2F;&quot;&gt;AWS Lambda&lt;&#x2F;a&gt; and quickly followed by &lt;a href=&quot;https:&#x2F;&#x2F;cloud.google.com&#x2F;functions&#x2F;docs&quot;&gt;Google&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;azure.microsoft.com&#x2F;en-us&#x2F;services&#x2F;functions&#x2F;&quot;&gt;Microsoft&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;new-console.ng.bluemix.net&#x2F;openwhisk&#x2F;&quot;&gt;IBM&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;www.iron.io&#x2F;introducing-aws-lambda-support&quot;&gt;Iron.io&lt;&#x2F;a&gt;, the goal is to &lt;strong&gt;offer to developers a new way of building apps&lt;&#x2F;strong&gt;. Instead of writing apps, &lt;strong&gt;you’ll just write a function that will respond to an event&lt;&#x2F;strong&gt;. In fact, you’ll be paying only for the time it’s running. It’s a interesting point-of-view, because you’ll be &lt;strong&gt;deploying an architecture built only using events&lt;&#x2F;strong&gt;. I must admit that I didn’t try it yet, but I think i&lt;strong&gt;t’s a great idea to force developers to split their apps and really think about events,&lt;&#x2F;strong&gt; but you could just build the same thing with any cloud provider.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;additional-links-and-talks-about-this-topic&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#additional-links-and-talks-about-this-topic&quot; aria-label=&quot;Anchor link for: additional-links-and-talks-about-this-topic&quot;&gt;🔗&lt;&#x2F;a&gt;Additional links and talks about this topic&lt;&#x2F;h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http:&#x2F;&#x2F;www.confluent.io&#x2F;blog&#x2F;apache-kafka-samza-and-the-unix-philosophy-of-distributed-data&quot;&gt;Apache Kafka, Samza, and the Unix Philosophy of Distributed Data&lt;&#x2F;a&gt; by &lt;a href=&quot;https:&#x2F;&#x2F;medium.com&#x2F;u&#x2F;13be457aed12&quot;&gt;Martin Kleppmann&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;http:&#x2F;&#x2F;blog.cloudera.com&#x2F;blog&#x2F;2014&#x2F;09&#x2F;apache-kafka-for-beginners&#x2F;&quot;&gt;Apache Kafka for Beginners&lt;&#x2F;a&gt; by Cloudera Engineering Blog&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.voxxed.com&#x2F;blog&#x2F;2016&#x2F;04&#x2F;introduction-apache-kafka&#x2F;&quot;&gt;Introduction to Apache Kafka&lt;&#x2F;a&gt; by Guglielmo Iozza&lt;&#x2F;li&gt;
&lt;li&gt;[Apache Flink Training] (&lt;a href=&quot;http:&#x2F;&#x2F;dataartisans.github.io&#x2F;flink-training&#x2F;)by&quot;&gt;http:&#x2F;&#x2F;dataartisans.github.io&#x2F;flink-training&#x2F;)by&lt;&#x2F;a&gt; data-artisans&lt;&#x2F;li&gt;
&lt;li&gt;Meetup LeboncoinTech — AMQP 101 by &lt;a href=&quot;https:&#x2F;&#x2F;medium.com&#x2F;u&#x2F;58ea5a89aaae&quot;&gt;Quentin ADAM&lt;&#x2F;a&gt; (French sorry)&lt;&#x2F;li&gt;
&lt;li&gt;vert.x 3 — be reactive on the JVM but not only in Java by Clement Escoffier&#x2F;Paulo Lopes DEVOXX 2015&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Please, Feel free to react to this article, you can reach me on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt;, or have a look on my &lt;a href=&quot;https:&#x2F;&#x2F;pierrezemb.fr&quot;&gt;website&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">architecture</category>
          <category domain="tag">messaging</category>
          <category domain="tag">distributed</category>
          <category domain="tag">design</category>
      </item>
      <item>
          <title>Let’s talk about containers</title>
          <pubDate>Mon, 04 Jan 2016 18:52:19 +0000</pubDate>
          <author>Pierre Zemb</author>
          <link>https://pierrezemb.fr/posts/lets-talk-about-containers/</link>
          <guid>https://pierrezemb.fr/posts/lets-talk-about-containers/</guid>
          <description xml:base="https://pierrezemb.fr/posts/lets-talk-about-containers/">&lt;p&gt;&lt;strong&gt;update 2019:&lt;&#x2F;strong&gt; this is a repost on my own blog. original article can be read on &lt;a href=&quot;https:&#x2F;&#x2F;medium.com&#x2F;@pierrez&#x2F;let-s-talk-about-containers-1f11ee68c470&quot;&gt;medium&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;em&gt;English is not my first language, so the whole story may have some mistakes… corrections and fixes will be greatly appreciated. I’m also still a student, so my point of view could be far from “production ready”, be gentle ;-)&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;p&gt;In the last two years, there’s been a technology that became really hype. It was the graal for easy deployments, easy applications management. Let’s talk about containers.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;write-once-run-everywhere&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#write-once-run-everywhere&quot; aria-label=&quot;Anchor link for: write-once-run-everywhere&quot;&gt;🔗&lt;&#x2F;a&gt;“Write once, run everywhere”&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;lets-talk-about-containers&#x2F;1.jpeg&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;When I first heard about containers, I was working as a part-time internship for a french bank as a developer in a Ops team. I was working around &lt;a href=&quot;https:&#x2F;&#x2F;hadoop.apache.org&#x2F;&quot;&gt;Hadoop&lt;&#x2F;a&gt; and monitoring systems, and I was wondering “How should I properly deploy my work?”. It was a java app, running into the official Java version provided by my company. &lt;strong&gt;I couldn’t just give it to my colleagues&lt;&#x2F;strong&gt; &lt;strong&gt;and leave them do some vaudou stuff because they are the Ops team&lt;&#x2F;strong&gt;. I remembered saying to myself ”fortunately, all the features that I need are in this official java version, I don’t need the latest JRE. I just need to bundle everything into a jar and done”. But what if it wasn’t? What if I had to explain to my colleagues that I need the new JRE for a really small app written by an intern? Or I needed another non-standard library during runtime?&lt;&#x2F;p&gt;
&lt;p&gt;The important thing here at the time was that, at any time, &lt;strong&gt;I could deploy it on another server that had Java, because everything is bundled into that big fat jar file&lt;&#x2F;strong&gt;. After all, “&lt;strong&gt;write once, run everywhere&lt;&#x2F;strong&gt;” was the slogan created by Sun Microsystems to illustrate the cross-platform benefits of the Java language. That is a real commodity, and this is the first thing that strike me with Docker.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;docker-hype&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#docker-hype&quot; aria-label=&quot;Anchor link for: docker-hype&quot;&gt;🔗&lt;&#x2F;a&gt;Docker hype&lt;&#x2F;h3&gt;
&lt;p&gt;I will always remember my chat with my colleagues about it. I was like this:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;lets-talk-about-containers&#x2F;2.jpeg&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;and-they-were-more-like&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#and-they-were-more-like&quot; aria-label=&quot;Anchor link for: and-they-were-more-like&quot;&gt;🔗&lt;&#x2F;a&gt;And they were more like&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;lets-talk-about-containers&#x2F;3.jpeg&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Ops knew about containers since the dawn of time, so why such hype now? I think that “write once, run everywhere” is the true slogan of Docker, because you can run docker containers in any environments that has Docker. &lt;strong&gt;You want to try the latest datastore&#x2F;SaaS app that you found on Hacker News or Reddit? There’s a Dockerfile for that&lt;&#x2F;strong&gt;. And that is super cool. So everyone started to get interested in Docker, myself included. But the real benefit is that many huge companies like Google admits that containers are the way they are deploying apps. &lt;strong&gt;They don’t care what type of applications they are deploying or where it’s running, it’s just running somewhere.&lt;&#x2F;strong&gt; That’s all that matters. By unifying the packages, you can automatize and deliver whatever you want somewhere. Do you really care if it’s on a specific machine? No you don’t. That’s a powerful way to think infrastructure more like a bunch of compute or storage power, and not individual machines.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;let-s-create-a-container&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#let-s-create-a-container&quot; aria-label=&quot;Anchor link for: let-s-create-a-container&quot;&gt;🔗&lt;&#x2F;a&gt;Let’s create a container&lt;&#x2F;h3&gt;
&lt;p&gt;That’s not a secret: I love &lt;a href=&quot;https:&#x2F;&#x2F;golang.org&#x2F;&quot;&gt;Go&lt;&#x2F;a&gt;. It’s in my opinion a very nice programming language &lt;a href=&quot;https:&#x2F;&#x2F;medium.com&#x2F;@PierreZ&#x2F;why-you-really-should-give-golang-a-try-6b577092d725&quot;&gt;that you should really try&lt;&#x2F;a&gt;. So let’s say that I’m creating a go app, and then ship it with Docker. So I’ll use the officiel Docker image right? &lt;strong&gt;Then I end up with a 700MB container to ship a 10MB app&lt;&#x2F;strong&gt;… I thought that containers were supposed to be small… Why? because it’s based on a full OS, with go compiler and so on. To run a single binary, there’s no need to have the whole Go compiler stack.&lt;&#x2F;p&gt;
&lt;p&gt;That was really bothering me. At this point, if the container is holding everything, why not use a VM? Why do we need to bundle Ubuntu into the container? From a outside point-of-view, running a container in interactive mode is much like a virtual machines right? &lt;strong&gt;At the time of writing, Docker’s official image for Ubuntu was pulled more than 36,000,000 time&lt;&#x2F;strong&gt;. That’s huge! And disturbing. Do you really need for example “ls, chmod, chown, sudo” into a container?&lt;&#x2F;p&gt;
&lt;p&gt;There is another huge impact on having a full distribution on a container: Security. &lt;strong&gt;You now have to watch not only for CVEs (Common Vulnerabilities and Exposures) on the packages in your host distribution, but also in your container&lt;&#x2F;strong&gt;! After all, based on this &lt;a href=&quot;https:&#x2F;&#x2F;docs.google.com&#x2F;presentation&#x2F;d&#x2F;1toUKgqLyy1b-pZlDgxONLduiLmt2yaLR0GliBB7b3L0&#x2F;pub?start=false&amp;amp;loop=false#slide=id.ge614ec624_2_70&quot;&gt;presentation&lt;&#x2F;a&gt;, 66.6% of analyzed images on Quay.io are vulnerable to &lt;a href=&quot;https:&#x2F;&#x2F;community.qualys.com&#x2F;blogs&#x2F;laws-of-vulnerabilities&#x2F;2015&#x2F;01&#x2F;27&#x2F;the-ghost-vulnerability&quot;&gt;Ghost&lt;&#x2F;a&gt;, and 80% to &lt;a href=&quot;http:&#x2F;&#x2F;heartbleed.com&#x2F;&quot;&gt;Heartbleed&lt;&#x2F;a&gt;. That is quite scary… So adding this nightmare doesn’t seems the solution.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;so-what-should-i-put-into-my-container&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#so-what-should-i-put-into-my-container&quot; aria-label=&quot;Anchor link for: so-what-should-i-put-into-my-container&quot;&gt;🔗&lt;&#x2F;a&gt;So what should I put into my container?&lt;&#x2F;h3&gt;
&lt;p&gt;I looked a lot around the internet, I saw things like &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;gliderlabs&#x2F;docker-alpine&quot;&gt;docker-alpine&lt;&#x2F;a&gt; or [baseimage-docker] (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;phusion&#x2F;baseimage-docker)which&quot;&gt;https:&#x2F;&#x2F;github.com&#x2F;phusion&#x2F;baseimage-docker)which&lt;&#x2F;a&gt; are cool, but in fact, the answer was on Docker’s website… Here’s the [official sentence] (&lt;a href=&quot;https:&#x2F;&#x2F;www.docker.com&#x2F;what-docker)that&quot;&gt;https:&#x2F;&#x2F;www.docker.com&#x2F;what-docker)that&lt;&#x2F;a&gt; explains the difference between containers and virtual machines:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;“Containers include the application and all of its dependencies, but share the kernel with other containers.”&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;This specific sentence triggers something in my head. When you execute a program on your UNIX system, the system creates a special environment for that program. This environment contains everything needed for the system to run the program as if no other program were running on the system. It’s exactly the same! &lt;strong&gt;So a container should be abstract not as a Virtual machines, but as a UNIX process!&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;application + dependencies represent the image&lt;&#x2F;li&gt;
&lt;li&gt;Runtime environment like token&#x2F;password will be passed through env vars for example&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;static-compilation&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#static-compilation&quot; aria-label=&quot;Anchor link for: static-compilation&quot;&gt;🔗&lt;&#x2F;a&gt;Static compilation&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;lets-talk-about-containers&#x2F;4.png&quot; alt=&quot;image&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Meet Go&lt;&#x2F;p&gt;
&lt;p&gt;Here’s an interesting fact: Go, the open-source programming language pushed by Google &lt;strong&gt;supports statically apps&lt;&#x2F;strong&gt;, what a coincidence! That means that this statically app will be directly talking to the kernel. &lt;strong&gt;Our Docker image can be empty&lt;&#x2F;strong&gt;, except for the binary and needed files like configuration. There’s a strange image on Docker that you might have seen, which is called “scratch”:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;You can use Docker’s reserved, minimal image, scratch, as a starting point for building containers. Using the scratch “image” signals to the build process that you want the next command in the Dockerfile to be the first filesystem layer in your image. While scratch appears in Docker’s repository on the hub, you can’t pull it, run it, or tag any image with the name scratch. Instead, you can refer to it in your Dockerfile.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;That means that our Dockerfile now looks like this:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;dockerfile&quot; style=&quot;background-color:#2b303b;color:#c0c5ce;&quot; class=&quot;language-dockerfile &quot;&gt;&lt;code class=&quot;language-dockerfile&quot; data-lang=&quot;dockerfile&quot;&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;FROM&lt;&#x2F;span&gt;&lt;span&gt; scratch  
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#b48ead;&quot;&gt;ADD &lt;&#x2F;span&gt;&lt;span&gt;hello &#x2F;  
&lt;&#x2F;span&gt;&lt;span&gt;CMD [&#x2F;hello]
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;So now, I have finally (I think) the right abstraction for a container! &lt;strong&gt;We have a container containing only our app&lt;&#x2F;strong&gt;. Can we go even further? The most interesting thing that I learned from (quickly) reading &lt;a href=&quot;https:&#x2F;&#x2F;static.googleusercontent.com&#x2F;media&#x2F;research.google.com&#x2F;en&#x2F;&#x2F;pubs&#x2F;archive&#x2F;43438.pdf&quot;&gt;&lt;em&gt;Large-scale cluster management at Google with Borg&lt;&#x2F;em&gt;&lt;&#x2F;a&gt; is this:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Borg programs are statically linked to reduce dependencies on their runtime environment, and structured as packages of binaries and data files, whose installation is orchestrated by Borg.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Here’s the (final) answer! By trully coming back to the UNIX process point-of-view, we can abstract containers as Unix processes. Bu we still need to handle them. So &lt;strong&gt;the role of Docker would be more like a Operating System builder&lt;&#x2F;strong&gt; (nice name found by &lt;a href=&quot;https:&#x2F;&#x2F;medium.com&#x2F;u&#x2F;58ea5a89aaae&quot;&gt;Quentin ADAM&lt;&#x2F;a&gt;).As a conclusion, I think that Docker true success was to show developers that they can sandbox their apps easily, and now it’s our work to build better software, and learning new design patterns.&lt;&#x2F;p&gt;
&lt;p&gt;Please, Feel free to react to this article, you can reach me on &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;PierreZ&quot;&gt;Twitter&lt;&#x2F;a&gt;, Or visite my &lt;a href=&quot;https:&#x2F;&#x2F;pierrezemb.fr&quot;&gt;website&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
          <category domain="tag">containers</category>
          <category domain="tag">docker</category>
          <category domain="tag">security</category>
          <category domain="tag">infrastructure</category>
      </item>
    </channel>
</rss>
