Diving Into is a blogpost series where we dig into specific parts of a project's codebase. In this episode, we dig into Kubernetes' watch cache implementation.
While debugging an etcd-shim on FoundationDB, I kept hitting "Timeout: Too large resource version" errors. The cache was stuck at revision 3044, but clients requested 3047. Three seconds later: timeout. This led me into the watch cache internals: specifically the 3-second timeout in waitUntilFreshAndBlock() and how progress notifications solve the problem. Let's dig into how it actually works.
Note: Yes, Clever Cloud runs an etcd-shim on top of FoundationDB for Kubernetes. Truth is, we're not alone: AWS and GKE have custom storage layers too. After operating etcd at OVHcloud, we chose a different path. I actually wrote a naive PoC during COVID (fdb-etcd) without testing it against a real apiserver 😅 it was mostly an excuse to discover the Record-Layer. You can read more about the technical challenges in this FoundationDB forum discussion.
🔗Overview of the Watch Cache
When I first looked at the watch cache implementation, I expected a single monolithic cache sitting between the apiserver and etcd. It took compiling my own apiserver with additional logging to realize the architecture is more interesting: each resource type gets its own independent Cacher instance. Pods have one. Services have another. Deployments get their own. Every resource group runs an isolated LIST+WATCH loop, maintaining its own in-memory cache.
As the Kubernetes 1.34 blog post explains, this enhancement allows the API server to serve consistent read requests directly from the watch cache, significantly reducing the load on etcd and improving overall cluster performance.
🔗Architecture
Client Requests (kubectl, controllers)
↓
Cacher (per resource)
↓ In-memory watch cache
↓ (on cache miss/delegate)
etcd3/Store
↓
etcd / etcd-shim
The main components:
🔗How The Cache Gets Fed
🔗Initialization: The LIST Phase
Nothing works until the cache initializes. When a Cacher starts, every read for that resource blocks until initialization completes. This matters because initialization isn't instant: it's a paginated LIST operation fetching 10,000 items per page. For a large cluster with thousands of pods, this takes time.
Here's the sequence: The Reflector pattern kicks off with a complete LIST operation. Each resource cache fetches all existing objects through paginated requests. Once the LIST completes, watchCache.Replace() populates the in-memory cache with these objects. The critical moment happens when the SetOnReplace() callback fires (cacher.go:468-478), marking the cache as READY. Until that callback fires, every request for that resource waits.
🔗Continuous Sync: The WATCH Phase
After initialization, the real trick begins: the cache maintains synchronization through a Watch stream that starts at LIST revision + 1. This guarantees no events are missed between the LIST and WATCH operations. The watch picks up exactly where the list left off. Events flow from etcd through a buffered channel (capacity: 100 events) and are processed by the dispatchEvents() goroutine, which runs continuously, matching events to interested watchers.
This pattern depends on continuous event flow. When events stop arriving, when resources go quiet, that's when progress notifications become essential. See Reflector documentation for the complete pattern.
🔗The Problem: "Timeout: Too large resource version"
While debugging our etcd-shim, we kept hitting this error:
Error getting keys: err="Timeout: Too large resource version: 3047, current: 3044"
A client was requesting ResourceVersion 3047, but the cache only knew about revision 3044. The cache would wait... and timeout after 3 seconds.
🔗Understanding Cache Freshness
🔗The Freshness Check
When a client requests a consistent read at a specific ResourceVersion, Kubernetes needs to ensure the cache is "fresh enough" to serve that request. Here's the check: is my current revision at least as high as the requested revision? If not, it calls waitUntilFreshAndBlock() with a 3-second timeout, waiting for Watch events to bring the cache up to date.
From cacher.go:1257-1261:
if c.watchCache.notFresh(requestedWatchRV) {
c.watchCache.waitingUntilFresh.Add()
defer c.watchCache.waitingUntilFresh.Remove()
}
err := c.watchCache.waitUntilFreshAndBlock(ctx, requestedWatchRV)
The actual timeout implementation (watch_cache.go:448-488):
func (w *watchCache) waitUntilFreshAndBlock(ctx context.Context, resourceVersion uint64) error {
startTime := w.clock.Now()
defer func() {
if resourceVersion > 0 {
metrics.WatchCacheReadWait.WithContext(ctx).WithLabelValues(w.groupResource.Group, w.groupResource.Resource).Observe(w.clock.Since(startTime).Seconds())
}
}()
// In case resourceVersion is 0, we accept arbitrarily stale result.
// As a result, the condition in the below for loop will never be
// satisfied (w.resourceVersion is never negative), this call will
// never hit the w.cond.Wait().
// As a result - we can optimize the code by not firing the wakeup
// function (and avoid starting a gorotuine), especially given that
// resourceVersion=0 is the most common case.
if resourceVersion > 0 {
go func() {
// Wake us up when the time limit has expired. The docs
// promise that time.After (well, NewTimer, which it calls)
// will wait *at least* the duration given. Since this go
// routine starts sometime after we record the start time, and
// it will wake up the loop below sometime after the broadcast,
// we don't need to worry about waking it up before the time
// has expired accidentally.
<-w.clock.After(blockTimeout)
w.cond.Broadcast()
}()
}
w.RLock()
span := tracing.SpanFromContext(ctx)
span.AddEvent("watchCache locked acquired")
for w.resourceVersion < resourceVersion {
if w.clock.Since(startTime) >= blockTimeout {
// Request that the client retry after 'resourceVersionTooHighRetrySeconds' seconds.
return storage.NewTooLargeResourceVersionError(resourceVersion, w.resourceVersion, resourceVersionTooHighRetrySeconds)
}
w.cond.Wait()
}
span.AddEvent("watchCache fresh enough")
return nil
}
If the cache can't catch up within those 3 seconds, the request times out.
If you've ever seen kubectl commands hang for exactly 3 seconds before returning data, this is why. The cache is waiting for events that will never come.
🔗The Problem with Quiet Resources
This is where things get tricky. For infrequently-updated resources (namespaces, configmaps, etc.):
| Time | Component | Event | Cache RV | etcd RV | Notes |
|---|---|---|---|---|---|
| T0 | Namespace cache | Idle, no changes | 3044 | 3044 | No namespace changes for 5 minutes |
| T1 | Pod/Service caches | Resources changing | - | 3047 | Global etcd revision advances |
| T2 | Namespace watch | Receives nothing | 3044 | 3047 | No namespace events to process |
| T3 | Namespace cache | Still waiting | 3044 | 3047 | Cache stuck, unaware of global progress |
| T4 | Client | Lists pods successfully | - | 3047 | Response includes current RV 3047 |
| T5 | Client | Requests namespace read at RV ≥ 3047 | - | 3047 | Consistent read requirement |
| T6 | Namespace cache | waitUntilFreshAndBlock() | 3044 | 3047 | "I'm at 3044, need 3047... waiting" |
| T7 | Namespace cache | Timeout! | 3044 | 3047 | 3 seconds elapsed, returns error |
The cache has no way to know if etcd has moved forward. Is the system healthy? Is something broken? It just sees... nothing.
🔗Timeout Behavior Summary
| Scenario | Cache RV | Requested RV | Result |
|---|---|---|---|
| Fresh cache | 3047 | 3045 | ✓ Serve immediately |
| Stale cache | 3044 | 3047 | ⏱ Wait 3s → timeout |
| With progress | 3044 | 3047 | ✓ RequestProgress → serve |
🔗Progress Notifications: Keeping Quiet Resources Fresh
🔗What Are Progress Notifications?
Here's the trick: progress notifications are empty Watch responses that only update the revision:
WatchResponse {
Header: { Revision: 3047 }, // Current etcd revision
Events: [] // No actual data changes
}
They solve the quiet resource problem by telling the cache: "etcd is now at revision X, even though your resource hasn't changed."
This is exactly what we had forgotten to implement in our etcd-shim. We handled regular Watch events perfectly, but didn't support progress notifications. The result? Kubernetes' watch cache would timeout waiting for revisions that would never arrive through normal events. Once we added RequestProgress support and started sending these empty bookmark responses, the timeouts disappeared.
🔗Two Mechanisms
🔗1. On-Demand: RequestWatchProgress()
When the cache needs to catch up, it can explicitly request a progress notification. See store.go:99-103:
func (s *store) RequestWatchProgress(ctx context.Context) error {
return s.client.RequestProgress(s.watchContext(ctx))
}
When called, etcd responds with a bookmark (also called a progress notification) containing the current revision. The cache at revision 3044 calls RequestProgress(), receives { Revision: 3047, Events: [] }, and immediately updates its internal state to 3047.
The progress notification is detected in the watch stream (watcher.go:401-404):
// Handle progress notifications (bookmarks)
if wres.IsProgressNotify() {
wc.queueEvent(progressNotifyEvent(wres.Header.GetRevision()))
metrics.RecordEtcdBookmark(wc.watcher.groupResource)
continue
}
🔗2. Proactive: Periodic Progress Requests
Kubernetes also runs a background component called progressRequester that monitors quiet watches. This component detects when watches haven't received events for a while and periodically calls RequestProgress() to ensure even completely idle resources stay fresh. This proactive approach prevents timeout errors before they happen.
The progress requester is initialized when the Cacher is created (cacher.go:425-428):
progressRequester := progress.NewConditionalProgressRequester(
config.Storage.RequestWatchProgress, // The function to call
config.Clock,
contextMetadata
)
🔗The Complete Flow
Timeline showing how progress notifications solve the timeout:
| Time | Component | Action | Cache RV | etcd RV | Details |
|---|---|---|---|---|---|
| T0 | Namespace watch | Established | 3044 | 3044 | No namespace changes happening |
| T1 | Pod resources | Creates/updates | 3044 | 3047 | Namespace watch: silent, cache stuck at 3044 |
| T2 | Client | Requests namespace LIST at RV 3047 | 3044 | 3047 | notFresh(3047) → true, starts waitUntilFreshAndBlock() |
| T3 | progressRequester | Detects quiet watch | 3044 | 3047 | Calls RequestProgress() on namespace watch stream |
| T4 | etcd | Sends progress notification | 3044 | 3047 | WatchResponse { Header: { Revision: 3047 }, Events: [] } |
| T5 | Namespace cache | Processes bookmark | 3047 | 3047 | Updates internal revision 3044 → 3047, signals waiters |
| T6 | Namespace cache | Returns successfully | 3047 | 3047 | waitUntilFreshAndBlock() completes, request served from cache |
🔗Key Takeaways
Here's what you need to know: Kubernetes runs a separate watch cache for each resource type (pods, services, deployments, etc.), and each one maintains its own LIST+WATCH loop. When you request a consistent read, the cache performs a freshness check with a 3-second timeout via waitUntilFreshAndBlock(). Without this mechanism, you'd see 3-second hangs on every consistent read to quiet resources.
Progress notifications solve the critical problem of quiet resources: those that don't receive updates for extended periods. These empty Watch responses update the cache's revision without transferring data. Kubernetes implements this through two mechanisms: on-demand (explicit RequestProgress calls when the cache needs to catch up) and proactive (periodic monitoring by the progressRequester component).
Without progress notifications, consistent reads must bypass the cache entirely and go directly to etcd, significantly increasing load on the storage layer. This is the difference between a responsive cluster and one where every kubectl command feels sluggish.
🔗Related Posts
If you enjoyed this deep dive into Kubernetes watch caching, you might also be interested in:
- Notes about ETCD - An overview and collection of resources about etcd, the distributed key-value store that powers Kubernetes
- Diving into ETCD's linearizable reads - A deep dive into how etcd implements linearizable reads using Raft consensus
Feel free to reach out with any questions or to share your experiences with Kubernetes watch caching. You can find me on Twitter, Bluesky or through my website.