# EEVDF: How the Earliest Eligible Virtual Deadline First Scheduler Replaced CFS in Linux 6.6

*A technical deep-dive into the first wholesale rewrite of the Linux fair-class scheduler in sixteen years.*

## 1. Virtual Deadline Computation and the Eligibility Check

At the heart of EEVDF is a two-step selection rule that is strikingly more principled than anything CFS ever had: among all tasks that are currently *eligible*, pick the one with the earliest *virtual deadline*. The algorithm, first described in Ion Stoica and Hussein Abdel-Wahab's 1995 paper "Earliest Eligible Virtual Deadline First: A Flexible and Accurate Mechanism for Proportional Share Resource Allocation," was adapted for Linux by Peter Zijlstra in a patch series first posted in March 2023 and merged in the 6.6 merge window. Each task accumulates a quantity called *lag*, defined as the difference between the service the task was entitled to (based on its weight share of the CPU) and the service it has actually received, measured in virtual time. A task whose lag is greater than or equal to zero has not yet received its fair share and is therefore marked eligible to run; a task whose lag has gone negative has been over-served and is temporarily *ineligible*, meaning the scheduler refuses to place it on the CPU until its lag decays back to zero as virtual time advances. This eligibility check is the first half of the rule and is what gives EEVDF its strict proportional-share guarantee: no task can run ahead of its entitlement, because the moment it does it is removed from consideration. The second half is the deadline. For every eligible task, the scheduler computes a virtual deadline `VD = V_e + r/w`, where `V_e` is the task's eligible virtual time, `r` is the requested slice length, and `w` is the task's weight. Short-slice tasks thus have closer virtual deadlines than long-slice tasks of the same weight, and the scheduler always picks the eligible task with the smallest `VD`. Kernel documentation at docs.kernel.org/scheduler/sched-eevdf.html notes that this reduces to earliest-deadline-first scheduling over the *left half* of the runqueue's augmented red-black tree — specifically, over the subtree containing every entity whose virtual runtime is less than or equal to the weighted average virtual runtime maintained on the runqueue. The runqueue keeps this min-deadline information augmented into the tree so that the "earliest eligible" lookup can be done in O(log n) without scanning, which is critical for runqueues containing thousands of entities.

## 2. The Request Model: Slices and Requests vs. CFS Timeslices

The most important conceptual shift EEVDF imposes is that a task no longer runs "until a heuristic decides to preempt it"; instead it runs a *request* of its own chosen length. CFS, in contrast, computed a per-entity timeslice by dividing a tunable scheduling period (`sched_latency_ns`, typically 6–24 ms depending on CPU count) across the runnable tasks according to their weights, then forced preemption when either that slice was consumed or a wakeup satisfied the byzantine `wakeup_preempt_entity` heuristic. The resulting slice was entirely a function of how many other tasks happened to be runnable at that instant, which is why CFS exhibited such strong load-dependent latency characteristics: adding a single extra CPU-bound task could stretch every other task's slice. EEVDF inverts that relationship. Each task carries its own `slice` attribute — its *request size* — and when that request is exhausted, a new request with a fresh, later virtual deadline is generated; until then the task either runs to completion of its slice or is preempted by an eligible task with an even earlier deadline. Crucially the weighted fairness property is preserved *without* any global period tunable, because the virtual-deadline math `V_e + r/w` already encodes the trade-off: a task that asks for a short slice earns an earlier deadline and therefore higher scheduling priority *at the cost of* more frequent context switches and smaller chunks of CPU service, while a task that asks for a long slice gets pushed further into the future but runs in bigger, more cache-friendly bursts. This is what Peter Zijlstra meant in his cover letter when he said EEVDF "completely reworks the base scheduler, placement, preemption, picking — everything; the only thing they have in common is that they're both a virtual time based scheduler." The Linux Magazine "A Fair Slice" piece (issue 301, 2025) summarised the effect well: with EEVDF, responsiveness is something tasks *opt into* by choosing a shorter request, not something the kernel guesses at with period math. One subtlety worth highlighting is how the request model interacts with sleepers. To prevent a task from gaming the system by sleeping briefly so its lag is reset to zero, Zijlstra's implementation keeps sleeping tasks on the runqueue in a "deferred dequeue" state and lets negative lag decay only as virtual runtime advances — so a task that was massively over-served and then slept for a microsecond comes back with its negative lag almost entirely intact, and is correctly held ineligible until it has "paid back" its debt.

## 3. `latency_nice` and the `sched_setattr()` Interface

Because a task's requested slice size now directly determines how fast it can be scheduled, EEVDF needed a way for userspace to express latency preferences. The interface chosen — after a lively debate on lore.kernel.org about whether the kernel should expose a slice length directly or abstract it behind a latency hint — was a two-pronged approach. First, the existing `sched_setattr(2)` system call, which already carried `sched_nice`, `sched_priority`, and the deadline-class fields, gained a `sched_util_min`/`sched_util_max`-style companion: `sched_latency_nice`, a signed value in the conventional Linux nice range of [-20, 19], where lower (more negative) values indicate "I care about wakeup latency, please give me a shorter request and an earlier virtual deadline," and higher values indicate the opposite. Internally the kernel translates `latency_nice` into a per-`sched_entity` latency offset that biases the computed virtual deadline: a low `latency_nice` shortens the request and pulls the deadline earlier, so the task will be preferred among eligible candidates and, after Zijlstra's "completing EEVDF" patch series (covered in LWN article 969062 from 2024), can even *preempt* a currently running task whose virtual deadline is later. Without that preemption change, a latency-sensitive task that woke up would still have had to wait for the running task's slice to drain before getting the CPU, which defeated the point of the whole interface. Second, the patch series "sched: EEVDF and latency-nice and/or slice-attr" on lore.kernel.org (message ID `20230531115839.089944915@infradead.org`, posted by Zijlstra with Vincent Guittot) added the *option* of exposing the slice length directly via `sched_attr::sched_runtime`, which for `SCHED_NORMAL` tasks is reinterpreted as the requested base slice length. That lets power users bypass the `latency_nice` translation entirely and say, e.g., "give me 500 µs requests" — a capability Zijlstra himself described as the "primary motivator for finally finishing these patches," because EEVDF fundamentally supports per-task slice sizes and it would have been a waste to hide that behind only a nice-like dial. A companion `cpu.latency.nice` knob was added to the CPU cgroup controller so that container orchestrators can set group-level latency priorities, with the group value acting as a clamp on tasks inside it. The combination of these interfaces finally retires, or at least subsumes, the long-standing "latency nice" patch series that had been circulating for years as an attempt to bolt latency hints onto CFS's heuristic-driven preemption logic — which was precisely the kind of "icky heuristics code" Zijlstra said he wanted to delete.

## 4. Comparison to CFS vruntime Accounting and Why EEVDF Gives Better Latency Guarantees

CFS tracks fairness through a single per-entity counter called `vruntime`: the wall-clock time a task has spent on the CPU, scaled by the inverse of its weight, so that a nice-0 task accumulates vruntime at one unit per nanosecond while a nice-19 task accumulates it much faster and a nice-(-20) task much slower. CFS picks the task with the *smallest* `vruntime` on the runqueue, which is sufficient to guarantee proportional fairness *in the limit* but provides no bound on how long any individual task has to wait before running again. The worst-case wakeup latency in CFS depends on two things: the global `sched_latency` / `sched_min_granularity` tunables (how long a slice can be before the scheduler forces preemption) and the `sched_wakeup_granularity` tunable (how much vruntime advantage a waking task must have before it can preempt a currently running one). Neither of these is directly expressible to userspace, and both were, in practice, tuned to compromise values that worked tolerably for most workloads but left latency-sensitive and throughput-sensitive tasks fighting each other. Worse, CFS had no notion of *eligibility*: a task that had been starved for a long time would accumulate a small vruntime and be picked next, but a task that had been briefly over-served had no mechanism to be held back other than the fact that its vruntime was now larger than its peers'. This made it possible for a task that had just burned through a long slice to immediately win the next scheduling decision if the other contenders' vruntimes had drifted, producing unfair bursts. EEVDF replaces all of that with the lag/virtual-deadline framework described above. Lag gives a symmetric, bounded notion of "how far ahead or behind your entitlement are you"; the eligibility check enforces that nobody runs when they are over-served; and the virtual deadline gives a *two-dimensional* priority — weight *and* requested slice — that is finally expressive enough to say "I care about latency more than I care about throughput" without lying about my weight. Zijlstra showed on lore.kernel.org that this gives the classical proportional-share fairness theorem a much tighter lag bound than CFS ever achieved, and Jonathan Corbet's LWN coverage (article 925371) noted that initial benchmarks showed "EEVDF schedules a lot more consistently than CFS and has a bunch of latency wins" — with smaller variance being one of the most consistently observed effects. The upshot is that EEVDF's latency guarantees are derived from the algorithm itself, not from a set of tunables layered on top, which is why the merge deleted a swathe of CFS's sysctls (`sched_latency_ns`, `sched_min_granularity_ns`, `sched_wakeup_granularity_ns`, and a handful of others) rather than porting them.

## 5. Tuning Tradeoffs, Regressions, and Workload Impact

No scheduler rewrite of this scale lands without regressions, and the EEVDF merge commit was unusually frank about it: Linus Torvalds's pull of Ingo Molnar's scheduler tree for 6.6 included the explicit warning that "EEVDF inevitably changes workload scheduling in a binary fashion, hopefully for the better in the overwhelming majority of cases, but in some cases it won't, especially in adversarial loads that got lucky with the previous code, such as some variants of hackbench." That prediction turned out to be accurate. The most frequently reported regressions clustered around microbenchmarks (hackbench, schbench in certain modes, some sysbench OLTP configurations) that happened to benefit from CFS's specific preemption-granularity heuristics — workloads where the "wrong" choice of when to preempt happened to align with a particular cache-warmth pattern. Kernel developers including Mel Gorman, Vincent Guittot, and Shrikanth Hegde tracked these down through successive point releases, and LWN's "Completing the EEVDF scheduler" (article 969062) documented the fixes that landed between 6.6 and 6.12 — notably around how lag is managed across task migrations between CPUs, how cgroups interact with the new accounting, and the aforementioned "preempt on earlier deadline" change that closed a latency hole for short-slice tasks waking while a long-slice task was holding the CPU. Linux 6.12, released in late 2024, is generally considered the point at which the EEVDF transition was *complete*: the per-task slice attribute became settable from userspace in a form considered stable, the deferred-dequeue handling for sleeping tasks was finalised, and most of the merged-but-disabled pieces from 6.6 were enabled by default. On the workload side, the clearest wins have been where Zijlstra predicted: database and analytics workloads on high-core-count machines (Phoronix's benchmarks on a 128-core EPYC 9754 "Bergamo" showed substantial improvements between 6.5 and 6.6 for PostgreSQL, ClickBench, and several OLTP suites), container and CI runner workloads with many small bursty tasks, and interactive/desktop workloads where the reduced variance shows up as visibly smoother responsiveness under load. The losses have mostly been in tightly tuned HPC and messaging benchmarks that had been implicitly calibrated against CFS's specific behaviour, and in a small number of workloads where the new eligibility rule produced slightly lower throughput in exchange for the tighter latency bound — which is precisely the tradeoff the new `latency_nice` / `sched_runtime` interface exists to let users make. The practical tuning guidance that emerged is simple: leave the scheduler alone for ordinary workloads (EEVDF exposes far fewer knobs than CFS by design, and the defaults are the result of genuine algorithmic reasoning rather than averaged-out heuristics), use negative `latency_nice` via `sched_setattr(2)` or the `cpu.latency.nice` cgroup file for threads that really do need sub-millisecond wakeup response, and use a custom `sched_runtime` request size only when you have a specific measured reason to override the default slice. After almost two years in the mainline kernel and the completion of the transition in 6.12, EEVDF has largely delivered on its promise: a better-defined scheduling policy, tighter latency bounds, fewer tunables, and a principled way for userspace to express latency-versus-throughput preferences — at the cost of a handful of narrowly-scoped regressions that continue to be ironed out release by release.

---

## Sources

- Jonathan Corbet, ["An EEVDF CPU scheduler for Linux"](https://lwn.net/Articles/925371/), LWN.net, March 9, 2023
- Jonathan Corbet, ["Completing the EEVDF scheduler"](https://lwn.net/Articles/969062/), LWN.net, 2024
- ["EEVDF Scheduler"](https://docs.kernel.org/scheduler/sched-eevdf.html), Linux Kernel Documentation
- Peter Zijlstra, ["[PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr"](https://lore.kernel.org/lkml/20230531115839.089944915@infradead.org/), lore.kernel.org, May 31, 2023
- Peter Zijlstra, ["[PATCH 00/10] sched: EEVDF using latency-nice"](https://lore.kernel.org/lkml/20230321160458.GB2273492@hirez.programming.kicks-ass.net/t/), lore.kernel.org, March 2023
- Michael Larabel, ["EEVDF Scheduler Merged For Linux 6.6"](https://www.phoronix.com/news/Linux-6.6-EEVDF-Merged), Phoronix
- ["A Fair Slice"](https://www.linux-magazine.com/Issues/2025/301/EEVDF), Linux Magazine issue 301, 2025
- ["Linux 6.12: Scheduler now expandable and EEVDF conversion complete"](https://www.heise.de/en/news/Linux-6-12-Scheduler-now-expandable-and-EEVDF-conversion-complete-9949941.html), heise online
- Ion Stoica and Hussein Abdel-Wahab, "Earliest Eligible Virtual Deadline First: A Flexible and Accurate Mechanism for Proportional Share Resource Allocation," 1995
- [Linux 6.6 Kernel Newbies page](https://kernelnewbies.org/Linux_6.6)
