The Linux 6.8 Scheduler: State of the Art

A technical survey of EEVDF, sched_ext, cgroup v2 CPU control, and latency behavior in Linux 6.8 — with a look at what changed from 6.6 and where the scheduler is heading. Raw markdown.

1. EEVDF: The Quiet Revolution that Replaced CFS

For roughly sixteen years the Completely Fair Scheduler (CFS), introduced by Ingo Molnár in 2.6.23, was the default fair-class scheduler in Linux. Kernel 6.6, released in October 2023, ended that era by merging Peter Zijlstra's long-gestating EEVDF (Earliest Eligible Virtual Deadline First) implementation. By the time kernel 6.8 shipped in March 2024, EEVDF had absorbed two rounds of stabilization fixes and was the unambiguous default for SCHED_NORMAL and SCHED_BATCH tasks. Structurally, EEVDF keeps much of CFS's machinery — tasks are still organized per-CPU in a red-black tree keyed by a virtual runtime, and the per-runqueue weighted average cfs_rq->avg_vruntime is still maintained — but the selection rule is different. Instead of simply picking the leftmost task (smallest vruntime), EEVDF picks the eligible task with the earliest virtual deadline. A task is eligible when its accumulated vruntime is less than or equal to the weighted mean V of the runqueue; the quantity vlag = V - vruntime is the task's lag, literally the service it is owed (positive) or has been loaned (negative). Lag is conserved across sleep and wakeup: when a task sleeps, its lag is renormalized against the queue's forward-moving V so it does not accumulate unbounded credit, a correctness property that older CFS heuristics like GENTLE_FAIR_SLEEPERS only approximated. Each runnable task has a "request" — a slice of CPU it asks for before being preempted — whose size is derived from sysctl_sched_base_slice (replacing the old sched_latency_ns / sched_min_granularity_ns pair) divided across the task's proportional weight. The virtual deadline is eligible_time + request / weight. Smaller requests yield earlier deadlines, which is how the new latency_nice attribute, exposed via sched_setattr(2) with SCHED_FLAG_LATENCY_NICE, buys lower wakeup latency at the cost of throughput: a task with latency_nice = -20 gets a deadline offset that lets it preempt neighbors sooner, whereas latency_nice = 19 pushes its deadline out and lets background work coalesce. The knobs in /sys/kernel/debug/sched/ have shrunk correspondingly: base_slice_ns is the main tunable, sched_wakeup_granularity_ns is gone, and features exposes boolean toggles like PLACE_LAG, PLACE_DEADLINE_INITIAL, RUN_TO_PARITY, and NEXT_BUDDY, each gating a specific EEVDF placement or preemption heuristic.

2. sched_ext: A BPF Escape Hatch for Scheduling Policy

Running in parallel with EEVDF's stabilization, the sched_ext ("SCX") framework, shepherded by Tejun Heo and David Vernet, matured on the linux-sched-ext tree throughout the 6.6–6.8 cycle. In the 6.8 timeframe sched_ext was not yet in mainline — it would be merged in 6.12 — but it was already shipping in Ubuntu HWE backports and in distributions like CachyOS, and the patchset was close enough to mainline that its ABI is worth describing as part of the "state of the 6.8 scheduler" conversation. Sched_ext adds a new scheduling class, ext_sched_class, positioned between idle_sched_class and fair_sched_class. When no BPF scheduler is attached it is a no-op; when a user loads a struct_ops BPF program of type sched_ext_ops, that program receives callbacks for the core scheduling events — enqueue, dequeue, dispatch, running, stopping, select_cpu, update_idle, and cgroup hooks — and is free to implement whatever policy it wants. The framework provides "DSQs" (dispatch queues) as primitives: the BPF scheduler can create any number of global or per-CPU DSQs, move tasks into them with scx_bpf_dispatch(), and consume from them with scx_bpf_consume(). A watchdog enforces forward progress — if a task goes unrunnable for longer than the configured timeout the kernel ejects the BPF scheduler and falls back to EEVDF, making misbehavior survivable in a way that out-of-tree scheduler patches have never been. The out-of-tree scx repository ships several reference schedulers. scx_rusty is a multi-domain, user-space-assisted scheduler written partly in Rust that does load balancing between NUMA-aware domains from userspace while the BPF component handles the hot-path dispatch. scx_lavd ("Latency-criticality Aware Virtual Deadline") targets desktop and gaming workloads by dynamically detecting interactive tasks via wakeup frequency and runtime history and giving them shorter virtual deadlines. scx_layered lets an administrator declaratively partition tasks into "layers" by cgroup, comm, or nice, and attach different policies to each. scx_central centralizes all dispatch decisions on one CPU to minimize cross-CPU cache contention for latency-sensitive workloads. Use cases range from mobile/laptop battery optimization, to gaming (Steam Deck–style interactive preemption), to datacenter experimentation where a fleet can A/B a scheduler change without a kernel rebuild. The broader significance is architectural: sched_ext converts scheduler policy into userspace-debuggable, hot-swappable code, drastically lowering the cost of experimentation.

3. The cgroup v2 CPU controller under EEVDF

The cgroup v2 cpu controller presents a small set of interface files and expects the hierarchical-weight model of the unified hierarchy. cpu.weight takes an integer from 1 to 10000, default 100, representing proportional share; internally it is mapped through sched_prio_to_weight[] and used as the weight of a group-level sched_entity in the parent cfs_rq. Under EEVDF this has a crisper semantic than under classic CFS: a group's vruntime advances at a rate inversely proportional to its weight, and the group entity has its own eligibility time and deadline, so the EEVDF selection at the group level reproduces the "pick the earliest-deadline eligible child" rule recursively down the hierarchy. The legacy cpu.weight.nice file exposes the same weight through the -20..+19 nice scale for scripts ported from cgroup v1. cpu.max is the hard bandwidth limiter, a pair "quota period" in microseconds (default max 100000), implemented via the bandwidth control machinery: each period, the group is refilled with quota microseconds of runtime, and when it exhausts that budget its cfs_rq is throttled — removed from the parent's runnable tree until the next period tick refills it. cpu.max.burst lets a group accumulate unused quota up to a cap, absorbing short spikes without raising the steady-state limit. cpu.idle, introduced in 5.15 and refined through 6.x, flips the group into SCHED_IDLE semantics: every task in the group inherits idle-class treatment and only runs when no non-idle SCHED_NORMAL task is eligible on the CPU, making it a clean way to implement "best effort" background groups without reaching for nice levels. Under EEVDF, cpu.idle groups have weight 3 (the lowest non-zero) and a very large latency offset, so their deadlines always land after any normal task's. cpu.stat reports usage_usec, nr_periods, nr_throttled, throttled_usec and, since 6.6, a nr_bursts / burst_usec pair when burst is configured. PSI (cpu.pressure) is cgroup-aware and tracks the some/full stall times the group's tasks experience. An important 6.8-era detail is that the group-level bandwidth throttle is evaluated before EEVDF selection, so a throttled group simply disappears from the deadline tree; there is no interaction where a group exceeds its quota because its deadline happened to be early. Conversely, cpu.weight and latency_nice are orthogonal: weight controls share, latency_nice controls deadline offset, and both are respected simultaneously throughout the cgroup hierarchy.

4. Latency under contention: bounded lag, deterministic preemption

The most observable difference between CFS and EEVDF shows up in tail latency under contention. CFS's wakeup preemption was governed by sched_wakeup_granularity_ns and a heuristic comparison against the currently running task's vruntime; in practice the kernel often deferred preemption of a recently started task for up to a millisecond to avoid thrashing, which under heavy fan-out (many short wakeups arriving while a batch job was running) translated to p99 wakeup latencies in the multi-millisecond range on desktop kernels. EEVDF replaces that heuristic with a deadline comparison: on wakeup, the new task's deadline is computed, and if it is earlier than the running task's deadline, it preempts. Because latency_nice directly offsets the deadline, there is now a single number that a workload can set to buy preemption priority, with no hidden interaction with the runqueue size. The theoretical guarantee EEVDF inherits from the 1995 Stoica et al. paper is bounded lag: at any instant, every runnable task's lag is bounded by its request size, which in Linux means at most one base_slice_ns (6 ms by default, sometimes tuned down to 750 µs on latency-sensitive systems). Empirically, benchmarks published during the 6.6–6.8 cycle (Phoronix, Canonical's QA reports, Meta's production rollouts) show rough throughput parity with CFS on kernbench-style workloads and meaningful improvements — on the order of 20–40% reductions in p99 — on schbench and hackbench at moderate overcommit. The RUN_TO_PARITY sched_feat, enabled by default, specifically addresses pathological over-preemption: when multiple tasks have very close deadlines (e.g., at wakeup storms), the currently running task is allowed to exhaust its request rather than being immediately yanked, trading a tiny amount of short-term fairness for a large reduction in context-switch count. Preemptibility itself is still governed by the kernel's preempt model (voluntary, full, lazy, none), and 6.8 is the release where the "lazy preempt" work by Thomas Gleixner began to stabilize behind CONFIG_PREEMPT_LAZY, aiming to give PREEMPT_DYNAMIC kernels the throughput of voluntary preempt with the latency of full preempt by deferring preemption of SCHED_OTHER until the next return-to-userspace rather than at the next safe point inside the kernel. SCHED_DEADLINE tasks and RT tasks continue to preempt EEVDF unconditionally; the interaction with sched_ext, when a BPF scheduler is eventually attached, is that EEVDF is bypassed entirely for SCHED_NORMAL tasks, but RT and DL classes still dominate above it.

5. What actually changed between 6.6 and 6.8

Kernel 6.6 landed the initial EEVDF merge — the headline commit is Peter Zijlstra's "sched/fair: Implement an EEVDF-like scheduling policy" (147f3efaa241) — and 6.7 and 6.8 were primarily stabilization and correctness releases for it. Zijlstra's EEVDF follow-up series in 6.7 fixed several placement bugs: place_entity() was reworked to correctly handle the "joining" case when a task wakes onto a queue whose V has advanced far past its vruntime, preventing the new task from either starving or being starved due to enormous lag on entry. The PLACE_LAG sched_feat, in particular, preserves wakeup lag across sleep events so that a task that voluntarily yielded does not gain a deadline advantage when it returns. 6.8 brought further fixes: the pick_eevdf() core function was simplified and its walk of the augmented RB tree bounded more tightly, fixing a quadratic worst case discovered under stress-ng --cpu-sched. The base_slice_ns debugfs knob gained range validation. A long-standing issue where SCHED_BATCH tasks could be preempted by SCHED_NORMAL siblings with only marginally earlier deadlines was addressed by giving SCHED_BATCH a fixed latency offset equivalent to latency_nice = 19, so batch work reliably coalesces. On the cgroup side, 6.8 improved cpu.max.burst accounting so that burst credit is not lost across CPU hotplug events. Load-balancer interactions with EEVDF were also tuned: update_curr() now updates the group hierarchy's deadline more lazily rather than on every tick, cutting roughly a few percent off the tick-time cost in deeply nested cgroup trees, and a large portion of the load balancer was refactored into the sched_balance_* namespace as part of Ingo Molnár's "scheduler naming cleanup" patchset that reached mainline in 6.8. Regressions were modest but real: a handful of HFT-style workloads reported higher p99.9 on 6.8 versus 6.5, traced to the augmented-RB rebalancing cost at very high context-switch rates, and RUN_TO_PARITY occasionally delayed a SCHED_IDLE-to-SCHED_NORMAL promotion by a few hundred microseconds, which could be mitigated by disabling the feature bit. The PSI subsystem gained more granular IRQ pressure tracking, useful for detecting softirq-induced stalls separately from scheduler stalls. Finally, 6.8 is the release where the scheduler's sysctl tunables under /proc/sys/kernel/sched_* are largely deprecated in favor of /sys/kernel/debug/sched/, which is a minor API churn for tuning scripts but reflects the understanding that these knobs are debug aids, not supported configuration.

6. Outlook: where scheduling is heading

With EEVDF stable in 6.8, the frontier of Linux scheduling has moved decisively toward two things: pluggability and heterogeneity. Pluggability arrived a few releases later in the form of sched_ext, whose 6.12 merge formalized what was already an active out-of-tree ecosystem in the 6.8 timeframe; its long-term implication is that the upstream fair scheduler can focus on being a safe, predictable default while specialist policies — gaming, HFT, bulk batch, energy-aware — live as BPF modules maintained by the people who actually care about those workloads. Heterogeneity is the harder problem. The rise of big.LITTLE on Arm, P-core/E-core on Intel Alder Lake and successors, and asymmetric AMD hybrid parts means the scheduler must route tasks to the right kind of core, respecting both throughput targets and energy cost. The Energy-Aware Scheduler (EAS) has been upstream since 5.0 but only works on SMP systems with a published energy model; 6.8 extended EAS awareness to Intel Thread Director through the Hardware Feedback Interface (HFI) driver, exposing per-core performance/efficiency hints to cpufreq_schedutil and the load balancer. Expect that integration to deepen — the sched_asym_cpucapacity static key, which today gates asymmetric-specific fast paths, will likely become a runtime choice driven by HFI in future releases. A third thread is the "DL server" work by Daniel Bristot de Oliveira, aiming to protect SCHED_NORMAL from SCHED_DEADLINE starvation by running fair tasks inside a synthetic deadline reservation; the patches were under review during the 6.8 cycle and landed shortly after. Lazy preemption, merged experimentally in 6.8, promises to unify the fragmented PREEMPT_* configurations into a single kernel binary whose latency behavior is tunable at boot. And finally, the sched_ext ecosystem is driving a reconsideration of what the fair class itself should look like: if scx_lavd consistently wins on desktop latency, its heuristics may feed back into EEVDF as new sched_feat bits, a virtuous cycle the out-of-tree CFS patchsets of the 2010s never quite achieved. The net picture as of 6.8 is a scheduler that is mathematically cleaner than CFS, operationally tunable in ways that make latency a first-class knob rather than a folklore tweak, and structurally ready to grow a pluggable policy layer on top. The next few releases will decide whether that promise translates into the kind of measurable, workload-specific wins that justify the sixteen-year upheaval of replacing CFS.