# The Linux 6.8 Scheduler: State of the Art

## 1. EEVDF: The Quiet Revolution that Replaced CFS

For roughly sixteen years the Completely Fair Scheduler (CFS), introduced by Ingo Molnar in 2.6.23, was the default fair-class scheduler in Linux. Kernel 6.6, released in October 2023, ended that era by merging Peter Zijlstra's long-gestating EEVDF (Earliest Eligible Virtual Deadline First) implementation. By the time kernel 6.8 shipped in March 2024, EEVDF had absorbed two rounds of stabilization fixes and was the unambiguous default for `SCHED_NORMAL` and `SCHED_BATCH` tasks. Structurally, EEVDF keeps much of CFS's machinery — tasks are still organized per-CPU in a red-black tree keyed by a virtual runtime, and the per-runqueue weighted average `cfs_rq->avg_vruntime` is still maintained — but the *selection* rule is different. Instead of simply picking the leftmost task (smallest vruntime), EEVDF picks the eligible task with the earliest virtual deadline. A task is *eligible* when its accumulated vruntime is less than or equal to the weighted mean `V` of the runqueue; the quantity `vlag = V - vruntime` is the task's lag, literally the service it is owed (positive) or has been loaned (negative). Lag is conserved across sleep and wakeup: when a task sleeps, its lag is renormalized against the queue's forward-moving `V` so it does not accumulate unbounded credit, a correctness property that older CFS heuristics like `GENTLE_FAIR_SLEEPERS` only approximated. Each runnable task has a "request" — a slice of CPU it asks for before being preempted — whose size is derived from `sysctl_sched_base_slice` (replacing the old `sched_latency_ns` / `sched_min_granularity_ns` pair) divided across the task's proportional weight. The virtual deadline is `eligible_time + request / weight`. Smaller requests yield earlier deadlines, which is how the new `latency_nice` attribute, exposed via `sched_setattr(2)` with `SCHED_FLAG_LATENCY_NICE`, buys lower wakeup latency at the cost of throughput: a task with `latency_nice = -20` gets a deadline offset that lets it preempt neighbors sooner, whereas `latency_nice = 19` pushes its deadline out and lets background work coalesce. The knobs in `/sys/kernel/debug/sched/` have shrunk correspondingly: `base_slice_ns` is the main tunable, `sched_wakeup_granularity_ns` is gone, and `features` exposes boolean toggles like `PLACE_LAG`, `PLACE_DEADLINE_INITIAL`, `RUN_TO_PARITY`, and `NEXT_BUDDY`, each gating a specific EEVDF placement or preemption heuristic.

## 2. sched_ext: A BPF Escape Hatch for Scheduling Policy

Running in parallel with EEVDF's stabilization, the sched_ext ("SCX") framework, shepherded by Tejun Heo and David Vernet, matured on the linux-sched-ext tree throughout the 6.6–6.8 cycle. In the 6.8 timeframe sched_ext was not yet in mainline — it would be merged in 6.12 — but it was already shipping in Ubuntu HWE backports and in distributions like CachyOS, and the patchset was close enough to mainline that its ABI is worth describing as part of the "state of the 6.8 scheduler" conversation. Sched_ext adds a new scheduling class, `ext_sched_class`, positioned between `idle_sched_class` and `fair_sched_class`. When no BPF scheduler is attached it is a no-op; when a user loads a `struct_ops` BPF program of type `sched_ext_ops`, that program receives callbacks for the core scheduling events — `enqueue`, `dequeue`, `dispatch`, `running`, `stopping`, `select_cpu`, `update_idle`, and cgroup hooks — and is free to implement whatever policy it wants. The framework provides "DSQs" (dispatch queues) as primitives: the BPF scheduler can create any number of global or per-CPU DSQs, move tasks into them with `scx_bpf_dispatch()`, and consume from them with `scx_bpf_consume()`. A watchdog enforces forward progress — if a task goes unrunnable for longer than the configured timeout the kernel ejects the BPF scheduler and falls back to EEVDF, making misbehavior survivable in a way that out-of-tree scheduler patches have never been. The out-of-tree `scx` repository ships several reference schedulers. `scx_rusty` is a multi-domain, user-space-assisted scheduler written partly in Rust that does load balancing between NUMA-aware domains from userspace while the BPF component handles the hot-path dispatch. `scx_lavd` ("Latency-criticality Aware Virtual Deadline") targets desktop and gaming workloads by dynamically detecting interactive tasks via wakeup frequency and runtime history and giving them shorter virtual deadlines. `scx_layered` lets an administrator declaratively partition tasks into "layers" by cgroup, comm, or nice, and attach different policies to each. `scx_central` centralizes all dispatch decisions on one CPU to minimize cross-CPU cache contention for latency-sensitive workloads. Use cases range from mobile/laptop battery optimization, to gaming (Steam Deck–style interactive preemption), to datacenter experimentation where a fleet can A/B a scheduler change without a kernel rebuild. The broader significance is architectural: sched_ext converts scheduler policy into userspace-debuggable, hot-swappable code, drastically lowering the cost of experimentation.

## 3. The cgroup v2 CPU controller under EEVDF

The cgroup v2 `cpu` controller presents a small set of interface files and expects the hierarchical-weight model of the unified hierarchy. `cpu.weight` takes an integer from 1 to 10000, default 100, representing proportional share; internally it is mapped through `sched_prio_to_weight[]` and used as the weight of a group-level `sched_entity` in the parent `cfs_rq`. Under EEVDF this has a crisper semantic than under classic CFS: a group's vruntime advances at a rate inversely proportional to its weight, and the group entity has its own eligibility time and deadline, so the EEVDF selection at the group level reproduces the "pick the earliest-deadline eligible child" rule recursively down the hierarchy. The legacy `cpu.weight.nice` file exposes the same weight through the -20..+19 nice scale for scripts ported from cgroup v1. `cpu.max` is the hard bandwidth limiter, a pair "quota period" in microseconds (default `max 100000`), implemented via the `bandwidth control` machinery: each period, the group is refilled with `quota` microseconds of runtime, and when it exhausts that budget its `cfs_rq` is *throttled* — removed from the parent's runnable tree until the next period tick refills it. `cpu.max.burst` lets a group accumulate unused quota up to a cap, absorbing short spikes without raising the steady-state limit. `cpu.idle`, introduced in 5.15 and refined through 6.x, flips the group into `SCHED_IDLE` semantics: every task in the group inherits idle-class treatment and only runs when no non-idle `SCHED_NORMAL` task is eligible on the CPU, making it a clean way to implement "best effort" background groups without reaching for nice levels. Under EEVDF, `cpu.idle` groups have weight 3 (the lowest non-zero) and a very large latency offset, so their deadlines always land after any normal task's. `cpu.stat` reports `usage_usec`, `nr_periods`, `nr_throttled`, `throttled_usec` and, since 6.6, a `nr_bursts` / `burst_usec` pair when burst is configured. PSI (`cpu.pressure`) is cgroup-aware and tracks the `some`/`full` stall times the group's tasks experience. An important 6.8-era detail is that the group-level bandwidth throttle is evaluated before EEVDF selection, so a throttled group simply disappears from the deadline tree; there is no interaction where a group exceeds its quota because its deadline happened to be early. Conversely, `cpu.weight` and `latency_nice` are orthogonal: weight controls *share*, latency_nice controls *deadline offset*, and both are respected simultaneously throughout the cgroup hierarchy.

## 4. Latency under contention: bounded lag, deterministic preemption

The most observable difference between CFS and EEVDF shows up in tail latency under contention. CFS's wakeup preemption was governed by `sched_wakeup_granularity_ns` and a heuristic comparison against the currently running task's vruntime; in practice the kernel often deferred preemption of a recently started task for up to a millisecond to avoid thrashing, which under heavy fan-out (many short wakeups arriving while a batch job was running) translated to p99 wakeup latencies in the multi-millisecond range on desktop kernels. EEVDF replaces that heuristic with a deadline comparison: on wakeup, the new task's deadline is computed, and if it is earlier than the running task's deadline, it preempts. Because `latency_nice` directly offsets the deadline, there is now a single number that a workload can set to buy preemption priority, with no hidden interaction with the runqueue size. The theoretical guarantee EEVDF inherits from the 1995 Stoica et al. paper is bounded lag: at any instant, every runnable task's lag is bounded by its request size, which in Linux means at most one `base_slice_ns` (6 ms by default, sometimes tuned down to 750 μs on latency-sensitive systems). Empirically, benchmarks published during the 6.6–6.8 cycle (Phoronix, Canonical's QA reports, Meta's production rollouts) show rough throughput parity with CFS on kernbench-style workloads and meaningful improvements — on the order of 20–40% reductions in p99 — on schbench and hackbench at moderate overcommit. The `RUN_TO_PARITY` sched_feat, enabled by default, specifically addresses pathological over-preemption: when multiple tasks have very close deadlines (e.g., at wakeup storms), the currently running task is allowed to exhaust its request rather than being immediately yanked, trading a tiny amount of short-term fairness for a large reduction in context-switch count. Preemptibility itself is still governed by the kernel's preempt model (voluntary, full, lazy, none), and 6.8 is the release where the "lazy preempt" work by Thomas Gleixner began to stabilize behind `CONFIG_PREEMPT_LAZY`, aiming to give `PREEMPT_DYNAMIC` kernels the throughput of voluntary preempt with the latency of full preempt by deferring preemption of `SCHED_OTHER` until the next return-to-userspace rather than at the next safe point inside the kernel. `SCHED_DEADLINE` tasks and RT tasks continue to preempt EEVDF unconditionally; the interaction with sched_ext, when a BPF scheduler is eventually attached, is that EEVDF is bypassed entirely for `SCHED_NORMAL` tasks, but RT and DL classes still dominate above it.

## 5. What actually changed between 6.6 and 6.8

Kernel 6.6 landed the initial EEVDF merge — the headline commit is Peter Zijlstra's "sched/fair: Implement an EEVDF-like scheduling policy" (147f3efaa241) — and 6.7 and 6.8 were primarily stabilization and correctness releases for it. Zijlstra's EEVDF follow-up series in 6.7 fixed several placement bugs: `place_entity()` was reworked to correctly handle the "joining" case when a task wakes onto a queue whose `V` has advanced far past its vruntime, preventing the new task from either starving or being starved due to enormous lag on entry. The `PLACE_LAG` sched_feat, in particular, preserves wakeup lag across sleep events so that a task that voluntarily yielded does not gain a deadline advantage when it returns. 6.8 brought further fixes: the `pick_eevdf()` core function was simplified and its walk of the augmented RB tree bounded more tightly, fixing a quadratic worst case discovered under `stress-ng --cpu-sched`. The `base_slice_ns` debugfs knob gained range validation. A long-standing issue where `SCHED_BATCH` tasks could be preempted by `SCHED_NORMAL` siblings with only marginally earlier deadlines was addressed by giving `SCHED_BATCH` a fixed latency offset equivalent to `latency_nice = 19`, so batch work reliably coalesces. On the cgroup side, 6.8 improved `cpu.max.burst` accounting so that burst credit is not lost across CPU hotplug events. Load-balancer interactions with EEVDF were also tuned: `update_curr()` now updates the group hierarchy's deadline more lazily rather than on every tick, cutting roughly a few percent off the tick-time cost in deeply nested cgroup trees, and a large portion of the load balancer was refactored into the `sched_balance_*` namespace as part of Ingo Molnar's "scheduler naming cleanup" patchset that reached mainline in 6.8. Regressions were modest but real: a handful of HFT-style workloads reported higher p99.9 on 6.8 versus 6.5, traced to the augmented-RB rebalancing cost at very high context-switch rates, and `RUN_TO_PARITY` occasionally delayed a `SCHED_IDLE`-to-`SCHED_NORMAL` promotion by a few hundred microseconds, which could be mitigated by disabling the feature bit. The PSI subsystem gained more granular IRQ pressure tracking, useful for detecting softirq-induced stalls separately from scheduler stalls. Finally, 6.8 is the release where the scheduler's sysctl tunables under `/proc/sys/kernel/sched_*` are largely deprecated in favor of `/sys/kernel/debug/sched/`, which is a minor API churn for tuning scripts but reflects the understanding that these knobs are debug aids, not supported configuration.

## 6. Outlook: where scheduling is heading

With EEVDF stable in 6.8, the frontier of Linux scheduling has moved decisively toward two things: pluggability and heterogeneity. Pluggability arrived a few releases later in the form of sched_ext, whose 6.12 merge formalized what was already an active out-of-tree ecosystem in the 6.8 timeframe; its long-term implication is that the upstream fair scheduler can focus on being a safe, predictable default while specialist policies — gaming, HFT, bulk batch, energy-aware — live as BPF modules maintained by the people who actually care about those workloads. Heterogeneity is the harder problem. The rise of big.LITTLE on Arm, P-core/E-core on Intel Alder Lake and successors, and asymmetric AMD hybrid parts means the scheduler must route tasks to the right kind of core, respecting both throughput targets and energy cost. The Energy-Aware Scheduler (EAS) has been upstream since 5.0 but only works on SMP systems with a published energy model; 6.8 extended EAS awareness to Intel Thread Director through the Hardware Feedback Interface (HFI) driver, exposing per-core performance/efficiency hints to `cpufreq_schedutil` and the load balancer. Expect that integration to deepen — the `sched_asym_cpucapacity` static key, which today gates asymmetric-specific fast paths, will likely become a runtime choice driven by HFI in future releases. A third thread is the "DL server" work by Daniel Bristot de Oliveira, aiming to protect `SCHED_NORMAL` from `SCHED_DEADLINE` starvation by running fair tasks inside a synthetic deadline reservation; the patches were under review during the 6.8 cycle and landed shortly after. Lazy preemption, merged experimentally in 6.8, promises to unify the fragmented `PREEMPT_*` configurations into a single kernel binary whose latency behavior is tunable at boot. And finally, the sched_ext ecosystem is driving a reconsideration of what the fair class itself should look like: if `scx_lavd` consistently wins on desktop latency, its heuristics may feed back into EEVDF as new `sched_feat` bits, a virtuous cycle the out-of-tree CFS patchsets of the 2010s never quite achieved. The net picture as of 6.8 is a scheduler that is mathematically cleaner than CFS, operationally tunable in ways that make latency a first-class knob rather than a folklore tweak, and structurally ready to grow a pluggable policy layer on top. The next few releases will decide whether that promise translates into the kind of measurable, workload-specific wins that justify the sixteen-year upheaval of replacing CFS.
