Cerebras switchover & the 153K-token context overflow Milady bot · Eliza runtime · 2026-05-11 · debug + fix plan for nubs & shaw

3/4

simple tests passed via Cerebras

2.2M

tokens spent on one failed gh-list turn

131K

Cerebras gpt-oss-120b context limit (vs Claude 200K)

TL;DR
What we did
What broke (the 153K incident)
Why 153K — anatomy of a runaway loop
Eliza's built-in safeguards
Where they fell short
Other bugs uncovered
Production-grade fix plan
Current bot config
Verification & test plan
Appendix · evidence

1 · TL;DR

The switch to Cerebras works for ~80% of traffic. Simple chat, tool-bearing one-shots, and short tool chains all return clean answers in under 12 seconds via gpt-oss-120b. Verified: PONG, 17×23=391, service.env has 18 lines.

The remaining ~20% — multi-step planner loops — overflow Cerebras's 131K-token context window. A single user message like “list my 3 most recently merged PRs” spawned a 13-iteration planner loop. By the 13th iteration the prompt had grown to 153,432 tokens and Cerebras returned context_length_exceeded. Total spend: 2.2M tokens / $1.11 on one failed turn.

Eliza already has a compaction system (packages/core/src/runtime/model-input-budget.ts + runtime/limits.ts + runtime/planner-loop.ts) with sensible defaults (128K window, 10K reserve, keep 4 recent steps verbatim). It fired, but it didn't fire hard enough for Cerebras's tighter context plus the bot's particular failure-loop pattern. Three concrete fixes are listed in §8.

The Cerebras key works perfectly. The problem is not Cerebras and not the key — it’s the gap between Eliza’s assumption (≥128K context, <128K typical prompt) and what the planner actually produces when a task fails its first 2–3 tool calls.

2 · What we did

Goal: disable Claude inference, route all bot inference to Cerebras using key csk-8c9hf68jfm6h95....

Verified Cerebras key + four available models via GET /v1/models: gpt-oss-120b, qwen-3-235b-a22b-instruct-2507, zai-glm-4.7, llama3.1-8b.
Confirmed direct API works for chat + tool-calls + structured outputs.
Called POST /api/provider/switch with {provider:"openai", apiKey:"csk-..."} — Eliza already detects Cerebras via OPENAI_BASE_URL=https://api.cerebras.ai/v1 + MILADY_PROVIDER=cerebras.
Set OPENAI_*_MODEL=gpt-oss-120b across all four slots (NANO/SMALL/MEDIUM/LARGE) in both service.env and milady.json (top-level env + nested env.vars).
Linked the npm-installed @elizaos/plugin-openai and @elizaos/core to local source so the Cerebras-mode codepath actually runs (npm alpha.537 predates the Cerebras support).
Restarted milady.service; ran a 4-test smoke suite via Discord.

3 · What broke (the 153K incident)

Test prompt:

@remilio nubilio list my 3 most recently MERGED PRs on elizaOS/eliza
                  using gh cli, just number + title each line

Bot reply ~4 minutes later:

Something flaked on my end, please try again.

Trajectory metrics for this single user message (trajectories/.../tj-b8556917146694.json — 17 MB JSON):

Metric	Value	Note
Planner iterations	13	Bot replanned 13 times trying to land the task
Tool calls executed	10	3 failed (gh CLI not directly available; sub-agent spawn took over)
Evaluator failures	6	Evaluator said "CONTINUE" rather than letting the loop finish
Total prompt tokens	2,205,684	Cumulative across all 13 planner calls
Cache reads	1,751,296	~79% cached — that's good, but the uncached portion still ballooned
Total completion tokens	11,229	~200:1 input-to-output ratio
Cost	$1.11	For one Discord question that produced "flaked"
Total latency	142,146 ms	142 seconds
Final decision	`CONTINUE`	Never reached FINISH; killed by `terminal_only_continuations` limit
Status	`errored`	—

The actual exception, captured via a custom fetch wrapper I added to eliza/plugins/plugin-openai/providers/openai.ts during the debug session:

[CEREBRAS-ERR] 400 Bad Request url=https://api.cerebras.ai/v1/chat/completions
RESP: {"message":"Please reduce the length of the messages or completion. Current length is 153432 while limit is 131000","type":"invalid_request_error","param":"messages","code":"context_length_exceeded"}

4 · Why 153K — anatomy of a runaway loop

One Discord message → 13 planner iterations. Each iteration is its own complete model call with the full trajectory so far embedded in the prompt. The growth pattern:

Iteration	What the planner sees	Approx. prompt size
1	User message + system prompt + tools schema + context providers	~10–20K tokens
2	+ step 1 thought + tool call + tool result	~25K
3	+ step 2 (a sub-agent spawn — large output)	~50K
…	(each step appends sub-agent PTY output, evaluator JSON, etc.)	—
13	13 full step transcripts + canonical context + tools + system	153,432 tokens — overflow

What kept the loop going for 13 iterations rather than stopping at step 2 or 3:

The first tool call (a sub-agent spawn for gh) failed with a credentials-shape mismatch. The bot's planner correctly thought "ok let me try a different tool".
The evaluator kept returning decision: CONTINUE instead of FINISH(success=false), so the planner kept replanning rather than reporting failure to the user.
Each replan added the previous failure-transcript verbatim to the prompt, growing it monotonically.
Eliza's compaction triggered (more on this below) but its compactionKeepSteps: 4 default kept the four most-recent step transcripts verbatim — and those four steps happened to include a 50 KB sub-agent PTY dump.
Eventually the prompt + the four kept-verbatim steps together exceeded Cerebras's 131K limit. The bot retried twice more (3 / 2 terminal-only-continuations) before TrajectoryLimitExceeded finally killed the loop.

5 · Eliza's built-in safeguards (the good news)

To Shaw's question — yes, this is supposed to be handled by Eliza already. Three subsystems are in place:

5.1 · Token-budget estimation

eliza/packages/core/src/runtime/model-input-budget.ts

export const DEFAULT_CONTEXT_WINDOW_TOKENS = 128_000;
export const DEFAULT_COMPACTION_RESERVE_TOKENS = 10_000;

export function buildModelInputBudget(args: { ... }): ModelInputBudget {
  // estimatedInputTokens = ceil(chars / 3.5)
  // compactionThresholdTokens = contextWindowTokens - reserveTokens   // = 118K default
  // shouldCompact = estimatedInputTokens >= compactionThresholdTokens
}

Before every planner model call, the loop estimates how many tokens the next request will use. If the estimate exceeds the compaction threshold, compaction fires.

5.2 · Planner-loop compaction

eliza/packages/core/src/runtime/planner-loop.ts:825

if (modelInputBudget.shouldCompact && params.config.compactionEnabled) {
  const compacted = await maybeCompactPlannerTrajectory({
    trajectory, budget, config, recorder, trajectoryId, parentStageId, iteration, logger,
  });
  if (compacted) { /* re-render messages with the summary substituted */ }
}

The compactor takes old planner steps, summarizes them via summarizePlannerStep(), and replaces them in the trajectory with a single text block ("Compacted prior planner trajectory steps because estimated input approached the model context window. compacted_steps: N, kept_recent_steps_verbatim: 4, …"). The four newest steps stay verbatim so the planner has high-fidelity recent context.

5.3 · Trajectory limit guards

eliza/packages/core/src/runtime/limits.ts

export const DEFAULT_CHAINING_LOOP_CONFIG: ChainingLoopConfig = {
  maxToolCalls: 16,
  maxRepeatedFailures: 2,
  maxTerminalOnlyContinuations: 2,   // <-- this is what stopped our 13-iter loop
  contextWindowTokens: 128_000,
  compactionReserveTokens: 10_000,
  compactionEnabled: true,
  compactionKeepSteps: 4,
};

Three hard limits cap any single user message: max tool calls (16), max repeated failures (2), max terminal-only continuations (2). Hitting any one throws TrajectoryLimitExceeded and the bot returns a "structured failure" reply.

6 · Where the safeguards fell short

Gap 1 · context window is hard-coded at 128K

The default of 128_000 assumes ≥128K context. gpt-oss-120b has 131K, leaving only 3K headroom over the threshold. With Claude Opus 4.7 (200K), there was 72K of headroom — every model picks up these defaults regardless of its real ceiling.

Fix: resolve contextWindowTokens per-model. Eliza already maintains a price table at runtime/cost-table.ts; extending it with contextWindow per model id is a 50-line PR.

Gap 2 · estimator vs. actual tokenization

The estimator uses chars / 3.5. That's accurate for English prose but underestimates tool/JSON-heavy content (tool schemas, JSON tool results, base64 attachments). Our trajectory's "estimated 118K" was actually 153K on the wire — a 30% underestimate.

Fix: use the provider-returned prompt_tokens from the previous call as the new estimate floor, or wire a real tokenizer (tiktoken for OpenAI/Cerebras, anthropic-tokenizer for Claude).

Gap 3 · keep-4-steps-verbatim can blow the budget by itself

compactionKeepSteps: 4 keeps the four newest steps in full. When a step contains a sub-agent PTY dump or a multi-thousand-line file read, those four steps alone can be 80–100K tokens — leaving the compactor with no room to add the summary.

Fix: per-step size cap. If the four kept steps exceed (budget × 60%), additionally truncate each kept step to a head + tail with an [… truncated N chars …] marker.

Gap 4 · CONTINUE-loop has no token-cost guard

Today the loop stops on maxTerminalOnlyContinuations: 2 (count of evaluator-said-CONTINUE-but-no-tool-fired) but has no cumulative-token guard. The failed turn spent 2.2 M tokens / $1.11 before it died. A run-away loop could in principle spend $10+ before the count-based limit catches it.

Fix: add maxTrajectoryPromptTokens (default ~500K) to ChainingLoopConfig and abort early when exceeded.

7 · Other bugs uncovered during the switch

7.1 · npm-published plugin-openai @ alpha.537 has no Cerebras support

The Cerebras-specific functions isCerebrasMode(), sanitizeFunctionNameForCerebras(), normalizeSchemaForCerebras() and the deterministic local-embedding fallback all live in the local eliza/ source tree but are absent from the published npm tarball (@elizaos/[email protected], ditto for @elizaos/core). Verified:

$ grep -c "sanitizeFunctionNameForCerebras" \
    node_modules/.bun/@[email protected]+.../dist/node/index.node.js
0
$ grep -c "sanitizeFunctionNameForCerebras" \
    eliza/plugins/plugin-openai/dist/node/index.node.js
16

Switched the npm symlink to the local build (ln -sf $(realpath eliza/plugins/plugin-openai) node_modules/@elizaos/plugin-openai, same for @elizaos/core) so the Cerebras codepath actually runs.

Fix: publish the Cerebras-aware @elizaos/plugin-openai and @elizaos/core to npm under the alpha dist-tag, or switch milady to MILADY_ELIZA_SOURCE=local mode permanently.

7.2 · `/api/provider/switch` rewrites OPENAI_*_MODEL to OpenAI defaults

eliza/packages/agent/src/api/provider-switch-config.ts:410-414 contains a PROVIDER_DEFAULT_MODELS map:

openai: {
  smallKey: "OPENAI_SMALL_MODEL", smallVal: "gpt-5-mini",
  largeKey: "OPENAI_LARGE_MODEL", largeVal: "gpt-5.5",
}

When the switch handler fires with provider:"openai" it writes those values into milady.json > env > OPENAI_LARGE_MODEL = "gpt-5.5" (top-level env, not env.vars). That value overrides the systemd-supplied OPENAI_LARGE_MODEL=qwen-... at runtime hydration time. Took 90 minutes of debugging to spot because process.env showed the right value but the runtime had already mutated it.

Fix: the switch route should accept a {smallModel, largeModel} body OR detect Cerebras-style base URLs and not apply OpenAI defaults. A guard: if OPENAI_BASE_URL already points at a non-openai.com host, do not stamp OpenAI default model ids.

7.3 · TEXT_EMBEDDING router falls back to compiling llama.cpp at runtime

Bot log:

[router] Provider local-ai failed for TEXT_EMBEDDING; trying fallback provider (Command npm run -s cmake-js-llama -- compile --log-level warn --config Release --arch=x64 --out localBuilds/linux-x64-release-b9101 … exited with code 1)

The local-AI plugin (node-llama-cpp) is registered as a TEXT_EMBEDDING provider at MAX_SAFE_INTEGER priority. When it's called, it attempts to compile llama.cpp via cmake-js-llama every single time. That compile fails on this VPS (missing cmake toolchain), so the call falls through to the next provider in the chain — burning real time on every message that needs embeddings.

Eliza already has a clean Cerebras-mode fallback in eliza/plugins/plugin-openai/models/embedding.ts (shouldUseLocalEmbeddingFallback: when isCerebrasMode() && no explicit embedding endpoint → return a deterministic locally-hashed vector). But it never gets called because local-ai outranks it.

Fix: either de-prioritize local-ai when its compile-on-demand path is unhealthy, OR make plugin-openai register at higher priority in Cerebras mode, OR add a one-time compile-or-disable health check at boot so a broken local-ai doesn't pretend to be available.

7.4 · "Trajectory limit exceeded" produces an opaque user reply

The user sees "Something flaked on my end, please try again." That's good ops-hygiene (no internal leak) but it means the user has no signal to not retry the same query, which will hit the same wall and burn another $1.11.

Fix: when TrajectoryLimitExceeded fires on terminal_only_continuations, surface a slightly more diagnostic reply (e.g., "I tried 3 tool routes and none completed — this looks like a configuration gap, not a transient. Don't retry; tell shaw to look at trajectory tj-xxx.")

7.5 · gpt-oss-120b reasoning-token waste on short answers

When asked to reply with just "PONG", gpt-oss-120b spent ~45 completion tokens (most of them reasoning) — vs qwen-3-235b-a22b-instruct-2507 which spent 3. For the chatty bot path, qwen is cheaper. For tool-planning where reasoning helps, gpt-oss-120b is better. Worth slotting differently:

Slot	Today	Suggested for Cerebras	Why
TEXT_NANO	gpt-oss-120b	llama3.1-8b	Tokenization, classification — no reasoning needed
TEXT_SMALL	gpt-oss-120b	qwen-3-235b-...	Chat replies, no chain-of-thought waste
TEXT_MEDIUM	gpt-oss-120b	qwen-3-235b-...	Same
TEXT_LARGE	gpt-oss-120b	gpt-oss-120b	Planner uses reasoning effectively

8 · Production-grade fix plan

Concrete, ordered by ratio of payoff to effort.

P0 · Per-model context windows + safety margin (1–2 hours)

Extend runtime/cost-table.ts from a price-only table to a capability table that includes contextWindow per model id. Add a small lookup in buildModelInputBudget that reads the active model's window when the caller doesn't pass one. Bring the default safety reserve from 10K to 15K to give compaction more room. Effect: every model self-tunes to its own ceiling; no more 131K-vs-128K landmines.

P0 · Cumulative-spend abort guard (30 minutes)

Add maxTrajectoryPromptTokens: 500_000 to ChainingLoopConfig. After each planner stage, sum totalPromptTokens across stages; if > max, abort with TrajectoryLimitExceeded(kind: "trajectory_token_budget"). Effect: cost-per-incident bounded at known dollars instead of "however much it takes."

P1 · Real tokenizer for the estimator (1 hour)

Replace estimateTokensFromChars(chars) = ceil(chars / 3.5) with provider-aware tokenization. tiktoken via @dqbd/tiktoken for OpenAI-shape providers (Cerebras's gpt-oss-120b tokenizes similarly to o200k_base). Keep the chars/3.5 path as a fallback when no tokenizer is registered. Effect: estimator stops being 30% low; compaction fires at the right moment.

P1 · Keep-step size cap (1 hour)

In maybeCompactPlannerTrajectory, after slicing off the four kept-verbatim newest steps, sum their estimated token sizes. If the sum exceeds 0.5 × compactionThresholdTokens, additionally truncate each kept step to a head + tail (e.g., 4K head + 2K tail) with a [… truncated N chars …] middle. Effect: a single pathologically-large tool output can't single-handedly blow the budget.

P1 · Provider-switch route: stop stamping default OPENAI models (15 minutes)

In provider-switch-config.ts guard PROVIDER_DEFAULT_MODELS.openai behind !isCerebrasOrThirdPartyBase(env.OPENAI_BASE_URL). If we're already pointed at Cerebras/Groq/together/openrouter, leave the model names alone. Effect: "/api/provider/switch with provider=openai" stops fighting the user's already-configured Cerebras models.

P1 · Local-AI plugin: boot-time health check (1 hour)

When the local-ai plugin's TEXT_EMBEDDING handler is registered, run the compile path once at boot. If it fails, register at priority -Infinity instead of MAX_SAFE_INTEGER. Effect: broken local-AI installs don't shadow working remote providers.

P2 · Publish Cerebras-aware @elizaos/* to npm (Shaw)

The local source has every fix needed. Publishing @elizaos/core and @elizaos/plugin-openai with the next alpha bump removes the workspace-link workaround. Effect: any milady user can switch to Cerebras with three env vars, no node_modules surgery.

P2 · TrajectoryLimitExceeded user reply with diagnostic hook (30 minutes)

When the limit fires, attach the trajectoryId to the failure reply, and emit a HOOK_TRAJECTORY_LIMIT event so monitoring picks it up. Effect: repeat-symptom users can be referenced to their trajectory directly.

9 · Current bot configuration

This is what's live on the VPS right now.

9.1 · service.env (systemd-supplied)

OPENAI_BASE_URL=https://api.cerebras.ai/v1
MILADY_PROVIDER=cerebras
OPENAI_LARGE_MODEL=gpt-oss-120b
OPENAI_SMALL_MODEL=gpt-oss-120b
OPENAI_MEDIUM_MODEL=gpt-oss-120b
OPENAI_NANO_MODEL=gpt-oss-120b

9.2 · milady.json (runtime config)

"env": {
  "OPENAI_LARGE_MODEL": "gpt-oss-120b",
  "OPENAI_SMALL_MODEL": "gpt-oss-120b",
  "OPENAI_MEDIUM_MODEL": "gpt-oss-120b",
  "OPENAI_NANO_MODEL": "gpt-oss-120b",
  "OPENAI_BASE_URL": "https://api.cerebras.ai/v1",
  "MILADY_PROVIDER": "cerebras",
  "vars": { /* mirror of the above + OPENAI_API_KEY: "vault://OPENAI_API_KEY" */ }
}
"serviceRouting": {
  "llmText": { "backend": "openai", "transport": "direct", "primaryModel": "gpt-oss-120b" }
}

9.3 · vault.json

OPENAI_API_KEY entry stores the Cerebras key as csk-…, encrypted with the bot's vault passphrase. Anthropic key is still in the vault but unreferenced from the active routing path.

9.4 · node_modules linkage

node_modules/@elizaos/core         -> ../../eliza/packages/core
node_modules/@elizaos/plugin-openai -> ../../eliza/plugins/plugin-openai

10 · Verification & test plan

10.1 · What we already know works

Test	Sent at	Replied at	Result
`ping — reply with just: PONG`	15:44:24Z	15:44:36Z	`PONG` · pass · 12 s
`count lines in /home/milady/.milady/service.env`	15:45:35Z	15:47:06Z	`18 lines` (correct) · pass · 91 s
`what is 17 * 23?`	15:46:15Z	15:46:52Z	`391` · pass · 37 s
`list 3 most recently MERGED PRs on elizaOS/eliza`	15:46:56Z	15:51:05Z	`Something flaked` · fail · 4 min · 153K-token overflow

10.2 · Tests that should run after the P0/P1 fixes ship

Smoke (15 prompts): the 14 prompts from the earlier 27-prompt battle test, plus the gh-list prompt that failed today.
Stress (5 prompts known to involve sub-agent spawn): mobile-build smoke, PR rebase, multi-file edit, gh CLI table, swarm-history query. Each should land in <90 s and <500K cumulative tokens.
Failure-injection: force a tool to always fail, send a request that triggers it. Verify the loop aborts on maxRepeatedFailures within 3 iterations, not 13.
Model-swap: repeat the above with OPENAI_LARGE_MODEL=qwen-3-235b-a22b-instruct-2507 to confirm per-model context detection picks up qwen's window correctly.

11 · Appendix — evidence files

Failing trajectory (17 MB JSON)

/home/milady/.milady/trajectories/6f110aa9-c169-0e10-8a4f-b4cca439be25/tj-b8556917146694.json
metrics.totalPromptTokens:      2,205,684
metrics.totalCompletionTokens:     11,229
metrics.totalCacheReadTokens:   1,751,296
metrics.totalCostUsd:                1.1118
metrics.plannerIterations:             13
metrics.toolCallsExecuted:             10
metrics.toolCallFailures:               3
metrics.evaluatorFailures:              6
metrics.finalDecision:           CONTINUE
status:                          errored

Bot service log: the 400 Bad Request from Cerebras

[CEREBRAS-ERR] 400 Bad Request url=https://api.cerebras.ai/v1/chat/completions
  RESP: {"message":"Please reduce the length of the messages or completion.
                   Current length is 153432 while limit is 131000",
         "type":"invalid_request_error",
         "param":"messages",
         "code":"context_length_exceeded"}

Eliza source files referenced

eliza/packages/core/src/runtime/model-input-budget.ts · token estimation + compaction-threshold computation
eliza/packages/core/src/runtime/limits.ts · ChainingLoopConfig, TrajectoryLimitExceeded
eliza/packages/core/src/runtime/planner-loop.ts:825 · compaction trigger
eliza/packages/core/src/runtime/planner-loop.ts:1011 · maybeCompactBeforeNextModelCall
eliza/packages/core/src/runtime/cost-table.ts · candidate for the new contextWindow field
eliza/plugins/plugin-openai/utils/config.ts · isCerebrasMode, model-slot getters
eliza/plugins/plugin-openai/models/embedding.ts · deterministic local-embedding fallback for Cerebras mode
eliza/packages/agent/src/api/provider-switch-config.ts:400-428 · PROVIDER_DEFAULT_MODELS table (the OPENAI default-stamping bug)
eliza/packages/app-core/src/services/local-inference/router-handler.ts · MAX_SAFE_INTEGER-priority router (where local-ai outranks plugin-openai for embeddings)

What the /api/provider/switch call actually wrote

POST /api/provider/switch
{
  "provider": "openai",
  "apiKey":   "csk-...",
  "primaryModel": "qwen-3-235b-a22b-instruct-2507",
  "useLocalEmbeddings": true
}

→ milady.json mutated:
  env.OPENAI_API_KEY        = "vault://OPENAI_API_KEY"
  env.OPENAI_LARGE_MODEL    = "gpt-5.5"            // ← from PROVIDER_DEFAULT_MODELS
  env.OPENAI_SMALL_MODEL    = "gpt-5-mini"         // ← from PROVIDER_DEFAULT_MODELS
  agents.defaults.subscriptionProvider = null
  serviceRouting.llmText    = { backend:"openai", transport:"direct",
                                primaryModel:"qwen-3-235b-a22b-instruct-2507" }