The switch to Cerebras works for ~80% of traffic. Simple chat, tool-bearing one-shots, and short tool chains all return clean answers in under 12 seconds via gpt-oss-120b. Verified: PONG, 17×23=391, service.env has 18 lines.
The remaining ~20% — multi-step planner loops — overflow Cerebras's 131K-token context window. A single user message like “list my 3 most recently merged PRs” spawned a 13-iteration planner loop. By the 13th iteration the prompt had grown to 153,432 tokens and Cerebras returned context_length_exceeded. Total spend: 2.2M tokens / $1.11 on one failed turn.
Eliza already has a compaction system (packages/core/src/runtime/model-input-budget.ts + runtime/limits.ts + runtime/planner-loop.ts) with sensible defaults (128K window, 10K reserve, keep 4 recent steps verbatim). It fired, but it didn't fire hard enough for Cerebras's tighter context plus the bot's particular failure-loop pattern. Three concrete fixes are listed in §8.
The Cerebras key works perfectly. The problem is not Cerebras and not the key — it’s the gap between Eliza’s assumption (≥128K context, <128K typical prompt) and what the planner actually produces when a task fails its first 2–3 tool calls.
Goal: disable Claude inference, route all bot inference to Cerebras using key csk-8c9hf68jfm6h95....
GET /v1/models: gpt-oss-120b, qwen-3-235b-a22b-instruct-2507, zai-glm-4.7, llama3.1-8b.POST /api/provider/switch with {provider:"openai", apiKey:"csk-..."} — Eliza already detects Cerebras via OPENAI_BASE_URL=https://api.cerebras.ai/v1 + MILADY_PROVIDER=cerebras.OPENAI_*_MODEL=gpt-oss-120b across all four slots (NANO/SMALL/MEDIUM/LARGE) in both service.env and milady.json (top-level env + nested env.vars).@elizaos/plugin-openai and @elizaos/core to local source so the Cerebras-mode codepath actually runs (npm alpha.537 predates the Cerebras support).milady.service; ran a 4-test smoke suite via Discord.Test prompt:
@remilio nubilio list my 3 most recently MERGED PRs on elizaOS/eliza
using gh cli, just number + title each line
Bot reply ~4 minutes later:
Trajectory metrics for this single user message (trajectories/.../tj-b8556917146694.json — 17 MB JSON):
| Metric | Value | Note |
|---|---|---|
| Planner iterations | 13 | Bot replanned 13 times trying to land the task |
| Tool calls executed | 10 | 3 failed (gh CLI not directly available; sub-agent spawn took over) |
| Evaluator failures | 6 | Evaluator said "CONTINUE" rather than letting the loop finish |
| Total prompt tokens | 2,205,684 | Cumulative across all 13 planner calls |
| Cache reads | 1,751,296 | ~79% cached — that's good, but the uncached portion still ballooned |
| Total completion tokens | 11,229 | ~200:1 input-to-output ratio |
| Cost | $1.11 | For one Discord question that produced "flaked" |
| Total latency | 142,146 ms | 142 seconds |
| Final decision | CONTINUE | Never reached FINISH; killed by terminal_only_continuations limit |
| Status | errored | — |
The actual exception, captured via a custom fetch wrapper I added to eliza/plugins/plugin-openai/providers/openai.ts during the debug session:
One Discord message → 13 planner iterations. Each iteration is its own complete model call with the full trajectory so far embedded in the prompt. The growth pattern:
| Iteration | What the planner sees | Approx. prompt size |
|---|---|---|
| 1 | User message + system prompt + tools schema + context providers | ~10–20K tokens |
| 2 | + step 1 thought + tool call + tool result | ~25K |
| 3 | + step 2 (a sub-agent spawn — large output) | ~50K |
| … | (each step appends sub-agent PTY output, evaluator JSON, etc.) | — |
| 13 | 13 full step transcripts + canonical context + tools + system | 153,432 tokens — overflow |
What kept the loop going for 13 iterations rather than stopping at step 2 or 3:
decision: CONTINUE instead of FINISH(success=false), so the planner kept replanning rather than reporting failure to the user.compactionKeepSteps: 4 default kept the four most-recent step transcripts verbatim — and those four steps happened to include a 50 KB sub-agent PTY dump.TrajectoryLimitExceeded finally killed the loop.To Shaw's question — yes, this is supposed to be handled by Eliza already. Three subsystems are in place:
eliza/packages/core/src/runtime/model-input-budget.ts
export const DEFAULT_CONTEXT_WINDOW_TOKENS = 128_000;
export const DEFAULT_COMPACTION_RESERVE_TOKENS = 10_000;
export function buildModelInputBudget(args: { ... }): ModelInputBudget {
// estimatedInputTokens = ceil(chars / 3.5)
// compactionThresholdTokens = contextWindowTokens - reserveTokens // = 118K default
// shouldCompact = estimatedInputTokens >= compactionThresholdTokens
}
Before every planner model call, the loop estimates how many tokens the next request will use. If the estimate exceeds the compaction threshold, compaction fires.
eliza/packages/core/src/runtime/planner-loop.ts:825
if (modelInputBudget.shouldCompact && params.config.compactionEnabled) {
const compacted = await maybeCompactPlannerTrajectory({
trajectory, budget, config, recorder, trajectoryId, parentStageId, iteration, logger,
});
if (compacted) { /* re-render messages with the summary substituted */ }
}
The compactor takes old planner steps, summarizes them via summarizePlannerStep(), and replaces them in the trajectory with a single text block ("Compacted prior planner trajectory steps because estimated input approached the model context window. compacted_steps: N, kept_recent_steps_verbatim: 4, …"). The four newest steps stay verbatim so the planner has high-fidelity recent context.
eliza/packages/core/src/runtime/limits.ts
export const DEFAULT_CHAINING_LOOP_CONFIG: ChainingLoopConfig = {
maxToolCalls: 16,
maxRepeatedFailures: 2,
maxTerminalOnlyContinuations: 2, // <-- this is what stopped our 13-iter loop
contextWindowTokens: 128_000,
compactionReserveTokens: 10_000,
compactionEnabled: true,
compactionKeepSteps: 4,
};
Three hard limits cap any single user message: max tool calls (16), max repeated failures (2), max terminal-only continuations (2). Hitting any one throws TrajectoryLimitExceeded and the bot returns a "structured failure" reply.
The default of 128_000 assumes ≥128K context. gpt-oss-120b has 131K, leaving only 3K headroom over the threshold. With Claude Opus 4.7 (200K), there was 72K of headroom — every model picks up these defaults regardless of its real ceiling.
The estimator uses chars / 3.5. That's accurate for English prose but underestimates tool/JSON-heavy content (tool schemas, JSON tool results, base64 attachments). Our trajectory's "estimated 118K" was actually 153K on the wire — a 30% underestimate.
compactionKeepSteps: 4 keeps the four newest steps in full. When a step contains a sub-agent PTY dump or a multi-thousand-line file read, those four steps alone can be 80–100K tokens — leaving the compactor with no room to add the summary.
Today the loop stops on maxTerminalOnlyContinuations: 2 (count of evaluator-said-CONTINUE-but-no-tool-fired) but has no cumulative-token guard. The failed turn spent 2.2 M tokens / $1.11 before it died. A run-away loop could in principle spend $10+ before the count-based limit catches it.
The Cerebras-specific functions isCerebrasMode(), sanitizeFunctionNameForCerebras(), normalizeSchemaForCerebras() and the deterministic local-embedding fallback all live in the local eliza/ source tree but are absent from the published npm tarball (@elizaos/[email protected], ditto for @elizaos/core). Verified:
$ grep -c "sanitizeFunctionNameForCerebras" \
node_modules/.bun/@[email protected]+.../dist/node/index.node.js
0
$ grep -c "sanitizeFunctionNameForCerebras" \
eliza/plugins/plugin-openai/dist/node/index.node.js
16
Switched the npm symlink to the local build (ln -sf $(realpath eliza/plugins/plugin-openai) node_modules/@elizaos/plugin-openai, same for @elizaos/core) so the Cerebras codepath actually runs.
/api/provider/switch rewrites OPENAI_*_MODEL to OpenAI defaultseliza/packages/agent/src/api/provider-switch-config.ts:410-414 contains a PROVIDER_DEFAULT_MODELS map:
openai: {
smallKey: "OPENAI_SMALL_MODEL", smallVal: "gpt-5-mini",
largeKey: "OPENAI_LARGE_MODEL", largeVal: "gpt-5.5",
}
When the switch handler fires with provider:"openai" it writes those values into milady.json > env > OPENAI_LARGE_MODEL = "gpt-5.5" (top-level env, not env.vars). That value overrides the systemd-supplied OPENAI_LARGE_MODEL=qwen-... at runtime hydration time. Took 90 minutes of debugging to spot because process.env showed the right value but the runtime had already mutated it.
Bot log:
The local-AI plugin (node-llama-cpp) is registered as a TEXT_EMBEDDING provider at MAX_SAFE_INTEGER priority. When it's called, it attempts to compile llama.cpp via cmake-js-llama every single time. That compile fails on this VPS (missing cmake toolchain), so the call falls through to the next provider in the chain — burning real time on every message that needs embeddings.
Eliza already has a clean Cerebras-mode fallback in eliza/plugins/plugin-openai/models/embedding.ts (shouldUseLocalEmbeddingFallback: when isCerebrasMode() && no explicit embedding endpoint → return a deterministic locally-hashed vector). But it never gets called because local-ai outranks it.
The user sees "Something flaked on my end, please try again." That's good ops-hygiene (no internal leak) but it means the user has no signal to not retry the same query, which will hit the same wall and burn another $1.11.
When asked to reply with just "PONG", gpt-oss-120b spent ~45 completion tokens (most of them reasoning) — vs qwen-3-235b-a22b-instruct-2507 which spent 3. For the chatty bot path, qwen is cheaper. For tool-planning where reasoning helps, gpt-oss-120b is better. Worth slotting differently:
| Slot | Today | Suggested for Cerebras | Why |
|---|---|---|---|
| TEXT_NANO | gpt-oss-120b | llama3.1-8b | Tokenization, classification — no reasoning needed |
| TEXT_SMALL | gpt-oss-120b | qwen-3-235b-... | Chat replies, no chain-of-thought waste |
| TEXT_MEDIUM | gpt-oss-120b | qwen-3-235b-... | Same |
| TEXT_LARGE | gpt-oss-120b | gpt-oss-120b | Planner uses reasoning effectively |
Concrete, ordered by ratio of payoff to effort.
Extend runtime/cost-table.ts from a price-only table to a capability table that includes contextWindow per model id. Add a small lookup in buildModelInputBudget that reads the active model's window when the caller doesn't pass one. Bring the default safety reserve from 10K to 15K to give compaction more room. Effect: every model self-tunes to its own ceiling; no more 131K-vs-128K landmines.
Add maxTrajectoryPromptTokens: 500_000 to ChainingLoopConfig. After each planner stage, sum totalPromptTokens across stages; if > max, abort with TrajectoryLimitExceeded(kind: "trajectory_token_budget"). Effect: cost-per-incident bounded at known dollars instead of "however much it takes."
Replace estimateTokensFromChars(chars) = ceil(chars / 3.5) with provider-aware tokenization. tiktoken via @dqbd/tiktoken for OpenAI-shape providers (Cerebras's gpt-oss-120b tokenizes similarly to o200k_base). Keep the chars/3.5 path as a fallback when no tokenizer is registered. Effect: estimator stops being 30% low; compaction fires at the right moment.
In maybeCompactPlannerTrajectory, after slicing off the four kept-verbatim newest steps, sum their estimated token sizes. If the sum exceeds 0.5 × compactionThresholdTokens, additionally truncate each kept step to a head + tail (e.g., 4K head + 2K tail) with a [… truncated N chars …] middle. Effect: a single pathologically-large tool output can't single-handedly blow the budget.
In provider-switch-config.ts guard PROVIDER_DEFAULT_MODELS.openai behind !isCerebrasOrThirdPartyBase(env.OPENAI_BASE_URL). If we're already pointed at Cerebras/Groq/together/openrouter, leave the model names alone. Effect: "/api/provider/switch with provider=openai" stops fighting the user's already-configured Cerebras models.
When the local-ai plugin's TEXT_EMBEDDING handler is registered, run the compile path once at boot. If it fails, register at priority -Infinity instead of MAX_SAFE_INTEGER. Effect: broken local-AI installs don't shadow working remote providers.
The local source has every fix needed. Publishing @elizaos/core and @elizaos/plugin-openai with the next alpha bump removes the workspace-link workaround. Effect: any milady user can switch to Cerebras with three env vars, no node_modules surgery.
When the limit fires, attach the trajectoryId to the failure reply, and emit a HOOK_TRAJECTORY_LIMIT event so monitoring picks it up. Effect: repeat-symptom users can be referenced to their trajectory directly.
This is what's live on the VPS right now.
OPENAI_BASE_URL=https://api.cerebras.ai/v1
MILADY_PROVIDER=cerebras
OPENAI_LARGE_MODEL=gpt-oss-120b
OPENAI_SMALL_MODEL=gpt-oss-120b
OPENAI_MEDIUM_MODEL=gpt-oss-120b
OPENAI_NANO_MODEL=gpt-oss-120b
"env": {
"OPENAI_LARGE_MODEL": "gpt-oss-120b",
"OPENAI_SMALL_MODEL": "gpt-oss-120b",
"OPENAI_MEDIUM_MODEL": "gpt-oss-120b",
"OPENAI_NANO_MODEL": "gpt-oss-120b",
"OPENAI_BASE_URL": "https://api.cerebras.ai/v1",
"MILADY_PROVIDER": "cerebras",
"vars": { /* mirror of the above + OPENAI_API_KEY: "vault://OPENAI_API_KEY" */ }
}
"serviceRouting": {
"llmText": { "backend": "openai", "transport": "direct", "primaryModel": "gpt-oss-120b" }
}
OPENAI_API_KEY entry stores the Cerebras key as csk-…, encrypted with the bot's vault passphrase. Anthropic key is still in the vault but unreferenced from the active routing path.
node_modules/@elizaos/core -> ../../eliza/packages/core
node_modules/@elizaos/plugin-openai -> ../../eliza/plugins/plugin-openai
| Test | Sent at | Replied at | Result |
|---|---|---|---|
ping — reply with just: PONG | 15:44:24Z | 15:44:36Z | PONG · pass · 12 s |
count lines in /home/milady/.milady/service.env | 15:45:35Z | 15:47:06Z | 18 lines (correct) · pass · 91 s |
what is 17 * 23? | 15:46:15Z | 15:46:52Z | 391 · pass · 37 s |
list 3 most recently MERGED PRs on elizaOS/eliza | 15:46:56Z | 15:51:05Z | Something flaked · fail · 4 min · 153K-token overflow |
maxRepeatedFailures within 3 iterations, not 13.OPENAI_LARGE_MODEL=qwen-3-235b-a22b-instruct-2507 to confirm per-model context detection picks up qwen's window correctly./home/milady/.milady/trajectories/6f110aa9-c169-0e10-8a4f-b4cca439be25/tj-b8556917146694.json metrics.totalPromptTokens: 2,205,684 metrics.totalCompletionTokens: 11,229 metrics.totalCacheReadTokens: 1,751,296 metrics.totalCostUsd: 1.1118 metrics.plannerIterations: 13 metrics.toolCallsExecuted: 10 metrics.toolCallFailures: 3 metrics.evaluatorFailures: 6 metrics.finalDecision: CONTINUE status: errored
[CEREBRAS-ERR] 400 Bad Request url=https://api.cerebras.ai/v1/chat/completions
RESP: {"message":"Please reduce the length of the messages or completion.
Current length is 153432 while limit is 131000",
"type":"invalid_request_error",
"param":"messages",
"code":"context_length_exceeded"}
ChainingLoopConfig, TrajectoryLimitExceededmaybeCompactBeforeNextModelCallcontextWindow fieldisCerebrasMode, model-slot gettersPROVIDER_DEFAULT_MODELS table (the OPENAI default-stamping bug)/api/provider/switch call actually wrotePOST /api/provider/switch
{
"provider": "openai",
"apiKey": "csk-...",
"primaryModel": "qwen-3-235b-a22b-instruct-2507",
"useLocalEmbeddings": true
}
→ milady.json mutated:
env.OPENAI_API_KEY = "vault://OPENAI_API_KEY"
env.OPENAI_LARGE_MODEL = "gpt-5.5" // ← from PROVIDER_DEFAULT_MODELS
env.OPENAI_SMALL_MODEL = "gpt-5-mini" // ← from PROVIDER_DEFAULT_MODELS
agents.defaults.subscriptionProvider = null
serviceRouting.llmText = { backend:"openai", transport:"direct",
primaryModel:"qwen-3-235b-a22b-instruct-2507" }