Memory & Context River

The AI that actually
remembers.

Cerebral OS sends the right context. Three stages — once at upload, once at runtime, once before the model runs. The LLM only sees what matters. Nothing else.

58%

Token reduction vs naive context

36→9

Candidates: before vs after curation

91%

Token savings on 10-step reasoning chains

Book a Demo Technical Docs →

The Pipeline

Three steps.
Only the right context reaches the model.

Most platforms dump everything into the context window and hope the LLM figures it out. Cerebral OS runs a deliberate pipeline. The result: the model only sees what matters.

01 · Upload time

Knowledge Librarian

Runs once per document

Documents are processed once at upload. Every future query hits the exact answer — not a chunk that happens to contain related words.

Processed once. Precise forever.

Runs once per document — never again

Every chunk knows what question it answers

02 · Runtime

Context Curator

Runs every message

Irrelevant context is removed before the model runs. The LLM never sees noise — only what matters for this specific message.

Runs before every LLM call, every time

36 candidates → 9 in production

Scores and ranks by relevance, not recency

03 · Assembly

Context River

Assembles the context window

The LLM only sees what matters. Nothing else. Token budget enforced by priority — critical SOPs always in, noise always out.

58% token reduction in production

MANDATORY → HIGH → MEDIUM → LOW priority

Full manifest: what included, what dropped

Real Numbers

Same question.
Completely different result.

This is a real production comparison — same Cerebral, same question, same document. Before the memory pipeline refactor vs after.

Before — Naive context assembly

"How long does setup take?"

Candidates assembled36

Tokens used5,123

CurationNone

GroundingNone

"Setup typically takes some time depending on your configuration. Most users find the process straightforward with proper planning and resources."

Generic. No numbers. No grounding.

After — Three-stage memory pipeline

"How long does setup take?"

Candidates assembled9

Tokens used2,140

Curator kept / dropped8 kept · 5 dropped

GroundingInjected — cites docs

"Initial setup takes about 20 minutes for a standard configuration. Enterprise deployments with custom SOPs and integrations typically complete in 2–4 hours. Based on your plan, you have access to our guided setup flow which most customers complete in under 30 minutes."

Specific. Grounded. Cites real numbers from your docs.

Triple-Vector Search

Three embeddings.
One precision retrieval.

Standard RAG uses a single content embedding. Cerebral OS stores three — and weights them by how questions are actually asked.

Content embedding weight: 0.3

Raw text similarity. Standard RAG stops here. Useful for literal keyword matches but misses semantic intent. We include it but don't over-rely on it.

Question embedding weight: 0.5 — highest

The Librarian extracts "what question does this chunk answer?" at upload time. At runtime, the user's question matches against other questions — not content. This is why a chunk buried on page 47 about pricing outranks everything else when someone asks about cost.

Summary embedding weight: 0.2

A one-sentence summary of each chunk captures concept-level meaning. Catches cases where neither the raw content nor the question phrasing matches directly, but the topic does.

Weighted scoring formula

-- Combined relevance score

(1 - content_embedding <=> query) * 0.3

+ (1 - question_embedding <=> query) * 0.5

+ (1 - summary_embedding <=> query) * 0.2

-- COALESCE null embeddings to 1.0

-- (worst score = backward compatible)

What this means in practice

"Tell me about pricing" → matches question "What is the cost?" on chunk 47 → ranks 1st. Before triple-vector, chunk 47 ranked 89th behind generic overview sections.

Related chunk expansion

When a chunk is selected, its related chunks (mapped by the Librarian) are automatically pulled in — so the ROI section and comparison section come with the pricing section, without a second search.

Context River

Token budget that
knows what matters.

The Context River assembles curated candidates into the LLM context window using a strict priority system. Critical SOPs always load. Knowledge fills the middle. Conversation history fills the rest. When the budget runs out, the least important drops first — never the SOP, never the policy.

Assembly priority — loads top to bottom, drops bottom first

MANDATORY

SOPs · Scripts · Workflow state

sop · script · workflow_state

Always included · No token limit

HIGH

Catalog · Policy · Available actions · External data

catalog · policy · available_actions · external

Included first within budget

MEDIUM

Training · Customer memory · Knowledge chunks

training · customer · general

Included if budget remains

LOW

Conversation history · General memories

conversation · general

Filled last · Oldest dropped first

Token budgets per channel

Chat

32K

input · 2K output

History25%

Knowledge50%

External25%

48K

input · 4K output

History35%

Knowledge45%

External20%

SMS

input · 500 output

History20%

Knowledge40%

External40%

Voice

input · 800 output

History30%

Knowledge40%

External30%

Token Economics

Better answers.
Lower cost. Not a trade-off.

The memory pipeline doesn't just improve response quality — it measurably reduces LLM token spend. These are real numbers from production, not estimates.

91%

Token reduction on a 10-step reasoning chain. 140,000 tokens naive → 12,500 tokens with Context River.

58%

Average token reduction per message in production. 5,123 tokens → 2,140 tokens on the exact same question.

Generic answers. Every response grounded in your actual knowledge — specific numbers, real context, cited sources. Not training data guesses.

Compression by context layer

Layer	Naive tokens	Cerebral OS	Reduction
Short-term chat	4,000	600	85%
Long-term memory	8,000	400	95%
Procedural SOP	2,000	250	87.5%
Total runtime	14,000	1,250	91.1%

Why the savings compound

Naive orchestration has O(n²) token growth — each step replays the full prior context. Cerebral's memory architecture transmits only state deltas (~50 tokens per step) instead of the full conversation (~4,000+ tokens). Over a 10-step reasoning chain, this is the difference between 140,000 tokens and 12,500.

Customer Isolation

Every customer.
Their own memory space.

Memory isolation is enforced at the database level — not application logic. Customer A's conversation history, preferences, and order context are never visible when Customer B is talking to the same Cerebral. Scope filtering is built into every query.

Three memory scopes

system — SOPs, policies, training docs visible to all customers. customer — isolated per individual. visitor — anonymous session scope.

Guaranteed isolation

Customer A's memory is never accessible when Customer B is in conversation. Isolation is enforced at the database level on every query — not application logic.

Metadata boost

When the incoming message contains an order_id, email, or phone number that matches stored memory metadata, that memory's relevance score gets a boost — so the right customer context surfaces automatically.

Scope filter — every query

-- Every query is scope-filtered

AND (

scope = 'system'

OR (scope = 'visitor'

AND customer_id IS NULL)

OR (scope = 'customer'

AND customer_id = $N)

)

system

visitor

customer

The AI that actuallyremembers.

Three steps.Only the right context reaches the model.

Same question.Completely different result.

Three embeddings.One precision retrieval.

Token budget thatknows what matters.

Better answers.Lower cost. Not a trade-off.

Every customer.Their own memory space.

Memory thatactually works.