Memory & Context River

The AI that actually
remembers.

Cerebral OS sends the right context. Three stages — once at upload, once at runtime, once before the model runs. The LLM only sees what matters. Nothing else.

58%
Token reduction vs naive context
36→9
Candidates: before vs after curation
91%
Token savings on 10-step reasoning chains
The Pipeline

Three steps.
Only the right context reaches the model.

Most platforms dump everything into the context window and hope the LLM figures it out. Cerebral OS runs a deliberate pipeline. The result: the model only sees what matters.

01 · Upload time
Knowledge Librarian
Runs once per document

Documents are processed once at upload. Every future query hits the exact answer — not a chunk that happens to contain related words.

Processed once. Precise forever.
Runs once per document — never again
Every chunk knows what question it answers
02 · Runtime
Context Curator
Runs every message

Irrelevant context is removed before the model runs. The LLM never sees noise — only what matters for this specific message.

Runs before every LLM call
Runs before every LLM call, every time
36 candidates → 9 in production
03 · Assembly
Context River
Assembles the context window

The LLM only sees what matters. Nothing else. Token budget enforced by priority — critical SOPs always in, noise always out.

58% token reduction in production
MANDATORY → HIGH → MEDIUM → LOW priority
Full manifest: what included, what dropped
Real Numbers

Same question.
Completely different result.

This is a real production comparison — same Cerebral, same question, same document. Before the memory pipeline refactor vs after.

Before — Naive context assembly
"How long does setup take?"
Candidates assembled36
Tokens used5,123
CurationNone
GroundingNone
"Setup typically takes some time depending on your configuration. Most users find the process straightforward with proper planning and resources."
Generic. No numbers. No grounding.
After — Three-stage memory pipeline
"How long does setup take?"
Candidates assembled9
Tokens used2,140
Curator kept / dropped8 kept · 5 dropped
GroundingInjected — cites docs
"Initial setup takes about 20 minutes for a standard configuration. Enterprise deployments with custom SOPs and integrations typically complete in 2-4 hours. Based on your plan, you have access to our guided setup flow which most customers complete in under 30 minutes."
Specific. Grounded. Cites real numbers from your docs.
Triple-Vector Search

Three embeddings.
One precision retrieval.

Standard RAG uses a single content embedding. Cerebral OS stores three — and weights them by how questions are actually asked.

Content embedding weight: 0.3
Raw text similarity. Standard RAG stops here. Useful for literal keyword matches but misses semantic intent. We include it but don't over-rely on it.
Question embedding weight: 0.5 — highest
The Librarian extracts "what question does this chunk answer?" at upload time. At runtime, the user's question matches against other questions — not content. This is why a chunk buried on page 47 about pricing outranks everything else when someone asks about cost.
Summary embedding weight: 0.2
A one-sentence summary of each chunk captures concept-level meaning. Catches cases where neither the raw content nor the question phrasing matches directly, but the topic does.
Weighted scoring formula
-- Combined relevance score
(1 - content_embedding <=> query) * 0.3
+ (1 - question_embedding <=> query) * 0.5
+ (1 - summary_embedding <=> query) * 0.2
-- COALESCE null embeddings to 1.0
-- (worst score = backward compatible)
What this means in practice
"Tell me about pricing" → matches question "What is the cost?" on chunk 47 → ranks 1st. Before triple-vector, chunk 47 ranked 89th behind generic overview sections.
Related chunk expansion
When a chunk is selected, its related chunks (mapped by the Librarian) are automatically pulled in — so the ROI section and comparison section come with the pricing section, without a second search.
Context River

Token budget that
knows what matters.

The Context River assembles curated candidates into the LLM context window using a strict priority system. Critical SOPs always load. Knowledge fills the middle. Conversation history fills the rest. When the budget runs out, the least important drops first — never the SOP, never the policy.

Assembly priority — loads top to bottom, drops bottom first
MANDATORY
SOPs · Scripts · Workflow state
sop · script · workflow_state
Always included · No token limit
HIGH
Catalog · Policy · Available actions · External data
catalog · policy · available_actions · external
Included first within budget
MEDIUM
Training · Customer memory · Knowledge chunks
training · customer · general
Included if budget remains
LOW
Conversation history · General memories
conversation · general
Filled last · Oldest dropped first
Token budgets per channel
Chat
32K
input · 2K output
History25%
Knowledge50%
External25%
Email
48K
input · 4K output
History35%
Knowledge45%
External20%
SMS
8K
input · 500 output
History20%
Knowledge40%
External40%
Voice
8K
input · 800 output
History30%
Knowledge40%
External30%
Token Economics

Better answers.
Lower cost. Not a trade-off.

The memory pipeline doesn't just improve response quality — it measurably reduces LLM token spend. These are real numbers from production, not estimates.

91%
Token reduction on a 10-step reasoning chain. 140,000 tokens naive → 12,500 tokens with Context River.
58%
Average token reduction per message in production. 5,123 tokens → 2,140 tokens on the exact same question.
0%
Generic answers. Every response grounded in your actual knowledge — specific numbers, real context, cited sources. Not training data guesses.
Compression by context layer
Layer Naive tokens Cerebral OS Reduction
Short-term chat4,00060085%
Long-term memory8,00040095%
Procedural SOP2,00025087.5%
Total runtime14,0001,25091.1%
Why the savings compound
Naive orchestration has O(n²) token growth — each step replays the full prior context. Cerebral's memory architecture transmits only state deltas (~50 tokens per step) instead of the full conversation (~4,000+ tokens). Over a 10-step reasoning chain, this is the difference between 140,000 tokens and 12,500.
Customer Isolation

Every customer.
Their own memory space.

Memory isolation is enforced at the database level — not application logic. Customer A's conversation history, preferences, and order context are never visible when Customer B is talking to the same Cerebral. Scope filtering is built into every query.

Three memory scopes
system — SOPs, policies, training docs visible to all customers. customer — isolated per individual. visitor — anonymous session scope.
Guaranteed isolation
Customer A's memory is never accessible when Customer B is in conversation. Isolation is enforced at the database level on every query — not application logic.
Metadata boost
When the incoming message contains an order_id, email, or phone number that matches stored memory metadata, that memory's relevance score gets a boost — so the right customer context surfaces automatically.
Scope filter — every query
-- Every query is scope-filtered
AND (
  scope = 'system'
  OR (scope = 'visitor'
    AND customer_id IS NULL)
  OR (scope = 'customer'
    AND customer_id = $N)
)
system
visitor
customer

Memory that
actually works.

58% token reduction. 91% savings on reasoning chains. Better answers from the same knowledge base. See it in production.

Book a Demo See the full platform →