Send 0 tokens instead of 0.
NocturnusAI sits between your agent and the LLM. It extracts facts, reasons about what's relevant, and returns only what changed.
# Send raw conversation turns $ curl localhost:9300/context \ -d '{ "turns": [ "User: login broken after Okta cutover", "Tool crm: account=acme tier=enterprise" ], "sessionId": "ticket-42" }' # NocturnusAI returns only the delta ✓ briefingDelta: "Login failing post-Okta cutover. Acme is enterprise tier." ~221 tokens instead of ~1,259
Measured on live APIs. Every number is real.
15-turn product support conversation. Token counts from usage.input_tokens. Not a model. Not an estimate.
Counts from usage.input_tokens — not a model, not an estimate.
Full methodology + run it yourself →
Start in your stack.
Pick your framework. Copy the install. Run bench.py against your own workload. Each example ships with the problem, exact before/after token counts, a copy-paste install, and runnable code.
The context layer is happening. The bill already did.
Three things hit production at the same time — and they all compound. A context server is no longer a nice-to-have; it's the layer the agent stack has been missing.
Agent bills scale exponentially, not linearly.
A $500 POC becomes $847k/month without changing a single line of code — just by adding users. Most of those tokens are the model re-reading what it already knows. That's a structural problem, not a tuning problem.
Similarity is not truth.
When an agent decides based on the current state of a record, a clause, or an inventory level, “the most semantically similar document” is the wrong abstraction. Vector retrieval cannot fix operational hallucinations. Inference over facts can.
Every prompt token is an attention cost.
Long-running agents get slower in proportion to their context. Smaller context = faster response. We see 3× to 10× latency improvements on agents that used to drag as threads grew.
There is going to be a context infrastructure layer in the agent stack — the same way there is a vector database layer, an orchestration layer, and a model layer. Within 18 months it will be obvious it was always there.
Model providers won't build this; they get paid by the token. Agent frameworks won't build this; they need to stay provider-neutral. It's going to be its own layer. Read the launch post →
Reasoning is how you get 82–90% fewer tokens.
Vector stores retrieve everything that looks vaguely related — so your context window fills with noise. NocturnusAI uses logical inference to keep only what must be true. That's the difference between ~1,259 tokens and ~221 (measured on Claude Opus 4).
Extracts facts, not chunks
The LLM pulls structured predicates from raw turns. No embeddings, no chunks, no similarity matching. Just clean facts: customer_tier(acme, enterprise).
Infers what's relevant
Backward chaining finds only facts reachable from your agent's current goals. Everything else stays out of the context window. This is why compression reaches 82%, not 30%.
Returns only the delta
Each turn retrieves only stored facts. By turn 10, your model sees 272 tokens instead of the full 1,607-token thread replay. By turn 15: 331 vs 2,460.
Why not Mem0, Zep, or Letta?
They store and retrieve. NocturnusAI reasons and compresses.
| NocturnusAI | Mem0 | Zep | Letta | |
|---|---|---|---|---|
| Context strategy | Goal-driven inference | Similarity search | Graph retrieval | Agent-managed |
| Compression method | briefingDelta (only what changed) | — | Summarization | Compaction |
| Works without LLM | Yes — inference is deterministic | No | No | No |
| Self-hosted complexity | 1 container, 1 port | 3 containers | Graph DB required | DB required |
| MCP tools | 16 tools | 9 tools | Experimental | Consumer only |
| Framework integrations | 8 | 4 | 2 | 2 |
$2,400/month instead of $13,600.
At 1,000 requests/hour. Claude Opus 4, $15/1M input tokens. Measured, not estimated.
How we calculated this
Calculation: 1,000 req/hr × 720 hr/mo × 1,259 tokens × $15/1M = $13,596. With NocturnusAI: same × 221 tokens = $2,387. Measured on Claude Opus 4 over a 15-turn benchmark. The 82% reduction is (1,259 − 221) / 1,259 = 82.4%. Gemini 2.0 Flash shows 90%.
221 tokens, not 1,259.
Per turn. Every turn. The dollars take care of themselves.
Based on a typical long-running agent workload. See the math →
Running in 10 seconds. No API keys.
The logic engine, fact storage, inference, and context ranking all work instantly. Add an LLM later for natural-language extraction.
docker run -d --name nocturnusai -p 9300:9300 ghcr.io/auctalis/nocturnusai:latest
curl -s -X POST localhost:9300/tell -H "Content-Type: application/json" -H "X-Tenant-ID: default" -d '{"predicate":"likes","args":["alice","logic"]}'
curl -s -X POST localhost:9300/ask -H "Content-Type: application/json" -H "X-Tenant-ID: default" -d '{"predicate":"likes","args":["alice","?what"]}'
# → likes(alice, logic) — deterministic, sub-millisecond, zero token cost # Store facts curl -s -X POST http://localhost:9300/tell \ -H "Content-Type: application/json" \ -H "X-Tenant-ID: default" \ -d '{"predicate":"customer_tier","args":["acme_corp","enterprise"]}' curl -s -X POST http://localhost:9300/tell \ -H "Content-Type: application/json" \ -H "X-Tenant-ID: default" \ -d '{"predicate":"contract_value","args":["acme_corp","2000000"]}' # Query them back curl -s -X POST http://localhost:9300/ask \ -H "Content-Type: application/json" \ -H "X-Tenant-ID: default" \ -d '{"predicate":"customer_tier","args":["acme_corp","?tier"]}'
customer_tier(acme_corp, enterprise)
Sub-millisecond. Deterministic. The same query always returns the same answer. No token cost.
curl -s -X POST http://localhost:9300/memory/context \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: default" \
-d '{
"goals": [{"predicate":"customer_tier","args":["acme_corp","?tier"]}],
"maxFacts": 10,
"sessionId": "ticket-42"
}' | python3 -m json.tool { "windowSize": 1, "goalDriven": true,
"facts": [{"predicate":"customer_tier", "args":["acme_corp","enterprise"], "salience":0.65}] } Backward chaining finds only facts reachable from your goals. Feed this to your LLM instead of the whole thread.
ollama pull granite3.3:8b) or add -e ANTHROPIC_API_KEY=sk-ant-... to your Docker command. Setup guide → curl -s -X POST http://localhost:9300/context \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: default" \
-d '{
"turns": [
"User: We cannot log in after the Okta cutover.",
"Tool crm_lookup: account=acme tier=enterprise"
],
"scope": "ticket-4821",
"sessionId": "ticket-4821"
}' | python3 -m json.tool ## Login Issue Login is currently failing following the Okta cutover. Acme is on the enterprise tier.
Two messy messages became two clean sentences. On the next turn, only the delta comes back. Full context workflow →
What your model sees (turn 2+)
14 failed SAML assertions at 09:12 UTC. Issuer mismatch after IdP migration.
~221 tokens avg · 5.7× fewer
Without NocturnusAI
User: We can't log in after the Okta cutover. Tool crm_lookup: account=acme tier=enterprise Tool auth_audit: 14 failed SAML assertions... Tool auth_audit: issuer mismatch after IdP... + internal notes, retries, system events...
~1,259 tokens avg · naive full-history replay (see benchmark)
Steps 1–2 work instantly with zero dependencies. Step 3 adds LLM extraction — Ollama locally (free) or any cloud provider.
Exact output phrasing varies by model. Cloud LLMs (Anthropic, OpenAI) are faster but require an API key.
One call per turn. The server handles the rest.
Extracts
The LLM pulls structured facts from your raw turns and stores them under the conversation scope.
Remembers
Prior turns are fed back automatically so references like "the account" resolve correctly across turns.
Returns the delta
Each call returns briefingDelta — a short natural-language summary of only what's new since the last turn.
// Each turn — one call r = POST /context(turns, scope, sessionId) // Feed the delta to your model messages = [ system(r.briefingDelta), user(next_question), ] // Done with the conversation? Clean up. DELETE /scope/ticket-4821
One server between your agent and the model.
Self-hosted. Open source. Not a memory store — a reasoning engine.
Star on GitHub
Drop into your existing stack.
Native tools for the frameworks you already use. One import, one line.
LangChain
7 tools + agent example
pip install nocturnusai[langchain] CrewAI
5 tools + Storage backend
pip install nocturnusai[crewai] OpenAI Agents
5 function tools
pip install nocturnusai[openai-agents] MCP
16 tools, 5 IDE configs
Claude, Cursor, Windsurf, VS Code Vercel AI SDK
Next.js streamText wrapper
npm install nocturnusai-sdk AutoGen · LangGraph
Tools + Memory + Checkpointer
pip install nocturnusai[autogen] Full integration docs → · All integrations include get_nocturnusai_tools(client) for one-line setup.
Python & TypeScript. Typed. Zero dependencies (TS).
from nocturnusai import SyncNocturnusAIClient
with SyncNocturnusAIClient("http://localhost:9300") as c:
c.tell("parent", ["alice", "bob"])
c.tell("parent", ["bob", "charlie"])
c.teach(
head={"predicate": "grandparent", "args": ["?x", "?z"]},
body=[
{"predicate": "parent", "args": ["?x", "?y"]},
{"predicate": "parent", "args": ["?y", "?z"]},
],
)
results = c.ask("grandparent", ["?who", "charlie"])
# [Atom(predicate='grandparent', args=['alice', 'charlie'])] import { NocturnusAIClient } from 'nocturnusai-sdk';
const c = new NocturnusAIClient({
baseUrl: 'http://localhost:9300',
});
await c.tell('parent', ['alice', 'bob']);
await c.tell('parent', ['bob', 'charlie']);
await c.teach(
{ predicate: 'grandparent', args: ['?x', '?z'] },
[
{ predicate: 'parent', args: ['?x', '?y'] },
{ predicate: 'parent', args: ['?y', '?z'] },
],
);
const results = await c.ask('grandparent', ['?who', 'charlie']);
// [{ predicate: 'grandparent', args: ['alice', 'charlie'] }]
48 methods (Python) · 46 methods (TypeScript) · Full async support · Typed returns · Retry with backoff
SDK reference →
Field notes from the context layer.
$600 Billion Says You'll Keep Wasting Tokens
The hyperscaler capex buildout only pays off if token volume grows. Your agent's context replay is the business model.
Read post →Stop Sending the Model What It Already Knows
Nocturnus is live. The token bill is now optional.
Read post →Start here.
764 TESTS · ACID TRANSACTIONS · OPEN SOURCE · SINGLE CONTAINER
Docs
Full context workflow, API reference, SDKs, MCP, CLI.
Features
Memory lifecycle, three-layer model, goal-driven context, scope management.
GitHub
Source, issues, Docker images, native binaries.
Blog
Launch essays, benchmarks, and field notes from the team.
Questions, partnerships, or licensing?