CONTEXT ENGINEERING ENGINE

Send 0 tokens instead of 0.

NocturnusAI sits between your agent and the LLM. It extracts facts, reasons about what's relevant, and returns only what changed.

$2,400/mo instead of $13,600/mo
82–90% fewer tokens · source available · single container
POST /context
# Send raw conversation turns
$ curl localhost:9300/context \
  -d '{
    "turns": [
      "User: login broken after Okta cutover",
      "Tool crm: account=acme tier=enterprise"
    ],
    "sessionId": "ticket-42"
  }'

# NocturnusAI returns only the delta
 briefingDelta: "Login failing post-Okta
  cutover. Acme is enterprise tier."

  ~221 tokens instead of ~1,259
Benchmark

Measured on live APIs. Every number is real.

15-turn product support conversation. Token counts from usage.input_tokens. Not a model. Not an estimate.

Claude Opus 4 — 5.7× fewer tokens
2,460 331 Naive NocturnusAI
Gemini 2.0 Flash — 10.0× fewer tokens
7,084 330 Naive NocturnusAI
5.7×
Claude Opus 4 · 82% reduction
10.0×
Gemini 2.0 Flash · 90% reduction
15
turns · live API · source available

Counts from usage.input_tokens — not a model, not an estimate. Full methodology + run it yourself →



Why Now

The context layer is happening. The bill already did.

Three things hit production at the same time — and they all compound. A context server is no longer a nice-to-have; it's the layer the agent stack has been missing.

Cost

Agent bills scale exponentially, not linearly.

A $500 POC becomes $847k/month without changing a single line of code — just by adding users. Most of those tokens are the model re-reading what it already knows. That's a structural problem, not a tuning problem.

Correctness

Similarity is not truth.

When an agent decides based on the current state of a record, a clause, or an inventory level, “the most semantically similar document” is the wrong abstraction. Vector retrieval cannot fix operational hallucinations. Inference over facts can.

Latency

Every prompt token is an attention cost.

Long-running agents get slower in proportion to their context. Smaller context = faster response. We see 3× to 10× latency improvements on agents that used to drag as threads grew.

There is going to be a context infrastructure layer in the agent stack — the same way there is a vector database layer, an orchestration layer, and a model layer. Within 18 months it will be obvious it was always there.

Model providers won't build this; they get paid by the token. Agent frameworks won't build this; they need to stay provider-neutral. It's going to be its own layer. Read the launch post →


How It Compresses

Reasoning is how you get 82–90% fewer tokens.

Vector stores retrieve everything that looks vaguely related — so your context window fills with noise. NocturnusAI uses logical inference to keep only what must be true. That's the difference between ~1,259 tokens and ~221 (measured on Claude Opus 4).

Step 1

Extracts facts, not chunks

The LLM pulls structured predicates from raw turns. No embeddings, no chunks, no similarity matching. Just clean facts: customer_tier(acme, enterprise).

Step 2

Infers what's relevant

Backward chaining finds only facts reachable from your agent's current goals. Everything else stays out of the context window. This is why compression reaches 82%, not 30%.

Step 3

Returns only the delta

Each turn retrieves only stored facts. By turn 10, your model sees 272 tokens instead of the full 1,607-token thread replay. By turn 15: 331 vs 2,460.

No LLM required for inference 1 container — no Postgres, Redis, or vector DB ACID transactions with truth maintenance Fork/diff/merge for hypothetical reasoning

Why not Mem0, Zep, or Letta?

They store and retrieve. NocturnusAI reasons and compresses.

NocturnusAI Mem0 Zep Letta
Context strategy Goal-driven inference Similarity search Graph retrieval Agent-managed
Compression method briefingDelta (only what changed) Summarization Compaction
Works without LLM Yes — inference is deterministic No No No
Self-hosted complexity 1 container, 1 port 3 containers Graph DB required DB required
MCP tools 16 tools 9 tools Experimental Consumer only
Framework integrations 8 4 2 2

How the logic engine compresses context →


The Cost

$2,400/month instead of $13,600.

At 1,000 requests/hour. Claude Opus 4, $15/1M input tokens. Measured, not estimated.

221
tokens per turn
with NocturnusAI
1,259
tokens per turn
naive avg
82%
reduction
Claude Opus 4 avg

How we calculated this

Without NocturnusAI A 15-turn product support conversation averages ~1,259 input tokens per LLM call when replaying the full thread (user messages, tool outputs, system events). At 1,000 requests/hour with Claude Opus 4 ($15/1M input tokens), that's $13,600/month.
With NocturnusAI NocturnusAI retrieves only stored facts per turn — averaging ~221 tokens. Same scenario: $2,400/month. The logic engine runs locally with zero token cost. Only the initial LLM extraction step uses tokens.

Calculation: 1,000 req/hr × 720 hr/mo × 1,259 tokens × $15/1M = $13,596. With NocturnusAI: same × 221 tokens = $2,387. Measured on Claude Opus 4 over a 15-turn benchmark. The 82% reduction is (1,259 − 221) / 1,259 = 82.4%. Gemini 2.0 Flash shows 90%.


The Token Math

221 tokens, not 1,259.

Per turn. Every turn. The dollars take care of themselves.

Naive agent
0
tokens per turn
Nocturnus
0
tokens per turn
You save
0
tokens per turn — 82% reduction

Based on a typical long-running agent workload. See the math →


Try It

Running in 10 seconds. No API keys.

The logic engine, fact storage, inference, and context ranking all work instantly. Add an LLM later for natural-language extraction.

Fastest path — 3 lines, no signup
docker run -d --name nocturnusai -p 9300:9300 ghcr.io/auctalis/nocturnusai:latest
curl -s -X POST localhost:9300/tell -H "Content-Type: application/json" -H "X-Tenant-ID: default" -d '{"predicate":"likes","args":["alice","logic"]}'
curl -s -X POST localhost:9300/ask  -H "Content-Type: application/json" -H "X-Tenant-ID: default" -d '{"predicate":"likes","args":["alice","?what"]}'
# → likes(alice, logic)  — deterministic, sub-millisecond, zero token cost
1 Store facts and run inference — instant, no LLM
Copy this
# Store facts
curl -s -X POST http://localhost:9300/tell \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: default" \
  -d '{"predicate":"customer_tier","args":["acme_corp","enterprise"]}'

curl -s -X POST http://localhost:9300/tell \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: default" \
  -d '{"predicate":"contract_value","args":["acme_corp","2000000"]}'

# Query them back
curl -s -X POST http://localhost:9300/ask \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: default" \
  -d '{"predicate":"customer_tier","args":["acme_corp","?tier"]}'
customer_tier(acme_corp, enterprise)

Sub-millisecond. Deterministic. The same query always returns the same answer. No token cost.

2 Get a salience-ranked context window — still no LLM
Goal-driven context for your agent's next step
curl -s -X POST http://localhost:9300/memory/context \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: default" \
  -d '{
    "goals": [{"predicate":"customer_tier","args":["acme_corp","?tier"]}],
    "maxFacts": 10,
    "sessionId": "ticket-42"
  }' | python3 -m json.tool
{ "windowSize": 1, "goalDriven": true,
  "facts": [{"predicate":"customer_tier", "args":["acme_corp","enterprise"], "salience":0.65}] }

Backward chaining finds only facts reachable from your goals. Feed this to your LLM instead of the whole thread.

3 Turn compression — add an LLM for natural-language extraction
Requires an LLM. Restart with Ollama (ollama pull granite3.3:8b) or add -e ANTHROPIC_API_KEY=sk-ant-... to your Docker command. Setup guide →
Send raw conversation turns
curl -s -X POST http://localhost:9300/context \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: default" \
  -d '{
    "turns": [
      "User: We cannot log in after the Okta cutover.",
      "Tool crm_lookup: account=acme tier=enterprise"
    ],
    "scope": "ticket-4821",
    "sessionId": "ticket-4821"
  }' | python3 -m json.tool
## Login Issue
Login is currently failing following the Okta cutover.
Acme is on the enterprise tier.

Two messy messages became two clean sentences. On the next turn, only the delta comes back. Full context workflow →

What your model sees (turn 2+)

14 failed SAML assertions at 09:12 UTC.
Issuer mismatch after IdP migration.

~221 tokens avg · 5.7× fewer

Without NocturnusAI

User: We can't log in after the Okta cutover.
Tool crm_lookup: account=acme tier=enterprise
Tool auth_audit: 14 failed SAML assertions...
Tool auth_audit: issuer mismatch after IdP...
+ internal notes, retries, system events...

~1,259 tokens avg · naive full-history replay (see benchmark)

Steps 1–2 work instantly with zero dependencies. Step 3 adds LLM extraction — Ollama locally (free) or any cloud provider.
Exact output phrasing varies by model. Cloud LLMs (Anthropic, OpenAI) are faster but require an API key.


How It Works

One call per turn. The server handles the rest.

Extracts

The LLM pulls structured facts from your raw turns and stores them under the conversation scope.

Remembers

Prior turns are fed back automatically so references like "the account" resolve correctly across turns.

Returns the delta

Each call returns briefingDelta — a short natural-language summary of only what's new since the last turn.

Your code
// Each turn — one call
r = POST /context(turns, scope, sessionId)

// Feed the delta to your model
messages = [
  system(r.briefingDelta),
  user(next_question),
]

// Done with the conversation? Clean up.
DELETE /scope/ticket-4821

Where It Fits

One server between your agent and the model.

Your Agent
Sends raw turns
NocturnusAI
Extract · Store · Infer · Rank · Diff
LLM API
Receives only the delta

Self-hosted. Open source. Not a memory store — a reasoning engine.
GitHub stars Star on GitHub



SDKs

Python & TypeScript. Typed. Zero dependencies (TS).

pip install nocturnusai
from nocturnusai import SyncNocturnusAIClient

with SyncNocturnusAIClient("http://localhost:9300") as c:
    c.tell("parent", ["alice", "bob"])
    c.tell("parent", ["bob", "charlie"])
    c.teach(
        head={"predicate": "grandparent", "args": ["?x", "?z"]},
        body=[
            {"predicate": "parent", "args": ["?x", "?y"]},
            {"predicate": "parent", "args": ["?y", "?z"]},
        ],
    )
    results = c.ask("grandparent", ["?who", "charlie"])
    # [Atom(predicate='grandparent', args=['alice', 'charlie'])]
npm install nocturnusai-sdk
import { NocturnusAIClient } from 'nocturnusai-sdk';

const c = new NocturnusAIClient({
  baseUrl: 'http://localhost:9300',
});

await c.tell('parent', ['alice', 'bob']);
await c.tell('parent', ['bob', 'charlie']);
await c.teach(
  { predicate: 'grandparent', args: ['?x', '?z'] },
  [
    { predicate: 'parent', args: ['?x', '?y'] },
    { predicate: 'parent', args: ['?y', '?z'] },
  ],
);
const results = await c.ask('grandparent', ['?who', 'charlie']);
// [{ predicate: 'grandparent', args: ['alice', 'charlie'] }]

48 methods (Python) · 46 methods (TypeScript) · Full async support · Typed returns · Retry with backoff
SDK reference →

Start here.

764 TESTS · ACID TRANSACTIONS · OPEN SOURCE · SINGLE CONTAINER

Docs

Full context workflow, API reference, SDKs, MCP, CLI.

Features

Memory lifecycle, three-layer model, goal-driven context, scope management.

GitHub

Source, issues, Docker images, native binaries.

Blog

Launch essays, benchmarks, and field notes from the team.

Questions, partnerships, or licensing?

dev@nocturnus.ai · licensing@nocturnus.ai