Yak — Personal LLM Memory System Design
Dual-layer memory architecture with temporal decay, 9B-driven warmth, and versioned distillations. Built in Go with PostgreSQL + pgvector.
Overview
Yak is a personal memory system for LLMs that combines:
- Raw storage — Full conversation exchanges
- Distillation layer — LLM-generated summaries with confidence scoring
- Semantic retrieval — Vector search with temporal decay
Dual-Layer Architecture
┌─────────────────────────────────────────────────────────────┐
│ Raw Memory Layer │
│ Stores: "User: ...\nAssistant: ..." (full fidelity) │
│ Purpose: Source of truth, audit trail, re-distillation │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Distillation Layer │
│ Stores: 1-4 factual sentences + confidence + tier + embed │
│ Purpose: Efficient semantic search, RAG injection │
└─────────────────────────────────────────────────────────────┘
Temporal Decay
Older memories are penalised during retrieval using exponential decay:
score = cosine_distance × e^(ln(2)/half_life × age_days)
| Parameter | Default | Effect |
|---|---|---|
half_life_days | 30 | Older memories decay exponentially |
core_threshold_multiplier | 2.0× | Core memories always surface |
decay_enabled | true | Set to 0 to disable |
Rationale: Recency matters. A preference from 6 months ago is less relevant than one from last week.
Importance Tiers
| Tier | Threshold | Examples |
|---|---|---|
| core | 2.0× looser | Name, location, occupation, family |
| medium | 1.0× (baseline) | Preferences, habits, recurring topics |
| low | 0.5× stricter | Ephemeral context, one-off mentions |
How it works: Core memories use a 2× higher cosine distance threshold during recall, ensuring identity facts always surface.
Confidence Scoring
LLM outputs distillation + confidence (0–1):
{
"summary": "User prefers coffee over tea.",
"confidence": 0.87
}
| Confidence | Action |
|---|---|
| ≥ 0.8 | Accept, store normally |
| 0.6–0.8 | Accept, flag for later review |
| < 0.6 | Flag for manual review, don’t auto-deploy |
Why? Prevents hallucinated or weak distillations from polluting memory.
Memory Reinforcement Learning (9B-Driven)
Since cost is not a constraint for personal infra, all warmth updates use 9B evaluation instead of heuristics.
1. Per-Turn 9B Relevance Evaluation
After every conversation turn, the 9B worker evaluates all recalled memories:
Flow:
After assistant responds:
Input: [recalled_memories, full_conversation_turn]
For each recalled memory:
9B evaluates: "Was memory X useful in this context? Score 0–1"
If score > 0.7: warmth += 0.1
If score < 0.3: warmth -= 0.2
If 0.3–0.7: no change (neutral)
Latency: ~2s per turn (acceptable for personal use)
Why per-turn? Immediate feedback loop. Memories that are actually useful rise faster. Misfires are penalised in real-time.
2. 9B-Driven Warmth (Not Heuristics)
All warmth signals come from 9B evaluation, not heuristics:
| Signal | Heuristic (Old) | 9B Evaluation (New) |
|---|---|---|
| Memory recalled | warmth += 0.05 | 9B: "Was this relevant?" (0–1) |
| User references it | warmth += 0.1 | 9B: "Did user use this?" (0–1) |
| User says “thanks” | warmth += 0.15 | 9B: "Was this helpful?" (0–1) |
Why? LLM understands context. Heuristics are brittle (e.g., user says “thanks” for unrelated reason).
3. Near-Duplicate Handling (Vector-Only)
Skip keyword keys entirely. Use vector similarity only — pgvector is fast enough.
Detection Flow:
For each new distillation:
1. Compute embedding (768-dim)
2. pgvector cosine search (limit=5, threshold=0.85)
3. If matches found: VERSION (don't merge)
4. If no matches: CREATE NEW
Versioned Distillations:
CREATE TABLE distillation_versions (
id UUID PRIMARY KEY,
distillation_id UUID FK → distillations,
content TEXT,
created_at TIMESTAMPTZ,
merged_from UUID[] -- parent distillation IDs
);
Merge Strategy:
IF similarity > 0.85:
- Create new distillation_version
- 9B consolidates: old_content + new_content → unified
- Store unified content
- Keep old versions for audit
ELSE:
- Create new distillation (normal flow)
Why versioned? Append creates bloat, replace loses history. Version + 9B consolidation preserves both.
4. 9B Conflict Detection
On every new memory, 9B checks for contradictions with existing memories:
Input: [new_distillation, existing_similar_distillations]
9B evaluates:
"Do these contradict? Score 0–1"
If conflict > 0.7:
- Flag for review
- Don't auto-merge
- Alert user
Why? 9B detects semantic contradictions that heuristics miss.
5. Gravity (Intrinsic Importance)
Each memory has an intrinsic “gravity” that acts as a base multiplier:
gravity = tier_multiplier × user_weight
core → 2.0
medium → 1.0
low → 0.5
final_score = cosine_distance × decay × (1 + warmth) × gravity
Why: Core facts (name, location) always surface. Preferences (medium) surface when relevant. Ephemeral (low) fade faster.
Final Recall Score
score = cosine_similarity × e^(ln(2)/half_life × age_days)
× (1 + warmth)
× gravity
× tier_bonus
| Component | Range | Purpose |
|---|---|---|
cosine_similarity | 0–1 | Semantic match |
decay | 0–1 | Temporal recency |
1 + warmth | 1–2 | Usage frequency |
gravity | 0.5–2.0 | Intrinsic importance |
tier_bonus | 1.0–2.0 | Core memory boost |
Core Principles
1. No Prompt Mutations from Memory Injection
Memories are injected as a distinct, isolated block — never interpolated into existing template sections.
Why: Prompt mutations cause template corruption, inconsistent injection points, and make debugging difficult.
Pattern:
<system_prompt>
...original template...
</system_prompt>
<memories>
{{ recalled_distillations }}
</memories>
<user_query>
...
</user_query>
Rules:
- Memory block always injected at fixed position (before user query, after system prompt)
- Use XML/Markdown delimiters to isolate memory content
- Never interpolate memories into template variables
- If memory injection fails, prompt still works (graceful degradation)
2. Inference-Last Ethos
Prioritise performant, deterministic tools before using the LLM layer.
Retrieval Pipeline:
- Keyword filters — Exact match, user_id scoping, tier filtering
- Vector search — pgvector cosine similarity with decay
- LLM summarisation — Only if recall results need synthesis
Storage Pipeline:
- Keyword delete detection —
/forgetor “forget this” (no LLM) - Near-duplicate check — Vector similarity only (version, don’t merge)
- LLM distillation — Async, only for new memories
Why:
- Reduces latency (keyword search < 1ms vs. LLM 100s of ms)
- Reduces cost (no LLM calls for simple operations)
- More predictable behaviour (deterministic filters)
- LLM reserved for reasoning tasks, not retrieval
Explicit Delete Detection
User-initiated deletion via explicit command:
Command:
/forget <memory_id>
/forget <exact phrase>
forget this memory
delete memory: <id>
Flow:
1. Parse command for /forget keyword
2. If memory_id provided → DELETE by ID
3. If phrase provided → SEARCH → CONFIRM → DELETE
4. Cascade delete: raw memory + distillation + all versions + embeddings
5. Log deletion to audit trail (soft delete option)
Why explicit? Prevents accidental deletion from LLM hallucinations. User must explicitly request.
Error Handling & Discord Webhooks
Explicit error alerts with configurable webhook destination.
Config:
errors:
enabled: true
webhook_url_env: "YAK_ERROR_WEBHOOK" # From env, not config file
dedup_window_sec: 300 # 5 minutes
alert_on:
- distillation_failed
- embedding_failed
- db_connection_lost
- queue_overflow
Alert Format:
🚨 Yak Error: distillation_failed
Memory ID: abc-123
Error: LLM timeout after 30s
Retry: 2/3
Time: 2026-06-01T09:15:00Z
Levels:
warn— Logged, no alert (e.g., low confidence)error— Logged + webhook (e.g., LLM failure)critical— Logged + webhook + escalate to SMS/pager (e.g., DB down)
Deduplication:
- Same error type within 5 min window = 1 alert
- After dedup window expires, alert again
- Prevents webhook spam during outages
Metrics
Prometheus /metrics endpoint for observability.
Endpoints:
GET /metrics → Prometheus format
Metrics:
| Metric | Type | Description |
|---|---|---|
yak_requests_total | counter | Total API requests |
yak_distillations_created | counter | Distillations stored |
yak_distillations_versioned | counter | Near-duplicates detected (versioned) |
yak_memories_deleted | counter | Deletions |
yak_recall_latency_ms | histogram | Search latency |
yak_distillation_latency_ms | histogram | LLM latency |
yak_queue_size | gauge | Pending jobs |
yak_warmth_updates | counter | 9B warmth evaluation jobs |
yak_conflicts_detected | counter | Contradiction alerts |
yak_errors_total | counter | Errors by type |
Example query:
# Average recall latency
histogram_quantile(0.95, yak_recall_latency_ms_bucket)
# Version rate (near-duplicates)
rate(yak_distillations_versioned[5m])
# Conflict rate
rate(yak_conflicts_detected[5m])
# Error rate
rate(yak_errors_total[5m])
CLI
Thin API wrapper for manual operations. Supports local and remote instances.
Commands:
yak --help
# Server
yak serve # Start HTTP API
yak serve --port 8000 # Custom port
# Memory ops
yak memory list --user user-123 # List memories
yak memory show <id> # Show details
yak memory delete <id> # Delete by ID
yak memory search "coffee" --limit 10 # Search
# Distillation ops
yak distill list --user user-123 # List distillations
yak distill version <id1> <id2> # Manual version (merge)
yak distill re-embed <id> # Re-compute embedding
# Warmth ops
yak warmth show <id> # Show warmth score
yak warmth reset <id> # Reset to 0.5
yak warmth bulk-reset # Reset all (debug)
# Admin
yak metrics # Show metrics
yak queue status # Show queue
yak health # Health check
# Remote support
yak --remote https://yak.example.com memory list
yak --api-key xxx memory delete <id>
Why thin wrapper? Reuses API logic, supports remote instances, --json output for scripting.
Background Jobs (Real-Time + Batch)
Per-Turn 9B Relevance Evaluation
Every conversation turn:
After assistant responds:
Input: [recalled_memories, full_conversation_turn]
For each recalled memory:
9B evaluates: "Was memory X useful? Score 0–1"
If score > 0.7: warmth += 0.1
If score < 0.3: warmth -= 0.2
Latency: ~2s per turn (acceptable for personal use)
Nightly Batch (Optional)
02:00 local time:
- Re-evaluate all memories from last 7 days
- Catch any edge cases missed in per-turn evaluation
- Adjust warmth for memories that were recalled but not used
Data Model
memories table
id UUID PRIMARY KEY
content TEXT -- "User: ...\nAssistant: ..."
user_id TEXT -- Multi-user scoping
created_at TIMESTAMPTZ
updated_at TIMESTAMPTZ
distillations table
id UUID PRIMARY KEY
memory_id UUID FK → memories
content TEXT -- 1-4 factual sentences
confidence FLOAT -- 0–1, from LLM
tier TEXT -- core/medium/low
warmth FLOAT DEFAULT 0.5 -- 0–1, 9B evaluation
embedding vector(768) -- nomic-embed-text
created_at TIMESTAMPTZ
updated_at TIMESTAMPTZ -- updated on warmth changes
distillation_versions table
id UUID PRIMARY KEY
distillation_id UUID FK → distillations
content TEXT -- Consolidated content
created_at TIMESTAMPTZ
merged_from UUID[] -- Parent distillation IDs
Indexes:
idx_distillations_embedding— IVFFlat for fast vector searchidx_distillations_tier— Filter by importanceidx_distillations_warmth— Sort by warmth for recallidx_memories_user_id— Multi-user scoping
API Design
Create Memory
POST /api/memories/
{
"content": "User: I love coffee\nAssistant: Great choice!",
"user_id": "user-123"
}
→ 201 Created
{
"id": "uuid",
"content": "...",
"user_id": "user-123",
"created_at": "2026-06-01T..."
}
Search Distillations
POST /api/distillations/search
{
"query": "What's my favorite drink?",
"user_id": "user-123",
"limit": 5,
"threshold": 0.7
}
→ 200 OK
[
{
"id": "uuid",
"content": "User prefers coffee over tea.",
"confidence": 0.87,
"tier": "medium",
"warmth": 0.65,
"score": 0.82,
"is_core": false
}
]
Update Warmth (9B evaluation)
POST /api/warmth/evaluate
{
"memory_id": "uuid",
"context": "full_conversation_turn",
"evaluator": "9b-worker"
}
→ 200 OK
{
"memory_id": "uuid",
"relevance_score": 0.85,
"warmth_change": 0.1,
"new_warmth": 0.75
}
Deployment
- Binary: Single Go binary (no dependencies)
- Database: PostgreSQL 16+ with pgvector
- LLM: llamacpp at
hydrogen:8082 - Auth: API key header (
X-API-Key)
Future Considerations
- Conflict resolution — What happens when new memory contradicts old?
- Cold storage — Archive old distillations to separate table
- Warmth decay tuning — Adjust cool-down rate based on usage patterns
- Remote CLI — Support
--remotefor multi-instance deployments