Building Yak: A Personal Memory System for LLMs
Introduction
Yak is a personal memory system for LLMs — a thematic sibling to Sherpa (prompt guard). Where Sherpa filters what goes in, Yak remembers what matters.
Built on hydrogen (my personal infra), Yak combines raw conversation storage with LLM-generated distillations, semantic retrieval via PostgreSQL + pgvector, and a 9B-driven warmth system that learns what’s actually useful.
This is the implementation plan and architectural deep-dive.
The Problem: LLMs Have No Memory
LLMs are stateless. Every conversation starts from scratch unless you:
- Dump the entire history (expensive, noisy)
- Use vector search (better, but still blind to recency and importance)
- Hardcode context (brittle, doesn’t scale)
Yak solves this with a dual-layer architecture:
┌─────────────────────────────────────────────────────────────┐
│ Raw Memory Layer │
│ Stores: "User: ...\nAssistant: ..." (full fidelity) │
│ Purpose: Source of truth, audit trail, re-distillation │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Distillation Layer │
│ Stores: 1-4 factual sentences + confidence + tier + embed │
│ Purpose: Efficient semantic search, RAG injection │
└─────────────────────────────────────────────────────────────┘
The raw layer preserves everything. The distillation layer is what actually gets retrieved and injected into prompts.
Important: Distillation is async. A just-stored memory isn’t immediately recallable until the background worker generates the distillation and embedding.
Design Principles
1. Personal Infra First
No SaaS dependencies. Yak runs on hydrogen:
- PostgreSQL + pgvector for storage and retrieval
- llamacpp at hydrogen:8082 for 9B inference
- Single Go binary for low memory footprint
2. Inference-Last Ethos
Prioritize deterministic tools before the LLM layer:
Retrieval Pipeline:
- Keyword filters (exact match, user_id scoping, tier filtering)
- Vector search (pgvector cosine similarity with temporal decay)
- LLM synthesis (only if recall results need summarization)
Storage Pipeline:
- Keyword delete detection (
/forgetcommand — no LLM) - Near-duplicate check (vector similarity only)
- LLM distillation (async, only for new memories)
This reduces latency (keyword < 1ms vs. LLM 100s of ms) and cost (no LLM calls for simple operations).
3. Temporal Decay + Gravity (Simplified Scoring)
Older memories are penalized during retrieval. The original design stacked five multiplicative factors (decay, warmth, gravity, tier_bonus), which is hard to reason about and lets one term silently dominate. Revised approach:
# Option A: Simplified multiplicative (folds tier into gravity)
score = cosine_similarity × e^(ln(2)/half_life × age_days)
× (1 + warmth)
× gravity
# Option B: Additive boosts (more interpretable)
score = cosine_similarity × e^(ln(2)/half_life × age_days)
+ warmth_bonus
+ gravity_bonus
# Option C: Log-space (prevents runaway scaling)
log_score = log(cosine_similarity)
+ log(decay)
+ log(1 + warmth)
+ log(gravity)
Gravity now encodes tier:
- core = 2.0
- medium = 1.0
- low = 0.5
This means your name (core) always surfaces, preferences (medium) surface when relevant, and ephemeral context (low) fades faster.
4. 9B-Driven Warmth (Batched, Symmetric)
All warmth signals come from 9B evaluation, not brittle heuristics:
| Signal | Heuristic (Old) | 9B Evaluation (New) |
|---|---|---|
| Memory recalled | warmth += 0.05 | 9B: "Was this relevant?" (0–1) |
| User references it | warmth += 0.1 | 9B: "Did user use this?" (0–1) |
| User says “thanks” | warmth += 0.15 | 9B: "Was this helpful?" (0–1) |
Batched evaluation (cost fix):
After assistant responds:
# Single 9B call with all recalled memories
Input: [conversation_turn, recalled_memory_1, recalled_memory_2, ...]
Output: [{memory_id, usefulness_score: 0–1}, ...]
For each memory:
If score > 0.7: warmth += 0.1
If score < 0.3: warmth -= 0.1 # Symmetric (was -0.2)
If 0.3–0.7: no change (neutral)
# Optional: decay toward 0.5 if not recalled
if not recalled: warmth += (0.5 - warmth) * 0.05
Latency: ~2s per turn (single call, not N calls)
Changes from original:
- Symmetric steps: ±0.1 (was +0.1/−0.2, which eroded occasionally-useful memories)
- Decay toward 0.5: Prevents permanent inflation/deflation
- Batched: One 9B call per turn, not one per memory
5. No Prompt Mutations from Memory Injection
Memories are injected as a distinct, isolated block — never interpolated into existing template sections:
<system_prompt>
...original template...
</system_prompt>
<memories>
{{ recalledDistillations }}
</memories>
<user_query>
...
</user_query>
This prevents template corruption and makes debugging predictable.
Data Model
memories — Raw Storage
CREATE TABLE memories (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
content TEXT NOT NULL,
user_id TEXT NOT NULL DEFAULT '',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Stores full conversation exchanges: "User: ...\nAssistant: ...".
distillations — LLM Summaries
CREATE TABLE distillations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
memory_id UUID NOT NULL REFERENCES memories(id) ON DELETE CASCADE,
content TEXT NOT NULL,
confidence FLOAT NOT NULL DEFAULT 0.7,
tier TEXT NOT NULL DEFAULT 'medium' CHECK (tier IN ('core', 'medium', 'low')),
warmth FLOAT NOT NULL DEFAULT 0.5 CHECK (warmth >= 0 AND warmth <= 1),
embedding vector(768),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
1–4 factual sentences with:
- Confidence: 0–1 score from LLM
- Tier: core/medium/low (intrinsic importance)
- Warmth: 0–1 (usage frequency, 9B-driven)
- Embedding: 768-dim vector (nomic-embed-text)
distillation_versions — Near-Duplicate Handling
CREATE TABLE distillation_versions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
distillation_id UUID NOT NULL REFERENCES distillations(id) ON DELETE CASCADE,
content TEXT NOT NULL,
merged_from UUID[] NOT NULL DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
When similarity > 0.85, version instead of merge. 9B consolidates old + new content into unified version.
conflict_alerts — Contradiction Detection
CREATE TABLE conflict_alerts (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
memory_id UUID NOT NULL REFERENCES memories(id),
similar_id UUID NOT NULL REFERENCES distillations(id),
conflict_score FLOAT NOT NULL,
explanation TEXT,
resolved BOOLEAN NOT NULL DEFAULT false,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
resolved_at TIMESTAMPTZ,
resolved_by TEXT
);
9B checks for semantic contradictions on every new memory.
audit_log — Deletion Audit Trail
CREATE TABLE audit_log (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
action TEXT NOT NULL,
entity_type TEXT NOT NULL,
entity_id UUID NOT NULL,
user_id TEXT NOT NULL,
details JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Implementation Roadmap
Phase 1: Foundation (Week 1)
#1 Project Scaffold — Go setup, CI, linting, Makefile, TDD test infrastructure
#2 Database Schema — PostgreSQL + pgvector migrations, connection pool
#6 Async Queue — In-memory worker pool for async distillation (no Redis)
#9 Discord Webhooks — Error alerts with deduplication (5-min window)
#10 Metrics — Prometheus /metrics endpoint
Phase 2: Core (Week 2)
#3 LLM Client — llamacpp integration for distillation + embeddings
#4 Memory API — CRUD endpoints, /forget command with confirmation
#5 Distillation API — Search with decay, recall with RAG injection
Phase 3: Advanced (Week 3)
#7 Warmth System — Per-turn 9B batched relevance evaluation
#8 Near-Duplicate Detection — Vector similarity + versioned distillations
#12 Conflict Detection — 9B contradiction detection with alerts
Phase 4: Polish (Week 4)
#11 CLI — Thin API wrapper for manual ops (yak memory list, yak warmth reset, etc.)
Dependency Graph
Phase 1: Foundation
├── #1 Project Scaffold (blocks everything)
├── #2 Database Schema (blocks #3, #4, #5)
├── #6 Async Queue (blocks #4 async distillation)
├── #9 Discord Webhooks (blocks #7, #12 alerts)
└── #10 Metrics (nice to have)
Phase 2: Core
├── #3 LLM Client (depends on #1, #2; blocks #4, #5)
├── #4 Memory API (depends on #2, #3, #6)
└── #5 Distillation API (depends on #2, #3, #4)
Phase 3: Advanced
├── #7 Warmth System (depends on #2, #3, #5, #9)
├── #8 Near-Duplicate Detection (depends on #2, #3)
└── #12 Conflict Detection (depends on #2, #3, #9)
Phase 4: Polish
└── #11 CLI (depends on #4, #5)
Critical path: #1 → #2 → #3 → #4 → #5 (MVP: store + recall)
Key Technical Decisions
Why Go?
- Single binary deployment (no runtime dependencies)
- Low memory footprint (important for personal infra)
- Fast compilation (TDD iteration)
- Great stdlib for HTTP, JSON, SQL
Why Chi Router?
- Minimal, no framework bloat
- Standard
net/httpcompatibility - No reflection, no magic
Why pgvector?
- Vector search in PostgreSQL (no separate service)
- At personal scale (<100k rows), HNSW beats IVFFlat on recall
- IVFFlat needs periodic rebuilding; HNSW is incremental
Revised: Use HNSW for personal use, IVFFlat only if dataset grows beyond 100k rows.
Why In-Memory Queue?
- No Redis dependency
- Sufficient for personal use (single instance)
- Worker pool pattern (bounded concurrency)
Why Versioned Distillations?
- Append creates bloat
- Replace loses history
- Version + 9B consolidation preserves both
Why 9B for Warmth?
LLM understands context. Heuristics are brittle:
- User says “thanks” for unrelated reason → false positive
- 9B evaluates: “Was this actually helpful?” → real signal
Backend Contention (Highest Risk)
Embeddings, distillation, warmth, and conflict detection all hit hydrogen:8082, the same instance serving aux inference. Under load they thrash and compete.
Mitigation:
- Give embeddings a dedicated endpoint (separate llamacpp instance)
- Load-test the shared instance before relying on it
- Queue long-running LLM tasks (distillation, conflict detection) to off-peak hours
Risk Assessment
| Risk | Impact | Mitigation |
|---|---|---|
| Backend contention (hydrogen:8082) | Critical | Dedicated embedding endpoint, load-test first |
| Retrieval scoring complexity | High | Simplify formula (fold tier into gravity, additive boosts) |
| Warmth erosion | Medium | Symmetric steps (±0.1), decay toward 0.5 |
| /forget safety | Medium | Add preview/confirmation before delete |
| pgvector performance | Medium | Use HNSW at personal scale, tune IVFFlat lists |
| Async recall latency | Low | Document that fresh memories aren’t immediately recallable |
Testing Strategy
TDD Workflow (Mandatory):
- Write failing test first (Red)
- Write minimum code to pass (Green)
- Refactor (Green)
Coverage Target: > 80% line coverage
Race Detection: go test -race ./... must pass
Test Types:
- Unit tests (config, utils, LLM client mocks)
- Integration tests (migrations, API endpoints with testcontainers)
- E2E tests (full pipeline: store → distill → recall)
API Design
Create Memory
POST /api/memories/
{
"content": "User: I love coffee\nAssistant: Great choice!",
"user_id": "user-123"
}
→ 201 Created (returns immediately, async distillation — memory not yet recallable)
Search Distillations
POST /api/distillations/search
{
"query": "What's my favorite drink?",
"user_id": "user-123",
"limit": 5,
"threshold": 0.7
}
→ Returns ranked distillations with decay + warmth + gravity scores
Recall (RAG Injection)
POST /api/recall
{
"query": "Should I get espresso?",
"user_id": "user-123"
}
→ Returns distilled memories formatted for system prompt injection
Delete Memory (with confirmation)
POST /api/memories/delete
{
"memory_id": "uuid",
"confirm": true # Must be explicit
}
Or interactive:
POST /api/memories/preview-delete
{
"memory_id": "uuid"
}
→ Returns distillation content for confirmation
Configuration
server:
host: "0.0.0.0"
port: 8000
api_key: "your-secret-key" # X-API-Key header auth
database:
dsn: "postgres://yak:***@localhost:5432/yak?sslmode=disable"
llm:
base_url: "http://hydrogen:8082" # Main inference
model: "llama-3.1-8b"
embedding:
base_url: "http://hydrogen:8082" # Dedicated endpoint if available
model: "nomic-embed-text"
queue:
workers: 4
max_queue_size: 100
decay:
half_life_days: 30
errors:
enabled: true
webhook_url_env: "YAK_ERROR_WEBHOOK"
dedup_window_sec: 300
warmth:
symmetric_steps: true # ±0.1 instead of +0.1/-0.2
decay_to_neutral: true # Drift toward 0.5
Commands
make build # Compile the binary
make run # Run the server
make test # Run tests (-short)
make test-race # Run tests with race detector
make lint # Run golangci-lint
make fmt # Format code with gofumpt
make hooks # Install git hooks via lefthook
make migrate-up # Run pending migrations
make migrate-down # Roll back one migration
Looking Ahead
Yak is still in early implementation. The schema is drafted, the design is documented, but the actual Go packages are empty shells.
Next steps:
- Implement Phase 1 (scaffold, schema, queue, webhooks, metrics)
- Build the core (LLM client, memory API, distillation API)
- Add the advanced features (warmth, near-duplicates, conflicts)
- Polish with CLI
The goal: a memory system that’s actually useful, not just another vector database wrapper.
License
Private — personal use only.