Building Yak: A Personal Memory System for LLMs

June 1, 2026

Introduction

Yak is a personal memory system for LLMs — a thematic sibling to Sherpa (prompt guard). Where Sherpa filters what goes in, Yak remembers what matters.

Built on hydrogen (my personal infra), Yak combines raw conversation storage with LLM-generated distillations, semantic retrieval via PostgreSQL + pgvector, and a 9B-driven warmth system that learns what’s actually useful.

This is the implementation plan and architectural deep-dive.

The Problem: LLMs Have No Memory

LLMs are stateless. Every conversation starts from scratch unless you:

Dump the entire history (expensive, noisy)
Use vector search (better, but still blind to recency and importance)
Hardcode context (brittle, doesn’t scale)

Yak solves this with a dual-layer architecture:

┌─────────────────────────────────────────────────────────────┐
│                     Raw Memory Layer                        │
│  Stores: "User: ...\nAssistant: ..." (full fidelity)        │
│  Purpose: Source of truth, audit trail, re-distillation     │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                  Distillation Layer                         │
│  Stores: 1-4 factual sentences + confidence + tier + embed  │
│  Purpose: Efficient semantic search, RAG injection          │
└─────────────────────────────────────────────────────────────┘

The raw layer preserves everything. The distillation layer is what actually gets retrieved and injected into prompts.

Important: Distillation is async. A just-stored memory isn’t immediately recallable until the background worker generates the distillation and embedding.

Design Principles

1. Personal Infra First

No SaaS dependencies. Yak runs on hydrogen:

PostgreSQL + pgvector for storage and retrieval
llamacpp at hydrogen:8082 for 9B inference
Single Go binary for low memory footprint

2. Inference-Last Ethos

Prioritize deterministic tools before the LLM layer:

Retrieval Pipeline:

Keyword filters (exact match, user_id scoping, tier filtering)
Vector search (pgvector cosine similarity with temporal decay)
LLM synthesis (only if recall results need summarization)

Storage Pipeline:

Keyword delete detection (/forget command — no LLM)
Near-duplicate check (vector similarity only)
LLM distillation (async, only for new memories)

This reduces latency (keyword < 1ms vs. LLM 100s of ms) and cost (no LLM calls for simple operations).

3. Temporal Decay + Gravity (Simplified Scoring)

Older memories are penalized during retrieval. The original design stacked five multiplicative factors (decay, warmth, gravity, tier_bonus), which is hard to reason about and lets one term silently dominate. Revised approach:

# Option A: Simplified multiplicative (folds tier into gravity)
score = cosine_similarity × e^(ln(2)/half_life × age_days)
        × (1 + warmth)
        × gravity

# Option B: Additive boosts (more interpretable)
score = cosine_similarity × e^(ln(2)/half_life × age_days)
        + warmth_bonus
        + gravity_bonus

# Option C: Log-space (prevents runaway scaling)
log_score = log(cosine_similarity)
            + log(decay)
            + log(1 + warmth)
            + log(gravity)

Gravity now encodes tier:

core = 2.0
medium = 1.0
low = 0.5

This means your name (core) always surfaces, preferences (medium) surface when relevant, and ephemeral context (low) fades faster.

4. 9B-Driven Warmth (Batched, Symmetric)

All warmth signals come from 9B evaluation, not brittle heuristics:

Signal	Heuristic (Old)	9B Evaluation (New)
Memory recalled	`warmth += 0.05`	`9B: "Was this relevant?" (0–1)`
User references it	`warmth += 0.1`	`9B: "Did user use this?" (0–1)`
User says “thanks”	`warmth += 0.15`	`9B: "Was this helpful?" (0–1)`

Batched evaluation (cost fix):

After assistant responds:
  # Single 9B call with all recalled memories
  Input: [conversation_turn, recalled_memory_1, recalled_memory_2, ...]
  Output: [{memory_id, usefulness_score: 0–1}, ...]
  
  For each memory:
    If score > 0.7: warmth += 0.1
    If score < 0.3: warmth -= 0.1  # Symmetric (was -0.2)
    If 0.3–0.7: no change (neutral)
    
    # Optional: decay toward 0.5 if not recalled
    if not recalled: warmth += (0.5 - warmth) * 0.05
  
  Latency: ~2s per turn (single call, not N calls)

Changes from original:

Symmetric steps: ±0.1 (was +0.1/−0.2, which eroded occasionally-useful memories)
Decay toward 0.5: Prevents permanent inflation/deflation
Batched: One 9B call per turn, not one per memory

5. No Prompt Mutations from Memory Injection

Memories are injected as a distinct, isolated block — never interpolated into existing template sections:

<system_prompt>
...original template...
</system_prompt>

<memories>
{{ recalledDistillations }}
</memories>

<user_query>
...
</user_query>

This prevents template corruption and makes debugging predictable.

Data Model

`memories` — Raw Storage

CREATE TABLE memories (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    content     TEXT NOT NULL,
    user_id     TEXT NOT NULL DEFAULT '',
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Stores full conversation exchanges: "User: ...\nAssistant: ...".

`distillations` — LLM Summaries

CREATE TABLE distillations (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    memory_id   UUID NOT NULL REFERENCES memories(id) ON DELETE CASCADE,
    content     TEXT NOT NULL,
    confidence  FLOAT NOT NULL DEFAULT 0.7,
    tier        TEXT NOT NULL DEFAULT 'medium' CHECK (tier IN ('core', 'medium', 'low')),
    warmth      FLOAT NOT NULL DEFAULT 0.5 CHECK (warmth >= 0 AND warmth <= 1),
    embedding   vector(768),
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

1–4 factual sentences with:

Confidence: 0–1 score from LLM
Tier: core/medium/low (intrinsic importance)
Warmth: 0–1 (usage frequency, 9B-driven)
Embedding: 768-dim vector (nomic-embed-text)

`distillation_versions` — Near-Duplicate Handling

CREATE TABLE distillation_versions (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    distillation_id UUID NOT NULL REFERENCES distillations(id) ON DELETE CASCADE,
    content         TEXT NOT NULL,
    merged_from     UUID[] NOT NULL DEFAULT '{}',
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

When similarity > 0.85, version instead of merge. 9B consolidates old + new content into unified version.

`conflict_alerts` — Contradiction Detection

CREATE TABLE conflict_alerts (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    memory_id       UUID NOT NULL REFERENCES memories(id),
    similar_id      UUID NOT NULL REFERENCES distillations(id),
    conflict_score  FLOAT NOT NULL,
    explanation     TEXT,
    resolved        BOOLEAN NOT NULL DEFAULT false,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    resolved_at     TIMESTAMPTZ,
    resolved_by     TEXT
);

9B checks for semantic contradictions on every new memory.

`audit_log` — Deletion Audit Trail

CREATE TABLE audit_log (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    action      TEXT NOT NULL,
    entity_type TEXT NOT NULL,
    entity_id   UUID NOT NULL,
    user_id     TEXT NOT NULL,
    details     JSONB,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Implementation Roadmap

Phase 1: Foundation (Week 1)

#1 Project Scaffold — Go setup, CI, linting, Makefile, TDD test infrastructure

#2 Database Schema — PostgreSQL + pgvector migrations, connection pool

#6 Async Queue — In-memory worker pool for async distillation (no Redis)

#9 Discord Webhooks — Error alerts with deduplication (5-min window)

#10 Metrics — Prometheus /metrics endpoint

Phase 2: Core (Week 2)

#3 LLM Client — llamacpp integration for distillation + embeddings

#4 Memory API — CRUD endpoints, /forget command with confirmation

#5 Distillation API — Search with decay, recall with RAG injection

Phase 3: Advanced (Week 3)

#7 Warmth System — Per-turn 9B batched relevance evaluation

#8 Near-Duplicate Detection — Vector similarity + versioned distillations

#12 Conflict Detection — 9B contradiction detection with alerts

Phase 4: Polish (Week 4)

#11 CLI — Thin API wrapper for manual ops (yak memory list, yak warmth reset, etc.)

Dependency Graph

Phase 1: Foundation
├── #1 Project Scaffold (blocks everything)
├── #2 Database Schema (blocks #3, #4, #5)
├── #6 Async Queue (blocks #4 async distillation)
├── #9 Discord Webhooks (blocks #7, #12 alerts)
└── #10 Metrics (nice to have)

Phase 2: Core
├── #3 LLM Client (depends on #1, #2; blocks #4, #5)
├── #4 Memory API (depends on #2, #3, #6)
└── #5 Distillation API (depends on #2, #3, #4)

Phase 3: Advanced
├── #7 Warmth System (depends on #2, #3, #5, #9)
├── #8 Near-Duplicate Detection (depends on #2, #3)
└── #12 Conflict Detection (depends on #2, #3, #9)

Phase 4: Polish
└── #11 CLI (depends on #4, #5)

Critical path: #1 → #2 → #3 → #4 → #5 (MVP: store + recall)

Key Technical Decisions

Why Go?

Single binary deployment (no runtime dependencies)
Low memory footprint (important for personal infra)
Fast compilation (TDD iteration)
Great stdlib for HTTP, JSON, SQL

Why Chi Router?

Minimal, no framework bloat
Standard net/http compatibility
No reflection, no magic

Why pgvector?

Vector search in PostgreSQL (no separate service)
At personal scale (<100k rows), HNSW beats IVFFlat on recall
IVFFlat needs periodic rebuilding; HNSW is incremental

Revised: Use HNSW for personal use, IVFFlat only if dataset grows beyond 100k rows.

Why In-Memory Queue?

No Redis dependency
Sufficient for personal use (single instance)
Worker pool pattern (bounded concurrency)

Why Versioned Distillations?

Append creates bloat
Replace loses history
Version + 9B consolidation preserves both

Why 9B for Warmth?

LLM understands context. Heuristics are brittle:

User says “thanks” for unrelated reason → false positive
9B evaluates: “Was this actually helpful?” → real signal

Backend Contention (Highest Risk)

Embeddings, distillation, warmth, and conflict detection all hit hydrogen:8082, the same instance serving aux inference. Under load they thrash and compete.

Mitigation:

Give embeddings a dedicated endpoint (separate llamacpp instance)
Load-test the shared instance before relying on it
Queue long-running LLM tasks (distillation, conflict detection) to off-peak hours

Risk Assessment

Risk	Impact	Mitigation
Backend contention (hydrogen:8082)	Critical	Dedicated embedding endpoint, load-test first
Retrieval scoring complexity	High	Simplify formula (fold tier into gravity, additive boosts)
Warmth erosion	Medium	Symmetric steps (±0.1), decay toward 0.5
/forget safety	Medium	Add preview/confirmation before delete
pgvector performance	Medium	Use HNSW at personal scale, tune IVFFlat lists
Async recall latency	Low	Document that fresh memories aren’t immediately recallable

Testing Strategy

TDD Workflow (Mandatory):

Write failing test first (Red)
Write minimum code to pass (Green)
Refactor (Green)

Coverage Target: > 80% line coverage

Race Detection: go test -race ./... must pass

Test Types:

Unit tests (config, utils, LLM client mocks)
Integration tests (migrations, API endpoints with testcontainers)
E2E tests (full pipeline: store → distill → recall)

API Design

Create Memory

POST /api/memories/
{
  "content": "User: I love coffee\nAssistant: Great choice!",
  "user_id": "user-123"
}

→ 201 Created (returns immediately, async distillation — memory not yet recallable)

Search Distillations

POST /api/distillations/search
{
  "query": "What's my favorite drink?",
  "user_id": "user-123",
  "limit": 5,
  "threshold": 0.7
}

→ Returns ranked distillations with decay + warmth + gravity scores

Recall (RAG Injection)

POST /api/recall
{
  "query": "Should I get espresso?",
  "user_id": "user-123"
}

→ Returns distilled memories formatted for system prompt injection

Delete Memory (with confirmation)

POST /api/memories/delete
{
  "memory_id": "uuid",
  "confirm": true  # Must be explicit
}

Or interactive:

POST /api/memories/preview-delete
{
  "memory_id": "uuid"
}
→ Returns distillation content for confirmation

Configuration

server:
  host: "0.0.0.0"
  port: 8000
  api_key: "your-secret-key"  # X-API-Key header auth

database:
  dsn: "postgres://yak:***@localhost:5432/yak?sslmode=disable"

llm:
  base_url: "http://hydrogen:8082"  # Main inference
  model: "llama-3.1-8b"
  
embedding:
  base_url: "http://hydrogen:8082"  # Dedicated endpoint if available
  model: "nomic-embed-text"

queue:
  workers: 4
  max_queue_size: 100

decay:
  half_life_days: 30

errors:
  enabled: true
  webhook_url_env: "YAK_ERROR_WEBHOOK"
  dedup_window_sec: 300

warmth:
  symmetric_steps: true  # ±0.1 instead of +0.1/-0.2
  decay_to_neutral: true  # Drift toward 0.5

Commands

make build       # Compile the binary
make run         # Run the server
make test        # Run tests (-short)
make test-race   # Run tests with race detector
make lint        # Run golangci-lint
make fmt         # Format code with gofumpt
make hooks       # Install git hooks via lefthook
make migrate-up  # Run pending migrations
make migrate-down # Roll back one migration

Looking Ahead

Yak is still in early implementation. The schema is drafted, the design is documented, but the actual Go packages are empty shells.

Next steps:

Implement Phase 1 (scaffold, schema, queue, webhooks, metrics)
Build the core (LLM client, memory API, distillation API)
Add the advanced features (warmth, near-duplicates, conflicts)
Polish with CLI

The goal: a memory system that’s actually useful, not just another vector database wrapper.

License

Private — personal use only.