Yak — Personal LLM Memory System Design

June 1, 2026

Dual-layer memory architecture with temporal decay, 9B-driven warmth, and versioned distillations. Built in Go with PostgreSQL + pgvector.

Overview

Yak is a personal memory system for LLMs that combines:

Raw storage — Full conversation exchanges
Distillation layer — LLM-generated summaries with confidence scoring
Semantic retrieval — Vector search with temporal decay

Dual-Layer Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Raw Memory Layer                        │
│  Stores: "User: ...\nAssistant: ..." (full fidelity)        │
│  Purpose: Source of truth, audit trail, re-distillation     │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                  Distillation Layer                         │
│  Stores: 1-4 factual sentences + confidence + tier + embed  │
│  Purpose: Efficient semantic search, RAG injection          │
└─────────────────────────────────────────────────────────────┘

Temporal Decay

Older memories are penalised during retrieval using exponential decay:

score = cosine_distance × e^(ln(2)/half_life × age_days)

Parameter	Default	Effect
`half_life_days`	30	Older memories decay exponentially
`core_threshold_multiplier`	2.0×	Core memories always surface
`decay_enabled`	true	Set to 0 to disable

Rationale: Recency matters. A preference from 6 months ago is less relevant than one from last week.

Importance Tiers

Tier	Threshold	Examples
core	2.0× looser	Name, location, occupation, family
medium	1.0× (baseline)	Preferences, habits, recurring topics
low	0.5× stricter	Ephemeral context, one-off mentions

How it works: Core memories use a 2× higher cosine distance threshold during recall, ensuring identity facts always surface.

Confidence Scoring

LLM outputs distillation + confidence (0–1):

{
  "summary": "User prefers coffee over tea.",
  "confidence": 0.87
}

Confidence	Action
≥ 0.8	Accept, store normally
0.6–0.8	Accept, flag for later review
< 0.6	Flag for manual review, don’t auto-deploy

Why? Prevents hallucinated or weak distillations from polluting memory.

Memory Reinforcement Learning (9B-Driven)

Since cost is not a constraint for personal infra, all warmth updates use 9B evaluation instead of heuristics.

1. Per-Turn 9B Relevance Evaluation

After every conversation turn, the 9B worker evaluates all recalled memories:

Flow:

After assistant responds:
  Input: [recalled_memories, full_conversation_turn]
  
  For each recalled memory:
    9B evaluates: "Was memory X useful in this context? Score 0–1"
    
  If score > 0.7: warmth += 0.1
  If score < 0.3: warmth -= 0.2
  If 0.3–0.7: no change (neutral)
  
  Latency: ~2s per turn (acceptable for personal use)

Why per-turn? Immediate feedback loop. Memories that are actually useful rise faster. Misfires are penalised in real-time.

2. 9B-Driven Warmth (Not Heuristics)

All warmth signals come from 9B evaluation, not heuristics:

Signal	Heuristic (Old)	9B Evaluation (New)
Memory recalled	`warmth += 0.05`	`9B: "Was this relevant?" (0–1)`
User references it	`warmth += 0.1`	`9B: "Did user use this?" (0–1)`
User says “thanks”	`warmth += 0.15`	`9B: "Was this helpful?" (0–1)`

Why? LLM understands context. Heuristics are brittle (e.g., user says “thanks” for unrelated reason).

3. Near-Duplicate Handling (Vector-Only)

Skip keyword keys entirely. Use vector similarity only — pgvector is fast enough.

Detection Flow:

For each new distillation:
  1. Compute embedding (768-dim)
  2. pgvector cosine search (limit=5, threshold=0.85)
  3. If matches found: VERSION (don't merge)
  4. If no matches: CREATE NEW

Versioned Distillations:

CREATE TABLE distillation_versions (
  id          UUID PRIMARY KEY,
  distillation_id UUID FK → distillations,
  content     TEXT,
  created_at  TIMESTAMPTZ,
  merged_from UUID[]  -- parent distillation IDs
);

Merge Strategy:

IF similarity > 0.85:
  - Create new distillation_version
  - 9B consolidates: old_content + new_content → unified
  - Store unified content
  - Keep old versions for audit
ELSE:
  - Create new distillation (normal flow)

Why versioned? Append creates bloat, replace loses history. Version + 9B consolidation preserves both.

4. 9B Conflict Detection

On every new memory, 9B checks for contradictions with existing memories:

Input: [new_distillation, existing_similar_distillations]
9B evaluates:
  "Do these contradict? Score 0–1"
  
If conflict > 0.7:
  - Flag for review
  - Don't auto-merge
  - Alert user

Why? 9B detects semantic contradictions that heuristics miss.

5. Gravity (Intrinsic Importance)

Each memory has an intrinsic “gravity” that acts as a base multiplier:

gravity = tier_multiplier × user_weight
core    → 2.0
medium  → 1.0
low     → 0.5

final_score = cosine_distance × decay × (1 + warmth) × gravity

Why: Core facts (name, location) always surface. Preferences (medium) surface when relevant. Ephemeral (low) fade faster.

Final Recall Score

score = cosine_similarity × e^(ln(2)/half_life × age_days)
        × (1 + warmth)
        × gravity
        × tier_bonus

Component	Range	Purpose
`cosine_similarity`	0–1	Semantic match
`decay`	0–1	Temporal recency
`1 + warmth`	1–2	Usage frequency
`gravity`	0.5–2.0	Intrinsic importance
`tier_bonus`	1.0–2.0	Core memory boost

Core Principles

1. No Prompt Mutations from Memory Injection

Memories are injected as a distinct, isolated block — never interpolated into existing template sections.

Why: Prompt mutations cause template corruption, inconsistent injection points, and make debugging difficult.

Pattern:

<system_prompt>
...original template...
</system_prompt>

<memories>
{{ recalled_distillations }}
</memories>

<user_query>
...
</user_query>

Rules:

Memory block always injected at fixed position (before user query, after system prompt)
Use XML/Markdown delimiters to isolate memory content
Never interpolate memories into template variables
If memory injection fails, prompt still works (graceful degradation)

2. Inference-Last Ethos

Prioritise performant, deterministic tools before using the LLM layer.

Retrieval Pipeline:

Keyword filters — Exact match, user_id scoping, tier filtering
Vector search — pgvector cosine similarity with decay
LLM summarisation — Only if recall results need synthesis

Storage Pipeline:

Keyword delete detection — /forget or “forget this” (no LLM)
Near-duplicate check — Vector similarity only (version, don’t merge)
LLM distillation — Async, only for new memories

Why:

Reduces latency (keyword search < 1ms vs. LLM 100s of ms)
Reduces cost (no LLM calls for simple operations)
More predictable behaviour (deterministic filters)
LLM reserved for reasoning tasks, not retrieval

Explicit Delete Detection

User-initiated deletion via explicit command:

Command:

/forget <memory_id>
/forget <exact phrase>
forget this memory
delete memory: <id>

Flow:

1. Parse command for /forget keyword
2. If memory_id provided → DELETE by ID
3. If phrase provided → SEARCH → CONFIRM → DELETE
4. Cascade delete: raw memory + distillation + all versions + embeddings
5. Log deletion to audit trail (soft delete option)

Why explicit? Prevents accidental deletion from LLM hallucinations. User must explicitly request.

Error Handling & Discord Webhooks

Explicit error alerts with configurable webhook destination.

Config:

errors:
  enabled: true
  webhook_url_env: "YAK_ERROR_WEBHOOK"  # From env, not config file
  dedup_window_sec: 300  # 5 minutes
  alert_on:
    - distillation_failed
    - embedding_failed
    - db_connection_lost
    - queue_overflow

Alert Format:

🚨 Yak Error: distillation_failed
Memory ID: abc-123
Error: LLM timeout after 30s
Retry: 2/3
Time: 2026-06-01T09:15:00Z

Levels:

warn — Logged, no alert (e.g., low confidence)
error — Logged + webhook (e.g., LLM failure)
critical — Logged + webhook + escalate to SMS/pager (e.g., DB down)

Deduplication:

Same error type within 5 min window = 1 alert
After dedup window expires, alert again
Prevents webhook spam during outages

Metrics

Prometheus /metrics endpoint for observability.

Endpoints:

GET /metrics → Prometheus format

Metrics:

Metric	Type	Description
`yak_requests_total`	counter	Total API requests
`yak_distillations_created`	counter	Distillations stored
`yak_distillations_versioned`	counter	Near-duplicates detected (versioned)
`yak_memories_deleted`	counter	Deletions
`yak_recall_latency_ms`	histogram	Search latency
`yak_distillation_latency_ms`	histogram	LLM latency
`yak_queue_size`	gauge	Pending jobs
`yak_warmth_updates`	counter	9B warmth evaluation jobs
`yak_conflicts_detected`	counter	Contradiction alerts
`yak_errors_total`	counter	Errors by type

Example query:

# Average recall latency
histogram_quantile(0.95, yak_recall_latency_ms_bucket)

# Version rate (near-duplicates)
rate(yak_distillations_versioned[5m])

# Conflict rate
rate(yak_conflicts_detected[5m])

# Error rate
rate(yak_errors_total[5m])

CLI

Thin API wrapper for manual operations. Supports local and remote instances.

Commands:

yak --help

# Server
yak serve              # Start HTTP API
yak serve --port 8000  # Custom port

# Memory ops
yak memory list --user user-123        # List memories
yak memory show <id>                   # Show details
yak memory delete <id>                 # Delete by ID
yak memory search "coffee" --limit 10  # Search

# Distillation ops
yak distill list --user user-123       # List distillations
yak distill version <id1> <id2>        # Manual version (merge)
yak distill re-embed <id>              # Re-compute embedding

# Warmth ops
yak warmth show <id>                   # Show warmth score
yak warmth reset <id>                  # Reset to 0.5
yak warmth bulk-reset                  # Reset all (debug)

# Admin
yak metrics                            # Show metrics
yak queue status                       # Show queue
yak health                             # Health check

# Remote support
yak --remote https://yak.example.com memory list
yak --api-key xxx memory delete <id>

Why thin wrapper? Reuses API logic, supports remote instances, --json output for scripting.

Background Jobs (Real-Time + Batch)

Per-Turn 9B Relevance Evaluation

Every conversation turn:

After assistant responds:
  Input: [recalled_memories, full_conversation_turn]
  
  For each recalled memory:
    9B evaluates: "Was memory X useful? Score 0–1"
    
  If score > 0.7: warmth += 0.1
  If score < 0.3: warmth -= 0.2
  
  Latency: ~2s per turn (acceptable for personal use)

Nightly Batch (Optional)

02:00 local time:

Re-evaluate all memories from last 7 days
Catch any edge cases missed in per-turn evaluation
Adjust warmth for memories that were recalled but not used

Data Model

`memories` table

id          UUID PRIMARY KEY
content     TEXT          -- "User: ...\nAssistant: ..."
user_id     TEXT          -- Multi-user scoping
created_at  TIMESTAMPTZ
updated_at  TIMESTAMPTZ

`distillations` table

id          UUID PRIMARY KEY
memory_id   UUID FK → memories
content     TEXT          -- 1-4 factual sentences
confidence  FLOAT         -- 0–1, from LLM
tier        TEXT          -- core/medium/low
warmth      FLOAT DEFAULT 0.5  -- 0–1, 9B evaluation
embedding   vector(768)   -- nomic-embed-text
created_at  TIMESTAMPTZ
updated_at  TIMESTAMPTZ   -- updated on warmth changes

`distillation_versions` table

id          UUID PRIMARY KEY
distillation_id UUID FK → distillations
content     TEXT          -- Consolidated content
created_at  TIMESTAMPTZ
merged_from UUID[]        -- Parent distillation IDs

Indexes:

idx_distillations_embedding — IVFFlat for fast vector search
idx_distillations_tier — Filter by importance
idx_distillations_warmth — Sort by warmth for recall
idx_memories_user_id — Multi-user scoping

API Design

Create Memory

POST /api/memories/
{
  "content": "User: I love coffee\nAssistant: Great choice!",
  "user_id": "user-123"
}

→ 201 Created
{
  "id": "uuid",
  "content": "...",
  "user_id": "user-123",
  "created_at": "2026-06-01T..."
}

Search Distillations

POST /api/distillations/search
{
  "query": "What's my favorite drink?",
  "user_id": "user-123",
  "limit": 5,
  "threshold": 0.7
}

→ 200 OK
[
  {
    "id": "uuid",
    "content": "User prefers coffee over tea.",
    "confidence": 0.87,
    "tier": "medium",
    "warmth": 0.65,
    "score": 0.82,
    "is_core": false
  }
]

Update Warmth (9B evaluation)

POST /api/warmth/evaluate
{
  "memory_id": "uuid",
  "context": "full_conversation_turn",
  "evaluator": "9b-worker"
}

→ 200 OK
{
  "memory_id": "uuid",
  "relevance_score": 0.85,
  "warmth_change": 0.1,
  "new_warmth": 0.75
}

Deployment

Binary: Single Go binary (no dependencies)
Database: PostgreSQL 16+ with pgvector
LLM: llamacpp at hydrogen:8082
Auth: API key header (X-API-Key)

Future Considerations

Conflict resolution — What happens when new memory contradicts old?
Cold storage — Archive old distillations to separate table
Warmth decay tuning — Adjust cool-down rate based on usage patterns
Remote CLI — Support --remote for multi-instance deployments