Frank's Reports

Yak — Personal LLM Memory System Design

June 1, 2026

Dual-layer memory architecture with temporal decay, 9B-driven warmth, and versioned distillations. Built in Go with PostgreSQL + pgvector.

Overview

Yak is a personal memory system for LLMs that combines:

  1. Raw storage — Full conversation exchanges
  2. Distillation layer — LLM-generated summaries with confidence scoring
  3. Semantic retrieval — Vector search with temporal decay

Dual-Layer Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Raw Memory Layer                        │
│  Stores: "User: ...\nAssistant: ..." (full fidelity)        │
│  Purpose: Source of truth, audit trail, re-distillation     │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                  Distillation Layer                         │
│  Stores: 1-4 factual sentences + confidence + tier + embed  │
│  Purpose: Efficient semantic search, RAG injection          │
└─────────────────────────────────────────────────────────────┘

Temporal Decay

Older memories are penalised during retrieval using exponential decay:

score = cosine_distance × e^(ln(2)/half_life × age_days)
ParameterDefaultEffect
half_life_days30Older memories decay exponentially
core_threshold_multiplier2.0×Core memories always surface
decay_enabledtrueSet to 0 to disable

Rationale: Recency matters. A preference from 6 months ago is less relevant than one from last week.


Importance Tiers

TierThresholdExamples
core2.0× looserName, location, occupation, family
medium1.0× (baseline)Preferences, habits, recurring topics
low0.5× stricterEphemeral context, one-off mentions

How it works: Core memories use a 2× higher cosine distance threshold during recall, ensuring identity facts always surface.


Confidence Scoring

LLM outputs distillation + confidence (0–1):

{
  "summary": "User prefers coffee over tea.",
  "confidence": 0.87
}
ConfidenceAction
≥ 0.8Accept, store normally
0.6–0.8Accept, flag for later review
< 0.6Flag for manual review, don’t auto-deploy

Why? Prevents hallucinated or weak distillations from polluting memory.


Memory Reinforcement Learning (9B-Driven)

Since cost is not a constraint for personal infra, all warmth updates use 9B evaluation instead of heuristics.

1. Per-Turn 9B Relevance Evaluation

After every conversation turn, the 9B worker evaluates all recalled memories:

Flow:

After assistant responds:
  Input: [recalled_memories, full_conversation_turn]
  
  For each recalled memory:
    9B evaluates: "Was memory X useful in this context? Score 0–1"
    
  If score > 0.7: warmth += 0.1
  If score < 0.3: warmth -= 0.2
  If 0.3–0.7: no change (neutral)
  
  Latency: ~2s per turn (acceptable for personal use)

Why per-turn? Immediate feedback loop. Memories that are actually useful rise faster. Misfires are penalised in real-time.

2. 9B-Driven Warmth (Not Heuristics)

All warmth signals come from 9B evaluation, not heuristics:

SignalHeuristic (Old)9B Evaluation (New)
Memory recalledwarmth += 0.059B: "Was this relevant?" (0–1)
User references itwarmth += 0.19B: "Did user use this?" (0–1)
User says “thanks”warmth += 0.159B: "Was this helpful?" (0–1)

Why? LLM understands context. Heuristics are brittle (e.g., user says “thanks” for unrelated reason).

3. Near-Duplicate Handling (Vector-Only)

Skip keyword keys entirely. Use vector similarity only — pgvector is fast enough.

Detection Flow:

For each new distillation:
  1. Compute embedding (768-dim)
  2. pgvector cosine search (limit=5, threshold=0.85)
  3. If matches found: VERSION (don't merge)
  4. If no matches: CREATE NEW

Versioned Distillations:

CREATE TABLE distillation_versions (
  id          UUID PRIMARY KEY,
  distillation_id UUID FK  distillations,
  content     TEXT,
  created_at  TIMESTAMPTZ,
  merged_from UUID[]  -- parent distillation IDs
);

Merge Strategy:

IF similarity > 0.85:
  - Create new distillation_version
  - 9B consolidates: old_content + new_content → unified
  - Store unified content
  - Keep old versions for audit
ELSE:
  - Create new distillation (normal flow)

Why versioned? Append creates bloat, replace loses history. Version + 9B consolidation preserves both.

4. 9B Conflict Detection

On every new memory, 9B checks for contradictions with existing memories:

Input: [new_distillation, existing_similar_distillations]
9B evaluates:
  "Do these contradict? Score 0–1"
  
If conflict > 0.7:
  - Flag for review
  - Don't auto-merge
  - Alert user

Why? 9B detects semantic contradictions that heuristics miss.

5. Gravity (Intrinsic Importance)

Each memory has an intrinsic “gravity” that acts as a base multiplier:

gravity = tier_multiplier × user_weight
core    → 2.0
medium  → 1.0
low     → 0.5

final_score = cosine_distance × decay × (1 + warmth) × gravity

Why: Core facts (name, location) always surface. Preferences (medium) surface when relevant. Ephemeral (low) fade faster.


Final Recall Score

score = cosine_similarity × e^(ln(2)/half_life × age_days)
        × (1 + warmth)
        × gravity
        × tier_bonus
ComponentRangePurpose
cosine_similarity0–1Semantic match
decay0–1Temporal recency
1 + warmth1–2Usage frequency
gravity0.5–2.0Intrinsic importance
tier_bonus1.0–2.0Core memory boost

Core Principles

1. No Prompt Mutations from Memory Injection

Memories are injected as a distinct, isolated block — never interpolated into existing template sections.

Why: Prompt mutations cause template corruption, inconsistent injection points, and make debugging difficult.

Pattern:

<system_prompt>
...original template...
</system_prompt>

<memories>
{{ recalled_distillations }}
</memories>

<user_query>
...
</user_query>

Rules:

  • Memory block always injected at fixed position (before user query, after system prompt)
  • Use XML/Markdown delimiters to isolate memory content
  • Never interpolate memories into template variables
  • If memory injection fails, prompt still works (graceful degradation)

2. Inference-Last Ethos

Prioritise performant, deterministic tools before using the LLM layer.

Retrieval Pipeline:

  1. Keyword filters — Exact match, user_id scoping, tier filtering
  2. Vector search — pgvector cosine similarity with decay
  3. LLM summarisation — Only if recall results need synthesis

Storage Pipeline:

  1. Keyword delete detection/forget or “forget this” (no LLM)
  2. Near-duplicate check — Vector similarity only (version, don’t merge)
  3. LLM distillation — Async, only for new memories

Why:

  • Reduces latency (keyword search < 1ms vs. LLM 100s of ms)
  • Reduces cost (no LLM calls for simple operations)
  • More predictable behaviour (deterministic filters)
  • LLM reserved for reasoning tasks, not retrieval

Explicit Delete Detection

User-initiated deletion via explicit command:

Command:

/forget <memory_id>
/forget <exact phrase>
forget this memory
delete memory: <id>

Flow:

1. Parse command for /forget keyword
2. If memory_id provided → DELETE by ID
3. If phrase provided → SEARCH → CONFIRM → DELETE
4. Cascade delete: raw memory + distillation + all versions + embeddings
5. Log deletion to audit trail (soft delete option)

Why explicit? Prevents accidental deletion from LLM hallucinations. User must explicitly request.


Error Handling & Discord Webhooks

Explicit error alerts with configurable webhook destination.

Config:

errors:
  enabled: true
  webhook_url_env: "YAK_ERROR_WEBHOOK"  # From env, not config file
  dedup_window_sec: 300  # 5 minutes
  alert_on:
    - distillation_failed
    - embedding_failed
    - db_connection_lost
    - queue_overflow

Alert Format:

🚨 Yak Error: distillation_failed
Memory ID: abc-123
Error: LLM timeout after 30s
Retry: 2/3
Time: 2026-06-01T09:15:00Z

Levels:

  • warn — Logged, no alert (e.g., low confidence)
  • error — Logged + webhook (e.g., LLM failure)
  • critical — Logged + webhook + escalate to SMS/pager (e.g., DB down)

Deduplication:

  • Same error type within 5 min window = 1 alert
  • After dedup window expires, alert again
  • Prevents webhook spam during outages

Metrics

Prometheus /metrics endpoint for observability.

Endpoints:

GET /metrics → Prometheus format

Metrics:

MetricTypeDescription
yak_requests_totalcounterTotal API requests
yak_distillations_createdcounterDistillations stored
yak_distillations_versionedcounterNear-duplicates detected (versioned)
yak_memories_deletedcounterDeletions
yak_recall_latency_mshistogramSearch latency
yak_distillation_latency_mshistogramLLM latency
yak_queue_sizegaugePending jobs
yak_warmth_updatescounter9B warmth evaluation jobs
yak_conflicts_detectedcounterContradiction alerts
yak_errors_totalcounterErrors by type

Example query:

# Average recall latency
histogram_quantile(0.95, yak_recall_latency_ms_bucket)

# Version rate (near-duplicates)
rate(yak_distillations_versioned[5m])

# Conflict rate
rate(yak_conflicts_detected[5m])

# Error rate
rate(yak_errors_total[5m])

CLI

Thin API wrapper for manual operations. Supports local and remote instances.

Commands:

yak --help

# Server
yak serve              # Start HTTP API
yak serve --port 8000  # Custom port

# Memory ops
yak memory list --user user-123        # List memories
yak memory show <id>                   # Show details
yak memory delete <id>                 # Delete by ID
yak memory search "coffee" --limit 10  # Search

# Distillation ops
yak distill list --user user-123       # List distillations
yak distill version <id1> <id2>        # Manual version (merge)
yak distill re-embed <id>              # Re-compute embedding

# Warmth ops
yak warmth show <id>                   # Show warmth score
yak warmth reset <id>                  # Reset to 0.5
yak warmth bulk-reset                  # Reset all (debug)

# Admin
yak metrics                            # Show metrics
yak queue status                       # Show queue
yak health                             # Health check

# Remote support
yak --remote https://yak.example.com memory list
yak --api-key xxx memory delete <id>

Why thin wrapper? Reuses API logic, supports remote instances, --json output for scripting.


Background Jobs (Real-Time + Batch)

Per-Turn 9B Relevance Evaluation

Every conversation turn:

After assistant responds:
  Input: [recalled_memories, full_conversation_turn]
  
  For each recalled memory:
    9B evaluates: "Was memory X useful? Score 0–1"
    
  If score > 0.7: warmth += 0.1
  If score < 0.3: warmth -= 0.2
  
  Latency: ~2s per turn (acceptable for personal use)

Nightly Batch (Optional)

02:00 local time:

  • Re-evaluate all memories from last 7 days
  • Catch any edge cases missed in per-turn evaluation
  • Adjust warmth for memories that were recalled but not used

Data Model

memories table

id          UUID PRIMARY KEY
content     TEXT          -- "User: ...\nAssistant: ..."
user_id     TEXT          -- Multi-user scoping
created_at  TIMESTAMPTZ
updated_at  TIMESTAMPTZ

distillations table

id          UUID PRIMARY KEY
memory_id   UUID FK  memories
content     TEXT          -- 1-4 factual sentences
confidence  FLOAT         -- 0–1, from LLM
tier        TEXT          -- core/medium/low
warmth      FLOAT DEFAULT 0.5  -- 0–1, 9B evaluation
embedding   vector(768)   -- nomic-embed-text
created_at  TIMESTAMPTZ
updated_at  TIMESTAMPTZ   -- updated on warmth changes

distillation_versions table

id          UUID PRIMARY KEY
distillation_id UUID FK  distillations
content     TEXT          -- Consolidated content
created_at  TIMESTAMPTZ
merged_from UUID[]        -- Parent distillation IDs

Indexes:

  • idx_distillations_embedding — IVFFlat for fast vector search
  • idx_distillations_tier — Filter by importance
  • idx_distillations_warmth — Sort by warmth for recall
  • idx_memories_user_id — Multi-user scoping

API Design

Create Memory

POST /api/memories/
{
  "content": "User: I love coffee\nAssistant: Great choice!",
  "user_id": "user-123"
}

→ 201 Created
{
  "id": "uuid",
  "content": "...",
  "user_id": "user-123",
  "created_at": "2026-06-01T..."
}

Search Distillations

POST /api/distillations/search
{
  "query": "What's my favorite drink?",
  "user_id": "user-123",
  "limit": 5,
  "threshold": 0.7
}

→ 200 OK
[
  {
    "id": "uuid",
    "content": "User prefers coffee over tea.",
    "confidence": 0.87,
    "tier": "medium",
    "warmth": 0.65,
    "score": 0.82,
    "is_core": false
  }
]

Update Warmth (9B evaluation)

POST /api/warmth/evaluate
{
  "memory_id": "uuid",
  "context": "full_conversation_turn",
  "evaluator": "9b-worker"
}

→ 200 OK
{
  "memory_id": "uuid",
  "relevance_score": 0.85,
  "warmth_change": 0.1,
  "new_warmth": 0.75
}

Deployment

  • Binary: Single Go binary (no dependencies)
  • Database: PostgreSQL 16+ with pgvector
  • LLM: llamacpp at hydrogen:8082
  • Auth: API key header (X-API-Key)

Future Considerations

  1. Conflict resolution — What happens when new memory contradicts old?
  2. Cold storage — Archive old distillations to separate table
  3. Warmth decay tuning — Adjust cool-down rate based on usage patterns
  4. Remote CLI — Support --remote for multi-instance deployments