How I Caught a Blind Spot That Would've Sunk MindCap

This one's a fun story. I built a system to map human curiosity, and it only understood programmers.

I caught it this week while running MindCap's keyword extraction through a validation notebook against 56,000+ real browsing history records. The numbers told a clear story: this wasn't a tuning problem, it was a structural one. And once I saw it, the fix became obvious.

Quick Context

MindCap is a privacy-focused browser extension that maps how your mind wanders online — not URLs or content, but the patterns: depth, spread, coherence, branching factor. Chrome extension (Plasmo + Dexie) on the front end, FastAPI + Supabase on the back end, Claude API for behavioral pattern detection. Full URLs never leave the browser.

The core extraction pipeline was working — but "working" and "working well across domains" turned out to be very different things.

The Validation Notebook

Unit tests passing and clean TypeScript compilation will only tell you so much. I wanted real numbers, so I sat down with a Jupyter notebook, loaded 56,000+ rows of crawler-enriched Firefox history, ported the entire extraction pipeline from TypeScript to Python, and ran it against actual browsing data.

The results:

Category	Percentage
unknown (fallback)	51.3%
technology	11.8%
education	9.0%
social	7.8%
shopping	6.8%
reference	5.1%
entertainment	3.0%
science	0.7%
finance	0.7%
health	0.4%
food	0.4%

More than half my data was unclassifiable. And the classification source distribution told the same story:

Source	Percentage
fallback	51.3%
keyword match	25.3%
domain lookup	12.0%
url pattern	6.6%
domain pattern	4.8%

The fallback bucket — "I have no idea what this is" — was the single largest classification source. And 95% of all records were flagged as needing review.

I'd set an 80% accuracy bar before moving forward. The system was at 49%. That gap was too wide to be a calibration issue — something fundamental was off. Time to dig in.

The Real Problem Wasn't the Categories

The 15-category taxonomy was part of it — "Technology" covering everything from React tutorials to SaaS pricing pages isn't useful. But the root cause ran deeper, into the keyword extraction itself. Three things jumped out:

The tech keyword whitelist. keywords.ts has a 65-term TECH_KEYWORDS set — words like "kubernetes", "react", "docker", "typescript" that are protected from being filtered during extraction. If you browse a React tutorial, those keywords survive. They get matched against category keyword lists. The system works.

But if you browse an article about quantum mechanics? "Quantum" and "mechanics" aren't in TECH_KEYWORDS. They get treated like generic noise. The system extracts nothing useful from "Introduction to Quantum Mechanics."

Same for endocrinology. Same for chord progressions. Same for jurisprudence, fermentation, macroeconomics, and every other domain of human knowledge that isn't software engineering.

Classic case of building for yourself first and generalizing second — except I'd skipped the second part.

The shallow content extraction. The content script had access to the full DOM but was only extracting keywords from meta tags, headings, and the first 500 characters of article text. That's two or three sentences. A 3,000-word medical article about autoimmune disorders would get keywords from its introduction — and the introduction is almost always generic. The specific terminology that tells you what the article is actually about lives in the body text, which the extraction pipeline never saw.

The extension already had a getMainText() function that grabs the full article text — it was being used for readability scoring. The keyword extractor just hadn't been wired up to it yet. Easy fix once I spotted it.

The wrong side of the wire. The server was running its own domain classification — calling Claude Haiku API at ~$0.0001/call to classify domains into those same 14 broad categories, caching results in a Supabase domain_cache table, and sending corrections back to the extension. A whole server→extension feedback loop to classify "github.com" as "technology."

But the extension had more information than the server ever saw: page title, URL path, content keywords, engagement signals. The classification was happening on the wrong side of the wire — paying for API calls and adding latency when the better data was already on the client.

The Redesign: Intent + Emergent Topics

Once the diagnosis was clear, the redesign came together quickly. The 15-category taxonomy, the tech-centric keyword whitelist, the server-side domain classification — all symptoms of the same mistake: trying to enumerate knowledge domains instead of detecting behavior.

The replacement is a two-layer system:

Layer 1: Intent (5 behavioral modes). Instead of asking "what is this content about?" (which requires knowing every possible topic), ask "what is the user doing?"

Learning — tutorials, courses, documentation, guides
Researching — comparisons, reviews, wikis, deep dives
Working — dashboards, tools, email, project management
Consuming — news, entertainment, social media, streaming
Transacting — shopping, booking, pricing, checkout

Intent is detected from behavioral signals — words like "tutorial", "how to", "review", "buy" — not subject-specific keywords. This is what makes it domain-agnostic. "How to bake sourdough" and "How to deploy Kubernetes" both signal learning. The same detection logic works for a med student, a jazz musician, and a software engineer.

The detection runs through six layers in priority order: title keywords → URL patterns → content type hints → domain lookup → domain/TLD patterns → fallback. Higher layers are content-aware (higher confidence), lower layers are generic defaults (lower confidence). Content signals always override domain defaults — so youtube.com/watch?v=tutorial returns learning, not consuming.

Layer 2: Emergent Topics (from keywords). No predefined taxonomy. Topics come from the keywords themselves, extracted through a broadened pipeline. Instead of 15 categories, you get whatever the content is actually about: "quantum mechanics", "chord progressions", "autoimmune disorders", "sourdough starter."

The fix for keyword extraction was straightforward but involved:

Rename TECH_KEYWORDS → PROTECTED_KEYWORDS and add terms across science, medicine, music, finance, law, cooking, math, and humanities
Broaden COMPOUND_EXCEPTIONS so "quantum + mechanics" and "blood + pressure" get the same treatment as "machine + learning"
Expand KNOWN_COMPOUND_TERMS beyond tech compounds
Extract from 5,000 characters of article text instead of 500

Server-side classification goes away entirely. The 6-layer intent detector in the extension does what domain_classifier.py did, but better and for free. No more Claude Haiku API calls. No more correction loop. The extension is the authority on intent — it has behavioral signals the server never sees.

What This Actually Looked Like in Practice

This wasn't a clean "delete old code, write new code" refactor. It was a two-day design session — mostly reading, mostly thinking, mostly tracing the data flow to understand exactly where assumptions broke down.

The plan ended up spanning five phases:

New intent system — two new files (intent-data.ts, intent-detector.ts) that exist alongside the old system. Nothing breaks yet.
Broadened keyword extraction — edit keywords.ts to be domain-agnostic, expand content extraction depth in tracker.ts.
Rewire consumers — switch background.ts, db.ts, session-detector.ts, url-parsers.ts, and sync.ts from the old category system to the new intent system. Add Jaccard distance for measuring topic spread along the visit tree (replacing the old "count unique categories" metric).
Delete old code — remove category-data.ts (555 lines) and topic-categorizer.ts (475 lines). Over 1,000 lines gone.
Simplify backend — delete domain_classifier.py, delete the domains API router, remove classification from the sync flow. The server becomes a storage layer that trusts what the extension sends.

One key design decision: topic clustering (grouping visits into topic nodes for visualization) now lives at the visualization layer, computed on demand from raw visit data — not pre-computed and stored on session records. This means improving the clustering algorithm later won't require migrating every stored session. The raw data (parent/child visit relationships + keywords) is always available to recompute.

The Meta-Lesson

This is a pattern worth naming: when your test data mirrors your own usage, your system works great — for you. My test cases were tech-heavy, my keyword lists were tech-focused, and "technology" was by far the most detailed category. The system worked perfectly for someone who browses like me. A music theory student's "chord progression" or a medical researcher's "endocrinology" were invisible to it.

That's why I wrote the validation notebook in the first place — intuition is useful for generating hypotheses, but 56,000 rows of real data is what actually tells you where you stand. The gap between 49% and my 80% target wasn't discouraging; it was clarifying. It pointed directly at what needed to change.

What's Next

The plan is written. Both versions — a concise printable reference for working alongside, and a full implementation guide with mechanism explanations for understanding why each change works the way it does. I'm hand-coding the implementation myself, working through each phase with Claude Code as a co-pilot, because I want to deeply understand every line of this system.

Phase 1 (the new intent detector) and Phase 2 (broadened keyword extraction) are independent — I can work on them in any order. Phase 3 (rewiring consumers) depends on both. Phases 4 and 5 are cleanup.

Building a curiosity mapper that was blind to most curiosity is a good bit of irony. But that's what validation is for — it turns assumptions into data, and data tells you what to build next.