Building the Fix: 700 Lines That Replaced 1,000

Last week I wrote about catching a blind spot in MindCap's categorization system — 51% unknown classifications, a tech-only keyword whitelist, and a taxonomy that tried to enumerate human knowledge instead of detecting behavior. Today I built the replacement. Two out of five phases complete, and the new system is already simpler, more capable, and half the size.

The Satisfaction of a Better Question

The old category system: 555 lines of hand-mapped data in category-data.ts (domain-to-category lookups, subreddit mappings, keyword clusters for fifteen categories) plus 475 lines of orchestration in topic-categorizer.ts running five layers of classification logic. Over a thousand lines of code that produced the right answer less than half the time.

The new intent system: 163 lines for the data, 147 for the detector. 310 lines total. It handles more cases, works across any field of knowledge, and it's easier to read.

The difference isn't cleverer code. It's a better question. "What is this content about?" requires mapping every possible topic in human knowledge. "What is the user doing?" has five answers: learning, researching, working, consuming, transacting. "How to bake sourdough" and "How to deploy Kubernetes" are both learning. The signal is in the verb, not the noun.

The Intent Detector

intent-detector.ts runs through six layers, stopping at the first confident match:

Title keywords — behavioral words like "tutorial", "review", "buy", "dashboard". If the title says "React Tutorial for Beginners", that's learning at 0.85 confidence before we even look at the URL.
URL patterns — structural signals like /docs/, /cart, /checkout, /dashboard. A URL containing /pricing is transacting regardless of domain.
Content type hints — for ambiguous domains only. YouTube could be learning or consuming. If the content script detects "how-to" page structure, that tips it to learning. A video page with no tutorial signals stays consuming.
Domain lookup — amazon.com is always transacting. Netflix is always consuming. The easy cases.
Domain patterns — .edu defaults to learning. Subdomains starting with docs. default to learning. shop. defaults to transacting.
Fallback — unknown at 0.20 confidence with a review flag.

Higher layers override lower ones. Content always beats containers. A coding tutorial on YouTube returns learning, not consuming. An educational article on Reddit returns learning, not consuming. The behavioral signal in the title trumps the domain's general purpose.

Broadening the Keyword Pipeline

The old TECH_KEYWORDS whitelist — 65 terms that got special protection during extraction — became PROTECTED_KEYWORDS, covering every major field:

Science: biology, chemistry, physics, genetics, neuroscience, astrophysics, paleontology
Medicine: cardiology, oncology, endocrinology, immunology, pharmacology, epidemiology
Music: counterpoint, syncopation, orchestration, polyphony, chromaticism
Finance: derivatives, cryptocurrency, arbitrage, securitization, blockchain
Law: jurisprudence, litigation, constitutional, arbitration
Plus cooking, math, humanities...

I also built a keyword alias system (keyword-aliases.ts) that normalizes abbreviations: "bio" becomes "biology", "econ" becomes "economics", "cardio" becomes "cardiology". This handles how people actually title content. Nobody writes "Introduction to Biological Sciences" — they write "Intro to Bio."

The Small Fix That Felt Big

The change I'm most satisfied with today is the smallest one.

The content script (tracker.ts) extracts keywords from the page and runs content analysis — readability scoring, sentiment analysis. Both need the page's main text content. Both were calling getMainText() independently.

getMainText() isn't cheap. It tries to find an <article> or <main> element. If it can't, it clones the entire document body, strips out scripts, stylesheets, navigation, headers, footers, sidebars, and menus, then returns the remaining text. A full DOM clone and traversal. The old code ran it twice — once for keywords, once for analysis — both on page load.

The fix:

const mainText = getMainText()
contentKeywords = extractContentKeywords(mainText)
contentAnalysis = analyzeContent(mainText)

Three lines. One DOM traversal instead of two. Both functions now accept an optional text parameter. It won't show up in any feature list, but content scripts are guests in someone else's browser. They should be polite.

The Plan for Phase 3

Phase 3 is where everything gets connected. Five files need surgery:

db.ts — Schema changes. topicCategory becomes intent. categoryConfidence becomes intentConfidence. Six stale session fields get removed. A Dexie v6 migration re-extracts keywords and detects intent for every existing visit.
url-parsers.ts — getCategoryFromParsedUrl() becomes getIntentFromParsedUrl(). The 40-line function with subreddit lookups becomes a 20-line switch statement. Reddit returns null now — the subreddit name is a topic, not an intent. r/learnprogramming and r/AskHistorians are both learning, regardless of subject.
background.ts — categorizeVisit() becomes detectIntent(). Same treatment for the re-categorization logic that fires when engagement data arrives.
session-detector.ts — The biggest change. The old spread metric counted unique categories (coarse, 0–4 integers). The new one measures Jaccard distance on keywords along actual navigation paths (smooth, 0–4 continuous). A user who reads five pages about quantum mechanics then five about cooking gets high spread. Ten pages about quantum mechanics subtopics gets low spread. Same page count, different spread — because the keywords changed.
sync.ts — The sync payload sends intent fields and visit relationships. Parent references use positional array indices, not local database IDs. Visit[3]'s parent being Visit[1] is expressed as parent_visit_index: 1. Simple, portable, meaningful on the server.

One structural decision I'm glad I made: calculateTopicChain() gets deleted entirely. It pre-computed flat arrays of topics and categories for session records. The new system doesn't need them — the raw visit tree (parent/child relationships + keywords) can reconstruct topic flows on demand. Deleting pre-computed data that goes stale every time you improve your algorithms is the right call.

Working With Claude Code

I'm hand-coding this refactor, but Claude Code is my co-pilot. The workflow that's emerged over twelve sessions: I describe what I want to build and why, Claude reads the relevant files, we discuss the approach, I make the decisions, and then we code it together. Sometimes Claude writes the first draft and I edit. Sometimes I describe the change and Claude makes it.

The planning phase is where it's most valuable. "Read these five files and tell me everywhere that references topicCategory" — accurate answer in seconds. The Phase 3 plan document, with exact line numbers, field name mapping tables, and dependency ordering, was generated from Claude reading every file that would be touched and synthesizing the changes. That kind of cross-file analysis is tedious by hand and trivial for an AI that can hold six files in context simultaneously.

The trap is letting it think for you. When Claude suggested nesting extractContentKeywords() inside analyzeContent() to share the text, I said no — they're different concerns and should stay separate functions. The right fix was sharing the input, not merging the operations. That's a judgment call that requires understanding the system's architecture, not just its code.

What's Next

I can't run full validation until Phase 3 is done — the new code needs to be wired in to produce real results. But the architecture gives me confidence. Six layers of fallback in the intent detector. Five thousand characters of article text instead of 500. Protected keywords covering every major field, not just programming.

The intent question is also fundamentally easier to answer than the category question. "Is this person learning something?" has clear signals: tutorial, guide, documentation, how-to, course, lesson. "Is this content about science?" requires knowing what science is, what its subfields are, what its vocabulary looks like. Intent detection sidesteps the entire taxonomy problem.

The goal is to complete all five Phase 3 tasks in a single session. I'll report back with real numbers.