Dev Diary: Content-First Categorization
Rethinking Signal Priority
MindCap's topic categorization was treating domains as definitive signals—github.com meant technology, youtube.com meant entertainment. But domains are containers, not categories. YouTube hosts programming tutorials alongside gaming streams. GitHub has game mods next to enterprise software. Wikipedia covers everything from quantum physics to celebrity gossip. The domain tells you where content lives, not what it's about.
I spent today exploring what happens when you flip the categorization layers—prioritizing content signals over domain signals.
The Reversed Layer Order
The original 5-layer system evaluated domain lookups first, only falling back to keyword analysis when domains were unknown. The new order:
- Keyword classification (2+ matches) — What's the page actually about?
- Content type refinement — Is it a tutorial video or gaming stream?
- URL patterns — Reddit subreddit mappings, YouTube categories
- Domain patterns — TLDs like .edu, subdomains like docs.*
- Domain lookup — Generic domain category as last resort
- Single keyword match — Weak signal, but better than nothing
- Fallback — Unknown
Domain lookup moved from Layer 1 with 0.85 confidence down to Layer 5 with 0.55 confidence, always flagged for review. The hypothesis: content-specific signals should produce more accurate categorizations for multi-category platforms.
Cleaning Up Keyword Extraction
A surprising amount of time went into one function: isUrlGarbage().
When you extract keywords from URLs like https://12ft.io/proxy?q=https%3A%2F%2Fwww.amazon.com%2Fmidwest, you get garbage like "2fmidwest" and "christman%2fdp". These are URL-encoded fragments (%2F = /) that slip through naive tokenization.
The fix involved multiple heuristics:
- Detecting hex pairs embedded in words (2fmidwest, christman2fdp)
- Filtering common TLDs (com, org, io)
- Checking digit-to-letter ratios
- Pattern matching for tracking params (utm_, fbclid, gclid)
// Patterns that catch URL-encoded garbage
/^[0-9a-f]{2}[a-z]/i, // "2fmidwest" from %2F
/[0-9a-f]{2}[a-z]+[0-9a-f]{2}/i // "word2fword"
Other Refinements
- Adult category: Added for comprehensive behavior tracking (15 categories total now)
- Reddit fallback: Unknown subreddits now default to "social" instead of "unknown"
- Database migration: Version 5 re-extracts keywords for all existing visits using the improved filtering
The Migration Pattern
Dexie makes schema migrations elegant:
this.version(5).stores({...}).upgrade(async tx => {
// Dynamic import to avoid circular dependency
const { extractKeywords, extractKeywordsFromUrl } = await import('./keywords')
const visits = await tx.table('visits').toArray()
for (const visit of visits) {
const newKeywords = [...new Set([
...extractKeywords(visit.title || ''),
...extractKeywordsFromUrl(visit.url)
])]
await tx.table('visits').update(visit.id, { keywords: newKeywords })
}
})
Key gotcha: dynamic imports are needed inside migrations to avoid circular dependencies. Next time the extension loads, every existing visit gets its keywords re-extracted.
What I Learned
Confidence scoring design — Each categorization layer now has tuned confidence values that reflect signal strength. Higher layers (keywords with 2+ matches) get 0.85-0.95, while fallback layers (domain lookup) get 0.55 with mandatory review flags. The needsCategoryReview flag lets the server know which classifications are shaky.
URL fragment detection — Encoded URL fragments like %2F become 2f after tokenization, creating garbage keywords. The trick is detecting hex pair prefixes/suffixes, embedded hex patterns, and high digit ratios in longer strings.
Multi-file consistency — When categorization logic changes, the Python validation notebook needs the same updates. Maintaining parallel implementations across TypeScript and Python requires discipline.
What's Next
- Rebuild and test the reversed layer order against real browsing data
- Validate categorization accuracy improvements in the Jupyter notebook
- Track how often each layer produces the final classification