I spend a lot of time online. For years I've tried various productivity tools—screen time trackers, website blockers, focus apps. They all have the same approach: make you feel bad about where you spend your time, then help you avoid those places.
But some of my best learning happens in rabbit holes. I'll start reading about a JavaScript framework, end up on a Wikipedia article about type theory, detour through a blog post about compiler design, and emerge three hours later with a deeper understanding of programming. That's not a distraction—that's curiosity in action.
I wanted to build something different. Not a tool that judges my browsing, but one that maps it. Shows me the shape of my curiosity. Helps me understand how I explore, not just where I go.
The Fundamental Problem: YouTube Is Everything
After building the Topic Registry (which tracks what subjects you're spending time on), I hit a wall. The registry could tell me "you spent 2 hours on youtube.com today," but that's useless. YouTube could be:
- A programming tutorial (education)
- Cat videos (entertainment)
- Lo-fi music for focus (music/background)
- A journalism documentary (news)
The same problem exists everywhere. Is reddit.com social media or news? Depends entirely on which subreddit. Is github.com work or personal learning? Depends on what you're looking at.
Domain-level categorization is fundamentally broken for the modern web.
I spent days going in circles. AI for everything? Too expensive and slow. Simple domain lookups? Misses the nuance. Page content analysis? Invasive and complex.
The Breakthrough: Layered Confidence Scoring
What clicked was realizing that categorization doesn't have to be perfect—it has to be honest about uncertainty.
Some visits are easy: github.com is clearly technology (95% confidence). Some are ambiguous: a YouTube video titled "react hooks tutorial" is probably technology, but maybe it's satire or critique (65% confidence). And some are genuinely unknown until you check (20% confidence—flag for review).
The insight: Build a 5-layer categorization system where each layer has different confidence levels. Start with the cheapest, fastest methods. Only escalate to expensive AI analysis when needed.
Here's how it works:
Layer 1: Domain Lookup (95% Confidence)
Simple, fast, accurate. A static mapping of unambiguous domains:
const DOMAIN_MAP = {
'github.com': { category: 'technology', confidence: 0.95 },
'stackoverflow.com': { category: 'technology', confidence: 0.95 },
'espn.com': { category: 'sports', confidence: 0.95 },
'allrecipes.com': { category: 'food', confidence: 0.95 },
}
No ambiguity. Instant classification. No API calls.
Layer 2: Domain Patterns (80% Confidence)
TLD and subdomain rules that apply broadly:
- Anything ending in
.edu→ education - Subdomains like
docs.*orapi.*→ reference - Domains with
newsin the name → news
Still fast, no external calls, but slightly less confident.
Layer 3: URL Pattern Extraction (70-85% Confidence)
This is where it gets interesting. Platform-specific parsers that understand URL structure:
// Reddit: extract subreddit
// reddit.com/r/programming → "programming" → technology
function parseReddit(url: string): CategoryResult {
const match = url.match(/\/r\/([^\/]+)/)
if (match) {
const subreddit = match[1].toLowerCase()
return categorizeSubreddit(subreddit)
}
}
// YouTube: extract video ID for title lookup
// youtube.com/watch?v=xyz → extract video ID → get title → categorize
function parseYouTube(url: string): CategoryResult {
const match = url.match(/[?&]v=([^&]+)/)
if (match) {
return { videoId: match[1], needsTitleLookup: true }
}
}
For Reddit, I built a subreddit mapping: r/programming → technology, r/movies → entertainment, r/worldnews → news. For YouTube, I extract the video ID and look up the title from page metadata.
Layer 4: Keyword Matching (50-90% Confidence)
If the URL doesn't tell us enough, look at what we extracted from the page: title, meta description, headings, content keywords. Match against category keyword lists:
const CATEGORY_KEYWORDS = {
technology: ['programming', 'code', 'software', 'javascript', 'python'],
science: ['research', 'study', 'quantum', 'biology', 'physics'],
news: ['breaking', 'report', 'updates', 'latest', 'headline'],
// ... 11 more categories
}
function matchKeywords(text: string): CategoryResult {
const words = text.toLowerCase().split(/\s+/)
const scores = {}
for (const [category, keywords] of Object.entries(CATEGORY_KEYWORDS)) {
const matches = keywords.filter(kw => words.includes(kw)).length
if (matches > 0) {
scores[category] = matches / keywords.length
}
}
const best = Object.entries(scores).sort((a, b) => b[1] - a[1])[0]
return { category: best[0], confidence: best[1] }
}
Confidence varies based on match quality. If the title is "Python Tutorial for Beginners," that's high confidence (90%). If it's "10 Things You Didn't Know," that's low confidence (50%).
Layer 5: Fallback (20% Confidence)
When all else fails, mark it "unknown" and flag it for server-side AI review using Claude. The server runs categorization on unknowns in batches, then returns corrections on the next sync.
This hybrid approach means the extension is fast (no waiting for AI), but gets smarter over time as the server teaches it.
The 14-Category Taxonomy
After a lot of iteration, I settled on exactly 14 categories. Not too granular (nobody needs 50 categories), not too broad (lumping gaming into entertainment loses signal):
technology science news entertainment
gaming shopping education health
finance travel food sports
social reference
Each category has clear boundaries and matches how I actually think about my browsing. When MindCap tells me "you spent 3 hours on technology this week," that means something. When it says "gaming doubled from last week," that's actionable.
What This Enables: Pattern Detection
All this infrastructure was building toward one thing: pattern detection. Now that every page visit has a category, I can detect behavioral patterns that raw time tracking can't see.
Here are 9 patterns I'm implementing:
| Pattern | What It Detects |
|---|---|
recurring_interest |
Topics you keep returning to |
growing_interest |
Interests accelerating week-over-week |
paused_exploration |
Topics that have gone quiet |
passive_browsing |
High time + low engagement |
unanswered_question |
Repeated searches without resolution |
rabbit_hole |
Maps curiosity flow and branching |
temporal_pattern |
Behavior tied to time of day/week |
learning_style |
Content type preferences |
exploration_curiosity |
Tentative interest, peeking without diving |
For example, growing_interest compares your weekly time in each category over the past 12 weeks. If "science" was 5% of your browsing two months ago and 20% now, that's worth surfacing. If "social" dropped from 30% to 5%, maybe you've been more focused lately.
The one I care most about is rabbit_hole. Instead of warning "you went down a rabbit hole," it maps the journey:
Example output: "You started in technology, branched into science, touched on history, and ended up in philosophy. High spread, medium coherence—a wandering journey." That's not judgment. That's a map.
Architecture: Client-First with Server Validation
The system works entirely client-side for speed, with server corrections for accuracy.
In the extension (TypeScript):
category-data.ts— 100+ domain mappings, subreddit mappings, keyword listsurl-parsers.ts— Platform-specific parsers for YouTube, Reddit, GitHub, Wikipedia, etc.topic-categorizer.ts— The 5-layer categorization enginebackground.ts— Integrated categorization into visit capturedb.ts— IndexedDB schema with confidence fields
On the server (Python/FastAPI):
topic_registry.py— Aggregates topics per user, now with category-level statsclaude.py— AI classification aligned to the same 14 categoriessync.py— Returns category corrections when server knows better
The extension categorizes instantly. The server validates in the background. Over time, the extension gets smarter as it learns from corrections. It's about 1,500 lines of new code, carefully designed to work together.
What I Learned Building This
This project forced me to think about attention and behavior. Some notes:
Rabbit Holes Aren't Bad
They're a form of exploration. The question isn't "how do I avoid rabbit holes?" but "what kind of rabbit holes do I go down, and are they serving me?" A rabbit hole through Wikipedia's list of cognitive biases? Probably useful. A rabbit hole through celebrity gossip? Maybe not.
Engagement ≠ Value
I can scroll Reddit for 2 hours with zero engagement (passive consumption) or spend 30 minutes deep in documentation with high engagement (active learning). Time alone doesn't tell the story. MindCap tracks both.
Curiosity Has a Shape
Some people go deep on one topic (coherent deep dive). Others wander across many topics but stay loosely connected (wandering journey). Others bounce randomly (tangent hopper). None of these is wrong—they're just different exploration styles.
Categories Are a Simplification, But a Useful One
Real interests don't fit in boxes. But tracking time across 14 categories gives you a rough map of where your attention goes, which is better than no map at all.
What's Next
Tomorrow I start implementing the pattern detector. One pattern at a time, so I understand each one. By the end, I'll have a system that can look at someone's browsing and surface what they're curious about, how their interests are evolving, the shape of their exploration.
Not a productivity guilt machine. Not a simple time tracker. A curiosity mapper.
Today the pieces fit together. The confidence scoring solves the multi-category domain problem elegantly. The client-server hybrid gives speed and accuracy. The category aggregates set up pattern detection perfectly. I'm building something that doesn't exist yet.