Mapping Curiosity: Building MindCap's Categorization Engine

I spend a lot of time online. For years I've tried various productivity tools—screen time trackers, website blockers, focus apps. They all have the same approach: make you feel bad about where you spend your time, then help you avoid those places.

But some of my best learning happens in rabbit holes. I'll start reading about a JavaScript framework, end up on a Wikipedia article about type theory, detour through a blog post about compiler design, and emerge three hours later with a deeper understanding of programming. That's not a distraction—that's curiosity in action.

I wanted to build something different. Not a tool that judges my browsing, but one that maps it. Shows me the shape of my curiosity. Helps me understand how I explore, not just where I go.

The Fundamental Problem: YouTube Is Everything

After building the Topic Registry (which tracks what subjects you're spending time on), I hit a wall. The registry could tell me "you spent 2 hours on youtube.com today," but that's useless. YouTube could be:

A programming tutorial (education)
Cat videos (entertainment)
Lo-fi music for focus (music/background)
A journalism documentary (news)

The same problem exists everywhere. Is reddit.com social media or news? Depends entirely on which subreddit. Is github.com work or personal learning? Depends on what you're looking at.

Domain-level categorization is fundamentally broken for the modern web.

I spent days going in circles. AI for everything? Too expensive and slow. Simple domain lookups? Misses the nuance. Page content analysis? Invasive and complex.

The Breakthrough: Layered Confidence Scoring

What clicked was realizing that categorization doesn't have to be perfect—it has to be honest about uncertainty.

Some visits are easy: github.com is clearly technology (95% confidence). Some are ambiguous: a YouTube video titled "react hooks tutorial" is probably technology, but maybe it's satire or critique (65% confidence). And some are genuinely unknown until you check (20% confidence—flag for review).

The insight: Build a 5-layer categorization system where each layer has different confidence levels. Start with the cheapest, fastest methods. Only escalate to expensive AI analysis when needed.

Here's how it works:

Layer 1: Domain Lookup (95% Confidence)

Simple, fast, accurate. A static mapping of unambiguous domains:

typescript

const DOMAIN_MAP = {
  'github.com': { category: 'technology', confidence: 0.95 },
  'stackoverflow.com': { category: 'technology', confidence: 0.95 },
  'espn.com': { category: 'sports', confidence: 0.95 },
  'allrecipes.com': { category: 'food', confidence: 0.95 },
}

No ambiguity. Instant classification. No API calls.

Layer 2: Domain Patterns (80% Confidence)

TLD and subdomain rules that apply broadly:

Anything ending in .edu → education
Subdomains like docs.* or api.* → reference
Domains with news in the name → news

Still fast, no external calls, but slightly less confident.

Layer 3: URL Pattern Extraction (70-85% Confidence)

This is where it gets interesting. Platform-specific parsers that understand URL structure:

typescript

// Reddit: extract subreddit
// reddit.com/r/programming → "programming" → technology
function parseReddit(url: string): CategoryResult {
  const match = url.match(/\/r\/([^\/]+)/)
  if (match) {
    const subreddit = match[1].toLowerCase()
    return categorizeSubreddit(subreddit)
  }
}

// YouTube: extract video ID for title lookup
// youtube.com/watch?v=xyz → extract video ID → get title → categorize
function parseYouTube(url: string): CategoryResult {
  const match = url.match(/[?&]v=([^&]+)/)
  if (match) {
    return { videoId: match[1], needsTitleLookup: true }
  }
}

For Reddit, I built a subreddit mapping: r/programming → technology, r/movies → entertainment, r/worldnews → news. For YouTube, I extract the video ID and look up the title from page metadata.

Layer 4: Keyword Matching (50-90% Confidence)

If the URL doesn't tell us enough, look at what we extracted from the page: title, meta description, headings, content keywords. Match against category keyword lists:

typescript

const CATEGORY_KEYWORDS = {
  technology: ['programming', 'code', 'software', 'javascript', 'python'],
  science: ['research', 'study', 'quantum', 'biology', 'physics'],
  news: ['breaking', 'report', 'updates', 'latest', 'headline'],
  // ... 11 more categories
}

function matchKeywords(text: string): CategoryResult {
  const words = text.toLowerCase().split(/\s+/)
  const scores = {}
  
  for (const [category, keywords] of Object.entries(CATEGORY_KEYWORDS)) {
    const matches = keywords.filter(kw => words.includes(kw)).length
    if (matches > 0) {
      scores[category] = matches / keywords.length
    }
  }
  
  const best = Object.entries(scores).sort((a, b) => b[1] - a[1])[0]
  return { category: best[0], confidence: best[1] }
}

Confidence varies based on match quality. If the title is "Python Tutorial for Beginners," that's high confidence (90%). If it's "10 Things You Didn't Know," that's low confidence (50%).

Layer 5: Fallback (20% Confidence)

When all else fails, mark it "unknown" and flag it for server-side AI review using Claude. The server runs categorization on unknowns in batches, then returns corrections on the next sync.

This hybrid approach means the extension is fast (no waiting for AI), but gets smarter over time as the server teaches it.

The 14-Category Taxonomy

After a lot of iteration, I settled on exactly 14 categories. Not too granular (nobody needs 50 categories), not too broad (lumping gaming into entertainment loses signal):

text

technology  science       news         entertainment
gaming      shopping      education    health
finance     travel        food         sports
social      reference

Each category has clear boundaries and matches how I actually think about my browsing. When MindCap tells me "you spent 3 hours on technology this week," that means something. When it says "gaming doubled from last week," that's actionable.

What This Enables: Pattern Detection

All this infrastructure was building toward one thing: pattern detection. Now that every page visit has a category, I can detect behavioral patterns that raw time tracking can't see.

Here are 9 patterns I'm implementing:

Pattern	What It Detects
`recurring_interest`	Topics you keep returning to
`growing_interest`	Interests accelerating week-over-week
`paused_exploration`	Topics that have gone quiet
`passive_browsing`	High time + low engagement
`unanswered_question`	Repeated searches without resolution
`rabbit_hole`	Maps curiosity flow and branching
`temporal_pattern`	Behavior tied to time of day/week
`learning_style`	Content type preferences
`exploration_curiosity`	Tentative interest, peeking without diving

For example, growing_interest compares your weekly time in each category over the past 12 weeks. If "science" was 5% of your browsing two months ago and 20% now, that's worth surfacing. If "social" dropped from 30% to 5%, maybe you've been more focused lately.

The one I care most about is rabbit_hole. Instead of warning "you went down a rabbit hole," it maps the journey:

Example output: "You started in technology, branched into science, touched on history, and ended up in philosophy. High spread, medium coherence—a wandering journey." That's not judgment. That's a map.

Architecture: Client-First with Server Validation

The system works entirely client-side for speed, with server corrections for accuracy.

In the extension (TypeScript):

category-data.ts — 100+ domain mappings, subreddit mappings, keyword lists
url-parsers.ts — Platform-specific parsers for YouTube, Reddit, GitHub, Wikipedia, etc.
topic-categorizer.ts — The 5-layer categorization engine
background.ts — Integrated categorization into visit capture
db.ts — IndexedDB schema with confidence fields

On the server (Python/FastAPI):

topic_registry.py — Aggregates topics per user, now with category-level stats
claude.py — AI classification aligned to the same 14 categories
sync.py — Returns category corrections when server knows better

The extension categorizes instantly. The server validates in the background. Over time, the extension gets smarter as it learns from corrections. It's about 1,500 lines of new code, carefully designed to work together.

What I Learned Building This

This project forced me to think about attention and behavior. Some notes:

Rabbit Holes Aren't Bad

They're a form of exploration. The question isn't "how do I avoid rabbit holes?" but "what kind of rabbit holes do I go down, and are they serving me?" A rabbit hole through Wikipedia's list of cognitive biases? Probably useful. A rabbit hole through celebrity gossip? Maybe not.

Engagement ≠ Value

I can scroll Reddit for 2 hours with zero engagement (passive consumption) or spend 30 minutes deep in documentation with high engagement (active learning). Time alone doesn't tell the story. MindCap tracks both.

Curiosity Has a Shape

Some people go deep on one topic (coherent deep dive). Others wander across many topics but stay loosely connected (wandering journey). Others bounce randomly (tangent hopper). None of these is wrong—they're just different exploration styles.

Categories Are a Simplification, But a Useful One

Real interests don't fit in boxes. But tracking time across 14 categories gives you a rough map of where your attention goes, which is better than no map at all.

What's Next

Tomorrow I start implementing the pattern detector. One pattern at a time, so I understand each one. By the end, I'll have a system that can look at someone's browsing and surface what they're curious about, how their interests are evolving, the shape of their exploration.

Not a productivity guilt machine. Not a simple time tracker. A curiosity mapper.

Today the pieces fit together. The confidence scoring solves the multi-category domain problem elegantly. The client-server hybrid gives speed and accuracy. The category aggregates set up pattern detection perfectly. I'm building something that doesn't exist yet.

Display Settings

Mapping Curiosity: Building MindCap's Categorization Engine

The Fundamental Problem: YouTube Is Everything

The Breakthrough: Layered Confidence Scoring

Layer 1: Domain Lookup (95% Confidence)

Layer 2: Domain Patterns (80% Confidence)

Layer 3: URL Pattern Extraction (70-85% Confidence)

Layer 4: Keyword Matching (50-90% Confidence)

Layer 5: Fallback (20% Confidence)

The 14-Category Taxonomy

What This Enables: Pattern Detection

Architecture: Client-First with Server Validation

What I Learned Building This

Rabbit Holes Aren't Bad

Engagement ≠ Value

Curiosity Has a Shape

Categories Are a Simplification, But a Useful One

What's Next

Jen Kim

More from the MindCap Series