Display Settings

Back to Blog
Projects Python Data

Firefox History Crawler: Digital Archaeology for the Chronically Curious

January 2025 6 min read
🦊

"What did I even do on the internet last year?"

It started with a question I couldn't shake: what have I been looking at online for the past few years?

Not surveillance—more like digital archaeology. I wanted to understand my own browsing patterns. What rabbit holes did I fall into? What articles did I read at 3am that I've forgotten? What was that YouTube video about medieval baking that I've been trying to find for months?

Turns out, Firefox keeps good records. Your browser remembers everything. So I built a tool to dig through it.

The Problem: 56,000 URLs and No Context

Firefox stores your browsing history in a SQLite database called places.sqlite. It's just sitting there in your profile folder, quietly accumulating years of digital breadcrumbs.

My database had over 56,000 URLs. But the database only stores raw URLs—no titles, no descriptions, no context about what you were looking at.

"Was https://example.com/article/12345 the life-changing essay about productivity, or was it a recipe for sourdough that I never made?"

I needed to crawl all those URLs to get the content: title tags, meta descriptions, Open Graph data. And while I was at it, social media metadata too—YouTube video IDs, Twitter usernames, Reddit posts.

🦊 The Firefox History Crawler was born: a Python script that extracts your entire Firefox history, crawls every URL to fetch metadata, and outputs a beautifully enriched dataset you can actually use.

Building an Async Web Crawler (That's Polite)

The first version was slow. Crawling 56,000 URLs one at a time would take days.

asyncio and aiohttp solved that—Python's async capabilities let me crawl 100 URLs simultaneously. But I didn't want to accidentally DDoS anyone's server.

Being a Good Internet Citizen

The crawler includes several politeness features:

python
CONFIG = {
    'delay_per_domain': 0.5,      # Seconds between requests to same domain
    'max_concurrent': 100,        # Parallel connections (be nice!)
    'max_retries': 3,             # Attempts before giving up
    'timeout': 5,                 # Seconds to wait for slow servers
    'checkpoint_interval': 500,   # Save progress every N URLs
}

With these settings, crawling 56,000 URLs takes about 1.5 hours. Fast enough to be useful, slow enough to be respectful.

Extracting the Good Stuff

For each URL, the crawler extracts:

Social Media Parsing

The script recognizes URLs from major platforms and extracts structured data:

This means instead of just knowing you visited youtube.com/watch?v=dQw4w9WgXcQ, you know it was video dQw4w9WgXcQ—which you can look up, analyze, or discover you got Rickrolled 47 times in 2024.

The Joy of Checkpoints

Long-running scripts need checkpoints. I learned this when my laptop went to sleep after an hour of crawling. All progress lost.

Now the crawler saves its state every 500 URLs. If it crashes, your internet dies, or you need to restart—just run it again and it picks up where it left off.

text
============================================================
FIREFOX HISTORY EXTRACTION PIPELINE
============================================================
Output folder: crawl_output/

Loading Firefox history...
  Loaded 56,847 URLs from Firefox history
  Combined 1,234 duplicate URLs (visit counts summed)
  Final URL count: 55,613 unique URLs

Crawling URLs: 100%|████████████████| 55613/55613 [1:42:30<00:00]

[Checkpoint saved every 500 URLs, don't worry!]

What I Learned From My Own Data

Running this on my own history was illuminating.

What I found:

More practically, I was finally able to find that medieval baking video. It was from Tasting History. Of course it was.

The Technical Bits (For the Curious)

If you want to run this yourself, the setup is pretty simple:

bash
# Install dependencies
pip install aiohttp pandas beautifulsoup4 lxml tqdm

# Run the crawler (close Firefox first, or don't - it handles both)
python firefox_history_crawler.py

You'll get several output files:

Limitations (Keeping It Real)

Not everything works perfectly:

What's Next?

This crawler was the starting point for a bigger project: MindCap, a browser extension that tracks attention patterns in real-time. The Firefox History Crawler gave me historical data to analyze; MindCap will give me ongoing insights.

If you want to understand your own browsing patterns, or just want to find that article you read at 3am six months ago, give the crawler a try. It's open source and polite to servers.

Try it yourself

Download the script and explore your own browsing history.

🦊 View Project Page
Jen Kim

Jen Kim

Developer, Claude Whisperer. Building tools for curiosity, creativity, and chaos.

All Posts Next: MindCap