It started with a question I couldn't shake: what have I been looking at online for the past few years?
Not surveillance—more like digital archaeology. I wanted to understand my own browsing patterns. What rabbit holes did I fall into? What articles did I read at 3am that I've forgotten? What was that YouTube video about medieval baking that I've been trying to find for months?
Turns out, Firefox keeps good records. Your browser remembers everything. So I built a tool to dig through it.
The Problem: 56,000 URLs and No Context
Firefox stores your browsing history in a SQLite database called places.sqlite. It's just sitting there in your profile folder, quietly accumulating years of digital breadcrumbs.
My database had over 56,000 URLs. But the database only stores raw URLs—no titles, no descriptions, no context about what you were looking at.
"Was https://example.com/article/12345 the life-changing essay about productivity, or was it a recipe for sourdough that I never made?"
I needed to crawl all those URLs to get the content: title tags, meta descriptions, Open Graph data. And while I was at it, social media metadata too—YouTube video IDs, Twitter usernames, Reddit posts.
🦊 The Firefox History Crawler was born: a Python script that extracts your entire Firefox history, crawls every URL to fetch metadata, and outputs a beautifully enriched dataset you can actually use.
Building an Async Web Crawler (That's Polite)
The first version was slow. Crawling 56,000 URLs one at a time would take days.
asyncio and aiohttp solved that—Python's async capabilities let me crawl 100 URLs simultaneously. But I didn't want to accidentally DDoS anyone's server.
Being a Good Internet Citizen
The crawler includes several politeness features:
- Per-domain delays — 0.5 seconds between requests to the same domain
- Max 3 connections per host — don't overwhelm any single server
- Honest user agent — we identify ourselves clearly
- Graceful retries — 3 attempts before giving up, with backoff
CONFIG = {
'delay_per_domain': 0.5, # Seconds between requests to same domain
'max_concurrent': 100, # Parallel connections (be nice!)
'max_retries': 3, # Attempts before giving up
'timeout': 5, # Seconds to wait for slow servers
'checkpoint_interval': 500, # Save progress every N URLs
}
With these settings, crawling 56,000 URLs takes about 1.5 hours. Fast enough to be useful, slow enough to be respectful.
Extracting the Good Stuff
For each URL, the crawler extracts:
- Basic metadata — title, description, word count
- Open Graph data — og:title, og:description, og:image
- Content snippets — the first ~15,000 characters of actual text
- Platform-specific IDs — this is where it gets fun
Social Media Parsing
The script recognizes URLs from major platforms and extracts structured data:
- YouTube — video ID, channel, playlist, shorts
- Twitter/X — tweet ID, username
- Reddit — post ID, subreddit, username
- GitHub — repo, issue/PR number, file path
- Instagram, TikTok, LinkedIn, Twitch, Spotify...
This means instead of just knowing you visited youtube.com/watch?v=dQw4w9WgXcQ, you know it was video dQw4w9WgXcQ—which you can look up, analyze, or discover you got Rickrolled 47 times in 2024.
The Joy of Checkpoints
Long-running scripts need checkpoints. I learned this when my laptop went to sleep after an hour of crawling. All progress lost.
Now the crawler saves its state every 500 URLs. If it crashes, your internet dies, or you need to restart—just run it again and it picks up where it left off.
============================================================
FIREFOX HISTORY EXTRACTION PIPELINE
============================================================
Output folder: crawl_output/
Loading Firefox history...
Loaded 56,847 URLs from Firefox history
Combined 1,234 duplicate URLs (visit counts summed)
Final URL count: 55,613 unique URLs
Crawling URLs: 100%|████████████████| 55613/55613 [1:42:30<00:00]
[Checkpoint saved every 500 URLs, don't worry!]
What I Learned From My Own Data
Running this on my own history was illuminating.
What I found:
- I visited Stack Overflow 2,847 times (no surprises there)
- There are 400+ YouTube videos I watched but can't remember at all
- I have a concerning number of "I'll read this later" articles that I never read
- My 3am Wikipedia binges have a distinct pattern: start with something reasonable, end on "List of people who have lived in airports"
More practically, I was finally able to find that medieval baking video. It was from Tasting History. Of course it was.
The Technical Bits (For the Curious)
If you want to run this yourself, the setup is pretty simple:
# Install dependencies
pip install aiohttp pandas beautifulsoup4 lxml tqdm
# Run the crawler (close Firefox first, or don't - it handles both)
python firefox_history_crawler.py
You'll get several output files:
firefox_history.csv— full dataset, human-readablefirefox_history_clean.csv— only successful crawlsfirefox_history_errors.csv— the ones that got away (Cloudflare, login walls, etc.)firefox_history.pkl— pickle format for fast reloading in Python
Limitations (Keeping It Real)
Not everything works perfectly:
- Some sites fight back — Cloudflare bot detection, login walls, aggressive rate limiting. We try 3 times then move on.
- Social media is shy — Many platforms require authentication to see content. We get what we can from public pages.
- Memory grows — 56K URLs with content ≈ 500MB-1GB RAM. Your laptop can probably handle it, but maybe close some browser tabs first. (The irony is not lost on me.)
- macOS only (for now) — Auto-detection of the Firefox profile works on macOS. Linux/Windows users need to set the path manually.
What's Next?
This crawler was the starting point for a bigger project: MindCap, a browser extension that tracks attention patterns in real-time. The Firefox History Crawler gave me historical data to analyze; MindCap will give me ongoing insights.
If you want to understand your own browsing patterns, or just want to find that article you read at 3am six months ago, give the crawler a try. It's open source and polite to servers.