Doing NLP in a Browser Tab, With No Help From the Cloud

Here is a sentence that should be easy: figure out what a web page is about.

If I were allowed to send the page to a language model, it would be easy — one API call, a clean list of topics back, done. But MindCap is a browser extension that’s supposed to respect a simple promise: your browsing doesn’t have to leave your machine. No server round-trip, no API key, no “we’ll just send this to an LLM real quick.” Whatever understands the page has to run right there, in the tab, in the dozens of milliseconds before you navigate away.

That single constraint — no cloud — turns “figure out what a page is about” into a surprisingly deep engineering problem. This is the diary of that problem: what it takes to extract topics from a page without an LLM, and the slow realization that I was rebuilding, by hand and one rule at a time, a thin slice of what a language model already knows for free.

What “No Cloud” Actually Costs

Three constraints stack up, and each one removes a tool I’d reach for by default:

It runs on every page. The content script loads on every navigation you make. So whatever NLP I do is paid constantly, not in a batch job overnight.
There’s a bundle budget. Code shipped into every page should be small. A heavy library is a tax on every site you visit.
No API call. The whole point. That rules out the thing that would make this trivial.

Inside that box, the only moves left are: lightweight rule-based parsing, a small NLP library that runs in JavaScript, and a pile of hand-written knowledge about language. All three turned out to be necessary, and all three fought me.

Here’s the whole pipeline, with the hard parts called out — the 472KB library and its broken lazy-load, the eleven hand-maintained word lists feeding three different stages, and the fact that all of it runs in real time on every page:

The whole pipeline, in the tab: three constraints up top, three stages across, the 472KB compromise library feeding the fuse step (its lazy-load broke), and eleven hand-maintained word lists quietly feeding all three stages.

The Shredder

The old keyword pipeline was pure rule-based extraction: split text on non-letters, drop stop-words, filter generics, rank what’s left. Fast, deterministic, debuggable, tiny. And completely blind to the fact that some adjacent words are one idea.

"former Federal Reserve chair Alan Greenspan" came out as greenspan, reserve, former, alan, chair — five fragments, ranked by how long each word was. The topic of the page — a person, an institution — had been put through a wood chipper. “New York City” became three unrelated nouns. An LLM would never make this mistake; my regex made it every single time.

So I did the obvious thing: I added an NLP layer. compromise, a JavaScript library that does noun-phrase chunking and named-entity recognition right in the browser — no server, which is exactly the point. Now I could pull "machine learning" and "google deepmind" out as intact tokens.

I wired it in, reloaded the extension, dumped the keywords, and got back… single words. Every phrase shredded, exactly like before. The library worked; everything around it was still built for the old world. That’s the rest of this post.

When the Merge Eats the Meaning

The NLP layer was working perfectly. The merge was throwing its output away.

The merge logic was “rule-based keywords first, NLP phrases appended, then cap the list.” Reasonable on its face. But the rule-based extractor returns ~190 single words per article. The cap was 12. So the 12 slots filled with single words before a single NLP phrase got a turn, and the slice at the end guillotined every phrase off the bottom.

The fix is one line of intent: lead with the phrases. Multi-word terms go first, then single words fill whatever’s left. Suddenly "former federal reserve chair alan greenspan" survived all the way into the database.

I felt good about this for about a day.

Your Own Scoring, Optimized Against You

Then I actually read the scoring function — the thing that ranks keywords once they’re merged. I’d written it a long time ago, for the single-word world. Reading it now, with phrases flowing through, three things jumped out:

The length bonus saturated at ten characters. A keyword’s score got a small boost for being longer — capping out at +1.0 once it hit 10 characters. "machine learning" is 16 characters. So is "javascript" past the cap. The phrase I’d fought to preserve scored identically to a common single word. The whole point of keeping it intact earned it nothing.

Cross-source confirmation had gone half-blind. The single biggest signal — “this term shows up in the title AND the URL AND the body, it must matter” — was a +3.0 bonus. But it only fired on exact token matches. Post-NLP, the title had "machine learning", the URL had "machine-learning", the body had "machinelearning". Three spellings of one idea, zero matches, no bonus. The signal meant to reward agreement now missed precisely the multi-word topics I cared most about.

The minimum-length filter dropped short entities. A three-character floor, measured in characters, silently deleted "AI", "UN", "GO". The filter conflated “too short to be a real word” with “short string.”

None of these were new bugs. They were old code doing exactly what I told it to — for a problem I no longer had. Adding the library exposed that my own ranking was still living in 2015.

I fixed them. Phrases now earn an explicit bonus. Cross-source matching normalizes spelling before comparing. The length floor only applies to single words. Small changes, but they’re the difference between a scorer that tolerates phrases and one that rewards them.

The Bundle Budget Bites Back

Remember the second constraint — code shipped into every page should be small. compromise is about 472KB. Shipping that into every single page you visit felt wrong, so I made it lazy: import("compromise") only when needed, fetched as a separate chunk on first capture. This is the textbook fix for bundle weight.

It compiled. It built clean. The per-page bundle dropped from 524KB to 56KB. Beautiful.

And it silently produced single-word keywords again.

[MindCap] compromise IMPORT FAILED: Error: Cannot find module 'gCeck'
    at newRequire (tracker.efa474d3.js)

gCeck is a mangled chunk reference. The bundler’s dynamic-import machinery doesn’t work inside a content script — a known, ugly corner of the browser-extension world. This is the kind of thing that doesn’t exist when you have a server: on a backend, you import what you want and the bundle size of a Python process is nobody’s problem. In a browser tab, the delivery mechanism itself is a constraint, and the standard escape hatch for it is quietly broken.

My try/catch caught the failure and degraded to empty NLP, which is correct behavior and also why I didn’t notice for a while. It failed exactly the way it was designed to fail. Quietly.

I reverted to a static import. The 472KB is back, parsed once per page in maybe 15–40 milliseconds, off the critical path — an acceptable tax for staying local. The “optimization” had been trading an invisible-to-the-user cost for a silent-correctness bug. Not a trade worth making.

The lesson I keep relearning: the build passing tells you almost nothing.

Boost, or Fuse?

The deepest change was the smallest diff. MindCap had a list — KNOWN_COMPOUND_TERMS — of phrases like “machine learning” and “data science.” The old code used it to find the component words in a list and give each one a scoring boost. "machine" got +2, "learning" got +2, and they stayed two separate entries that happened to rank near each other.

But now compromise was also producing "machine learning" as a single token. Two systems, two different notions of the same idea, and they didn’t agree on what a compound even was — one fused, one boosted-but-split.

So I moved the compound list out of scoring entirely. It now fuses adjacent compound words into one token before anything gets ranked — the same shape compromise produces. A curated compound and an NLP-discovered phrase now enter the ranker as the same kind of thing and get scored the same way. The +2 boost is gone; a phrase earns its rank by being a phrase.

It also closed a gap I hadn’t noticed: URLs never went through the NLP layer, so /machine-learning-tutorial always stayed shredded. Fusing fixes that too.

Am I Just Doing NLP by Hand?

Somewhere in the middle of maintaining a list of stop-verbs to stop the extractor from thinking “father wants” was a topic, I stopped and asked myself the uncomfortable question: am I wasting my time? Should this just be an AI-powered app?

It’s a fair question, and the honest framing is uncomfortable: I have, at last count, eleven hand-maintained word lists — stop-words, generics, protected terms, compounds, pronouns, trailing verbs. Each one is me, by hand, teaching a machine a rule about English that a language model already knows for free. “Father wants” is not a topic because wants is a verb. An LLM knows that. My code knows it only because I added the word wants to a set. Every one of those lists is a tiny manual reconstruction of competence I’m refusing to buy from the cloud — that’s the actual price of the no-cloud promise, paid in vocabulary files.

The honest answer I landed on: some of this is a treadmill, and some of it isn’t. Clean article text, the behavioral signal of how you actually move through pages, a labeled dataset to test against — those are durable no matter what produces the keywords. The hand-tuned grammar lists are the part that an LLM, or a real part-of-speech tagger, would erase.

But I’m not going to decide that by vibes. I built a validation notebook that mirrors the extension’s pipeline exactly — same Readability text, same phrase extraction, the scoring function ported line for line — and runs it over 56,000 pages of my own browsing history. There’s an accuracy gate: rate the keywords good / okay / bad, and if it clears 80%, the rule-based foundation is good enough to ship. If it doesn’t, the gate tells me where it fails, which decides whether the answer is a better tagger, on-device transformers, or an LLM.

Decide with numbers, not with the urge to rewrite everything. The refactor can wait for the data.

The Pipeline in Numbers

	Before	After
`"machine learning"`	`["machine", "learning"]`	`"machine learning"`
Multi-word phrases in output	none	lead the list
Content-script bundle	56KB	524KB (reverted the “fix”)
Compound handling	boost split words at scoring	fuse to one token before ranking
Cross-source matching	exact token	normalized form
Hand-maintained word lists	11	11 (for now)
Validation corpus	—	56,392 pages

That last “for now” is the one I’m watching.

What I’m Reading

Less reading this stretch, more re-reading my own code — which turns out to be its own genre. Every refactor is a letter from a past version of yourself who was very sure about something.

The one outside thread: I keep circling the idea that the keyword layer is the wrong altitude to be hand-crafting at all. The clustering experiment I set up — embed every page, let communities emerge from the geometry instead of from keyword rules — is the bet that meaning lives in the space between documents, not in the words I can extract from any one of them. “Machine” and “learning” being close in that space matters more than whether I glued them into one string. That’s the next thing to find out.