You might have noticed that there's a new link in my navigation bar: TIL. It's a personal search engine, heavily inspired by thesephist's tool called Monocle.
I maintain a daily-ish updated "learnings" page on my blog, but I decided that I'd outgrown it.
Sure, it could've been more effective to use Notion or something like that. But amidst all the hype about embeddings and "semantic" search, I thought it would be a good exercise to go back to basics and learn a little bit about lexical search.
TIL is built with similar principles to thesephist's Monocle. To scope down to the project's essence to avoid getting caught in analysis-paralysis, here's what I wrote down in TIL's README.md:
The stack, like all of my personal projects, is absurdly simple.
My TILs are written in Markdown using Obsidian. A simple Python script translates these Markdown files into HTML, and then runs the tokenizer and indexer. Another script subsequently pushes the index and the TIL content to Github.
Broadly speaking, the "indexing" preprocessing work can broadly be split into two components: the "cleaner" and the "indexer".
The cleaner cleans, tokenizes, and stems the text corpus. TIL chooses to tokenize by word (by space, essentially), and uses an extremely simple, handcrafted stemming algorithm.
The indexer constructs the reverse index, assigns scores to each document (conditional on its search term) based on the TF-IDF (term frequency / inverse document frequency) metric. Simple!
From end-to-end, the cleaner and indexer work together as follows:
Most of the design decision points had to do with the cleaner - which I didn't expect before beginning this project. It makes sense now though - what insights you can extract from your data (the indexer) will ultimately be driven by how you represent your data (the cleaner). It's the classic case of feature engineering making a big difference when your statistical methology toolbox is limited.
The first design decision was token size: what should be my "units" of search? Should they be words, characters, or a mix of the two? I broke down the tradeoffs.
If tokens are character-based, then the upside is that I extract more "information" per document. Then downside is a horrible signal-to-noise ratio (~5%). Naturally, the more granular the units of information, the noisier the data will be - especially because a character usually doesn't mean anything in a semantic sense.
If tokens are words, then the upside is that I get fantastic signal-to-noise (~100%). Every word means something. The downside is that the onus is placed on the user to be precise with his search terms.
If tokens are a mix of the two, then the upside is that I get fantastic signal-to-noise (~70%). My guess is that if the average word is 5 characters, then the word starts to have meaning only with a 3 or 4 character substring Let's just call it 3.5 characters, and guess a 70% signal-to-noise ratio. The other upside is that the user doesn't have to be as precise with his search terms as with word-tokenizing. On the other hand, the downside is that the number of key terms will scale quadratically with the number of characters (n + nC2 ~ O(n^2)).
Given the tradeoffs of each tokenizing approach, I went with the hybrid version - but with a simplifying twist. Instead of getting every pair of start-end points in a word, TIL just extracts every substring that starts at the beginning of the word. For example, for cat, the cleaner outputs [c, ca, cat].
This simplification eliminates the downside of quadratic scaling and makes it O(n), with n being the number of tokens. Furthermore, you get much better signal-to-noise ratio (probably closer to ~85%), while still allowing users to be less precise.
While this approach yields a token corpus that's probably 3 to 4 times bigger than with the word-tokenization approach, I'm willing to make this tradeoff because my text corpus size just isn't that big for the moment.
A stemmer is what reduces word variants to their root word. For example, "linked" is a variant of "link", and "programmer" is a variant of "program". Essentially, it's a many-to-one mapping from words to "keywords".
I found that a good stemmer isn't critical to a good-enough search experience. In fact, consistency between the indexer and searcher is far more important. I'll explain what this means in the next section, but in a nutshell: because my indexer is written in Go, the stemmer used in the indexer wasn't exactly the same as the one used in the front-end. I quickly discovered that how frustrating this inconsistency was.
Furthermore, an unexpected benefit to my chosen tokenization approach was that a good stemmer was almost unnecessary. Stemmers typically modify the word endings; so "programmer" -> "programm" or "linked" -> "link". Since TIL considers every substring that starts from the beginning of the word, TIL automatically applies stemming. Even if "linked" -> "linke" in the stemmer (as opposed "linked" -> "link"), it doesn't matter since "link" is already considered to be a token in the document.
As a result, TIL uses an extremely simple stemmer:
Yep. That's it. Almost a shame to call it a stemmer.
The indexer builds the reverse index from the documents. Rather than going from document to keywords, the indexer constructs the inverse mapping: keyword to documents.
The indexer had less design decisions, which aligns with model-choosing being less complicated when you don't have a data behemoth to work with. Design decisions that are further downstream are much easier to fudge and experiment with.
A simple hash map is used for the reverse index data structure. For a dataset of this size, it wasn't worth it to use something more complicated like B-Trees.
Each document has a score conditional on the search term. For example, take "animal" as the search term. Then a document that mentions "animal" only once should have a lower score than a document that uses it five times.
But should these scores be computed at build-time or run-time? I chose build-time; I saw zero upside in computing them at run-time. If I prioritize time-to-first-result, why would I compute at run-time what is perfectly known at build-time?
TIL uses something called TF-IDF scoring, which is a very popular and simple method for indexing documents. Term frequency (TF) is how often a term occurs in a document, and index document frequency (IDF) is how many documents contain that term. The intuition behind index document frequency is that the more documents contain that term, the less important it probably is (within that text corpus). This serves as a weighting factor for the term frequency.
In other words, TF-IDF is the product of term frequency and indexed document frequency.
A short example. Take two documents:
- Document A: "I like dogs and you do too."
- Document B: "Dogs suck."
- Document C: "Cats are great."
- TF("dog", DocumentA) = 1/7, and
- IDF("dog") = log(3/2) ~= 0.4, so
- TF-IDF("dog", DocumentA) = 1/7 * 0.4 = 0.06.
- TF("dog", DocumentC) = 0/3, and
- IDF("dog") = log(3/2) ~= 0.4, so
- TF-IDF("dog", DocumentA) = 0 * 0.4 = 0.
- TF-IDF("cat", DocumentA) = 0/7 * log(3/1) = 0, and
- TF-IDF("cat", DocumentC) = 1/3 * log(3/1) ~= 0.37.
The higher score represents approximately how important the term is in the document and within the document corpus.
It's a little rich to call this a design decision because no trade-offs were really necessary.
It's really just bitmap intersection. For each search term, construct a bitmap of the documents that contain the term and those that don't. Intersect the bitmaps constructed from each search term. Take the TF-IDF scores of each document for each term and then multiply to get a holistic score for each term. Done.
TIL does this by taking the term with the least amount of documents associated with it, and it intersects from there. The key observation is that the length of the intersected bitmap is at most the length of the smallest bitmap of each component of the intersection.
With multiple word queries, how do you generate a score for each document? TIL does this by multiplying a document's scores for each term together. If document A has score 0.5 for the term "cat", and 0.25 for the "dog", then the document A's score for both "dog" and "cat" is 0.125.
The entire site is static, which means that all the text search magic occurs on the client-side.
Two JSON files, called indexing.json and store.json, are loaded when the site is loaded. indexing.json contains the inverse index, and store.json contains document-specific information.
During search, indexing.json does all of the heavy lifting, but when you click on a specific document, store.json is the main actor.
This project was another chance for me to explore the order in which web-things render, and how to make my websites feel "snappier" in general.
I always had a lot of trouble with Cumulative Layout Shift (CLS). CLS is a web metric designed by Google for measuring how much of the page content shifts as content loads on your page.
My main concern with CLS for TIL was with images. Because text renders before images, once the image loads, the text would snap downwards and create a horrible reading experience. I would rather watch an image load slowly than be reading a block of text only to be greeted by an image milliseconds later.
The solution was to add dimensions to the image tag. Then the browser leaves space for your image, eliminating the snapping effect. For example, say your image is 300x600. Then if you put height="300" and width="600" into your img tag attributes, then the browser will leave a 300x600 sized blank area for your image to render without disturbing the surrounding text elements.
However, you probably don't want a 300x600 sized image. What you really want is an image that is scaled correctly and has a specific width or height. That's where aspect-ratio comes in handy.
The aspect-ratio CSS attribute will take your specified width (or height), and apply the given aspect ratio to spit out an image that is correctly scaled. With the 300x600 example, say you want the image to be 200px wide. Then width="200" and aspect-ratio: 1 / 2; will give you a 100px x 200px image.
Like I said before, the workflow is pretty simple:
- I add a page to my Obsidian workspace.
- I run a script that compiles the Obsidian pages into HTML and then indexes the HTML.
- I run another script that pushes the new indexing.json and store.json to the website.
My favorite way of taking notes is screenshots. That meant that the workflow had to seamlessly integrate with my existing method of taking screenshots like a maniac. Thankfully, Obsidian has fantastic copy-and-paste-image functionality; all my script had to do was recognize Markdown image syntax, copy the image to the TIL website directory, and modify the image source path to point back the copied image in the TIL directory. All quick work with a little bit of regex and bash commands.
I don't have any plans to support other Markdown styles. I could be convinced to support parsing Markdown code blocks - but my maniacal love for screenshots runs too deep and screenshots provide formatted text out of the box anyway.
For now, I don't have any plans to extend TIL. I could see myself extending TIL to support semantic search in the future, but for now, I see little reason to. My personal opinion for semantic search is that they are better suited for emotional content rather than raw knowledge retrieval. And since I never intended TIL to be for emotional content like my journal entries or my favorite flavors - I'm shelving semantic search for TIL.
After every project I do, I ask myself: "How are you better than you were when you started this project?" If I could package up this TIL experience into just two things that I learned, they would be:
- Lexical search algorithms and data structures
- Critical rendering path for rendering HTML
Sayonara, RoboCop.