How Embedding Models Keep Instant Search Responses Under 10ms

Updated Thursday, September 4, 2025

In 2016 our team built a premium domain search tool designed to find similar aftermarket domains instantly as you type. We used fastText embeddings with Spotify's Annoy indexing to power semantic search at <10 ms.

Unlike traditional domain search tools that simply check availability, ours took a stab at understanding user intent. When a user searches "instantdomainsearch.com," our index finds semantically similar domains:

domaininstant.com
fastdomainsearch.com
searchdomain.com

The Pitfalls of fastText: Domain-Specific Context

fastText learned its semantic understanding from billions of words in Common Crawl and Wikipedia. But domains represent businesses and brands, which often adopt words differently than general web text.

Take "mint" as an example below. While fastText recognizes both meanings, it gives much higher similarity scores to plant/herb terms. In the domain and business world, "mint" is more commonly associated with money (e.g. Mint.com). This pattern repeated across many terms (e.g. Cloud, Spark).

Related Word	fastText Similarity to "mint"	Context
peppermint	0.64 (high)	Plant/herb
herb	0.49	Plant/herb
money	0.37	Money/finance
finance	0.29	Money/finance
budget	0.27 (low)	Money/finance

We realized we weren't dealing with a limitation of fastText, but rather that domain names have evolved their own semantic meaning.

This led to a challenge: How could we give our model domain-specific context?

Why Not Just Use AI?

Modern LLMs understand context brilliantly. When we tested GPT-4 for similar domain suggestions, it clearly grasped the meanings users expected.

With a small prompt to give domain-specific context, it understood that "cloud" also means cloud computing, "mint" means finance, and "spark" means data analytics. It suggested domains that matched what buyers actually want, not just textual variations.

But there was a critical blocker: latency. Our fastText system returns results in <10 ms. GPT-4 averaged at 1.7 seconds (85× slower). For a tool where users expect results as they type, waiting more than a second per keystroke would destroy the user experience.

Enter Embedding Models: From Words to Concepts

This led us to the idea that instead of running LLMs in real-time, we could use LLMs offline to train our own fast embedded model. This way instead of having general web context, it was specific to our use-case: domains.

Here's the practical differences:

fastText approach (general semantic understanding):

Uses 2 million word vectors trained on Common Crawl (2017)
Understands how words relate based on their usage in web text
"searchengine" = semantic meaning from millions of sentences
Excellent at general language understanding

Our enhanced approach (domain-specific semantics):

Treats each word and phrase as a business concept
"searchengine" = "A platform for web search and discovery"
Adds domain-specific context that fastText never saw
Builds on fastText's foundation with modern techniques

To build this we used sentence-transformers, a Python library that supports more modern embedding models designed for semantic search that we could fine-tune for our specific use case. The AI landscape has evolved significantly since 2017, making it much easier to build specialized models. We used GPT-4 to batch generate millions of domain-specific training examples. Then we fine-tuned these newer models to understand domains the way domainers do.

The implementation required five steps: generate training data, calculate similarities, structure for training, fine-tune the model, and optimize for production speed.

Step 1: Using GPT-4 to Generate Domain Descriptions

We discovered that GPT-4 excels at expanding single words into rich contextual descriptions. Given just a domain name, it could infer potential use cases, related industries, and semantic connections.

Domain	GPT-4 Generated Description	Keywords Extracted
torontopancakes.com	A delightful name that combines the charm of Toronto with the love for pancakes, perfect for breakfast lovers and city explorers.	food, breakfast, pancakes, maple syrup, Canada, Toronto, brunch, treats, sweet, delicious, local, community
matchupitchutours.com	A travel-friendly name inspired by Machu Picchu, offering guided tours and unforgettable adventures in the Andes.	travel, tours, adventure, Peru, Machu Picchu, history, culture, nature, explore, hiking, mountains, guide
frankenbergerschokolade.com	A rich, indulgent name blending the charm of German heritage with the sweetness of chocolate, perfect for gourmet treats and confections.	chocolate, gourmet, sweets, German, heritage, desserts, artisan, cocoa, indulgence, handmade, treats, premium

Generating descriptions for a portion of our inventory provided us with labeled training data that captured domain semantics.

Step 2: Transform Descriptions into Similarity Scores

With descriptions for each domain, we could measure semantic similarity using existing embedding models. This gave us ground truth data about which domains should be considered related:

Domain Pair	Similarity Score	Relationship Type
google.com ↔ searchengine.com	0.68588746	Same category
cloudeous.com ↔ cloudivio.com	0.82812774	Similar concept
omnigenics.com ↔ omniparent.com	0.66224474	Shared prefix only
xaute.com ↔ paveup.com	0.50881153	Unrelated

These similarity scores became our training targets. They told our model which domains should be close or far in the embedding space.

Step 3: Create Training Triplets for Fine-tuning

Triplet loss training requires data in sets of three: an anchor domain, a positive match (similar), and a negative match (dissimilar). We generated thousands of these triplets from our similarity calculations:

Anchor Domain	Positive Match (Similar)	Negative Match (Dissimilar)
google.com	searchengine.com (0.6858)	flowershop.com (0.2754)
uber.com	rideshare.com (0.7139)	cookbook.net (0.2201)
shopify.com	ecommerce.com (0.6064)	weatherapp.org (0.2656)

This structure forces the model to learn relative similarities rather than absolute classifications.

Step 4: Train Our Domain-Specific Embedding Model

We selected all-MiniLM-L6-v2 as our base model using all-mpnet-base-v2 as an initial embedding provider. It balances performance with size. Using triplet loss, we trained it to mimic GPT-4's semantic understanding:

Training Metric	Value
Base model	all-MiniLM-L6-v2
Training samples	500,000
Epochs	25
Final correlation with GPT-4	~0.87

The 0.87 correlation meant our lightweight model captured most of GPT-4's semantic understanding while running orders of magnitude faster.

Step 5: Optimize from ~500 ms to 10 ms

The first version of this new stack was implemented in Python to test out some hypotheses. We utilized commonly used libraries such as usearch for HNSW and SQLite to map back to the domain. This showed we were going in the right direction with promising results in mostly ~500 ms. Training a good model solved half the problem. We still needed sub-50 ms latency!

Most of our tech stack is already written in Rust. So we decided to move these newer models into Rust as well. To avoid some hefty dependencies and because Torch is not the fastest for inference, we chose ONNX Runtime.

Model format: ONNX (3× faster than PyTorch)
Runtime: Rust + ort (minimal overhead, excellent CPU performance)
Index: HNSW for a limited set of ~25M aftermarket domains
Hardware: CPU-only (no GPU coordination overhead)
Optimizations: quantized optimized output (optimum)

Result: 95th-percentile latency of 10 ms, within our 25 ms target. This speed is critical to maintaining the instant search experience we've optimized through our CDN infrastructure.

AI in Development vs. Production

"AI‑powered" often becomes shorthand for integrating an LLM into every request path, but placement is a design choice with distinct trade‑offs.

In development: Lower context, faster speeds
In production: Higher context, slower speeds

Our instant search proves how far the first approach can go: we distilled GPT‑4's trillion‑parameter knowledge into a 22.7M‑parameter embedding model, returning semantic results in <10 ms (ignoring the distance between the user and our servers).

On the other hand, our Business Name Generator shows where a live LLM pulls ahead: brand ideation needs fresh, expansive reasoning, so we trade speed for depth and deliver suggestions that truly need trillion+ parameter context.

This isn't to say LLMs in development are better than LLMs in production. It's about matching context depth to latency cost.

We are planning to roll out our new model over the next month.