Building Fast AI: Why Embedding Models Beat LLMs for Real-Time Search
Updated
In 2016 our team built a premium domain search tool designed to find similar aftermarket domains instantly as you type. We used fastText embeddings with Spotify's Annoy indexing to power semantic search at <10 ms.
Unlike traditional domain search tools that simply check availability, ours took a stab at understanding user intent. When a user searches "instantdomainsearch.com," our index finds semantically similar domains:
- domaininstant.com
- fastdomainsearch.com
- searchdomain.com
The Pitfalls of fastText: Domain-Specific Context
fastText learned its semantic understanding from billions of words in Common Crawl and Wikipedia. But domains represent businesses and brands, which often adopt words differently than general web text.
Take "mint" as an example below. While fastText recognizes both meanings, it gives much higher similarity scores to plant/herb terms. In the domain and business world, "mint" is more commonly associated with money (e.g. Mint.com). This pattern repeated across many terms (e.g. Cloud, Spark).
Related Word | fastText Similarity to "mint" | Context |
---|---|---|
peppermint | 0.64 (high) | Plant/herb |
herb | 0.49 | Plant/herb |
money | 0.37 | Money/finance |
finance | 0.29 | Money/finance |
budget | 0.27 (low) | Money/finance |
We realized we weren't dealing with a limitation of fastText, but rather that domain names have evolved their own semantic meaning.
This led to a challenge: How could we give our model domain-specific context?
Why Not Just Use AI?
Modern LLMs understand context brilliantly. When we tested GPT-4 for similar domain suggestions, it clearly grasped the meanings users expected.
With a small prompt to give domain-specific context, it understood that "cloud" also means cloud computing, "mint" means finance, and "spark" means data analytics. It suggested domains that matched what buyers actually want, not just textual variations.
But there was a critical blocker: latency. Our fastText system returns results in <10 ms. GPT-4 averaged at 1.7 seconds (85× slower). For a tool where users expect results as they type, waiting more than a second per keystroke would destroy the user experience.
Enter Embedding Models: From Words to Concepts
This led us to the idea that instead of running LLMs in real-time, we could use LLMs offline to train our own fast embedded model. This way instead of having general web context, it was specific to our use-case: domains.
Here's the practical differences:
fastText approach (general semantic understanding):
- Uses 2 million word vectors trained on Common Crawl (2017)
- Understands how words relate based on their usage in web text
- "searchengine" = semantic meaning from millions of sentences
- Excellent at general language understanding
Our enhanced approach (domain-specific semantics):
- Treats each word and phrase as a business concept
- "searchengine" = "A platform for web search and discovery"
- Adds domain-specific context that fastText never saw
- Builds on fastText's foundation with modern techniques
To build this we used sentence-transformers, a Python library that supports more modern embedding models designed for semantic search that we could fine-tune for our specific use case. The AI landscape has evolved significantly since 2017, making it much easier to build specialized models. We used GPT-4 to batch generate millions of domain-specific training examples. Then we fine-tuned these newer models to understand domains the way domainers do.
The implementation required five steps: generate training data, calculate similarities, structure for training, fine-tune the model, and optimize for production speed.
Step 1: Using GPT-4 to Generate Domain Descriptions
We discovered that GPT-4 excels at expanding single words into rich contextual descriptions. Given just a domain name, it could infer potential use cases, related industries, and semantic connections.
Domain | GPT-4 Generated Description | Keywords Extracted |
---|---|---|
torontopancakes.com | A delightful name that combines the charm of Toronto with the love for pancakes, perfect for breakfast lovers and city explorers. | food, breakfast, pancakes, maple syrup, Canada, Toronto, brunch, treats, sweet, delicious, local, community |
matchupitchutours.com | A travel-friendly name inspired by Machu Picchu, offering guided tours and unforgettable adventures in the Andes. | travel, tours, adventure, Peru, Machu Picchu, history, culture, nature, explore, hiking, mountains, guide |
frankenbergerschokolade.com | A rich, indulgent name blending the charm of German heritage with the sweetness of chocolate, perfect for gourmet treats and confections. | chocolate, gourmet, sweets, German, heritage, desserts, artisan, cocoa, indulgence, handmade, treats, premium |
Generating descriptions for a portion of our inventory provided us with labeled training data that captured domain semantics.
Step 2: Transform Descriptions into Similarity Scores
With descriptions for each domain, we could measure semantic similarity using existing embedding models. This gave us ground truth data about which domains should be considered related:
Domain Pair | Similarity Score | Relationship Type |
---|---|---|
google.com ↔ searchengine.com | 0.68588746 | Same category |
cloudeous.com ↔ cloudivio.com | 0.82812774 | Similar concept |
omnigenics.com ↔ omniparent.com | 0.66224474 | Shared prefix only |
xaute.com ↔ paveup.com | 0.50881153 | Unrelated |
These similarity scores became our training targets. They told our model which domains should be close or far in the embedding space.
Step 3: Create Training Triplets for Fine-tuning
Triplet loss training requires data in sets of three: an anchor domain, a positive match (similar), and a negative match (dissimilar). We generated thousands of these triplets from our similarity calculations:
Anchor Domain | Positive Match (Similar) | Negative Match (Dissimilar) |
---|---|---|
google.com | searchengine.com (0.6858) | flowershop.com (0.2754) |
uber.com | rideshare.com (0.7139) | cookbook.net (0.2201) |
shopify.com | ecommerce.com (0.6064) | weatherapp.org (0.2656) |
This structure forces the model to learn relative similarities rather than absolute classifications.
Step 4: Train Our Domain-Specific Embedding Model
We selected all-MiniLM-L6-v2 as our base model using all-mpnet-base-v2 as an initial embedding provider. It balances performance with size. Using triplet loss, we trained it to mimic GPT-4's semantic understanding:
Training Metric | Value |
---|---|
Base model | all-MiniLM-L6-v2 |
Training samples | 500,000 |
Epochs | 25 |
Final correlation with GPT-4 | ~0.87 |
The 0.87 correlation meant our lightweight model captured most of GPT-4's semantic understanding while running orders of magnitude faster.
Step 5: Optimize from ~500 ms to 10 ms
The first version of this new stack was implemented in Python to test out some hypotheses. We utilized commonly used libraries such as usearch for HNSW and SQLite to map back to the domain. This showed we were going in the right direction with promising results in mostly ~500 ms. Training a good model solved half the problem. We still needed sub-50 ms latency!
Most of our tech stack is already written in Rust. So we decided to move these newer models into Rust as well. To avoid some hefty dependencies and because Torch is not the fastest for inference, we chose ONNX Runtime.
- Model format: ONNX (3× faster than PyTorch)
- Runtime: Rust + ort (minimal overhead, excellent CPU performance)
- Index: HNSW for a limited set of ~25M aftermarket domains
- Hardware: CPU-only (no GPU coordination overhead)
- Optimizations: quantized optimized output (optimum)
Result: 95th-percentile latency of 10 ms, within our 25 ms target.
AI in Development vs. Production
"AI‑powered" often becomes shorthand for integrating an LLM into every request path, but placement is a design choice with distinct trade‑offs.
- In development: Lower context, faster speeds
- In production: Higher context, slower speeds
Our instant search proves how far the first approach can go: we distilled GPT‑4's trillion‑parameter knowledge into a 22.7M‑parameter embedding model, returning semantic results in <10 ms (ignoring the distance between the user and our servers).
On the other hand, our Business Name Generator shows where a live LLM pulls ahead: brand ideation needs fresh, expansive reasoning, so we trade speed for depth and deliver suggestions that truly need trillion+ parameter context.
This isn't to say LLMs in development are better than LLMs in production. It's about matching context depth to latency cost.
We are planning to roll out our new model over the next month.