fasttext

How To Use fastText For Instant Translations

Updated

We are always exploring ways to generate great domain names. Almost every English word has a .com domain associated with it, and many popular word pairs—like HelloWorld.com—are also taken. One thing you can try is to translate a word that is important to your business into other languages. For example, fast translates to vite in French, rápida or rápido in Spanish, or veloce in Italian. Though these are not English words, the languages share roots, and they might work for your business. Google Translate is an incredible product, but it’s tedious to use its interface to see what a word looks like in other languages. We set out to build a tool that can translate an English word into several languages—instantly.

The study of machine translation dates back to the 1950s, and is still evolving. We want to translate from English to several languages instantly, and need to do it on commodity hardware without the need for GPUs or other expensive equipment. Domain namesshould be short, so we wanted to focus on individual words instead of phrases. We have some experience using word vectors to help find domains for sale, and saw that Facebook Research published a dataset that aligns word vectors from one language to another.

Wait, what is a word vector?

Word vectors are generated using a neural network to learn how words are related from a large body of text—like a web crawl, or Wikipedia. Allison Parrish, an assistant professor at NYU, shared one of the best interactive walkthroughs of word vectors I’ve seen. Go check it out. Allison uses colors to illustrate how a word might be mapped to numeric values for Red, Green, and Blue. The PNG and JPG standards, for example, each support 24-bit color—so 0 to 255 for each of R, G, and B. Red is represented as 255,0,0, green as 0, 255, 0, and blue as 0,0,255. You learn to recognize something like 250,0,0 in code as a very red color.

fastText

In 2016, Facebook Research released fastText along with a number of pre-trained models that map millions of words to a numeric representation. These models map words to 300 32-bit floats. This captures a lot of detail and nuance, but is much harder to reason about as a human. The fastText word vector for the word red looks like:

red -0.0331 -0.0465 0.1046 0.0201 0.0665 -0.0394 0.0515 -0.0598 0.0905 0.0738 0.0871 0.0062 0.0002 -0.0135 0.1012 -0.0092 -0.1063 -0.0967 0.0297 0.0790 -0.0429 -0.0470 -0.0926 -0.0227 -0.0240 -0.0768 0.0174 -0.0628 -0.0714 0.0413 0.0072 0.0746 0.0332 0.0780 0.0248 0.0083 -0.0807 -0.0272 0.0805 -0.0736 -0.0323 -0.0140 0.0081 0.0639 -0.0186 -0.0961 0.0240 -0.0159 0.0252 0.0425 0.0403 -0.1151 -0.0582 0.0228 0.0503 0.0262 0.0092 -0.0314 -0.0031 0.0238 0.0023 0.0231 0.1031 0.0147 0.0032 0.0197 -0.0749 0.0452 -0.0060 0.0173 -0.0828 0.0347 0.0330 -0.0970 0.0665 -0.0090 -0.0148 -0.0379 -0.0735 -0.0456 0.0362 -0.0038 -0.0989 0.0229 -0.0710 0.0076 -0.0314 0.0331 0.0470 -0.0968 -0.0182 0.0503 -0.0603 0.0900 0.0617 0.0198 0.0360 0.0885 -0.0665 0.0382 0.0162 -0.0352 -0.0643 0.0298 -0.0647 -0.0815 0.0507 0.0307 -0.0312 -0.0265 -0.0255 -0.0556 0.0302 0.0085 -0.0142 0.0116 0.0497 -0.0091 -0.0327 -0.0533 0.0853 -0.0028 0.0138 0.0235 0.0288 0.0766 -0.0008 0.0410 -0.0574 0.0001 0.0378 0.0842 0.0237 0.0557 -0.0578 -0.0145 -0.0006 -0.1553 -0.0657 0.0826 -0.0335 0.1468 0.0287 -0.0240 -0.0060 0.1243 -0.0685 -0.0024 -0.0419 0.0122 0.0002 -0.1673 -0.1169 -0.0371 -0.0072 -0.0133 -0.0355 0.0781 0.0487 -0.0785 0.1488 0.0351 -0.1184 -0.0185 0.0348 0.0116 -0.0598 -0.0082 0.1296 -0.0158 -0.0234 -0.0796 -0.0322 -0.0004 -0.0170 0.0290 -0.0135 -0.0658 0.0224 0.0262 -0.0747 -0.0174 -0.0673 0.0018 -0.0009 -0.0170 -0.0229 0.0128 0.0414 0.0009 0.0807 -0.0990 0.1185 0.0776 -0.1242 -0.0860 -0.0464 0.0127 -0.0994 0.0284 0.0295 -0.0607 0.0268 0.0738 0.0820 -0.0623 -0.1275 -0.0181 -0.0645 0.0423 -0.0196 0.0610 -0.0459 -0.0614 0.1134 0.0480 -0.0723 -0.0421 -0.0073 -0.0136 -0.0843 -0.0286 -0.0247 0.0456 -0.0090 -0.0546 -0.0464 0.0170 0.0580 -0.0434 0.0340 0.0199 0.0258 -0.0641 0.0110 -0.1129 0.0479 -0.0298 -0.0738 0.0475 -0.0210 0.0199 -0.0134 -0.0297 -0.0400 0.0186 0.0519 0.0505 0.0018 -0.0292 0.0482 0.0071 -0.0222 -0.0302 0.0711 -0.0198 0.0230 -0.0573 0.1053 0.0609 0.0517 0.0693 -0.0668 -0.0047 -0.0557 -0.0430 -0.0130 0.0693 -0.0305 -0.1101 -0.0303 -0.0511 -0.0628 0.0036 0.0101 -0.0206 0.1078 0.0520 -0.0476 0.0408 -0.0027 -0.0753 0.0087 0.0203 0.1821 -0.0566 0.0721 0.0880 -0.0955 -0.1142 -0.0118 -0.0209 0.0230 0.0313 -0.0339 -0.0700 0.0841 -0.0484 -0.0148 -0.0190

Aligned word vectors

Later, Facebook Research released aligned word vectors for 44 languages. These act like a rosetta stone and provide a way to organize words from these languages in the same vector space. To bring it back to three dimensions, we can look at how red might be represented in English, French, and vector space:

A depiction of the RGB color space in the form of a cube.

Since we are looking at one word without context, the translation may be ambiguous in languages with feminine or masculine forms. In practice, the vectors will not align perfectly with one another. There might be some slight variation in the vector representations, so we need to consider approximate matches. Since we are working in vector space, we can measure the Euclidean distance from something like red to a close match in some other language.

Instant Distance

Computers are fast, and can brute-force their way through millions of Euclidean distance calculations relatively quickly. But not instantly. We need a way to build an index of these points to find a point’s neighbors in space. Many machine learning or artificial intelligence techniques depend on navigating vector spaces efficiently, and there are a variety of approaches and implementations available. We wanted a pure Rust implementation, but did not find any that were production ready. So we built and released instant-distance, a fast and pure-Rust implementation of Hierarchical Navigable Small World graphs with Python bindings.

Putting it together

To build a simple translation tool, we will start by downloading the word vector data published by fastText. Then, we’ll index the word vectors with Instant Distance. Once the index is finished building, we store the resulting dataset on the filesystem alongside a mapping from word to vector in the form of a JSON file.

LANGS = ("en", "fr", "it")
LANG_REPLACE = "$$lang"
WORD_MAP_PATH = f"./data/{'_'.join(LANGS)}.json"
BUILT_IDX_PATH = f"./data/{'_'.join(LANGS)}.idx"
DL_TEMPLATE = f"https://dl.fbaipublicfiles.com/fasttext/vectors-aligned/wiki.{LANG_REPLACE}.align.vec"
points = []
values = []
word_map = {}
async with aiohttp.ClientSession() as session:
for lang in LANGS:
# Construct a url for each language
url = DL_TEMPLATE.replace(LANG_REPLACE, lang)
# Ensure the directory and files exist
os.makedirs(os.path.dirname(BUILT_IDX_PATH), exist_ok=True)
lineno = 0
async with session.get(url) as resp:
while True:
lineno += 1
line = await resp.content.readline()
if not line:
# EOF
break
linestr = line.decode("utf-8")
tokens = linestr.split(" ")
# The first token is the word and the rest
# are the embedding
value = tokens[0]
embedding = [float(p) for p in tokens[1:]]
# We only go from english to the other two languages
if lang == "en":
word_map[value] = embedding
else:
# Don't index words that exist in english
# to improve the quality of the results.
if value in word_map:
continue
# We track values here to build the instant-distance index
# Every value is prepended with 2 character language code.
# This allows us to determine language output later.
values.append(lang + value)
points.append(embedding)
# Build the instant-distance index and dump it out to a file with .idx suffix
print("Building index... (this will take a while)")
hnsw = instant_distance.HnswMap.build(points, values, instant_distance.Config())
hnsw.dump(BUILT_IDX_PATH)
# Store the mapping from string to embedding in a .json file
with open(WORD_MAP_PATH, "w") as f:
json.dump(word_map, f)

Finally, using these tools we can convert an input (a word) into its word vector and use Instant Distance to find the nearest neighbors to the input. Since the word vectors have all been aligned, the closest word vectors in different languages should be very similar—if not a direct translation.

with open(WORD_MAP_PATH, "r") as f:
word_map = json.load(f)
# Get an embedding for the given word
embedding = word_map.get(word)
if not embedding:
print(f"Word not recognized: {word}")
exit(1)
hnsw = instant_distance.HnswMap.load(BUILT_IDX_PATH)
search = instant_distance.Search()
hnsw.search(embedding, search)
# Print the results
for result in list(search)[:10]:
# We know that the first two characters of the value is the language code
# from when we built the index.
print(f"Language: {result.value[:2]}, Translation: {result.value[2:]}")

For example, here are the results of translating the English word "hello":

Language: fr, Translation: bonjours
Language: fr, Translation: bonsoir
Language: fr, Translation: salutations
Language: it, Translation: buongiorno
Language: it, Translation: buonanotte
Language: fr, Translation: rebonjour
Language: it, Translation: auguri
Language: fr, Translation: bonjour,
Language: it, Translation: buonasera
Language: it, Translation: chiamatemi

Try it out!

You can check out the full example over at Instant Distance on GitHub. If you have any questions, please feel free to open an issue in GitHub!