How To Use fastText For Instant Translations
Updated
We are always exploring ways to generate great domain names. Almost every English word has a .com domain associated with it, and many popular word pairs—like HelloWorld.com—are also taken. One thing you can try is to translate a word that is important to your business into other languages. For example, fast translates to vite in French, rápida or rápido in Spanish, or veloce in Italian. Though these are not English words, the languages share roots, and they might work for your business. Google Translate is an incredible product, but it’s tedious to use its interface to see what a word looks like in other languages. We set out to build a tool that can translate an English word into several languages—instantly.
The study of machine translation dates back to the 1950s, and is still evolving. We want to translate from English to several languages instantly, and need to do it on commodity hardware without the need for GPUs or other expensive equipment. Domain namesshould be short, so we wanted to focus on individual words instead of phrases. We have some experience using word vectors to help find domains for sale, and saw that Facebook Research published a dataset that aligns word vectors from one language to another.
Wait, what is a word vector?
Word vectors are generated using a neural network to learn how words are related from a large body of text—like a web crawl, or Wikipedia. Allison Parrish, an assistant professor at NYU, shared one of the best interactive walkthroughs of word vectors I’ve seen. Go check it out. Allison uses colors to illustrate how a word might be mapped to numeric values for Red, Green, and Blue. The PNG and JPG standards, for example, each support 24-bit color—so 0 to 255 for each of R, G, and B. Red is represented as 255,0,0, green as 0, 255, 0, and blue as 0,0,255. You learn to recognize something like 250,0,0 in code as a very red color.
fastText
In 2016, Facebook Research released fastText along with a number of pre-trained models that map millions of words to a numeric representation. These models map words to 300 32-bit floats. This captures a lot of detail and nuance, but is much harder to reason about as a human. The fastText word vector for the word red looks like:
red -0.0331 -0.0465 0.1046 0.0201 0.0665 -0.0394 0.0515 -0.0598 0.0905 0.0738 0.0871 0.0062 0.0002 -0.0135 0.1012 -0.0092 -0.1063 -0.0967 0.0297 0.0790 -0.0429 -0.0470 -0.0926 -0.0227 -0.0240 -0.0768 0.0174 -0.0628 -0.0714 0.0413 0.0072 0.0746 0.0332 0.0780 0.0248 0.0083 -0.0807 -0.0272 0.0805 -0.0736 -0.0323 -0.0140 0.0081 0.0639 -0.0186 -0.0961 0.0240 -0.0159 0.0252 0.0425 0.0403 -0.1151 -0.0582 0.0228 0.0503 0.0262 0.0092 -0.0314 -0.0031 0.0238 0.0023 0.0231 0.1031 0.0147 0.0032 0.0197 -0.0749 0.0452 -0.0060 0.0173 -0.0828 0.0347 0.0330 -0.0970 0.0665 -0.0090 -0.0148 -0.0379 -0.0735 -0.0456 0.0362 -0.0038 -0.0989 0.0229 -0.0710 0.0076 -0.0314 0.0331 0.0470 -0.0968 -0.0182 0.0503 -0.0603 0.0900 0.0617 0.0198 0.0360 0.0885 -0.0665 0.0382 0.0162 -0.0352 -0.0643 0.0298 -0.0647 -0.0815 0.0507 0.0307 -0.0312 -0.0265 -0.0255 -0.0556 0.0302 0.0085 -0.0142 0.0116 0.0497 -0.0091 -0.0327 -0.0533 0.0853 -0.0028 0.0138 0.0235 0.0288 0.0766 -0.0008 0.0410 -0.0574 0.0001 0.0378 0.0842 0.0237 0.0557 -0.0578 -0.0145 -0.0006 -0.1553 -0.0657 0.0826 -0.0335 0.1468 0.0287 -0.0240 -0.0060 0.1243 -0.0685 -0.0024 -0.0419 0.0122 0.0002 -0.1673 -0.1169 -0.0371 -0.0072 -0.0133 -0.0355 0.0781 0.0487 -0.0785 0.1488 0.0351 -0.1184 -0.0185 0.0348 0.0116 -0.0598 -0.0082 0.1296 -0.0158 -0.0234 -0.0796 -0.0322 -0.0004 -0.0170 0.0290 -0.0135 -0.0658 0.0224 0.0262 -0.0747 -0.0174 -0.0673 0.0018 -0.0009 -0.0170 -0.0229 0.0128 0.0414 0.0009 0.0807 -0.0990 0.1185 0.0776 -0.1242 -0.0860 -0.0464 0.0127 -0.0994 0.0284 0.0295 -0.0607 0.0268 0.0738 0.0820 -0.0623 -0.1275 -0.0181 -0.0645 0.0423 -0.0196 0.0610 -0.0459 -0.0614 0.1134 0.0480 -0.0723 -0.0421 -0.0073 -0.0136 -0.0843 -0.0286 -0.0247 0.0456 -0.0090 -0.0546 -0.0464 0.0170 0.0580 -0.0434 0.0340 0.0199 0.0258 -0.0641 0.0110 -0.1129 0.0479 -0.0298 -0.0738 0.0475 -0.0210 0.0199 -0.0134 -0.0297 -0.0400 0.0186 0.0519 0.0505 0.0018 -0.0292 0.0482 0.0071 -0.0222 -0.0302 0.0711 -0.0198 0.0230 -0.0573 0.1053 0.0609 0.0517 0.0693 -0.0668 -0.0047 -0.0557 -0.0430 -0.0130 0.0693 -0.0305 -0.1101 -0.0303 -0.0511 -0.0628 0.0036 0.0101 -0.0206 0.1078 0.0520 -0.0476 0.0408 -0.0027 -0.0753 0.0087 0.0203 0.1821 -0.0566 0.0721 0.0880 -0.0955 -0.1142 -0.0118 -0.0209 0.0230 0.0313 -0.0339 -0.0700 0.0841 -0.0484 -0.0148 -0.0190
Aligned word vectors
Later, Facebook Research released aligned word vectors for 44 languages. These act like a rosetta stone and provide a way to organize words from these languages in the same vector space. To bring it back to three dimensions, we can look at how red might be represented in English, French, and vector space:
Since we are looking at one word without context, the translation may be ambiguous in languages with feminine or masculine forms. In practice, the vectors will not align perfectly with one another. There might be some slight variation in the vector representations, so we need to consider approximate matches. Since we are working in vector space, we can measure the Euclidean distance from something like red to a close match in some other language.
Instant Distance
Computers are fast, and can brute-force their way through millions of Euclidean distance calculations relatively quickly. But not instantly. We need a way to build an index of these points to find a point’s neighbors in space. Many machine learning or artificial intelligence techniques depend on navigating vector spaces efficiently, and there are a variety of approaches and implementations available. We wanted a pure Rust implementation, but did not find any that were production ready. So we built and released instant-distance, a fast and pure-Rust implementation of Hierarchical Navigable Small World graphs with Python bindings.
Putting it together
To build a simple translation tool, we will start by downloading the word vector data published by fastText. Then, we’ll index the word vectors with Instant Distance. Once the index is finished building, we store the resulting dataset on the filesystem alongside a mapping from word to vector in the form of a JSON file.
LANGS = ("en", "fr", "it")LANG_REPLACE = "$$lang"WORD_MAP_PATH = f"./data/{'_'.join(LANGS)}.json"BUILT_IDX_PATH = f"./data/{'_'.join(LANGS)}.idx"DL_TEMPLATE = f"https://dl.fbaipublicfiles.com/fasttext/vectors-aligned/wiki.{LANG_REPLACE}.align.vec"points = []values = []word_map = {}async with aiohttp.ClientSession() as session:for lang in LANGS:# Construct a url for each languageurl = DL_TEMPLATE.replace(LANG_REPLACE, lang)# Ensure the directory and files existos.makedirs(os.path.dirname(BUILT_IDX_PATH), exist_ok=True)lineno = 0async with session.get(url) as resp:while True:lineno += 1line = await resp.content.readline()if not line:# EOFbreaklinestr = line.decode("utf-8")tokens = linestr.split(" ")# The first token is the word and the rest# are the embeddingvalue = tokens[0]embedding = [float(p) for p in tokens[1:]]# We only go from english to the other two languagesif lang == "en":word_map[value] = embeddingelse:# Don't index words that exist in english# to improve the quality of the results.if value in word_map:continue# We track values here to build the instant-distance index# Every value is prepended with 2 character language code.# This allows us to determine language output later.values.append(lang + value)points.append(embedding)# Build the instant-distance index and dump it out to a file with .idx suffixprint("Building index... (this will take a while)")hnsw = instant_distance.HnswMap.build(points, values, instant_distance.Config())hnsw.dump(BUILT_IDX_PATH)# Store the mapping from string to embedding in a .json filewith open(WORD_MAP_PATH, "w") as f:json.dump(word_map, f)
Finally, using these tools we can convert an input (a word) into its word vector and use Instant Distance to find the nearest neighbors to the input. Since the word vectors have all been aligned, the closest word vectors in different languages should be very similar—if not a direct translation.
with open(WORD_MAP_PATH, "r") as f:word_map = json.load(f)# Get an embedding for the given wordembedding = word_map.get(word)if not embedding:print(f"Word not recognized: {word}")exit(1)hnsw = instant_distance.HnswMap.load(BUILT_IDX_PATH)search = instant_distance.Search()hnsw.search(embedding, search)# Print the resultsfor result in list(search)[:10]:# We know that the first two characters of the value is the language code# from when we built the index.print(f"Language: {result.value[:2]}, Translation: {result.value[2:]}")
For example, here are the results of translating the English word "hello":
Language: fr, Translation: bonjoursLanguage: fr, Translation: bonsoirLanguage: fr, Translation: salutationsLanguage: it, Translation: buongiornoLanguage: it, Translation: buonanotteLanguage: fr, Translation: rebonjourLanguage: it, Translation: auguriLanguage: fr, Translation: bonjour,Language: it, Translation: buonaseraLanguage: it, Translation: chiamatemi
Try it out!
You can check out the full example over at Instant Distance on GitHub. If you have any questions, please feel free to open an issue in GitHub!