Instant Word Segmentation With Rust
To suggest and generate new domain name ideas, we need to be able to see what words make up a domain name. Word segmentation is a natural language processing technique usually used to find words in strings written in languages with no natural word boundary, like Chinese or Japanese. We use this technique to find the words “Instant Domain Search” in a domain like instantdomainsearch.com. The Python wordsegment library solves this problem, but we needed to port it to Rust to make it fast enough for use on Instant Domain Search.
Wordsegment is based on Peter Norvig's word segmentation example in Beautiful Data. The core idea is that you iterate over all possible segmentations of a string and score them for the probability of appearing in a real world text. To do this we need two components:
- A way to estimate the probability of a sentence occurring in the real world.
- An efficient way to score all possible segmentations of the input string.
Let's discuss how both of these work.
Scoring a sentence
The simplest way of scoring segmentations is to score single words based on the chance they appear in real-world English language data; Norvig used a Google dataset called the Google Trillion Web Corpus (2006). The Web 1T collection is publicly available for research and educational purposes only, and this is what I initially used for our port.
To get the score for a single word appearing, you divide the number of matches of that word found in the corpus by the total number of word occurrences in the corpus. For example, the word “instant” occurred in the Web 1T collection some 28 million times. The corpus we’re using contains some 588 billion words in total (since it doesn’t contain the long tail of words that appear infrequently), so we’ll score the probability of “instant” appearing at 0.000049. We also need to score words that don’t appear in the corpus; for details, read Norvig’s chapter.
The next step is scoring a sequence of words. To do so, we could multiply the probabilities of all the words in the given input sentence. In order to avoid arithmetic errors as a result of multiplying a series of small numbers, we instead take the base-10 logarithm of each value and sum the results. For “this is a test”, this works out to -2.3 + -2.1 + -1.8 + -3.6 = -9.7. In contrast, applying the formula for words that are not found in the corpus gives us -22 for “thisisatest”.
To further refine the results, we (like Norvig) also consider bigrams: the conditional probability that two words occur next to each other. For each word not at the start of the input string, instead of considering the simple probability of the word appearing, we consider the probability of a word appearing after the previous word. With the original corpus, we’ll score “is” after “this” at -3.2 rather than using -2.1, allowing us to use contextual information to refine our scores.
For short strings we want to enumerate all possible segmentations to find the optimally scored separation. However, since the number of segmentations increases exponentially in the length of the input string, this becomes infeasible for longer strings. At the same time, very long words are increasingly unlikely. Thus, the first step we make is to define a maximum word length of 24 characters. We then split the input string at each of the first 24 character boundaries and add the score for the first word to the score for the remaining words.
While Norvig used recursion and we used that approach in our initial versions of the Rust port, we’ve recently changed our implementation based on Wolf Garbe’s article Fast Word Segmentation of Noisy Text. We now use a variant of the triangular matrix algorithm discussed there. In our implementation, we use a dynamically-sized array (Rust’s Vec) to track the best score at each segmentation position. For example, for “thisisatest”, the candidate for position 6 (after “is”) would have a length of 2 and the score for “this is”. Using this approach bounds the stack size, and I find the nested loops easier to reason about than the bounded recursion approach. We adapted the triangular matrix approach to allow the use of bigram scores.
On my machine, the Python wordsegment library is capable of segmenting “thisisatest” once every 500µs. The latest version of instant-segment can perform the same task in 4µs, 95x faster while doing the same work. Let me explain some of the techniques we used.
- The initial port was only 1.7x faster than the Python code. It used the recursive approach to enumerating segmentations, as in wordsegment and Norvig’s code, employing a hashmap to cache partial scores. For example, it’s faster to calculate scores for both “thisis a test” and “this is a test” if you can reuse the score for “a test”.
- Avoiding some unnecessary allocations in the core loop and switching to the ahash algorithm for the cache improved performance to 4.9x Python.
- Using the smartstring crate to avoid heap allocations for short strings is a good fit for this problem, since words are generally short. This improved performance to 9.0x Python.
- Inlining the core loop yielding segmentation positions got us to 11x Python.
- Avoiding repeated allocations for intermediate results helped us get to 12x Python.
- Rust strings are encoded in UTF-8, which means string slicing needs to validate string positions. However, instant-segment rejects strings containing non-ASCII input; using a sprinkling of unsafe code to bypass Unicode validation improved performance to 15x.
- Updating ahash to a newer version got us to 18x.
- Changing from f64::powf() to f64::powi() helped get us to 19x.
- Reuse of intermediate allocations (in the case where you’re doing multiple segmentations in a row) got performance to 25x faster than Python.
- And finally, the adoption of the triangular matrix approach got us to 95x.
We’ve open sourced this work, and the latest version of instant-segment is available on crates.io. This year, we’ve added Python bindings using the wonderful PyO3 and updated the test data set distributed as part of our repository to be based on more recent and more freely licensed data sets from Google and the SCOWL project, meaning it’s easier than ever to get started with instant-segment. We’d love to get your feedback on the project and to hear how you are using it!