From 2017 to 2020, I studied computer science at a small public university in the US deep south. Though the program I attended was brand new, in an area far from America's tech hubs, it proved to be a tremendous opportunity, and during that time I fell in love with algorithm design and joined my school's competitive programming team to get to practice this skill. In 2018, I had the joy of representing us in our regional qualifiers for the International Collegiate Programming Contest. I was proud of our team's performance, but in the end one problem had stalled us for hours and kept us from placing high enough to proceed. It was a deceptively simple problem, and one that has haunted me since: Given a string, what is the longest repeating substring in it? It's a problem that is trivial to solve with O(n^2) time and space complexity, but this was considered for the competition and the largest test data set included would easily exceed the available time or memory at that scale. After the competition I poured over what information I could find online of the problem and learned about the existence of linear time suffix tree construction as the generally accepted solution, but didn't understand the implementation at a glance and, busy with pursuing my degree and later my career, kept pushing out wrapping my head around it for later.
It's later now, and I demand satisfaction.
Ukkonen's Algorithm
This section is not a guide on implementing Ukkonen's algorithm. I'll be writing some insights about the suffix tree construction problem I had to grasp in order to be able to do the implementation and know how to apply it, but I ultimately decided describing the full implementation didn't work for the purpose of this article. For an accessible thorough description, I recommend this stackoverflow thread. I'll be talking a lot about suffix trees; these are a compressed form of tries, where edges describe a substring that, when traversing the path of edges from root to any leaf node, form a suffix found somewhere in the inserted text. Having this sort of traversable tree of substrings allows us to infer a lot of information about the text, which I will discuss later.
It's always useful to consider a naive solution to a problem to identify its shortcomings and barriers to greater performance. First, it should be pointed out that the insertion of a suffix to the tree, given a position to insert at, can always be done in constant time by representing substrings on the edges as two values: a starting character index, and an ending character index. Therefor, the time to insert a suffix without being given a position is entirely tied to the time to select an insertion position. To construct a suffix tree, we must somehow insert every single suffix of a string into the tree. My first instinct for this is to insert the suffix starting with a character i for every position, starting from the end of the text and moving towards the beginning. This means that each insert has to search through any substrings that match a substring from the beginning of the new suffix to find the insert position, which is an O(n) operation, resulting in O(n^2) overall. We will see that, by starting from the beginning of the text instead, we can leverage some assumptions about the state of the tree that allow for matching every insert position in O(1) time.
Ukkonen's algorithm was created by Finnish computer scientist Esko Ukkonen in 1995, over 20 years after Donald Knuth speculated that a linear time/space suffix tree construction method was impossible. Despite being a noteworthy enough accomplishment to merit a Wikipedia article, you may note that the algorithm described there is not linear time at all, and that all of the optimizations Ukkonen identified to achieve O(n) complexity are completely absent. Thankfully Ukkonen's original paper is readily available online, as are a handful of videos from researchers in the field of bioinformatics where suffix trees have found a particular niche.
The first thing to note about Ukkonen's algorithm is that many insertions may be done at once with implicit edge ends. It is assumed that, if we reserve a special value in our edge end positions to indicate they are located at the end of the string, then any update that does not require branching off any nodes can be treated as an implicit operation.
Second, we aren't actually inserting any complete suffixes until the end of the tree construction. Rather, we are inserting suffixes of prefixes of the tree. At the final step this will leave us with the full set of suffixes. By doing this, we can only perform O(1) implicit insertion operations, or operations at root, until a split is needed, and traverse the path of matching edges until it is not possible to continue matching. By doing this and keeping count of how many character matches we make, then when a split is necessary, we can know that we will need to make additional splits equal to the number of characters matched before the split, and we can be sure that the path to the edge to split already exists in the tree.
Knowledge that the path to split on exists does not inherently keep us from a complexity cost for finding the split point, since we would need to traverse edges to it up to n depth. This is where the last notable optimization in Ukkonen's algorithm comes in; if multiple splits are performed when resolving a mismatch in the current path (not including adding a new suffix to root), then we draw links going from the first node split on to the next, forming a chain of links between nodes. The result is that, if we need to split on a node that has already been linked, then we have an O(1) path to the next node that we need to split on. Note that the count of characters seen up to a position before we start splitting has a worst case of n, and we have demonstrated methods to do each insertion including matching in constant time, so the time complexity of the full construction is O(n).
Great. So we have a linear time suffix tree construction. Who cares? How do we use it?
Finding and Scoring Long String Trends in Hitchhiker's Guide to the Galaxy
We'll start with the originally stated problem: What is the longest repeating substring in a text? To make sure this performs acceptably even for large blocks of text, I wanted to choose something longer than a simple passage. I chose Hitchhiker's Guide to the Galaxy, as it is sufficiently long that O(n^2) complexity would noticeably suffer, it is one of the few novels I have actually read in the time since being cursed with this problem, and Douglas Adams's comedic writing style often hinges on repetition with a twist shift, which seemed like it would lend well to this analysis.
To find the longest repeating substring, we search the suffix tree for the deepest internal node that still branches into multiple additional suffixes. After scoring the depths of our nodes in O(n) time, we can at last learn that the longest repeating substring in Hitchhiker's Guide is:
liquid that was almost, but not quite, entirely unlike tea. The
Incredible. Riveting. This sequence actually occurs twice, in rather short succession. At this point it might be fun to see what some of the other contenders are, but we run into a problem: If we simply take the top K strings in the tree, then we actually end up with a bunch of substrings of our longest suffix. My first impression was this could be solved by ignoring paths that connect to nodes that are linked to from the node of a maximal string match, but this ends up ruling out legitimate potential matches as well. There's probably a graceful solution to this, but at this point in this endeavor I was getting impatient, so I settled on simply taking the top thousand or so contenders and filtering out substrings of accepted maximal strings until settling on the top 10 unique substrings. The rest of the top 10 we got include:
muttered to himself. "I beg your pardon?" said the old man
"No no Marvin," lilted Trillian, "that's just
There was a terrible ghastly silence. The
having tremendous difficulty with my lifestyle,"
Great Hyperlobic Omni-Cognate Neutron Wrangler
The Hitch Hiker's Guide to the Galaxy s
Slartibartfast coughed politely. "E
of Life, the Universe and Everything!"
I seem to be having tremendous difficulty
On one hand, I expected to actually find some longer matches. On the other, most of these make perfect sense coming off a fresh read. My source for the text of Hitchhiker was a bit odd, with some text being broken up unexpectedly or containing the occasional typo. To try to minimize any weirdness, I stripped out any non-alphanumeric-or-space characters, as well as concurrent spaces, and converted everything to lowercase. This actually had more impact on the outcome than I expected; some lines dropped off the top 10 entirely in favor of others, including the title drop and iconic question (results had case preserved when extracted):
liquid that was almost but not quite entirely unlike tea The
muttered to himselfI beg your pardon said the old man
having tremendous difficulty with my lifestyle
Great Hyperlobic OmniCognate Neutron Wrangler
I seem to be having tremendous difficulty
made out of the rib cage of a stegosaurus
from a small planet in the vicinity of
hyperintelligent pandimensional beings
No no Marvin lilted Trillian thats just
Resistance is useless bellowed the guard
Arthur's tremendous difficulty ended up breaking the list twice with this edit. Unfortunately it seems like my solution of filtering out overlapping strings wasn't perfect, but I suppose we could do a lot worse than letting one pass through. That would be the end of this story, but when I presented my initial findings to my friend and colleague Michael, he suggested that we could multiply the number of times a string occurs by its length and get a score of the prevalence of recurring long substrings in a text. We bounced thoughts around for a bit on how to calculate this, and settled on 2^len * occurrences^2, as this heavily weights string length first, giving some impact to number of repeats but mitigating extreme counts to prevent over-scoring a string like "the". Calculating these scores is trivial, since the number of occurrences is equal to the number of edges to follow out of a given node. Does scoring in this way change the top k compared to just sorting by length? Well, if we print the top 10 items with their scores...
liquid that was almost but not quite entirely unlike tea the --- 18446744073709552000
muttered to himselfi beg your pardon said the old man --- 36028797018963970
having tremendous difficulty with my lifestyle --- 562949953421312
great hyperlobic omnicognate neutron wrangler --- 281474976710656
made out of the rib cage of a stegosaurus --- 8796093022208
hyperintelligent pandimensional beings --- 8796093022208
i seem to be having tremendous difficulty --- 8796093022208
from a small planet in the vicinity of --- 8796093022208
no no marvin lilted trillian thats just --- 4398046511104
resistance is useless bellowed the guard --- 4398046511104
No matter what, the tea line continues to blow the rest out of the water, and the list remains unchanged. It is funny to see that four of the results have the exact same score, as well as the last two. If we look at the length of these substrings, the first one is a whopping 62 characters long without punctuation. After that though, they tend to cluster in the 40s, and it's rare for these recurring substrings to actually happen more than twice, so it ends up being very likely for two with the same length to have the same score. In the end this didn't yield any of the interesting results I had hoped, but it is satisfying to see how a metric like this plays out in practice.
And with that, my years long algorithm feud is done. My impression of Ukkonen's algorithm was always that it seemed straightforward but had some trick I just couldn't quite grasp, but after fighting with implementing it I've come to see it as deceptively complicated. Even with explanations that seem to convey all of its optimizations in hand, it proved very easy to go about it completely wrong if not considering odd edge cases, and all told I've sunk about a week of free time into this now. I would be interested in knowing if this sort of length-recurrence scoring yields different results with other data, but for now, that's my analysis of a single text. Thanks for reading, and I'll leave you with this parting insight that I only learned after doing all of this and reviewing the literature again:
If you think you need a suffix tree, you might just need a suffix array. They can be made more space efficient than suffix trees, support the same functionality, have a wider variety of algorithms for construction with different benefits and difficulty of implementation, and have almost entirely supplanted suffix tree usage in practice.
Oops. C'est la vie.