Suffix tree
Idea: | store all suffixes of a string in a (compressed) trie |
---|---|
Advantages: |
|
Disadvantages: | very space-consuming; using a suffix array may be a better alternative |
Trie of all suffixes
The idea of a suffix tree is simple: Take all the suffixes of a string and store them in a trie.
For example, MISSISSIPPI$
has 12 (nonempty) suffixes: $
, I$
, PI$
, PPI$
, IPPI$
, … ISSISSIPPI$
, MISSISSIPPI$
.
If we build a trie consisting of all these strings, we get:
We add a special symbol $
which doesn't appear anywhere else in the string at the end. This ensures that every suffix ends up in a leaf (rather than in an internal node).
Note that every root-to-leaf path corresponds to a single suffix so every path from root corresponds to some substring of the given string.
For example, we highlighted path "IS
" in the figure above. The subtree below it contains two leaves: 4 and 1, so there are two suffixes which start with "IS". These correspond
to the 2 occurrences of "IS
" in MISSISSIPPI
, starting at positions 1 and 4.
Compact representation
What we described so far is a conceptual view of a suffix tree (not a real representation). The obvious problem is that such a trie would have \(\Theta(n^2)\) nodes – it would take \(\Theta(n^2)\) space and couldn’t be constructed in time better than \(\Omega(n^2)\). This is too much to be of any practical use.
We show, however, that it is possible to represent such a trie in only \(O(n)\) memory and construct it in \(O(n)\) time.
First, we contract all paths which are not branching. Whenever we see a path \(v_1-v_2-\cdots-v_k\) such that \(v_{i+1}\) is the only child of \(v_i\),
we represent this path by a single edge. For our "MISSISSIPPI
" example, we get:
This way, each internal node is at least binary and therefore there are only \(O(n)\) nodes. (One can prove by induction that the number of internal nodes is less than the number of leaves.) However, the edge labels still consist of \(\Theta(n^2)\) symbols in total.
What shall we do? Instead of making copies of different substrings as edge labels, we just use indices into string \(S\). For example, instead of label SSI
, which is the substring \(S[3\ldots 5)\), we just store the indices \([3,5)\). Instead of label SSIPPI$
\(=S[5,12)\), we store just \([5,12)\).
This way, each edge takes \(O(1)\) space and there are \(O(n)\) nodes and edges, so the trie of all suffixes can be represented in \(O(n)\) memory.
Generalized suffix tree
Similarly, if we have multiple texts (documents), we can append a unique symbol at the end of each one and create a single trie with all their suffixes. This is called a generalized suffix tree.
Here is a toy example – a suffix tree for strings "ABAA", "BABA", and "BBAB". However, you can imagine a huge suffix tree consisting of all articles in Wikipedia (each article is a single string) or a suffix tree with genomes of multiple species (each string is a DNA of a single species).
Every leaf stores the position where it starts and in which string (this is represented by color: red/green/blue).
Applications
Suffix tree is a very powerful tool in stringology. It encodes a lot of structure of the input text(s). Problems on strings are translated into problems on trees.
Text search
When searching for a string, we simply start at the root and trace the path spelled out by the searched string. The leaves in the subtree below the resulting node correspond to all occurrences of the searched string.
For example, let's search string "BA
" in the generalized suffix tree above. Start at the root and follow the edges "B", "A". There are 4 leaves in that subtree: red 1, green 0 and 2, and blue 1,
which corresponds exactly with the 4 occurrences in the strings "ABAA", "BA BA", and "BBAB".
If we add a pointer from every internal node to a first leaf in its subtree and we connect all leaves in a linked list, we can jump directly to the leaves and list all occurrences one by one. Thus, finding all occurrences of pattern \(P\) can be done in \(O(|P| + \#occurrences)\) time.
n-grams
By n-gram, we mean a sequence of n letters. For example, there are 8 different tri-grams consisting only of letters A and B. Which of those occur in strings "ABAA", "BABA", "BBAB"?
One way to find out is by traversing the strings while maintaining a set of tri-grams we have seen.
Alternatively, all the tri-grams are visible in the generalized suffix tree of those strings – imagine we cut the tree at depth 3. We can read off the answer: ABA, BAA, BAB, and BBA.
Longest substring with multiple occurrences
What is the longest substring in "MISSISSIPPI" which has at least two occurrences?
Answer: "ISSI".
How do we find out in general?
Note that before suffix trees, this used to be a long-standing open problem. There is a trivial \(O(n^3)\) algorithm and a simple \(O(n^2)\) algorithm using dynamic programming. But can we do better? Some researchers even conjectured no linear algorithm can exist.
Yet, with suffix trees, the problem is dead simple: the longest substring with multiple occurrences corresponds to the deepest internal node (with multiple leaves in its subtree).
References
- E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249-260, 1995. [bib] [pdf]
- P. Weiner. Linear pattern matching algorithms. In IEEE 14th Ann. Symp. on Switching and Automata Theory, 1973, pages 1-11. [bib] [pdf]
- E.M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM (JACM), 23(2):262-272, 1976. [bib] [pdf]
- R. Giegerich and S. Kurtz. From Ukkonen to McCreight and Weiner: A unifying view of linear-time suffix tree construction. Algorithmica, 19(3):331-353, 1997. [bib] [pdf]