*Note : Suffix arrays are very much different than suffix trees as far as implementation and space usage is concerned. However They both serves the same functions. Suffix array implementation takes less memory and have less time complexity in building as compared to suffix trees.*

Suffix array : It is an array which consists of indices of all the suffixes of a string in sorted order. e.g. for string mayank, suffix array will consist of elements :

| 3 | 1 | 5 | 0 | 4 | 2 |

it says that suffix starting from position 3 (0 -indexing) is smallest, then comes suffix starting from index 1 and so on. It can be seen as :

0 -> ank (index 4)

1 -> ayank (index 1)

2 -> k (index 5)

3 -> mayank (index 0)

4 -> nk (index 4)

5 -> yank (index 2)

It has various applications in molecular biology where we have find longest common prefix of every suffix for huge amount of data.

*Building of suffix array :*

We have to consider every suffix and then sort them up. As string comparison will take at most O(n) of time for 2 strings, the time complexity is O(n^2 * log(n)). However this is a bit costly.

*But here we did not exploit the property that each suffix is a suffix of single string. So, we can do better.*

O(n * (log(n))^2) Solution :

(** note**) In this method of implementing the suffix array, we will sort the suffixes by 1st element, then by 2 elements, then by 4 elements, …. . . ., then by 2i elements. If we use this property then we can sort ‘n’ suffixes in log(n) iterations. As sorting will take O(n*log(n)) time. The complexity of this implementation will be O(n * (log(n)) ^ 2).

If we use radix sort here then the timecomplexity will be O(n * log(n)) with a very low constant hidden in Big – Oh notation.

*but how much time would it take to compare ‘2x’ elements after ‘x’ elements are compared. (see note). If it is not O(n), then this implementation can take more time which we claimed.*

*But this string comparison will take O(1) time.*

*Proof : Let String[i] be T[i .. n], and Compare(S[i], S[j], x) (it may be equal to ‘=’, ‘<‘, or ‘>’) be the comparison of S[i] and S[j] based on their first ‘x’ symbols. The main idea is that if we know the comparison results considering first ‘x’ symbols, we can find the result of Compare(S[i], S[j], 2x) in O(1). How? There are 2 cases:*

+ Compare(S[i], S[j], x) is not ‘=’, then the result of Compare(S[i], S[j], 2x) is the same as Compare(S[i], S[j], x).

+ Compare(S[i], S[j], x) is ‘=’, then we should compare last x characters of these two suffixes (since the first x symbols are same), i.e. the result is equal to Compare(T[i+x .. n], T[j+x .. n], x), which is equal to Compare(S[i+x], S[j+x], x). and since we know the result of this, we can find the result of Compare(S[i], S[j], 2x) in O(1).

Example :

In mayank, sorting by 1st element will lead to

0 -> **a**yank (index 1) (0 , 5)

0 -> **a**nk (index 3) (0 , 4)

2 -> **k **(index 5) (2 , -1)

3 -> **m**ayank (index 0) (3 , 0)

4 -> **n**k (index 4) (4 , 2)

5 -> **y**ank (index 2) (5 , 0)

We will give same rank to suffixes whose 1st letter are same. For example (ayank and ank, both have rank 0). We will make a tuple which will contain rank of the first half and rank of the second half.

Sorting the suffixes by first 2 letters, we get.

0 -> **an**k (index 4) (0 , 2)

1 -> **ay**ank (index 1) (1 , 0)

2 -> **k** (index 5) (2 , -1)

3 -> **ma**yank (index 0) (3 , 5)

4 -> **nk** (index 4) (4 , -1)

5 -> **ya**nk (index 2) (5 , 4 )

Sorting the suffixes by first 4 letters, we get.

0 -> **ank** (index 4) (0 , -1)

1 -> **ayan**k (index 1) (1 , 2)

2 -> **k** (index 5) (2 , -1)

3 -> **maya**nk (index 0) (3 , 4)

4 -> **nk** (index 4) (4 , -1)

5 -> **yank** (index 2) (5 , -1 )

and so on. Eventually, this will sort all the suffixes of a string.

Implementation : http://pastebin.com/G0nY2ZxP

I have documented the whole code which can be understood easily.

Further links : http://codeforces.com/blog/entry/4025

http://discuss.codechef.com/questions/21385/a-tutorial-on-suffix-arrays

http://web.stanford.edu/class/cs97si/suffix-array.pdf