Suffix Arrays

Note : Suffix arrays are very much different than suffix trees as far as implementation and space usage is concerned. However They both serves the same functions. Suffix array implementation takes less memory and have less time complexity in building as compared to suffix trees.

Suffix array : It is an array which consists of indices of all the suffixes of a string in sorted order. e.g. for string mayank, suffix array will consist of elements :

| 3 | 1 | 5 | 0 | 4 | 2 |

it says that suffix starting from position 3 (0 -indexing) is smallest, then comes suffix starting from index 1 and so on. It can be seen as :

0 -> ank (index 4)

1 -> ayank (index 1)

2 -> k (index 5)

3 -> mayank (index 0)

4 -> nk (index 4)

5 -> yank (index 2)

It has various applications in molecular biology where we have find longest common prefix of every suffix for huge amount of data.

Building of suffix array :

We have to consider every suffix and then sort them up. As string comparison will take at most O(n) of time for 2 strings, the time complexity is O(n^2 * log(n)). However this is a bit costly.

But here we did not exploit the property that each suffix is a suffix of single string. So, we can do better.

O(n * (log(n))^2) Solution :

(note) In this method of implementing the suffix array, we will sort the suffixes by 1st element, then by 2 elements, then by 4 elements, ….   .  . ., then by 2i elements. If we use this property then we can sort ‘n’ suffixes in log(n) iterations. As sorting will take O(n*log(n)) time. The complexity of this implementation will be O(n * (log(n)) ^ 2).

If we use radix sort here then the timecomplexity will be O(n * log(n)) with a very low constant hidden in Big – Oh notation.

but how much time would it take to compare ‘2x’ elements after ‘x’ elements are compared. (see note). If it is not O(n), then this implementation can take more time which we claimed.

But this string comparison will take O(1) time.

Proof : Let String[i] be T[i .. n], and Compare(S[i], S[j], x) (it may be equal to ‘=’, ‘<‘, or ‘>’) be the comparison of S[i] and S[j] based on their first ‘x’ symbols. The main idea is that if we know the comparison results considering first ‘x’ symbols, we can find the result of Compare(S[i], S[j], 2x) in O(1). How? There are 2 cases:

+ Compare(S[i], S[j], x) is not ‘=’, then the result of Compare(S[i], S[j], 2x) is the same as Compare(S[i], S[j], x).
+ Compare(S[i], S[j], x) is ‘=’, then we should compare last x characters of these two suffixes (since the first x symbols are same), i.e. the result is equal to Compare(T[i+x .. n], T[j+x .. n], x), which is equal to Compare(S[i+x], S[j+x], x). and since we know the result of this, we can find the result of Compare(S[i], S[j], 2x) in O(1).

Example :

In mayank, sorting by 1st element will lead to

0 -> ayank (index 1)                    (0 , 5)

0 -> ank (index 3)                        (0 , 4)

2 -> k (index 5)                             (2 , -1)

3 -> mayank (index 0)                 (3 , 0)

4 -> nk (index 4)                           (4 , 2)

5 -> yank (index 2)                       (5 , 0)

We will give same rank to suffixes whose 1st letter are same. For example (ayank and ank, both have rank 0). We will make a tuple which will contain rank of the first half and rank of the second half.

Sorting the suffixes by first 2 letters, we get.

0 -> ank (index 4)                               (0 , 2)

1 -> ayank (index 1)                           (1 , 0)

2 -> k (index 5)                                    (2 , -1)

3 -> mayank (index 0)                       (3 , 5)

4 -> nk (index 4)                                  (4 , -1)

5 -> yank (index 2)                              (5 , 4 )

Sorting the suffixes by first 4 letters, we get.

0 -> ank (index 4)                               (0 , -1)

1 -> ayank (index 1)                           (1 , 2)

2 -> k (index 5)                                    (2 , -1)

3 -> mayank (index 0)                       (3 , 4)

4 -> nk (index 4)                                  (4 , -1)

5 -> yank (index 2)                              (5 , -1 )

and so on. Eventually, this will sort all the suffixes of a string.

Implementation : http://pastebin.com/G0nY2ZxP

I have documented the whole code which can be understood easily.

Further links : http://codeforces.com/blog/entry/4025

http://discuss.codechef.com/questions/21385/a-tutorial-on-suffix-arrays

http://web.stanford.edu/class/cs97si/suffix-array.pdf

Lazy Propagation in Segment Trees

This is in continuation with my previous blog on segment trees. As we have seen segment trees can be a very powerful DS for solving problems of range min,max,sum etc. But when the updates come very frequently or in large range like almost the entire array, then the update function would end up with O(n) ,so to overcome this problem, we update only when it’s required which is also called Lazy Propagation.

Here the building tree will remain the same, but we will also declare other array that contains info. whether to update the particular node or not. We will update the node only when it is queried or updated once more.

So the new query method is

private int getSum(int startIndex, int endIndex, int queryStart, int queryEnd, int current)
    {
         if(updates[current]>0) //If the current node needs to be updated, first update it
        {
                treeNode[current] += updates[current]*(endIndex - startIndex + 1) ;
                if(startIndex != endIndex) //Also set its children for update
                {
                        updates[getLeftChild(current)] += updates[current] ;
                        updates[getRightChild(current)] += updates[current] ;
                }
                updates[current] = 0 ;
        }
        if (queryStart <= startIndex && queryEnd >= endIndex ) //If it is leaf node
        {
            return treeNode[current];
        }
        if (endIndex < queryStart || startIndex > queryEnd) //out of range
        {
            return 0;
        }
        int mid = (startIndex + (endIndex - startIndex) / 2);
        return  getSum(startIndex, mid, queryStart, queryEnd, getLeftChild(current)) 
                 + getSum( mid + 1, endIndex, queryStart, queryEnd, getRightChild(current)); //return the result
    }
 
    public int query(int queryStart, int queryEnd)
    {
        if(queryStart < 0 || queryEnd > treeNode.length)  //query is out of range
        {
            return -1;
        }
        return getSum(STARTINDEX, ENDINDEX, queryStart, queryEnd, ROOT);
    }
 

Here the updates[] I am referring to is the array that contains amount by which that particular node needs to be updated.
Also the now since we are doing lazy propagation,I will call update method on a range of elements to exploit its full use.The new update method is

private void updateTree(int startIndex, int endIndex, int updateStart,int updateEnd, int update, int current)
    {
        if(updates[current]>0) //If current node needs to be updated,update it first
        {
                treeNode[current] += updates[current]*(endIndex - startIndex + 1) ;
                if(startIndex != endIndex)
                {
                        updates[getLeftChild(current)] += updates[current] ; //Also set update for its children
                        updates[getRightChild(current)] += updates[current] ;
                }
                updates[current] = 0 ;
        }
        if ( updateStart < startIndex || updateEnd > endIndex) //out of range
        {
            return;
        }
 
        if(startIndex >= updateStart && endIndex <= updateEnd)// Node in range totally
        {
                treeNode[current] += update*(endIndex-startIndex  + 1) ;  //update the current node
                if(startIndex != endIndex)
                {
                        updates[getLeftChild(current)] += update ;  //set update for its children
                        updates[getRightChild(current)] += update ;
                }
                return ;
        }
        int mid = (startIndex + (endIndex - startIndex) / 2); //If its not totally in range
        updateTree(startIndex, mid, updateStart,updateEnd, update, getLeftChild(current)); //Call recusively on its children
        updateTree(mid+1, endIndex, updateStart,updateEnd, update, getRightChild(current));
        treeNode[current] = treeNode[getLeftChild(current)] + treeNode[getLeftChild(current)] ; //Also update current node accordingly
       
    }
 
    public void update(int update, int updateStart,int updateEnd, int[] elements)
    {
        int updatediff = update - elements[updateStart]  ; 
        for(int i=updateStart;i<updateEnd;i++)
        elements[i] = update; //update the element array
        updateTree(STARTINDEX, ENDINDEX, updateStart,updateEnd, updatediff, ROOT); // call the update method
    }

As you can see through lazy propagation it update a node only when a query/update is called on that node.So it will only update the required node at once keep the rest of the nodes for updating next time.In this way the update order reduces to O(1) which can be very helpful when a large no of updates are fired before having any query.

Feel free to type in any suggestions/corrections in comments

Keep Learning!

Parth Panchal

Segment Trees

As some of you might be knowing, Segment Trees are special type of data structure for query and updating of intervals of array in logarithmic time.Basically segment trees store data of specific intervals of array in tree form. The root contains the whole array, it’s left child contain data of start index to middle index and right child contain data of middle index +1 to end index and so on. So the leaf nodes contain data of specific element of array.

Now the data of element I have been mentioning can be anything like the sum of elements or the max element or the min element etc. I have written the code in java for data containing sum of elements. A segment tree for an array of n elements uses O(n log n) storage and can be built in O(n log n) time. Segment trees support searching for all the intervals that contain a query point in O(log n + k), k being the number of queried intervals.

The segment tree is stored in a form of heap so left child of node i is 2*i and right child of node i is 2*i+1. To build the segment tree for sum of particular range problem, I have recursively called the method for child nodes. The moment it is the leaf node, it will be the base case of the recursion so the parent will get the value of sum of its childrens values. In this way the whole tree will be created from bottom up.


private int[] treeNode;
private int maxsize;
private int height;

private final int STARTINDEX = 0;
private final int ENDINDEX;
private final int ROOT = 0;

public SegmentTree(int elements[])
{
  height = (int)(Math.ceil(Math.log(elements.length) / Math.log(2))); //height of segment tree is O(log(n))
  maxsize = 2 * (int) Math.pow(2, height) - 1;  //setting maxsize
  treeNode = new int[maxsize];
  ENDINDEX = elements.length - 1; //setting global variable to size of array given
  constructSegmentTree(elements, STARTINDEX, ENDINDEX, ROOT);  // calling method to construct tree from the array
}

private int getLeftChild(int pos){
   return 2 * pos + 1;
}

private int rightchild(int pos){
   return 2 * pos + 2;
}

private int constructSegmentTree(int[] elements, int startIndex, int endIndex, int current)
{
if (startIndex == endIndex) //base case or leaf node
{
   treeNode[current] = elements[startIndex];
   return treeNode[current];
}
int mid = (startIndex + (endIndex - startIndex) / 2);
treeNode[current] = constructSegmentTree(elements, startIndex, mid, getLeftChild(current))
+ constructSegmentTree(elements, mid + 1, endIndex, rightchild(current));
return treeNode[current];  // calling it recusively and setting the current node's value to sum of its children
}

Here’s the method to get result of query.

private int getSum(int startIndex, int endIndex, int queryStart, int queryEnd, int current)
{
if (queryStart &lt;= startIndex &amp;&amp; queryEnd &gt;= endIndex )  // base case
   return treeNode[current];

if (endIndex &lt; queryStart || startIndex &gt; queryEnd)  // current node is out of range
   return 0;

int mid = (startIndex + (endIndex - startIndex) / 2);
return getSum(startIndex, mid, queryStart, queryEnd, getLeftChild(current))
+ getSum( mid + 1, endIndex, queryStart, queryEnd, rightchild(current));  //recursively calling the query method and getting the result
}

public int query(int queryStart, int queryEnd)
{
if(queryStart &lt; 0 || queryEnd &gt; treeNode.length)  // if the query is out of range
  return -1;

return getSum(STARTINDEX, ENDINDEX, queryStart, queryEnd, ROOT);
}

Here’s the code for updating the tree

private void updateTree(int startIndex, int endIndex, int updatePos, int update, int current)
{
if ( updatePos &lt; startIndex || updatePos &gt; endIndex) //update pos out of range
   return;
treeNode[current] = treeNode[current] + update;  // if current node comes under the range to update, update it first and then call the method on its children
if (startIndex != endIndex)
{
  int mid = (startIndex + (endIndex - startIndex) / 2);
  updateTree(startIndex, mid, updatePos, update, getLeftChild(current));
  updateTree(mid+1, endIndex, updatePos, update, rightchild(current));
}
}

public void update(int update, int updatePos, int[] elements)  // This method first calculates the diff to be added in each of the required nodes and then calls the method on the root of the tree
{
   int updatediff = update - elements[updatePos] ;
   elements[updatePos] = update;  //the elements of the array are updated first
   updateTree(STARTINDEX, ENDINDEX, updatePos, updatediff, ROOT);  
}

This is the test run

 public static void main(String args[])
	    {
	        int[] elements = {1,2,3,4,5,6,7,8,9};
	        SegmentTree segmentTree = new SegmentTree(elements); //creating the segment tree
	        int sum = segmentTree.query(1, 4);  //querying for sum of elements in range 1-4
	 
	        System.out.println("the sum is " + sum);
	        segmentTree.update(3, 1,elements); // updating the tree
	        sum = segmentTree.query(1, 4); //getting the updated result
	        System.out.println("the sum is " + sum);	
	    }  	

And the output is

the sum is 14
the sum is 15

Segment tree stores cumulative values of all intervals of the array and each interval can be accessed in logarithmic time, segment tree can be very helpful for problems like range min or max or range sum which have large amount of queries.
Also if there is large amount of updates given in the problem the segment tree may not survive, for that lazy propagation comes in handy which I will discuss in my next blog.
Here is the link to 60 problems relating to segment trees,try them to get a hold of the topic
problems on segment trees

By the way,this is my first blog so feel free to write any suggestions in comments.

Keep learning!

Parth Panchal