A segment tree is a data structure in the form of a binary tree that allows to store intervals. It can sometimes be mistaken as an interval tree that has a different characteristic.

To build a segment tree the endpoints of intervals are projected to form new endpoints of elementary segments:

A binary tree is then built with each leaf representing either an elementary segment or an endpoint:

Each interval can then be described as the sum of its sub segments. We therefore store in every node the information, if the segment it represents is part of an interval. But only if the segment of the parent of this node is not part of this certain interval. Otherwise we would create redundant data: if the segment represented by a certain node is contained in an interval, of course also the segments represented by the children of this node are contained in the interval.

An implementation of the segment tree as a Node.js module can be found in this github repository: interval-query. It offers query methods to find all intervals in a certain range or at a certain point and also the calculation of all overlapping intervals in a set of intervals. Besides the tree also an alternative implementation is given in the form of a sequential algorithm that just traverses the set of intervals in search for matches. In certain cases this can be faster than building up a tree. See below for detailed analysis.

The coding is not aggressively optimized for performance: the buildTree() method as well as the query methods use recursion. Also the nodes of the tree are modeled as objects and therefore the tree is not implemented in an array. This leaves some space for optimization.

First the tree has to be build up. The following diagramm shows the time required to build a tree dependent on the number of intervals.

All tests were carried out on the following machine:

The time to build the tree with 1000 intervals takes around 10 ms and rises linear to approx. 450 ms for 20000 intervals. For intervals with floats the distribution of intervals in the number range does not affect the time to build the tree.

We see that for both algorithms, tree and sequential, the queries are faster for a lower density value. My current assumption is that this is more a side effect: if there is a lower number of overlapping intervals and we query for a certain point in the number range the number of hits or resulting intervals is also lower. The processing of these result amount adds to our complete processing time.

If we limit the influence of this side effect by choosing a density value of 0.01 and compare the two algorithms we get the following result:

Quering a point in a set of 1000 intervals is 3.5 times faster with the tree than with the sequential algorithm and the ratio is increasing with growing number of intervals.

The performance of the sequential algorithm is similar to the test result with d=0.01: it degrades for larger number of intervals. The tree does not degrade at all here. Even with a test set of 20000 intervals we could run more than 60000 queries per second. How can this be explained? The costly part of the tree traversal is when we hit a node with many intervals stored in it. These intervals then have to be copied to our result array of intervals. If we have relatively small intervals (compared to the number range) and less overlapping between the intervals, then each interval can be represented by a small number of segments (nodes) in the tree and therefore the number of intervals per segment keeps on a low level.

The applicability of the segment tree as a data structure to store intervals depends on a number of factors:

To build a segment tree the endpoints of intervals are projected to form new endpoints of elementary segments:

(1, 5) (2, 3) (4, 7) --------------------- (1, 2, 3, 4, 5, 7)

A binary tree is then built with each leaf representing either an elementary segment or an endpoint:

Source: Wikipedia |

#### Implementation in JavaScript

An implementation of the segment tree as a Node.js module can be found in this github repository: interval-query. It offers query methods to find all intervals in a certain range or at a certain point and also the calculation of all overlapping intervals in a set of intervals. Besides the tree also an alternative implementation is given in the form of a sequential algorithm that just traverses the set of intervals in search for matches. In certain cases this can be faster than building up a tree. See below for detailed analysis.

The coding is not aggressively optimized for performance: the buildTree() method as well as the query methods use recursion. Also the nodes of the tree are modeled as objects and therefore the tree is not implemented in an array. This leaves some space for optimization.

#### Performance analysis

First the tree has to be build up. The following diagramm shows the time required to build a tree dependent on the number of intervals.

All tests were carried out on the following machine:

Node.js (v0.6.11), Linux (3.1.9 x86_64), P8600 (Core2Duo@2.4GHz 4GB RAM)

The time to build the tree with 1000 intervals takes around 10 ms and rises linear to approx. 450 ms for 20000 intervals. For intervals with floats the distribution of intervals in the number range does not affect the time to build the tree.

##### Query methods

For the query methods the distribution of intervals in the number range has an impact on the performance of the tree. To measure this dependency we introduce the facter**d**(density) that describes the ratio between number of intervals to number range. An equal distributed set of 1000 intervals in a number range of 0 - 1000 would have the factor d = 1.We see that for both algorithms, tree and sequential, the queries are faster for a lower density value. My current assumption is that this is more a side effect: if there is a lower number of overlapping intervals and we query for a certain point in the number range the number of hits or resulting intervals is also lower. The processing of these result amount adds to our complete processing time.

If we limit the influence of this side effect by choosing a density value of 0.01 and compare the two algorithms we get the following result:

Quering a point in a set of 1000 intervals is 3.5 times faster with the tree than with the sequential algorithm and the ratio is increasing with growing number of intervals.

##### Further queries

To illustrate the influence of the quality of the interval set one step further, another test was done with test data of the following characteristic: we create a set of intervals where the number range is 100 times the number of intervals and the length of the intervals is a random number between 0 - 1000. This means again the density of intervals in the number range is low: an average query on a certain point would hit approx. one interval. And this result is independent on the number of intervals in our set. Let's look at the result:The performance of the sequential algorithm is similar to the test result with d=0.01: it degrades for larger number of intervals. The tree does not degrade at all here. Even with a test set of 20000 intervals we could run more than 60000 queries per second. How can this be explained? The costly part of the tree traversal is when we hit a node with many intervals stored in it. These intervals then have to be copied to our result array of intervals. If we have relatively small intervals (compared to the number range) and less overlapping between the intervals, then each interval can be represented by a small number of segments (nodes) in the tree and therefore the number of intervals per segment keeps on a low level.

#### Conclusion

The applicability of the segment tree as a data structure to store intervals depends on a number of factors:

**dynamic vs. static**. The tree is a static structure. If the set of intervals is changing constantly then a dynamic data structure like the one described as sequential might be a better match.**initial build time**. The tree needs to be build up first. If multiple queries are required this delay might be compensated by an overall better query performance of the tree.**distribution of intervals**. The performance of the here presented JavaScript implementation of the segment tree highly depends on the distribution of the intervals. For small intervals with moderate overlapping the tree can outperform the sequential method by magnitudes.

## No comments:

## Post a Comment