Pages

03 May 2012

Building segment trees with Go

The segment tree implementation in JavaScript presented in my last post is already able to process large number of intervals. A practical use of this algorithm is for example to analyze data of gene sequences. I wanted to find out if it is possible to push the performance for this task one step further.
There is still enough room for improvement in the JavaScript version like for example changing to typed arrays or using non-recursive implementations for building and querying the segment tree. But I saw the possibility to parallelize the problem of building and traversing the tree and distribute the workload on multiple cores. For this, and also to minimize memory usage another language is required.
First I thought about using the new concurrency features of C++11 but having enjoyed programming in JavaScript for a while now I was looking for a programming paradigm that is closer to JS.
Soon I was excited about trying out Go for this task as it seemed to be a  promising candidate:
  • Being designed as a system programming language with static types it should be possible to model the data structures for the segment tree in a memory efficient way
  • With lightweight type system, garbage collection and fast compilation it feels more like a dynamic language like JavaScript
  • According to the FAQ: "By its design, Go proposes an approach for the construction of system software on multicore machines"


First steps in Golang land


It's fun to learn Go. The documentation resources provided at http://golang.org/ are excellent. I found it especially helpful to read through the slides of a three day workshop about Go (Day1Day2Day3). The slides are not any more part of the current documentation, probably because they are not updated for current Go version 1. But I think the fundamental concepts have not changed, so it's still worth reading them.
After this, the Tour of Go provides a good means to try things out and do some exercises. The specification is even readable and on golang-nuts group the core developers of the language are very active.

Translate from JavaScript to Go


The first step was to create a single threaded version of the segment tree without the concurrency features of Go. The result can be found here: stree.go
When we measure the time to build the tree and to query a range we get the following results:

To build the tree Go needs only about a quarter of the time compared to JavaScript. Interestingly, the query method in Go is only around 25-30% faster. The query used here is on the complete number range, which is the worst case query for the segment tree as all nodes have to be evaluated and a lot of intervals are found which then have to be reduced to non-duplicates.
All tests were carried out on the following machine:
Go 1, Node.js (v0.6.11), Linux (3.1.9 x86_64), P8600 (Core2Duo@2.4GHz)

Running on multiple cores


Go offers two main language features for concurrency: goroutines and channels.
Goroutines can be used to solve the problem of building a segment tree in parallel: one could start multiple goroutines where each one independently creates a branch of the tree. Same concept applies to the querying algorithm: we can search with multiple goroutines in separate branches of the tree for overlapping intervals and later merge these results. So far the theory, here is the implementation: mtree.go
The following shows the result of the measurement.

We see only a 7% resp. 15% improvement when using goroutines. This is less than I expected. How can this be explained?

"Concurrency is not Parallelism"


Rob Pike explains in these slides what he sees as the difference:
Concurrency provides a way to structure a solution to solve a problem that may (but not necessarily) be parallelizable.
Go provides language features that allows to elegantly describe a problem in a concurrent way. This is the precondition before it can be executed in parallel. But if the actual Go program can utilize parallelism and also achieve a performance benefit depends on a number of factors:

  1. the nature of the problem: can it be divided in independent sub tasks?
  2. the concurrency overhead: the usage of these language features comes with a certain overhead, communication is required to synchronize tasks. The benefit of parallelism must outweigh this overhead
  3. the code quality: there are always multiple ways to structure concurrent solutions. Experience with the segment tree showed that there is a high variance how different solutions performed. More on this later.
  4. the runtime / implementation details: when a function is called with a keyword 'go' it does not start immediately, but is scheduled for later execution. The scheduling will therefore have an impact on the parallel execution of goroutines. Also the internal memory model will influence the capability of concurrent goroutines accessing the same data structure.


Experiences when building a segment tree in parallel


I want to analyze if building a segment tree is a suitable problem for parallelism and share my experiences with implementing it in Go. As stated above, building a tree is parallelizable: multiple threads could deal with building the multiple branches of the tree. But is it also a "good" problem? I don't think so. An ideal case is given here: Effective Go, an expensive operation without memory allocation distributed on multiple cores. Building a segment tree is the opposite: the calculation is very simple, only comparing two numbers per new node of the tree, but with constant memory allocation for each node. These allocations can probably not be executed in parallel and therefore serialize the goroutines. There is an answer from Ross Cox in the discussion here that leads to this conclusion.

When we start multiple goroutines to build our tree we somehow need to know when all goroutines are finished and the tree is complete. In a wider sense: communication is required between concurrent running threads. Golang propagates the paradigm share by communicating:
Do not communicate by sharing memory; instead, share memory by communicating.
Go channels are a means to implement this paradigm. Once used to 'thinking in channels' I found them a safe and simple toolkit for the often counter-intuitive problems of parallel programming. But Go also supports mutexes to lock shared access to variables. 

For the segment tree it was required to use these features in a sensible manner: as the problem as such is not optimal, the gain from parallelism could be easily eaten up by the overhead of communication and synchronization of goroutines. For every function in mtree.go I wrote at least three different versions, with different usage of channels and goroutines. For example you could first start a number of goroutines and supply them with tasks via channels, or start the goroutine in place with the task in a direct function parameter. These slight modifications on the structure had often a deep impact on the overall performance. Often the performance was worse than with the single threaded algorithm.

Optimizations in this area, which is of course a corner case, can only be done (apart from an empirical approach) with knowledge of the underlying runtime or implementation details. For example the question if write access to a single field of an array will block the whole data structure for other goroutines, cannot be answered from the Go specification.

Another area of the runtime is the scheduler. It's a question of granularity how many sub tasks you create from your original problem and therefore how many goroutines should be started in parallel. On the Core2Duo test machine the best performance for building the tree was achieved with 64 parallel goroutines, but for querying the tree the optimum was with only 4 goroutines (given a segment tree with 100000 intervals). To understand this behavior again a deeper understanding would be required: how are goroutines scheduled? When are they preempted? I don't want to go into more details here, but this discussion should help to get a basic understanding, also of the current limitations of the scheduler.
The question "Why does using GOMAXPROCS > 1 sometimes make my program slower?" belongs to this context as well.

Conclusion


It's hard not to like Go. It feels carefully crafted in all parts. I especially liked the lightweight type system that does not need classes but still allows object oriented design. When looking at Go from a Java perspective the simplifications and the reduction to the essential seems like a step in the right direction. Coming from JavaScript I appreciated more structure and the strict compiler. Interfaces feel a bit weird in the beginning but I  think one get used to. For developers who think that parallel algorithms are often non-intuitive, Go channels provide a means to structure these problems elegantly.

Although Go currently develops into a general purpose language it was designed for system programming and server applications. In this area where networking is involved, latency and huge number of multiple clients Go can show its strengths.

The segment tree presented here does not fall in this category. It might be more representative for scientific calculations. Is Go usable in this area as well? I think the interesting question is not so much what a programmer with a profound knowledge of the runtime and underlying implementation can squeeze out of the language. But more what the average developer can do in little time. Given a suitable problem Go has everything to express it in a concurrent way and it will benefit from parallelism. With the to be expected advancement of the Go runtime the scheduling should be optimized for long run calculations. If the problem is not so suitable like the here presented segment tree (and this might be obvious only on second look) it should be appropriate to implement first the single threaded version and optimize for parallelism only if required.

References

29 February 2012

Segment tree implementation in JavaScript

A segment tree is a data structure in the form of a binary tree that allows to store intervals. It can sometimes be mistaken as an interval tree that has a different characteristic.

To build a segment tree the endpoints of intervals are projected to form new endpoints of elementary segments:
(1, 5) (2, 3) (4, 7) --------------------- (1, 2, 3, 4, 5, 7)

A binary tree is then built with each leaf representing either an elementary segment or an endpoint:

Source: Wikipedia
Each interval can then be described as the sum of its sub segments. We therefore store in every node the information, if the segment it represents is part of an interval. But only if the segment of the parent of this node is not part of this certain interval. Otherwise we would create redundant data: if the segment represented by a certain node is contained in an interval, of course also the segments represented by the children of this node are contained in the interval.

Implementation in JavaScript


An implementation of the segment tree as a Node.js module can be found in this github repository: interval-query. It offers query methods to find all intervals in a certain range or at a certain point and also the calculation of all overlapping intervals in a set of intervals. Besides the tree also an alternative implementation is given in the form of a sequential algorithm that just traverses the set of intervals in search for matches. In certain cases this can be faster than building up a tree. See below for detailed analysis.

The coding is not aggressively optimized for performance: the buildTree() method as well as the query methods use recursion. Also the nodes of the tree are modeled as objects and therefore the tree is not implemented in an array. This leaves some space for optimization.

Performance analysis


First the tree has to be build up. The following diagramm shows the time required to build a tree dependent on the number of intervals.
All tests were carried out on the following machine:
Node.js (v0.6.11), Linux (3.1.9 x86_64), P8600 (Core2Duo@2.4GHz 4GB RAM)

The time to build the tree with 1000 intervals takes around 10 ms and rises linear to approx. 450 ms for 20000 intervals. For intervals with floats the distribution of intervals in the number range does not affect the time to build the tree.

Query methods
For the query methods the distribution of intervals in the number range has an impact on the performance of the tree. To measure this dependency we introduce the facter d (density) that describes the ratio between number of intervals to number range. An equal distributed set of 1000 intervals in a number range of 0 - 1000 would have the factor d = 1.


We see that for both algorithms, tree and sequential, the queries are faster for a lower density value. My current assumption is that this is more a side effect: if there is a lower number of overlapping intervals and we query for a certain point in the number range the number of hits or resulting intervals is also lower. The processing of these result amount adds to our complete processing time.
If we limit the influence of this side effect by choosing a density value of 0.01 and compare the two algorithms we get the following result:

Quering a point in a set of 1000 intervals is 3.5 times faster with the tree than with the sequential algorithm and the ratio is increasing with growing number of intervals.
Further queries
To illustrate the influence of the quality of the interval set one step further, another test was done with test data of the following characteristic: we create a set of intervals where the number range is 100 times the number of intervals and the length of the intervals is a random number between 0 - 1000. This means again the density of intervals in the number range is low: an average query on a certain point would hit approx. one interval. And this result is independent on the number of intervals in our set. Let's look at the result:

The performance of the sequential algorithm is similar to the test result with d=0.01: it degrades for larger number of intervals. The tree does not degrade at all here. Even with a test set of 20000 intervals we could run more than 60000 queries per second. How can this be explained? The costly part of the tree traversal is when we hit a node with many intervals stored in it. These intervals then have to be copied to our result array of intervals. If we have relatively small intervals (compared to the number range) and less overlapping between the intervals, then each interval can be represented by a small number of segments (nodes) in the tree and therefore the number of intervals per segment keeps on a low level.

Conclusion


The applicability of the segment tree as a data structure to store intervals depends on a number of factors:
  • dynamic vs. static. The tree is a static structure. If the set of intervals is changing constantly then a dynamic data structure like the one described as sequential might be a better match.
  • initial build time. The tree needs to be build up first. If multiple queries are required this delay might be compensated by an overall better query performance of the tree.
  • distribution of intervals. The performance of the here presented JavaScript implementation of the segment tree highly depends on the distribution of the intervals. For small intervals with moderate overlapping the tree can outperform the sequential method by magnitudes.

References

26 January 2012

Remove duplicate elements from an array

In this article I want to discuss three different implementations in JavaScript for the following problem: given an array of n elements of type integer, how to remove duplicate values?

Here an example of the input array:
[5, 8, 3, 3, 9, 5, 78, 8, 43, 7, 7]

1. loop & splice

In this first approach we traverse the array and compare each element with following elements. If an duplicate element is found it is removed by calling Array.splice():
var loopAndSplice = function(unordered, compFn) {
  for(var i = 0; i < unordered.length; i++) {
    for(var j = i + 1; j < unordered.length; j++) {
      var equal = (compFn !== undefined) 
                  ? compFn(unordered[i], unordered[j]) === 0 
                  : unordered[i] === unordered[j]; 
      if (equal) {
        unordered.splice(j, 1);
        j--;
      }
    }
  }
  return unordered;
}
The index of the inner for loop (j) is decremented to compare the same position again after removal of an element.
The following compare function (compFn) is used for all examples, doing a simple subtraction:
function(a, b) { return (a - b); }

2. sort & dedup

The second possibility is to sort the array with the built-in Array.sort() function and then do a single traversal to remove all adjacent duplicates.
var sortAndDeDup = function(unordered, compFn) {
  var result = [];
  var prev;
  unordered.sort(compFn).forEach(function(item) {
    var equal = (compFn !== undefined && prev !== undefined) 
                ? compFn(prev, item) === 0 : prev === item; 
    if (!equal) {
      result.push(item);
      prev = item;
    }
  });
  return result;
}

3. object dedup

Here we use property assignment to an object to remove duplicates. A property with a given name can only be added once to the same object. Therefore multiple assignments of the same property will filter out those values.
var objectDeDup = function(unordered) {
  var result = [];
  var object = {};
  unordered.forEach(function(item) {
    object[item] = null;
  });
  result = Object.keys(object);
  return result;
}
The properties of the object are assigned to the value null as we are only interested in the property names and not their values. In a next step the object is transformed to an array with the Object.keys() method.
The result array in this method is in ascending order without the need to explicitly sort it and consists of string values.

How does it perform?

The integer arrays used for these tests were created with a duplicate ratio of approx. 50%. The complete test script can be found here.


This leads to some first findings:

  • loop&splice is only usable for arrays with up to ~1000 elements, it degrades badly for larger sizes
  • object-dedup is 2 to 3 times faster than sort&dedup on V8, and on Firefox it outperforms the later even by a factor 3 to 6.
  • V8 is around 8 times faster on sort&dedup than Firefox and and 3 to 6 times faster on object-dedup
  • Firefox performs around 40% better with loop&splice than V8

Next the memory consumption was analyzed on node.js. The total heap consumption for the three methods can be seen in the next diagram:


Which does approve that everything comes with a cost: memory consumption for object-dedup can be 5 to 8 times higher than for the other methods.

How about floats?

All three algorithms can also be used with float arrays. The test arrays again contain 50% duplicate floats.

  • loop&splice: Chrome is here ~45% slower than with integers. Firefox needs 30 times longer compared to integer arrays.
  • sort&dedup: here it is the other way round. Firefox needs only ~28% longer than with integers. Chrome's performance degrades by a factor of 7 compared to integers (although overall faster than Firefox).
  • object-dedup: again the fastest method for float arrays. Chrome needs 11 times longer than with integers and Firefox 3.7 times



Chasin' strings?

Arrays with 40 character strings were used in this test. Again with a duplicate ratio of 50%.

  • Firefox shows quite similar results as with the previous tests with floats.
  • Chrome's performance degrades in all three algorithms. For sort&dedup with a factor of 4 to 6 and for object-dedup we see 18-46% increase in time to process the task.
  • Firefox outperforms Chrome with sort&dedup by a factor of 2 to 3.
  • loop&splice degrades faster than with the previous tests



Limitations of object-dedup: values are unescaped in this algorithm which can lead to complications in the processing of strings that contain a backslash (\). In the following example a property is assigned to an object:
var obj = {};
obj['\u0041lbert'] = null;  // same as obj.Albert = null; 
That means the values '\u0041lbert' and 'Albert' would be considered as equal by the object-dedup method!
Further problems when using this method with strings can be found in this article: an object is not a hash

Summary

We have seen three different methods to remove duplicate elements from an array. loop&splice is the typical O(n2) algorithm and not recommended for arrays with more than 1000 elements. sort&dedup performs much better and is the most straightforward way to perform this task. It uses the built-in Arrays.sort() method of the JavaScript runtime which leads to good results for all type of elements on both tested browsers. object-dedup is a bit the exotic bird of the compared algorithms. It exploits object property assignment for its task which is highly optimized in modern browsers (see hidden classes for Chrome). This leads to an impressive performance that was in most of the tests at least twice as good as sort&dedup and in some cases even 9 times faster. But it comes with the cost of higher memory consumption and limitations as mentioned above when it comes to strings (escape problem). For dedicated environments like node.js object-dedup can be a solution for performance critical tasks. Further tests should evaluate efficiency and compatibility on older browsers.

Update: created a new test case at http://jsperf.com/dedup-int-array. Here a slightly optimized loop&splice method is used that yields to better results.