Pages

26 January 2012

Remove duplicate elements from an array

In this article I want to discuss three different implementations in JavaScript for the following problem: given an array of n elements of type integer, how to remove duplicate values?

Here an example of the input array:
[5, 8, 3, 3, 9, 5, 78, 8, 43, 7, 7]

1. loop & splice

In this first approach we traverse the array and compare each element with following elements. If an duplicate element is found it is removed by calling Array.splice():
var loopAndSplice = function(unordered, compFn) {
  for(var i = 0; i < unordered.length; i++) {
    for(var j = i + 1; j < unordered.length; j++) {
      var equal = (compFn !== undefined) 
                  ? compFn(unordered[i], unordered[j]) === 0 
                  : unordered[i] === unordered[j]; 
      if (equal) {
        unordered.splice(j, 1);
        j--;
      }
    }
  }
  return unordered;
}
The index of the inner for loop (j) is decremented to compare the same position again after removal of an element.
The following compare function (compFn) is used for all examples, doing a simple subtraction:
function(a, b) { return (a - b); }

2. sort & dedup

The second possibility is to sort the array with the built-in Array.sort() function and then do a single traversal to remove all adjacent duplicates.
var sortAndDeDup = function(unordered, compFn) {
  var result = [];
  var prev;
  unordered.sort(compFn).forEach(function(item) {
    var equal = (compFn !== undefined && prev !== undefined) 
                ? compFn(prev, item) === 0 : prev === item; 
    if (!equal) {
      result.push(item);
      prev = item;
    }
  });
  return result;
}

3. object dedup

Here we use property assignment to an object to remove duplicates. A property with a given name can only be added once to the same object. Therefore multiple assignments of the same property will filter out those values.
var objectDeDup = function(unordered) {
  var result = [];
  var object = {};
  unordered.forEach(function(item) {
    object[item] = null;
  });
  result = Object.keys(object);
  return result;
}
The properties of the object are assigned to the value null as we are only interested in the property names and not their values. In a next step the object is transformed to an array with the Object.keys() method.
The result array in this method is in ascending order without the need to explicitly sort it and consists of string values.

How does it perform?

The integer arrays used for these tests were created with a duplicate ratio of approx. 50%. The complete test script can be found here.


This leads to some first findings:

  • loop&splice is only usable for arrays with up to ~1000 elements, it degrades badly for larger sizes
  • object-dedup is 2 to 3 times faster than sort&dedup on V8, and on Firefox it outperforms the later even by a factor 3 to 6.
  • V8 is around 8 times faster on sort&dedup than Firefox and and 3 to 6 times faster on object-dedup
  • Firefox performs around 40% better with loop&splice than V8

Next the memory consumption was analyzed on node.js. The total heap consumption for the three methods can be seen in the next diagram:


Which does approve that everything comes with a cost: memory consumption for object-dedup can be 5 to 8 times higher than for the other methods.

How about floats?

All three algorithms can also be used with float arrays. The test arrays again contain 50% duplicate floats.

  • loop&splice: Chrome is here ~45% slower than with integers. Firefox needs 30 times longer compared to integer arrays.
  • sort&dedup: here it is the other way round. Firefox needs only ~28% longer than with integers. Chrome's performance degrades by a factor of 7 compared to integers (although overall faster than Firefox).
  • object-dedup: again the fastest method for float arrays. Chrome needs 11 times longer than with integers and Firefox 3.7 times



Chasin' strings?

Arrays with 40 character strings were used in this test. Again with a duplicate ratio of 50%.

  • Firefox shows quite similar results as with the previous tests with floats.
  • Chrome's performance degrades in all three algorithms. For sort&dedup with a factor of 4 to 6 and for object-dedup we see 18-46% increase in time to process the task.
  • Firefox outperforms Chrome with sort&dedup by a factor of 2 to 3.
  • loop&splice degrades faster than with the previous tests



Limitations of object-dedup: values are unescaped in this algorithm which can lead to complications in the processing of strings that contain a backslash (\). In the following example a property is assigned to an object:
var obj = {};
obj['\u0041lbert'] = null;  // same as obj.Albert = null; 
That means the values '\u0041lbert' and 'Albert' would be considered as equal by the object-dedup method!
Further problems when using this method with strings can be found in this article: an object is not a hash

Summary

We have seen three different methods to remove duplicate elements from an array. loop&splice is the typical O(n2) algorithm and not recommended for arrays with more than 1000 elements. sort&dedup performs much better and is the most straightforward way to perform this task. It uses the built-in Arrays.sort() method of the JavaScript runtime which leads to good results for all type of elements on both tested browsers. object-dedup is a bit the exotic bird of the compared algorithms. It exploits object property assignment for its task which is highly optimized in modern browsers (see hidden classes for Chrome). This leads to an impressive performance that was in most of the tests at least twice as good as sort&dedup and in some cases even 9 times faster. But it comes with the cost of higher memory consumption and limitations as mentioned above when it comes to strings (escape problem). For dedicated environments like node.js object-dedup can be a solution for performance critical tasks. Further tests should evaluate efficiency and compatibility on older browsers.

Update: created a new test case at http://jsperf.com/dedup-int-array. Here a slightly optimized loop&splice method is used that yields to better results.