Array based semi-mutable sets
June 8, 2015 Leave a comment
Persistent sets, on which my semi-mutable sets are based, are usually implemented as trees and are relatively inefficient in both insertion costs and memory usage. They are particularly inefficient when inserting many items in a single transaction as each insert requires the creation of a new path to the newly inserted item. This comes with all the associated memory allocation costs and subsequent garbage collection overhead. This problem can severe when deserializing large sets. Unless of course the data structure supports a merge function.
To illustrate this I have performed some quick and dirty tests on a few .Net data structures:
- array
- HashSet
- FSharpSet
Where the FSharpSet is the only persistent data structure and which I use in my semi-mutable set. The first test consist of populating the data structures with 1’000’000 random longs:
long[] 12 ms long[] 112 ms and sorted HashSet 180 ms added one at a time FSharp 4990 ms added one at a time
Nothing surprising here, Note that creating the FSharpSet set from an array of longs isn’t faster than adding one item at at a time. A quick look at the source code indicates that the constructor actually adds these values one at a time.
The next tests consist of measuring the enumeration speed of the above structures. 100 enumerations of 1’000’000 elements:
long[] 820 ms HashSet 1400 ms FSharp 14000 ms
Again here, there is no real surprise, except maybe that the HashSet is doing rather well when compared to the array. Note that as the enumeration is executed 100 times in loop, the results might well be different when executed only once due to the caching of the data.
Finally memory usage is again unsurprising:
long[] 7 MB HashSet 37 MB FSharp 34 MB
Arrays obviously win, by a factor of four. For small objects such as a long, the difference is massive, with an overhead over the actual data of a factor of 3 for longs. 28 MB of overhead for 1’000’000 elements is about 28 bytes per element, which becomes almost negligible when dealing with large objects.
So, the idea is to combine the advantages of both array and persistent sets in a single structure that minimizes batch insertion time and memory consumption with maximum enumeration speed. We create a persistent set of arrays of values. At its simplest every new batch of values is inserted in the semi-mutable set as an ordered array of values.
Here is a short list of some of the disadvantages of this approach
- it is expensive to enumerate all the element in order as one will have to continuously hop through all the arrays, unless all the arrays are disjoint.
- adding or removing a few elements requires the copying of a complete array
- there is nothing to stop having multiple copies of the same elements in a given sets
However, since the arrays are ordered, it is relatively cheap to enumerate (not in order) all elements in a given range. Though it will still be slower than when working with a fully ordered set as all the arrays have to be searched (unless again they don’t overlap).
In the next post I shall illustrate the implementation and use of this data structure with a fairly realistic example. At that point we will a system with a very large number of relatively small array and we will see how our arrays can be consolidated into fewer, larger ones.