Array based semi-mutable sets – implementation

Now a typical example where many items can be added within single transaction is when a database collects time stamped events also called  time series. One can easily imagine a system where many thousands of events are added to a database every second. Obviously there is no point in creating a single transaction per event and in general one can batch them together. The delay between consecutive batches will then depend on the acceptable delay for events to be made available to the database readers. One can imagine batching periods ranging from the order of milliseconds to many seconds or more. Overall the important parameters of such a system are:

  • event production rate
  • batching period
  • events retention period

Together these parameters will determine the liveliness and memory usage of the database. As an example consider NASDAQ trades. About 20 billion shares are traded every day on the NASDAQ, assuming an average of 300 shares per trade (approximate NYSE numbers) and a trading day of 7 hours there are, on average, about 3000 transactions per second. Assuming a retention period of 5 days, we are talking about keeping about 350 million trading transactions in memory. With a batching period of 1 second, we will be creating about 117000 arrays, each containing on average 3000 events. While events are being added to the database, transactions beyond the retention date will be removed by the system. To avoid contention, I will be doing this garbage collection in the same transaction as the one that writes the new events. Essentially I will have a system with a single threaded writer and multi threaded reader.


The trading transactions will be represented using the following data structure:

public struct TradingEvent : IComparable {
    public long Id;
    public long Time; // unix milliseconds
    public short Price;
    public short Volume;
    public short Stock;
    public short BuyerId;
    public short SellerId;
    public short TraderId;
    public int CompareTo(object other) {
        return Time.CompareTo(((long)other));

which takes up 32 bytes. The following data structure holds the array of trades ordered by time:

public struct TradeBatch : IComparable {
    public Guid Id;
    public long First;
    public long Last;
    public TradingEvent[] Trades;
    public int CompareTo(object other) {
        var that = (TradeBatch) other;
        var result = First.CompareTo(that.First);
        if (result == 0) result = Id.CompareTo(that.Id);
        return result;

which are then stored in the following semi-mutable set:

smxSet<TradeBatch> allTradeBatches;

The code first populates 5 days worth of events (about 378 million events, i.e. 11 GB) and then generates 3000 events every second. 500 events are generated with time stamps within the batching period of one second, another 500 events  within 2 seconds, then another 500 within 5 seconds, 500 within 10 seconds, 500 within 30 seconds and the last 500 are generated within 60 seconds. This models the case where events are received after some delay. Hence on average about 60 arrays overlap each other in time, or in other words, for any given time, events can be found on average in 60 arrays.

Since the events within an array are sorted, enumerating all events within a time range is just a matter of filtering the batches that overlap with the desired range and then binary chopping the array to get the required indices that match the time interval:

foreach (var batch in allTradeBatches) {
    var firstIndex = 0;
    var lastIndex = batch.Trades.Length;
    if (start >= batch.First && start <= batch.Last) {
        firstIndex = Array.BinarySearch(batch.Trades, start);
        if (firstIndex < 0) firstIndex = -(firstIndex + 1); } 
        if (end >= batch.First && end<= batch.Last) {
        lastIndex = Array.BinarySearch(batch.Trades, end);
        if (lastIndex < 0) lastIndex = -(lastIndex + 1);
    for (int i = firstIndex; i < lastIndex; i++) {
        totalVolume += batch.Trades[i].Volume;

On average, the binary search should only have to be carried twice on sixty arrays. Once the indices have been determined, all accesses to memory are sequential, i.e. rates of 10 GB/s can be expected, which should translate to 320 Mev/s. Assuming a maximum bandwidth of 50 GB/s when multithreaded, we could expect to read events at the rate of 1.6 Gev/s.

Obviously the binary search adds an overhead. The standard array implementation takes about 4 μs per array fo 3000 events on my laptop. A basic custom implementation, one which doesn’t have to go through the ICompare interface and the unboxing of the long takes 1.4 μs. Assuming 120 binary searches, the current implementation presents an overhead of up to 0.5 ms or, with a custom implementation, 170 μs. To put things in context, this corresponds to the time it takes to sequentially access respectively 160 Kev or 54 Kev.

Interestingly, the binary search time of 4 μs corresponds to reading 1280 events, double that if you have to perform two binary searches and we could have already retrieved 2560 event! Reading the whole array takes almost the same time as performing the binary search! However, this is not the case if one uses the custom implementation, but is still a remarkable fact that highlights the incredible speed of sequential reads. Actually the code to be executed would have to be something like:

for (int i = 0; i < batch.Trades.Length; i++) { 
    if (batch.Trades[i].Time > start && batch.Trades[i].Time < end) {
        totalVolume += batch.Trades[i].Volume;

So let’s see how this compares to the same loop but without the conditions on the time. On an E5 2666 Xeon machine, it took 36 ms with the conditions versus 29 ms without them to enumerate 3000*3600 events. In other words 300 Mev/s versus 372 Mev/s. So a slowdown of about 20%.

The main advantage this last version is that the arrays do not have to be sorted! Whether this is really an advantage will depend on the actual use case, and in general more time is spent reading than writing, so it is often good practice to do more work only once.

I have executed some more complete tests (still focused on enumerating large numbers of events rather than just a few, so the impact of the binary searches should be negligible). The code is executed in up to 8 threads, each threads enumerates a disjoint set of events, for example, in the case of two threads, the first one enumerates the first half of all events while the second thread enumerates the second half. This way all the data from the event batches is read from main memory rather than from the cache. The events set was populated with 5*7*3600 batches of 3000 events.


Single threaded performance stands at 315 Mev/s (almost 10 GB/s), maximum total bandwidth reaches about 1.3 Gev/s. These results include garbage collection of versions, generation of 3000 events every second, removal of old batches, but no persistence to disk.

Using this approach, the database is populated with 126’000 arrays, which is a lot, also on average 60 arrays overlap in time. Although this approach is most efficient at enumerating all the events within a given time range, it is important to note that overall the events are only partially ordered.

In the next post I shall illustrate how these small arrays can be consolidated into fully ordered larger non-overlapping ones.

Memory benchmarks: Addendum tertium: the forgotten TLB

Obviously I have completely forgotten to consider the TLB. In Haswell processors, the L1 TLB contains 64 entries while the L2 TLB holds 1024 entries. All memory accesses must be translated from virtual to physical address, so every single read operation requires a TLB lookup. An L2 TLB miss implies retrieving the TLB entry from memory, so in the worst case a single memory read requires two (chained) memory accesses.

In .Net, page size is 4K, so 1024 TLB entries will cover only 4MB of memory. Remember that the benchmarks executed so far used 1 GB arrays, so in the case of random and chained memory accesses essentially all L2 TLB lookups should fail. But the access times observed don’t seem to exhibit such large delays. So what is going on here? The only explanation I have is that the L3 cache is actually holding all the TLB entries (this could be easily tested by disabling the L3 cache). On x64 a TLB entry is 8 bytes long, so a 25 MB cache will hold one eighth that many TLB entries covering at most 12.5 GB of memory. Hence, to feel the full effect of missed L3 caches we would have to work with arrays much larger than 12.5 GB. I added a few extra tests which deal with much larger arrays. Since I am using arrays of longs and in .Net arrays cannot have more than Int32.MaxValue elements, I have used an array of arrays of longs. Here is an extract of the code for a simplified version of chained reads:

var arrays = sizes.Max().EnumerateTo().Select(t => new long[max]).ToArray();
for (int i = 0; i < steps; i++) {
    next = arrays[i%size][(11587L*(i+next)) & mask]; 
    // first array should be retrieved from cache
    // integer division should be negligible 

The loop for the random reads is the same except for the next variable not appearing on the right hand side.

Here are the results for the 8 cores machine with 30 GB of memory for chained readd:

large array

Well, the full impact of TLB misses is rather clear, memory access is gone from 110 ns to 185 ns and keeps increasing. Caches have become quite complex and I will not attempt to explain the results! Suffice to say that it exhibits a classic cache saturation curve. For 1GB arrays cache hit is close to 100% (1GB needs only 2 MB of TLB entries), while above 30GB cache hit will be close to 0% (30GB requires 60 MB of TLB entries).  Maximum values will be somewhere around 190 ns or 200 ns, which corresponds to a doubling of the access time (allowing for 10 or 15 ns for L3 access).

The other case to consider  is the random reads one:

large array rnd

For random reads, the shape of the curve is very different… and frankly, I have even less to says 🙂 But a few things are worth mentioning. Firstly it seems to start flat which was not the case for the chained reads. Secondly it saturates at already 10 GB with a latency over three times the starting value, while one could expect the maximum latency to be twice that starting value.

Unsurprisingly there, array size has no effect on the performance of sequential reads. Note that it takes about 500 ns to read 4KB. In the worst case, adding a 110 ns delay every 500 ns would increase the average latency by 20 percent. In practice, subsequent TLB entries (assuming linear organization) should be prefetched, and hence have no impact.

So it looks like some of my previous results will not be applicable to very large data sets, so I better rerun my previous benchmarks on much larger arrays!

Bottom line is that for very large arrays the impact of TLB misses due to the tiny pages has a massive effect on latency. Best solution would be to use larger pages, but, sadly enough, that option doesn’t seem to be available in .Net.

[ 10 April 2016] I have updated some previous posts with the latest version of the benchmarks which now share a single 1 GB array to minimize the impact of TLB misses.

Memory benchmarks: Addendum secundum: .Net Native

Windows 10 has been out for a while now and I eventually got a machine running it: the 4 cores machine I have used so far. So I was in a position to try out .Net Native as it seems, disappointingly enough, to be only supported for so-called universal applications on Windows 10. .Net Native uses the same optimization stages as the native C++ compiler, so one would expect to get (in some cases) the same performance level.

Note that in the context of these memory benchmarks, I wouldn’t be expecting much improvements as in most cases we are memory bandwidth limited. The most obvious exception being the sequential reads, as we already know, can be executed faster with vector instructions (15 GB/s versus 11 GB/s on the 4 cores machine). Having said that, I might try the concurrent random reads case of my Addendum primum. I haven’t updated the code which needs various changes to compile on the universal platform. The code as it stands now in Github sums the values in the array and then discards the result. This, obviously, won’t go down very well with a good optimizing compiler, which will just avoid carrying out the calculation. Actually the impact wasn’t always quite as drastic, but still better safe than sorry.

So here is the latest greatest chart for sequential access on the four cores machine using .Net Native:.Net NativeAs expected, or hoped for, .Net Native gives similar performance to C++ or AVX2.

Memory benchmarks: Addendum primum

You might have noticed that I missed one case in my previous blog post: performing multiple independent random reads within a single thread. The code was expanded:

switch (inThreadParallelismLevel) {
    case 1: {
        for (int i = 0; i < steps; i++) {
        total0 += array[(11587L*i) &amp; mask]; 
 case 2: {
    for (int i = 0; i < steps; i++) {
        total0 += array[(24317L * i) &amp; mask];
        total1 += array[(14407L * i) &amp; mask];

We saw that the random reads performance scaled nicely with number of threads up (kind of linearly up to 5 threads on the 6 cores CPU). So the question is whether we can achieve the same when performing  multiple independent random reads within a thread.

Rather disappointingly this is not the case, here are the results for the 8 cores machine, the graph represents the memory bandwidth versus the number of concurrent requests within the single thread:


So, a bit of a surprise, there is absolutely no advantage here in running multiple reads within a thread. The scaling observed in the case of chained reads doesn’t happen here. Mind you it is going about 4 times faster here, so there is less scaling potential.  More importantly the scaling across threads/cores is not present here. So it looks like this will only scale if the memory read requests originate from different cores rather than from a single one.

Memory benchmarks

I have often wondered how the current many-core CPUs can retrieve enough data from memory to feed all its cores. So I decided to carry out a few simple experiments.

I looked at three different scenarios, sequential memory access, dependent chain memory access and random memory access. The second being the case where the result of a read determines the address of the subsequent read (load-to-use).

In the first case, we expect to reach very high memory bandwidth, potentially hitting the maximum specified in the CPU spec with relatively few threads. In the case of chained memory access the bandwidth will reflect the memory’s full (huge) latency. In the third case the bandwidth should be higher but probably not by much. In all cases I am mainly interested in the scalability aspects, i.e. how does the single thread/core bandwidth scale as more threads/cores hit the memory bus.

Note that the code was put together fairly quickly, far more care should be taken to get accurate results, however this should be good enough to understand the scalability aspects I am interested in.

As the goal is to measure main memory access as opposed to cache access, to minimize the impact of the limited TLB buffer, each thread accesses these same 1GB array of 64 bit longs. The code is available on Github. For the sequential case the core loop is simply:

    for (int i = 0; i < max; i++) total += array[i];

The next benchmark is where the addresses of a memory access depends on the value of the previous read:

    for (int i = 0; i < max/8; i++) index = array[index];

where the array has been initialized so that the index performs a random walk through the whole array. This models the case where one walks down a dependency path, e.g.

    var x = a.b.c.d.e.f;

The code for the random access performance test is:

    for (int i = 0; i < max/8; i++) total += array[(i * 11587L) & (max-1)];

where we take large, non-periodic, steps to avoid prefetch and cache effects. Note that since the reads are much slower we perform 8 times fewer memory accesses.

In all of these benchmarks ten runs of one hundred such loops are executed and the best, lowest, time is taken.

I executed this code on three different machines with the following specs:

  1. 4 cores – i7 2600K 3.4- 3.5 GHz with 16GB DDR3
  2. 6 cores – Xeon E5 1650 v3 – 3.4 – 3.5 GHz with 16GB of four channels DDR4
  3. 8 cores – Xeon E5 2666 v3 – 2.9 – 3.2 GHz with 30GB of four channels DDR4

The Xeon E5 2666 v3 is a custom Intel CPU for AWS, you can find more details in The Register.

All the results can be found on Github next to the actual code. Here are the results for sequential access which represents the bandwidth in GB/s as a function of the number of concurrent threads:


First thing to mention is that there is no massive difference between the memory bandwidth of the various systems in the single threaded case. Beyond that the memory bandwidth scales similarly on all CPUs. Where the difference is however significant is how far each system scales. The oldest CPU stops at 20 GB/s (21 GB/s in the spec), while the Xeon 1650 supports double that much (spec says 68 GB/s) and the socket Xeon 2666 with 8 cores reaches 50 GB/s with 8 threads (spec says 68 GB/s). Note that the old CPU uses DDR3 with two channels while the two Haswell CPUs use DDR4 with 4 channels, so no big surprise there.

Althoug not linear, the scalability of each of these is pretty good. The slowest and oldest system scales only up to twice its single threaded performance. The 1650 scales up nearly five times its single threaded performance, while the 2666 goes up to about six or seven times the single threaded performance (albeit starting lower). Note that neither of the Haswell CPUs reaches anywhere close to their spec-ed memory bandwidth of 68 GB/s on the tested systems.

In terms of memory access time, in the single threaded case, the memory bandwidth for the 1650 is equivalent to 0.83 ns access time per long! This number is most impressive, and is down to the quality of the CPUs prefetch unit (actually the streamer) which detects the sequential read pattern and accordingly prefetches aggressively (up to 20 cache lines ahead). You can read all the details (well the ones they made public) about the prefetch unit in Intel’s manual.

Let’s move on onto the chained reads:


where the difference with sequential reads is striking, roughly speaking chained reads are one hundred times slower than the sequential case. This corresponds to a latency of about 90 ns, which is pretty much what one can expect from DRAM nowadays. Note that the bandwidth increases almost linearly with the number of threads and shows no sign of saturating.

Here are the results for random reads. Single threaded bandwidth start a bit below 0.4 GB/s which corresponds to a latency of about 18 ns. Note that in this case the bandwidth increases fairly linearly with the number of threads to saturate at around 2.4-2.6 GB/s.


The difference with the chained reads, a factor of four or five, is due to the fact that since the reads are independent, the CPU can dispatch several of them to the memory controller pretty much concurrently. The memory controller can then take full advantage of the channel and bank-level parallelism as well as memory pipelining to perform concurrent memory reads.

Since this last result seems to indicate that the memory controller is able to dispatch maybe up to five (independent) memory requests concurrently, we could try to take advantage of this by executing multiple chained reads within a given thread:

    for (int i = 0; i < max/8; i++) { 
        index0 = array[index0];
        index1 = array[index1];

I have implemented this code with up to eight concurrent indices. Since all of these dependent chains access the same array, I made sure that all indices are sufficiently distant from each to exclude any caching effect. The code was executed on up to 8 threads on the 8 cores machine. The abscissa (x-coordinate)  is the product of the number of threads and the concurrency level:


In the single threaded case, the maximum bandwidth is 255 MB/s which is very close to the single threaded random reads result of 283 MB/s. When running multithreaded, the maximum bandwidth reaches 1.94 GB/s, again close to the multithreaded random reads of 2.04 GB/s. So, in principle, and possibly at almost not cost at all, one can gain a factor of four in memory bandwidth by executing multiple chained reads within a single loop.

So the bottom line is:

  • sequential access is the king, by a factor of twenty to one hundred
  • chained reads can be made to run four times faster

In the case of chained reads, the huge 90 ns delay between consecutive memory reads should permit the execution of a large number of instructions, potentially hundreds of them, allowing quite a lot of processing of the retrieved data. This is also the case, albeit on a smaller scale, for random reads.

Many thanks to Asier for suggestions, comments,  discussions and comparisons between C++ and C# performance!

[10 April 2016] This graphs in this post have been replaced with the latest version of the benchmark that minimize TLB impact. The text has been edited accordingly.

Array based semi-mutable sets

Persistent sets, on which my semi-mutable sets are based, are usually implemented as trees and are relatively inefficient in both insertion costs and memory usage. They are particularly inefficient when inserting many items in a single transaction as each insert requires the creation of a new path to the newly inserted item. This comes with all the associated memory allocation costs and subsequent garbage collection overhead. This problem can severe when deserializing large sets. Unless of course the data structure supports a merge function.

To illustrate this I have performed some quick and dirty tests on a few .Net data structures:

  • array
  • HashSet
  • FSharpSet

Where the FSharpSet is the only persistent data structure and which I use in my semi-mutable set. The first test consist of populating the data structures with 1’000’000 random longs:

    long[]    12 ms
    long[]   112 ms    and sorted
    HashSet  180 ms    added one at a time
    FSharp  4990 ms    added one at a time

Nothing surprising here, Note that creating the FSharpSet set from an array of longs isn’t faster than adding one item at at a time. A quick look at the source code indicates that the constructor actually adds these values one at a time.

The next tests consist of measuring the enumeration speed of the above structures. 100 enumerations of 1’000’000 elements:

    long[]    820 ms
    HashSet  1400 ms    
    FSharp  14000 ms

Again here, there is no real surprise, except maybe that the HashSet is doing rather well when compared to the array. Note that as the enumeration is executed 100 times in loop, the results might well be different when executed only once due to the caching of the data.

Finally memory usage is again unsurprising:

    long[]    7 MB
    HashSet  37 MB
    FSharp   34 MB

Arrays obviously win, by a factor of four. For small objects such as a long, the difference is massive, with an overhead over the actual data of a factor of 3 for longs. 28 MB of overhead for 1’000’000 elements is about 28 bytes per element, which becomes almost negligible when dealing with large objects.

So, the idea is to combine the advantages of both array and persistent sets in a single structure that minimizes batch insertion time and memory consumption with maximum enumeration speed. We create a persistent set of arrays of values. At its simplest every new batch of values is inserted in the semi-mutable set as an ordered array of values.

Here is a short list of some of the disadvantages of this approach

  • it is expensive to enumerate all the element in order as one will have to continuously hop through all the arrays, unless all the arrays are disjoint.
  • adding or removing a few elements requires the copying of a complete array
  • there is nothing to stop having multiple copies of the same elements in a given sets

However, since the arrays are ordered, it is relatively cheap to enumerate (not in order) all elements in a given range. Though it will still be slower than when working with a fully ordered set as all the arrays have to be searched (unless again they don’t overlap).

In the next post I shall illustrate the implementation and use of this data structure with a fairly realistic example. At that point we will a system with a very large number of relatively small array and we will see how our arrays can be consolidated into fewer, larger ones.

Empty for a reason

So far I have found Option<T> for reference types to be the best way of avoiding null reference exceptions while keeping my code very readable. I use it when parsing strings, retrieving values from dictionaries, getting properties from object via reflection and more. Option<T> allows one to write very clean code without any risk of null reference exceptions. To illustrate this point, consider the following code:

string r = null;
if (someObject != null) {
    var value = someObject.GetPropertyValue<string>("P");
    int i;
    if (value != null && int.TryParse(value, out i)) {
        dict.TryGetValue(i, out r);

which, using Option<T> and associated LINQ support can be rewritten as

var r = from o in someObject.AsOption()
        from s in o.GetPropertyValue<string>("P")
        from i in s.ParseAsInt()
        from v in i.Lookup(dict)
        select v;

on the first line we have an empty Option if someObject is null. The first from extracts the actual object if the Option is not empty, the second one retrieves a property value if the property exists and if it is of type string and if the value is not null, then we try to parse the string and get an integer value if the parsing is successful, finally we look up a dictionary and again get a value only if the dictionary holds the key. So all in all we have six different reasons to end up with an empty result. The main goal of Option is to avoid null reference exceptions, but quite often, in particular while debugging, one is actually interested in why the result is empty, is it because the dictionary didn’t contain the key? because the string didn’t represent an integer? etc. As the code stands there is no way of knowing the actual reason without redoing the whole calculation one step at a time, typically by stepping through the code one step at a time.

An empty result is empty for one  reason and only one reason. So the idea is that whenever a step returns an empty option the reason (a string) for returning empty is stored in the option. Here is the implementation of the methods used in the above example, you will see that each time an empty Option is returned, the reason for doing so is specified:

public static Option<T> AsOption<T>(this T value) {
    if (value is ValueType) return new Option<T>(value);
    if ((object)value == null) return Option.Empty<T>("value is null");
    return new Option<T>(value);

public static Option<T> AsOption<T>(this T value, string reason) {
    if (value is ValueType) return new Option<T>(value);
    if ((object)value == null) return Option.Empty<T>(reason);
    return new Option<T>(value);

public static Option<R> GetPropertyValue<R>(this object value, string propertyName) {
    return from v in value.AsOption()
           from pi in v.GetPropertyInfo(propertyName)
           from cast in pi.GetValue(v).AsOption().As<R>()
           select cast;

public static Option<PropertyInfo> GetPropertyInfo<T>(this T value, string propertyName) {
    return from v in value.AsOption()
           from p in v.GetType()
                      .AsOption("Property " + propertyName + " not found in " + v.GetType().Name)
           select p;

public Option<R> As<R>() {
    return this.HasValue && value is R
        ? ((R)(object)value).AsOption()
        : Option.Empty<R>("Cannot cast " + typeof(T).Name + " to " + typeof(R).Name);

public static Option<int> ParseAsInt(this string source) {
    int value;
    return int.TryParse(source, out value)
        ? value.AsOption()
        : Option.Empty<int>("Input string cannot be parsed:" + source);

public static IOption<V> Lookup<K, V>(this K key, Dictionary<K,V> dict) {
    V value;
    return dict.TryGetValue(key, out value)
        ? new Option<V>(value)
        : Option.Empty<V>("Key not found:" + key.ToString());

So now, if the result of our query is empty, we can retrieve the reason for it being empty from the Option instance in the form of a string. Note that this was achieved without making a single modification to our code.

Let’s now have a look at how the other version of the code would have to be modified to achieve the same thing:

string reason = "null someObject";
string r = null;
if (someObject != null){
    reason = "Unable to retrieve property P in " + someObject.GetType().Name;
    var value = someObject.GetPropertyValue<string>("P").Value;
    int i;
    if (value != null) {
        reason = "Unable to parse " + value + " to an int";
        if (int.TryParse(value, out i)){
            reason = "Key not found:" + i.ToString();
            dict.TryGetValue(i, out r);

where I have avoided using else branches to keep the code compact and slightly more readable, but clearly there is no contest between the two versions.

So far I have found this useful mostly for logging and debugging purposes. But I wonder whether a strongly typed version of this could be used for other purposes…

Exiting the monad with eval in LINQ

LINQ to object queries are lazy, but this is actually rarely what I need, so in most cases I add a ToList() or a ToArray() in my code before using the result of the query. Since this requires the addition of brackets around the query, the readability of the code suffers and often I skip LINQ and use chained extension methods instead:

var x = (from i in numbers
         where i < 0
         select i*2).ToArray();
var x = numbers.Where(i => i < 0).Select(i => i*2).ToArray();

The same happens with all methods that exit the monad, e.g. First/Last/Single, Any/All/Contains, Aggregate, Average/Min/Max/Sum, Count, etc. So wouldn’t it be nice to have an extra clause in LINQ that would let one apply any of these method to the result of the query?

var x = from i in numbers
        where i < 0
        select i*2
        eval ToArray();

I am using the eval keyword here as it does actually result in the evaluation of the query. In some cases we might want to include a lambda expression:

var x = from i in numbers
        where i < 0 select i*2 eval First(j => j > 8);

which isn’t very pretty but makes it clear that the last bit is not actually part of the query. The alternative would be to add new keywords to LINQ, e.g. first, last, any, etc. But this approach would result in a proliferation of keywords. The advantage of the above approach is that any existing method can be used.

Intermediate approaches might be possible, for example:

var x = from i in numbers
        where i < 0 select i*2 eval First j where j > 8;

where the where keyword could be used whenever the lambda is a predicate? Similarly for Sum/Max/Min/Average

var x = from i in numbers
        where i < 0
        select i*2
        eval Sum j for j*j;

though this one doesn’t look too nice.

Never mind, I have created a fork of Roslyn on Codeplex to try out this idea, … to be continued once the changes has been committed.

[24 hours later] I committed the code in a fork named AddEvalClauseToLinq. It was nice and easy to modify the syntax, but things got trickier when translating the query. I managed to get the example code (see below) to build and execute, but I would have to spend much more time on the compiler to do a less hacky job.

Here is the code I managed to compile:

var ints = new[] { 1, 2, 3 };
Func<int, bool> Trace = i => { Console.Write("<" + i + ">"); return true; };

Console.WriteLine("x = from i in ints where Trace(i) select i*2;");
var x = from i in ints
        where Trace(i)
        select i*2;
Console.WriteLine("type: " + x.GetType().Name);
foreach (var i in x) Console.Write(i + " ");
foreach (var i in x) Console.Write(i + " ");

Console.WriteLine("y = from i in ints where Trace(i) select i*2 eval ToList;");
var y = from i in ints
        where Trace(i)
        select i * 2
        eval ToList();
Console.WriteLine("type: " + y.GetType().Name);
foreach (var i in y) Console.Write(i + " ");
foreach (var i in y) Console.Write(i + " ");

Console.WriteLine("z = from i in ints where Trace(i) select i*2 eval First;");
var z = from i in ints
        where Trace(i)
        select i * 2
        eval First();
Console.WriteLine("type: " + z.GetType().Name);
Console.WriteLine("Value: " + z);

Console.WriteLine("s = from i in ints where Trace(i) select i*2 eval Sum(j => j*j);");
var s = from i in ints
        where Trace(i)
        select i * 2
        eval Sum(j => j*j);  
Console.WriteLine("type: " + s.GetType().Name);
Console.WriteLine("Value: " + s);

and here is the console output:

x = from i in ints where Trace(i) select i*2;
type: WhereSelectArrayIterator`2
<1>2 <2>4 <3>6
<1>2 <2>4 <3>6

y = from i in ints where Trace(i) select i*2 eval ToList;
<1><2><3>type: List`1
2 4 6
2 4 6

z = from i in ints where Trace(i) select i*2 eval First;
<1>type: Int32
Value: 2

m = from i in ints where Trace(i) select i*2 eval Sum(t => t*t);
<1><2><3>type: Int32
Value: 56

I think that the console output says it all, types are as expected and only the first query without eval is executed multiple times.

My next Roslyn hack I would love to try would be to get better syntaxic support for tuples, just as in

var x = from (i,j) in listOfTuples
        where i < 0
        select j*2;


var (i,j) = Tuple.Create(1,2);

depending on how hard it turns out to be.

[EDIT] Something very similar has been proposed #3571

Join-Calculus library hosted on Codeplex

Just a quick note to mention that the Join-Calculus library is now hosted on Codeplex.

Combining IEnumerable with Option/Maybe in LINQ queries

I have found that Option<T>/Maybe<T> (same as Nullable, but can be used with reference types and LINQ) to be incredibly useful, especially with LINQ support. Usefulness of Option/Maybe is well documented, so is also the LINQ support for Maybe<T> or Option<T>. What I want to briefly address here is the interplay between Option<T> and IEnumerable<T>. Essentially about flattening IEnumerable<Option<T>> to just IEnumerable<T> and Option<IEnumerable<T>> to IEnumerable<T>:

    IEnumerable<Option<T>> => IEnumerable<T>
    Option<IEnumerable<T>> => IEnumerable<T>

In the first case I want to avoid having to deal with an enumeration of holes, which in general is meaningless, while in the second an empty Option<IEnumerable<T>> can just be expressed as an empty IEnumerable<T>.

Taking concrete examples, this is what I would like to be able to write. Let’s first define a few helper methods to make our examples a little bit more realistic:

public static Option<V> GetValue<K, V>(this Dictionary<K, V> that, K key) {
    V value;
    return that.TryGetValue(key, out value)
        ? new Option<V>(value)
        : Option.Empty<V>();

public static Option<double> div(int n, int d) {
    return d == 0
        ? Option.Empty<double>()
        : new Option<double>(1.0n/d);

public static Option<double> inv(int n) {
    return n == 0
        ? Option.Empty<double>()
        : new Option<double>(1.0/n);

The simplest case consists of only selecting Options that have a value in a standard IEnumerable query:

IEnumerable<double> invs = from i in new[] { 0, 1, 2 }
                           select inv(i);

which contains just two elements. The same can be done with an Option sub query:

IEnumerable<double> invs = from i in new[] { 0, 1, 2 }
                           select from v in inv(i)
                                  select v;

A more complex example:

IEnumerable<double> res = from i in new[] {1, 2}
                          from j in new[] {0, 2}
                          select div(i, j);

For the other operation we use a dictionary of IEnumerable to illustrate the implementation:

var d = new Dictionary<int, IEnumerable<int>>() {
    {1, new[] {1, 2, 3}}

IEnumerable<int> res = from v in d.GetValue(1)
                       from w in d.GetValue(1)
                       select v.Concat(w);

The following example returns an empty enumeration:

IEnumerable<int> res = from v in d.GetValue(1)
                       from w in d.GetValue(0)
                       select v.Concat(w);

Queries can also be nested:

IEnumerable<int> res = from e in from v in d.GetValue(1)
                                 select v
                       select e*2;

Here are the overloads I defined to support these examples:

public static IEnumerable<R> Select<T,R>(this IEnumerable<T> source, Func<T, Option<R>> selector) {
    foreach (var s in source) {
        var r= selector(s);
        if (r.HasValue) yield return r.Value;

public static IEnumerable<R> SelectMany<T,C,R>(this IEnumerable<T> source, Func<T, IEnumerable<C>> collectionSelector, Func<T, C, Option<R>> resultSelector) {
    foreach (var s in source) {
        foreach (var r in collectionSelector(s)) {
            var result= resultSelector(s, r);
            if (result.HasValue) yield return result.Value;

public static IEnumerable<R> SelectMany<T,O,R>(this Option<T> that, Func<T, Option<O>> optionSelector, Func<T, O, IEnumerable<R>> selector) {
    if (that.HasValue) {
        var result = optionSelector(that.Value);
        if (result.HasValue) return selector(that.Value, result.Value);
    return Enumerable.Empty<R>();

public static IEnumerable<R> Select<T,R>(this Option<T> that, Func<T, IEnumerable<R>> selector) {
    return that.HasValue
        ? selector(that.Value)
        : Enumerable.Empty<R>();

For the record here are the LINQ extension methods for Option:

public static Option<R> SelectMany<T, O, R>(this Option<T> that, Func<T, Option<O>> optionSelector, Func<T, O, R> selector) {
    if (that.HasValue) {
        var option = optionSelector(that.Value);
        if (option.HasValue) {
            return new Option<R>(selector(that.Value, option.Value));
    return Option.Empty<R>();

public static Option<R> SelectMany<T, O, R>(this Option<T> that, Func<T, Option<O>> optionSelector, Func<T, O, Option<R>> selector) {
    if (that.HasValue) {
        var option = optionSelector(that.Value);
        if (option.HasValue) return selector(that.Value, option.Value);
    return Option.Empty<R>();

public static Option<R> SelectMany<T, R>(this Option<T> that, Func<T, Option<R>> selector) {
    if (that.HasValue) {
        var selected = selector(that.Value);
        if (selected.HasValue) return selected;
    return Option.Empty<R>();

public static Option<R> Select<T, R>(this Option<T> that, Func<T, R> selector) {
    return that.HasValue
        ? new Option<R>(selector(that.Value))
        : Option.Empty<R>();

There are be other ways of combining Option and IEnumerable, but quite often I ended up confusing LINQ as it was unable to make its choice between two or more overloads. So far, I have only used the first two methods in production code, and they have nicely demonstrated their worth in terms of readability and conciseness.

Awelon Blue

Thoughts on Programming Experience Design

Joe Duffy's Blog

Adventures in the High-tech Underbelly

Design Matters

Furniture design blog from Design Matters author George Walker


A Blog for Woodworkers by Gary Rogowski


Woodworking, life and all things between