Software Transactions

Model comparison

December 3, 2014 by Gabriel Horvath Leave a comment

A long time overdue post…

One of my preferred slides when presenting my transactional model compares different locking models. I start with the most restrictive model, a single global lock:

where I consider four threads named A,B,C,D. Writes are amber coloured, reads are in blue. Dashed sections represent blocked threads waiting to either write or read. Since there is a single global lock, the execution is serialized, there is no concurrent execution and there is always one active thread and three blocked ones. Long running “transactions” have a direct impact on the system’s throughput.

Things improve “dramatically” when we swap our single global lock for a reader-writer lock:

Hey it’s concurrent now! Well as long as we read we have concurrency, but as soon as we want to write all the reader threads are blocked. But still a huge improvement over having a single global lock. Also the writer thread has to wait for all reader threads to terminate before it can acquire the writer lock. The reader threads cannot be allowed to last too long if we want to have to give the writer thread a chance to update the system. So there are still some serious constraints. On the positive side there is one important property: since only one writer is allowed, with no concurrent reader, there is no need for any transactional mechanism, all writes performed can only be read after the writer lock has been released. So no partial writes can be observed by any reader. Long running “transactions” still have a direct impact on the system’s throughput: long running read transaction prevents writes and vice-versa.

Next comes the versioned system:

which uses a single, global writer lock. As before we can have any number of concurrent readers; the novelty is that since existing versions are strictly immutable, the reader threads can be executed concurrently with a writer thread. So in practice a reader thread can last as long as needed (at the cost of having to keep some versions in memory). We still have a single writer lock, so a thread that wants to perform writes has to wait for the lock to become available. So we are still single threaded in terms of writes. As a consequence there is still no need for any transactional mechanism as partial writes cannot be observed. Long running read-only transactions are possible, we can also have long running write transactions, but at the cost of blocking other would-be writers. The cost of moving to this higher level of concurrency is an increase in memory usage as all accessible versions must be preserved. Note that write transactions cannot fail when we are dealing with essentially single threaded writer systems. Additionally the thread can directly modify the versioned variables, no need to delay the writes to the commit phase. This model can also be fully single threaded in which case we get:

The last model is the one I have described in this blog:

Here we are adding support for long-running concurrent write transactions. In terms of locking we have the single global version lock which is very short (increment a number and copy it) and the longer per transactional box/variable lock that must be taken for the duration of the validation and update phase. In the above diagram we have thread A waiting on thread D to acquire a box lock and later we have thread C waiting on thread D. We have seen before that this can be sometimes mitigated by partitioning the data.

This last model comes at a cost: registration of reads and writes during the execution of the body of the transaction, validation stage and commit stage. Concurrent writer transactions also introduce the risk of conflicts, but we have seen that leveraging the semantics of data structures can greatly reduce this risk.

Any system with a decent level of concurrency will, most likely have to use at least versioned variables. We must remember that server processors can have up to 18 Intel cores (expensive, $4115 ) or 16 AMD cores (cheaper, $719) You can use dual socket, 8 cores per processor, systems in the cloud, that’s 32 hyper threaded cores at a relatively low cost.

I expect the difference between the single threaded writer model and the concurrent writer model to be, in terms of ease of implementation and in terms of performance to be absolutely massive. However, when dealing with concurrent, long-running transactions (I shall come back to discussing what a transaction can be considered a long-running one) write transactions there is no substitute to the full model.

Filed under immutable, Piecewise immutability, software transactions Tagged with concurrency, concurrent transactions, shared state, transactions

Empty for a reason

November 7, 2014 by Gabriel Horvath Leave a comment

So far I have found Option<T> for reference types to be the best way of avoiding null reference exceptions while keeping my code very readable. I use it when parsing strings, retrieving values from dictionaries, getting properties from object via reflection and more. Option<T> allows one to write very clean code without any risk of null reference exceptions. To illustrate this point, consider the following code:

string r = null;
if (someObject != null) {
    var value = someObject.GetPropertyValue<string>("P");
    int i;
    if (value != null && int.TryParse(value, out i)) {
        dict.TryGetValue(i, out r);
    }
}

which, using Option<T> and associated LINQ support can be rewritten as

var r = from o in someObject.AsOption()
        from s in o.GetPropertyValue<string>("P")
        from i in s.ParseAsInt()
        from v in i.Lookup(dict)
        select v;

on the first line we have an empty Option if someObject is null. The first from extracts the actual object if the Option is not empty, the second one retrieves a property value if the property exists and if it is of type string and if the value is not null, then we try to parse the string and get an integer value if the parsing is successful, finally we look up a dictionary and again get a value only if the dictionary holds the key. So all in all we have six different reasons to end up with an empty result. The main goal of Option is to avoid null reference exceptions, but quite often, in particular while debugging, one is actually interested in why the result is empty, is it because the dictionary didn’t contain the key? because the string didn’t represent an integer? etc. As the code stands there is no way of knowing the actual reason without redoing the whole calculation one step at a time, typically by stepping through the code one step at a time.

An empty result is empty for one reason and only one reason. So the idea is that whenever a step returns an empty option the reason (a string) for returning empty is stored in the option. Here is the implementation of the methods used in the above example, you will see that each time an empty Option is returned, the reason for doing so is specified:

public static Option<T> AsOption<T>(this T value) {
    if (value is ValueType) return new Option<T>(value);
    if ((object)value == null) return Option.Empty<T>("value is null");
    return new Option<T>(value);
}

public static Option<T> AsOption<T>(this T value, string reason) {
    if (value is ValueType) return new Option<T>(value);
    if ((object)value == null) return Option.Empty<T>(reason);
    return new Option<T>(value);
}

public static Option<R> GetPropertyValue<R>(this object value, string propertyName) {
    return from v in value.AsOption()
           from pi in v.GetPropertyInfo(propertyName)
           from cast in pi.GetValue(v).AsOption().As<R>()
           select cast;
}

public static Option<PropertyInfo> GetPropertyInfo<T>(this T value, string propertyName) {
    return from v in value.AsOption()
           from p in v.GetType()
                      .GetProperty(propertyName)
                      .AsOption("Property " + propertyName + " not found in " + v.GetType().Name)
           select p;
}

public Option<R> As<R>() {
    return this.HasValue && value is R
        ? ((R)(object)value).AsOption()
        : Option.Empty<R>("Cannot cast " + typeof(T).Name + " to " + typeof(R).Name);
}

public static Option<int> ParseAsInt(this string source) {
    int value;
    return int.TryParse(source, out value)
        ? value.AsOption()
        : Option.Empty<int>("Input string cannot be parsed:" + source);
}

public static IOption<V> Lookup<K, V>(this K key, Dictionary<K,V> dict) {
    V value;
    return dict.TryGetValue(key, out value)
        ? new Option<V>(value)
        : Option.Empty<V>("Key not found:" + key.ToString());
}

So now, if the result of our query is empty, we can retrieve the reason for it being empty from the Option instance in the form of a string. Note that this was achieved without making a single modification to our code.

Let’s now have a look at how the other version of the code would have to be modified to achieve the same thing:

string reason = "null someObject";
string r = null;
if (someObject != null){
    reason = "Unable to retrieve property P in " + someObject.GetType().Name;
    var value = someObject.GetPropertyValue<string>("P").Value;
    int i;
    if (value != null) {
        reason = "Unable to parse " + value + " to an int";
        if (int.TryParse(value, out i)){
            reason = "Key not found:" + i.ToString();
            dict.TryGetValue(i, out r);
        }
    }
}

where I have avoided using else branches to keep the code compact and slightly more readable, but clearly there is no contest between the two versions.

So far I have found this useful mostly for logging and debugging purposes. But I wonder whether a strongly typed version of this could be used for other purposes…

Filed under bagatelles Tagged with C#, debug, linq, option

Exiting the monad with eval in LINQ

November 5, 2014 by Gabriel Horvath 1 Comment

LINQ to object queries are lazy, but this is actually rarely what I need, so in most cases I add a ToList() or a ToArray() in my code before using the result of the query. Since this requires the addition of brackets around the query, the readability of the code suffers and often I skip LINQ and use chained extension methods instead:

var x = (from i in numbers
         where i < 0
         select i*2).ToArray();

var x = numbers.Where(i => i < 0).Select(i => i*2).ToArray();

The same happens with all methods that exit the monad, e.g. First/Last/Single, Any/All/Contains, Aggregate, Average/Min/Max/Sum, Count, etc. So wouldn’t it be nice to have an extra clause in LINQ that would let one apply any of these method to the result of the query?

var x = from i in numbers
        where i < 0
        select i*2
        eval ToArray();

I am using the eval keyword here as it does actually result in the evaluation of the query. In some cases we might want to include a lambda expression:

var x = from i in numbers
        where i < 0 select i*2 eval First(j => j > 8);

which isn’t very pretty but makes it clear that the last bit is not actually part of the query. The alternative would be to add new keywords to LINQ, e.g. first, last, any, etc. But this approach would result in a proliferation of keywords. The advantage of the above approach is that any existing method can be used.

Intermediate approaches might be possible, for example:

var x = from i in numbers
        where i < 0 select i*2 eval First j where j > 8;

where the where keyword could be used whenever the lambda is a predicate? Similarly for Sum/Max/Min/Average

var x = from i in numbers
        where i < 0
        select i*2
        eval Sum j for j*j;

though this one doesn’t look too nice.

Never mind, I have created a fork of Roslyn on Codeplex to try out this idea, … to be continued once the changes has been committed.

[24 hours later] I committed the code in a fork named AddEvalClauseToLinq. It was nice and easy to modify the syntax, but things got trickier when translating the query. I managed to get the example code (see below) to build and execute, but I would have to spend much more time on the compiler to do a less hacky job.

Here is the code I managed to compile:

var ints = new[] { 1, 2, 3 };
Func<int, bool> Trace = i => { Console.Write("<" + i + ">"); return true; };

Console.WriteLine("x = from i in ints where Trace(i) select i*2;");
var x = from i in ints
        where Trace(i)
        select i*2;
Console.WriteLine("type: " + x.GetType().Name);
foreach (var i in x) Console.Write(i + " ");
Console.WriteLine();
foreach (var i in x) Console.Write(i + " ");
Console.WriteLine();
Console.WriteLine();

Console.WriteLine("y = from i in ints where Trace(i) select i*2 eval ToList;");
var y = from i in ints
        where Trace(i)
        select i * 2
        eval ToList();
Console.WriteLine("type: " + y.GetType().Name);
foreach (var i in y) Console.Write(i + " ");
Console.WriteLine();
foreach (var i in y) Console.Write(i + " ");
Console.WriteLine();
Console.WriteLine();

Console.WriteLine("z = from i in ints where Trace(i) select i*2 eval First;");
var z = from i in ints
        where Trace(i)
        select i * 2
        eval First();
Console.WriteLine("type: " + z.GetType().Name);
Console.WriteLine("Value: " + z);
Console.WriteLine();
Console.WriteLine();

Console.WriteLine("s = from i in ints where Trace(i) select i*2 eval Sum(j => j*j);");
var s = from i in ints
        where Trace(i)
        select i * 2
        eval Sum(j => j*j);  
Console.WriteLine("type: " + s.GetType().Name);
Console.WriteLine("Value: " + s);

and here is the console output:

x = from i in ints where Trace(i) select i*2;
type: WhereSelectArrayIterator`2
<1>2 <2>4 <3>6
<1>2 <2>4 <3>6

y = from i in ints where Trace(i) select i*2 eval ToList;
<1><2><3>type: List`1
2 4 6
2 4 6

z = from i in ints where Trace(i) select i*2 eval First;
<1>type: Int32
Value: 2

m = from i in ints where Trace(i) select i*2 eval Sum(t => t*t);
<1><2><3>type: Int32
Value: 56

I think that the console output says it all, types are as expected and only the first query without eval is executed multiple times.

My next Roslyn hack I would love to try would be to get better syntaxic support for tuples, just as in

var x = from (i,j) in listOfTuples
        where i < 0
        select j*2;

or,

var (i,j) = Tuple.Create(1,2);

depending on how hard it turns out to be.

[EDIT] Something very similar has been proposed #3571

Filed under C# language Tagged with C#, linq, Roslyn

Join-Calculus library hosted on Codeplex

October 22, 2014 by Gabriel Horvath Leave a comment

Just a quick note to mention that the Join-Calculus library is now hosted on Codeplex.

Filed under join-calculus Tagged with C#

Combining IEnumerable with Option/Maybe in LINQ queries

October 13, 2014 by Gabriel Horvath Leave a comment

I have found that Option<T>/Maybe<T> (same as Nullable, but can be used with reference types and LINQ) to be incredibly useful, especially with LINQ support. Usefulness of Option/Maybe is well documented, so is also the LINQ support for Maybe<T> or Option<T>. What I want to briefly address here is the interplay between Option<T> and IEnumerable<T>. Essentially about flattening IEnumerable<Option<T>> to just IEnumerable<T> and Option<IEnumerable<T>> to IEnumerable<T>:

    IEnumerable<Option<T>> => IEnumerable<T>
    Option<IEnumerable<T>> => IEnumerable<T>

In the first case I want to avoid having to deal with an enumeration of holes, which in general is meaningless, while in the second an empty Option<IEnumerable<T>> can just be expressed as an empty IEnumerable<T>.

Taking concrete examples, this is what I would like to be able to write. Let’s first define a few helper methods to make our examples a little bit more realistic:

public static Option<V> GetValue<K, V>(this Dictionary<K, V> that, K key) {
    V value;
    return that.TryGetValue(key, out value)
        ? new Option<V>(value)
        : Option.Empty<V>();
}

public static Option<double> div(int n, int d) {
    return d == 0
        ? Option.Empty<double>()
        : new Option<double>(1.0n/d);
}

public static Option<double> inv(int n) {
    return n == 0
        ? Option.Empty<double>()
        : new Option<double>(1.0/n);
}

The simplest case consists of only selecting Options that have a value in a standard IEnumerable query:

IEnumerable<double> invs = from i in new[] { 0, 1, 2 }
                           select inv(i);

which contains just two elements. The same can be done with an Option sub query:

IEnumerable<double> invs = from i in new[] { 0, 1, 2 }
                           select from v in inv(i)
                                  select v;

A more complex example:

IEnumerable<double> res = from i in new[] {1, 2}
                          from j in new[] {0, 2}
                          select div(i, j);

For the other operation we use a dictionary of IEnumerable to illustrate the implementation:

var d = new Dictionary<int, IEnumerable<int>>() {
    {1, new[] {1, 2, 3}}
};

IEnumerable<int> res = from v in d.GetValue(1)
                       from w in d.GetValue(1)
                       select v.Concat(w);

The following example returns an empty enumeration:

IEnumerable<int> res = from v in d.GetValue(1)
                       from w in d.GetValue(0)
                       select v.Concat(w);

Queries can also be nested:

IEnumerable<int> res = from e in from v in d.GetValue(1)
                                 select v
                       select e*2;

Here are the overloads I defined to support these examples:

public static IEnumerable<R> Select<T,R>(this IEnumerable<T> source, Func<T, Option<R>> selector) {
    foreach (var s in source) {
        var r= selector(s);
        if (r.HasValue) yield return r.Value;
    }
}

public static IEnumerable<R> SelectMany<T,C,R>(this IEnumerable<T> source, Func<T, IEnumerable<C>> collectionSelector, Func<T, C, Option<R>> resultSelector) {
    foreach (var s in source) {
        foreach (var r in collectionSelector(s)) {
            var result= resultSelector(s, r);
            if (result.HasValue) yield return result.Value;
        }
    }
}

public static IEnumerable<R> SelectMany<T,O,R>(this Option<T> that, Func<T, Option<O>> optionSelector, Func<T, O, IEnumerable<R>> selector) {
    if (that.HasValue) {
        var result = optionSelector(that.Value);
        if (result.HasValue) return selector(that.Value, result.Value);
    }
    return Enumerable.Empty<R>();
}

public static IEnumerable<R> Select<T,R>(this Option<T> that, Func<T, IEnumerable<R>> selector) {
    return that.HasValue
        ? selector(that.Value)
        : Enumerable.Empty<R>();
}

For the record here are the LINQ extension methods for Option:

public static Option<R> SelectMany<T, O, R>(this Option<T> that, Func<T, Option<O>> optionSelector, Func<T, O, R> selector) {
    if (that.HasValue) {
        var option = optionSelector(that.Value);
        if (option.HasValue) {
            return new Option<R>(selector(that.Value, option.Value));
        }
    }
    return Option.Empty<R>();
}

public static Option<R> SelectMany<T, O, R>(this Option<T> that, Func<T, Option<O>> optionSelector, Func<T, O, Option<R>> selector) {
    if (that.HasValue) {
        var option = optionSelector(that.Value);
        if (option.HasValue) return selector(that.Value, option.Value);
    }
    return Option.Empty<R>();
}

public static Option<R> SelectMany<T, R>(this Option<T> that, Func<T, Option<R>> selector) {
    if (that.HasValue) {
        var selected = selector(that.Value);
        if (selected.HasValue) return selected;
    }
    return Option.Empty<R>();
}

public static Option<R> Select<T, R>(this Option<T> that, Func<T, R> selector) {
    return that.HasValue
        ? new Option<R>(selector(that.Value))
        : Option.Empty<R>();
}

There are be other ways of combining Option and IEnumerable, but quite often I ended up confusing LINQ as it was unable to make its choice between two or more overloads. So far, I have only used the first two methods in production code, and they have nicely demonstrated their worth in terms of readability and conciseness.

Filed under bagatelles Tagged with C#, linq, option

Runtime scopes

October 21, 2013 by Gabriel Horvath Leave a comment

Complexity

The project I am working on at the moment contains approximately 1000 interfaces and 4000 classes. We use Autofac for dependency injection, but not all the project uses dependency injection. Within the solution these types are grouped by folder/namespaces and this helps find related types. The only real organization is the use of one Autofac scope to manage sessions. So the types are grouped in two scopes, the global scope and the session scope.

I was expecting the scope separation between classes to be clearly visible on a diagram that represents the dependencies between types. In fact, producing such a diagram turns out to be much harder than I thought. This is a topic I will discuss further in another post. But I am hoping that a well organized diagram would clearly show a natural grouping of classes based on their dependencies.

My goal here is to somewhat formalize this grouping of classes. For this I am proposing to make use of the concept of scoping which plays an important role in many areas of computer science.

Scopes

Scopes are so pervasive that we use them pretty much unconsciously. In C# we have for example the following scopes:

statement block, delineated by curly brackets
using statement scope
method scope
class scope
namespace scope

Some characteristics and properties of scopes:

Always
Nested or side-by-side
Non crossing
Isolation
- Outer scopes have no access to the content of inner scopes
- Inner scopes might or might not have access to the content of outer scopes
- It is also good practice for inner scopes not to modify the state of outer scopes
- Sharing
  - Content of outer scopes might be available to inner scopes
  - Entry/exit
    - Data passed in on entry
    - Data returned on exit
    - Creation on entry
    - Disposal on exit
    - Restoring state on exit
    - Lifetime of object inside scope

Generally speaking, scope is considered to be a source code level concept. Scopes are enforced and resolved at compile time. Some aspects are also important at runtime, e.g. call stack management for method calls, disposal of objects when leaving a using scope. But I should emphasize that scopes are always bound to the source code. So the source code of an inner scope is defined within the code of the outer scope. This characteristic represents a strong constraint and limits the usefulness of lexical scopes.

Runtime scope

By contrast, a runtime scope defines the scoping of types declaratively rather than lexically. As such types defined in different namespaces or assemblies can live in the same scope. Note that runtime scopes are limited to classes and interfaces. Moreover I will only discuss runtime scopes in the context of (constructor) dependency injection. I will stick to Autofac as I am most familiar with it and as it already provides some support for scopes.

Declarative scoping

Every interface must define a scope (for the sake of simplicity lets limit ourselves, for now, to a single scope per interface) using a generic marking interface: interface IScoped<TScope> where TScope : IScope.

Scopes themselves are defined as interfaces that derive from the IScope interface.

Finally, scopes can be either singletons (only one can exist within a given scope at any time) or concurrent. I will consider defining other semantics as required, e.g. one-shot singleton or contiguous singletons. The marking interfaces are ISingleScope and IConcurrentScope.

Naturally scopes themselves are scoped.

With Autofac, one registers an implementation of an interface with a class non-declaratively, e.g. by calling the RegisterType method of the builder instance. This has two disadvantages, firstly as the sharing of the class is defined at the registration site rather than declaratively with the interface or class one has to repeatedly navigate between to the registration site and the interface\class implementations. It also makes the analysis of type dependencies by a tool much harder to perform unless one can use Roslyn. Going for a declarative approach solves these two problems.

Sharing, that is whether an instance is shared or not, or in other words whether an instance is a singleton or not can be defined declaratively using either marking interfaces or attributes. In my first version I use marking interfaces: ISingleton (the easy one) and the more dubious IManifold (please, please do suggest a better name).

Semantics

Let’s have a look at the semantics of runtime scopes.

A class or scope in TScope can only be resolved in a TScope or a sub-scope of TScope.
All injected arguments of all constructors of a class (i.e. all its dependencies) in TScope must be in either TScope or a super-scope of TScope.

Condition (1) means that the content of inner scopes is not available to outer scopes, but the content of outer scopes is visible to inner scopes. This last condition could easily be restricted so as to fully isolate the inner scopes. Condition (2) states that all dependencies must come from the same scope or from an outer scope. Here again we could restrict this to only allow dependencies from the same scope.

New scopes are created by existing scopes. A scope can only create a direct inner scope.

Disposal of scopes

Ever scope can be closed, i.e. disposed of. Thanks to Autofac, every class in the scope will also be disposed of if it implements the IDisposable interface.

Returned results

Just as functions return one or more values on exit, a scope can return one or more values when it is closed. The current mechanism is to instantiate a class in its outer scope using Autofac, so that this shared instance can then be retrieved by the outer scope. I am not really convinced by this mechanism and it doesn’t work well with concurrent scopes. I am investigating other solutions, e.g. continuation scopes or have the closed scope expose the result.

Filed under immutable, runtime scopes Tagged with scopes

A simple expression parser

December 11, 2011 by Gabriel Horvath Leave a comment

In my previous post I gave a brief introduction to my parser generator and its grammar definition language REBNF. Here I will give a complete example of a grammar for parsing simple arithmetic expressions:

namespace -> "Example.Expressions";

goal = Expressions;

Expressions = Expression/";"...;

Expression : Term | Terms;

Terms = Term\AddOp...;

AddOp = "+" | "-";

Term : Factor | Factors; Factors = Factor\MulOp...;

MulOp = "*" | "/";

Factor : constant | SubExpression;

SubExpression = "(" Expression ")";

decimalDigit :: ['0'-'9'];

constant == decimalDigit+;

newLine == U+000D | U+000A | U+000D U+000A | U+2028 | U+2029;

whitespaceChar :: [' ', U+0009, U+000B, U+000C];

whitespace == whitespaceChar+;

ignore -> whitespace;

ignore -> newLine;

This grammar contains also lexical definitions for three terminal tokens: constant, newLine and whitespace. The last two lines of the grammar are options which indicate that the whitespace and newLine tokens should be ignored by the parser. The option on the first line defines the namespace to use for the generated classes.

A class is generated for every terminal, here is the one for the constant token:

public partial class constant : ITerminal, IFactor {
    public constant(IToken token) {
        Token = token;
    }
    public IToken Token { get; private set; } }

All terminals implement the following ITerminal interface:

public interface ITerminal {
    IToken Token { get; }
}

So every terminal class exposes the token returned by the lexer. An empty interface for every alternative rule is also generated. These are IExpression, ITerm and IFactor. IExpression is inherited by the ITerm interface and implemented by the Terms class. Similarly for ITerm is inherited by the IFactor interface and implemented by the Factors class. Finally IFactor is implemented by the constant and SubExpression classes. So both constant and SubExpression classes implement all three interfaces.

The big advantage of this grammar is that an expression which consists of a constant only will contain just one instance of the constant class, rather than, as in more usual approaches, of a hierarchy of expression, term, factor and constant tree nodes.

I will now demonstrate how this can be used to implement an Evaluate function for an expression. Firstly I extend the IExpression interface (all generated classes and interfaces are marked as partial) with the following method:

public partial interface IExpression {
    double Evaluate();
}

Now all classes which implement this interface will have to implement this method, so we have the following:

public partial class Terms : ZsK.G5.Syntax.IExpression {
    public double Evaluate() {
        var result = Terms.First().Evaluate();
        foreach (var pair in Terms.Skip(1).Zip(AddOps, (factor, addOp) => new { op = addOp, val = factor.Evaluate() })) {
            if (pair.op.Minus == null) {
                result += pair.val;
            } else {
                result -= pair.val;
            }
        }
        return result;
    }
}
public partial class Factors {
    public double Evaluate() {
        var result = Factors.First().Evaluate();
        foreach (var pair in Factors.Skip(1).Zip(MulOps, (factor, mulOp) => new { op = mulOp, val = factor.Evaluate() })) {
            if (pair.op.Times == null) {
                result /= pair.val;
            } else {
                result *= pair.val;
            }
        }
        return result;
    }
}
public partial class constant : ITerminal, IFactor {
    public double Evaluate() {
        return int.Parse(this.Token.Value);
    }
}
public partial class SubExpression : IFactor {
    public double Evaluate() {
        return this.Expression.Evaluate();
    }
}

The processing on the lists of terms and factors could be improved by using better list processing primitives. Otherwise the code consists mostly in mapping lexical terms to actual values and operations. I think that this example illustrates nicely the advantages of mapping alternative rules to interfaces.

Walking the syntax tree

There are two ways one can walk the syntax tree, in the first case one’s code gets called at every step of the walk, in the other one, one can override specific method of a generated class. Let me start with the first one:

class Walker : WalkContext {
    public override void OnEnter(object node) {
        ...
    }
    public override void OnExit(object node) {
        ...
    }
    public override void OnLeaf(object leaf) {
        ...
    }
}

The following implementation of the OnLeaf method lets you reconstrcut the original source:

public override void OnLeaf(object leaf) {
    if (leaf != null) {
        IToken t = (leaf as dynamic).Token;
        Debug.Write(t.Value);
        if (t.Next != null) {
            if (t.Next.Symbol.Name == "whitespace" || t.Next.Symbol.Name == "newLine") {
                Debug.Write(t.Next.Value);
            }
        }
    }
}

Remember that we directed the grammar to ignore all whitespaces and new lines, so these tokens are not referenced anywhere in the syntax tree, however there are still stored in singly linked list of tokens as illustrated in the code. The following Walker class is also generated where I have only expanded just one of the generated method for illustration purposes:

public class Walker {
    public virtual void OnEnter(object node) {...}
    public virtual void OnExit(object node) {...}
    public virtual void OnLeaf(global::ZsK.G5.Interfaces.ITerminal terminal) {...}
    public virtual void Walk(AddOp rule) {...}
    public virtual void Walk(constant item) {...}
    public virtual void Walk(Expressions rule) {
        this.OnEnter(rule);
        int Expression_i = 0;
        if ((rule.Expression != null)) {
            for (Expression_i = 0; (Expression_i < (rule.Expression.Count - 0)); Expression_i = (Expression_i + 1)) {
                this.Walk(rule.Expression[Expression_i]);
                this.Walk(rule.SemiColon[Expression_i]);
            }
        }
        this.OnExit(rule);
    }
    public virtual void Walk(Factors rule) {...}
    public virtual void Walk(goal rule) {...}
    public virtual void Walk(IExpression item) {...}
    public virtual void Walk(IFactor item) {...}
    public virtual void Walk(ITerm item) {...}
    public virtual void Walk(Minus item) {...}
    public virtual void Walk(MulOp rule) {...}
    public virtual void Walk(newLine item) {...}
    public virtual void Walk(Plus item) {...}
    public virtual void Walk(RoundBra item) {...}
    public virtual void Walk(RoundKet item) {...}
    public virtual void Walk(SemiColon item) {...}
    public virtual void Walk(SubExpression rule) {...}
    public virtual void Walk(Terms rule) {...}
    public virtual void Walk(Times item) {...}
    public virtual void Walk(UpSlash item) {...}
    public virtual void Walk(whitespace item) {...}
}

This approach offers more flexibility than the first option.

I am in the process of writing the C# 4.0 grammar in REBNF. This is a really useful exercise (hopefully it will be more than that) as it is showing up already plenty of problems with the current form of REBNF. So changes are likely to happen in the next version of REBNF. Rev-ing the grammar itself is usually a rather painful process, and I have usually been waiting until I have collected enough annoying issues in the generator itself to move forward. I should specify that the current REBNF parser was generated using the previous version of REBNF, which in turn was generated using its preceding one, etc. The current REBNF stands at version 5.

Not quite sure what the next post on this topic will cover, I might discuss the C# grammar, or give more details about REBNF, or even add some functionality to this post’s expression parser example.

Filed under parser Tagged with C#, parser, REBNF, syntax tree

Introduction to REBNF

December 3, 2011 by Gabriel Horvath 2 Comments

My previous post was about a year old idea, this post is about a 20 years old one! It is basically a parser and scanner generator. Originally written in Turbo Pascal, then in Delphi Pascal, I eventually ported it to C# 2.0 and 3.0. This project has spent most of its life in a dormant state, and was painfully revived every couple of years.

This scanner and parser generator has the following features:

Combines scanner and parser definitions in a single grammar definition
Maps grammar rules to C# classes
Maps alternatives to C# interfaces
EBNF style grammar is restricted to ease mapping to classes
Supports repetition with separators
Syntax tree is strongly typed
Syntax tree contains all source tokens
Normally ignored tokens such as whitespaces and comments are also available in the syntax tree
Two types of syntax tree walk mechanisms are generated
Generates LR1 parsers

Traced execution

October 14, 2011 by Gabriel Horvath Leave a comment

I will present in this post a proof of concept of an idea I got nearly a year ago while trying to remote the execution of some code and got tired of the limitations of C# in general and of Linq.Expressions in particular. Run-time reflection in C# 4.0 is limited to type and Linq.Expressions, here I demonstrate a form of reflection that can be performed on a small subset of C# statements.

Piecewise immutable programming

May 14, 2011 by Gabriel Horvath Leave a comment

Although not a programming language expert at all, I will try to muster all my limited knowledge to briefly discuss software transactions in terms of programming model. Be aware that this is all very tentative…

On the one hand the level of isolation and immutability is very much reminiscent of functional programming, on the other hand the mutability of the versioned variables is essentially imperative programming. To better understand this relationship let’s see, very tentatively, how my model can be used to code in a “pure” imperative way or in a more functional one.