parser | Software Transactions

A simple expression parser

December 11, 2011 by Gabriel Horvath Leave a comment

In my previous post I gave a brief introduction to my parser generator and its grammar definition language REBNF. Here I will give a complete example of a grammar for parsing simple arithmetic expressions:

namespace -> "Example.Expressions";

goal = Expressions;

Expressions = Expression/";"...;

Expression : Term | Terms;

Terms = Term\AddOp...;

AddOp = "+" | "-";

Term : Factor | Factors; Factors = Factor\MulOp...;

MulOp = "*" | "/";

Factor : constant | SubExpression;

SubExpression = "(" Expression ")";

decimalDigit :: ['0'-'9'];

constant == decimalDigit+;

newLine == U+000D | U+000A | U+000D U+000A | U+2028 | U+2029;

whitespaceChar :: [' ', U+0009, U+000B, U+000C];

whitespace == whitespaceChar+;

ignore -> whitespace;

ignore -> newLine;

This grammar contains also lexical definitions for three terminal tokens: constant, newLine and whitespace. The last two lines of the grammar are options which indicate that the whitespace and newLine tokens should be ignored by the parser. The option on the first line defines the namespace to use for the generated classes.

A class is generated for every terminal, here is the one for the constant token:

public partial class constant : ITerminal, IFactor {
    public constant(IToken token) {
        Token = token;
    }
    public IToken Token { get; private set; } }

All terminals implement the following ITerminal interface:

public interface ITerminal {
    IToken Token { get; }
}

So every terminal class exposes the token returned by the lexer. An empty interface for every alternative rule is also generated. These are IExpression, ITerm and IFactor. IExpression is inherited by the ITerm interface and implemented by the Terms class. Similarly for ITerm is inherited by the IFactor interface and implemented by the Factors class. Finally IFactor is implemented by the constant and SubExpression classes. So both constant and SubExpression classes implement all three interfaces.

The big advantage of this grammar is that an expression which consists of a constant only will contain just one instance of the constant class, rather than, as in more usual approaches, of a hierarchy of expression, term, factor and constant tree nodes.

I will now demonstrate how this can be used to implement an Evaluate function for an expression. Firstly I extend the IExpression interface (all generated classes and interfaces are marked as partial) with the following method:

public partial interface IExpression {
    double Evaluate();
}

Now all classes which implement this interface will have to implement this method, so we have the following:

public partial class Terms : ZsK.G5.Syntax.IExpression {
    public double Evaluate() {
        var result = Terms.First().Evaluate();
        foreach (var pair in Terms.Skip(1).Zip(AddOps, (factor, addOp) => new { op = addOp, val = factor.Evaluate() })) {
            if (pair.op.Minus == null) {
                result += pair.val;
            } else {
                result -= pair.val;
            }
        }
        return result;
    }
}
public partial class Factors {
    public double Evaluate() {
        var result = Factors.First().Evaluate();
        foreach (var pair in Factors.Skip(1).Zip(MulOps, (factor, mulOp) => new { op = mulOp, val = factor.Evaluate() })) {
            if (pair.op.Times == null) {
                result /= pair.val;
            } else {
                result *= pair.val;
            }
        }
        return result;
    }
}
public partial class constant : ITerminal, IFactor {
    public double Evaluate() {
        return int.Parse(this.Token.Value);
    }
}
public partial class SubExpression : IFactor {
    public double Evaluate() {
        return this.Expression.Evaluate();
    }
}

The processing on the lists of terms and factors could be improved by using better list processing primitives. Otherwise the code consists mostly in mapping lexical terms to actual values and operations. I think that this example illustrates nicely the advantages of mapping alternative rules to interfaces.

Walking the syntax tree

There are two ways one can walk the syntax tree, in the first case one’s code gets called at every step of the walk, in the other one, one can override specific method of a generated class. Let me start with the first one:

class Walker : WalkContext {
    public override void OnEnter(object node) {
        ...
    }
    public override void OnExit(object node) {
        ...
    }
    public override void OnLeaf(object leaf) {
        ...
    }
}

The following implementation of the OnLeaf method lets you reconstrcut the original source:

public override void OnLeaf(object leaf) {
    if (leaf != null) {
        IToken t = (leaf as dynamic).Token;
        Debug.Write(t.Value);
        if (t.Next != null) {
            if (t.Next.Symbol.Name == "whitespace" || t.Next.Symbol.Name == "newLine") {
                Debug.Write(t.Next.Value);
            }
        }
    }
}

Remember that we directed the grammar to ignore all whitespaces and new lines, so these tokens are not referenced anywhere in the syntax tree, however there are still stored in singly linked list of tokens as illustrated in the code. The following Walker class is also generated where I have only expanded just one of the generated method for illustration purposes:

public class Walker {
    public virtual void OnEnter(object node) {...}
    public virtual void OnExit(object node) {...}
    public virtual void OnLeaf(global::ZsK.G5.Interfaces.ITerminal terminal) {...}
    public virtual void Walk(AddOp rule) {...}
    public virtual void Walk(constant item) {...}
    public virtual void Walk(Expressions rule) {
        this.OnEnter(rule);
        int Expression_i = 0;
        if ((rule.Expression != null)) {
            for (Expression_i = 0; (Expression_i < (rule.Expression.Count - 0)); Expression_i = (Expression_i + 1)) {
                this.Walk(rule.Expression[Expression_i]);
                this.Walk(rule.SemiColon[Expression_i]);
            }
        }
        this.OnExit(rule);
    }
    public virtual void Walk(Factors rule) {...}
    public virtual void Walk(goal rule) {...}
    public virtual void Walk(IExpression item) {...}
    public virtual void Walk(IFactor item) {...}
    public virtual void Walk(ITerm item) {...}
    public virtual void Walk(Minus item) {...}
    public virtual void Walk(MulOp rule) {...}
    public virtual void Walk(newLine item) {...}
    public virtual void Walk(Plus item) {...}
    public virtual void Walk(RoundBra item) {...}
    public virtual void Walk(RoundKet item) {...}
    public virtual void Walk(SemiColon item) {...}
    public virtual void Walk(SubExpression rule) {...}
    public virtual void Walk(Terms rule) {...}
    public virtual void Walk(Times item) {...}
    public virtual void Walk(UpSlash item) {...}
    public virtual void Walk(whitespace item) {...}
}

This approach offers more flexibility than the first option.

I am in the process of writing the C# 4.0 grammar in REBNF. This is a really useful exercise (hopefully it will be more than that) as it is showing up already plenty of problems with the current form of REBNF. So changes are likely to happen in the next version of REBNF. Rev-ing the grammar itself is usually a rather painful process, and I have usually been waiting until I have collected enough annoying issues in the generator itself to move forward. I should specify that the current REBNF parser was generated using the previous version of REBNF, which in turn was generated using its preceding one, etc. The current REBNF stands at version 5.

Not quite sure what the next post on this topic will cover, I might discuss the C# grammar, or give more details about REBNF, or even add some functionality to this post’s expression parser example.

Filed under parser Tagged with C#, parser, REBNF, syntax tree

Introduction to REBNF

December 3, 2011 by Gabriel Horvath 2 Comments

My previous post was about a year old idea, this post is about a 20 years old one! It is basically a parser and scanner generator. Originally written in Turbo Pascal, then in Delphi Pascal, I eventually ported it to C# 2.0 and 3.0. This project has spent most of its life in a dormant state, and was painfully revived every couple of years.

This scanner and parser generator has the following features:

Combines scanner and parser definitions in a single grammar definition
Maps grammar rules to C# classes
Maps alternatives to C# interfaces
EBNF style grammar is restricted to ease mapping to classes
Supports repetition with separators
Syntax tree is strongly typed
Syntax tree contains all source tokens
Normally ignored tokens such as whitespaces and comments are also available in the syntax tree
Two types of syntax tree walk mechanisms are generated
Generates LR1 parsers

Software Transactions

A simple expression parser

Walking the syntax tree

Introduction to REBNF

Tags

Categories

Blogs I Follow

Copyright © Gabriel Zs. K. Horvath