Introduction to REBNF
December 3, 2011 Leave a comment
My previous post was about a year old idea, this post is about a 20 years old one! It is basically a parser and scanner generator. Originally written in Turbo Pascal, then in Delphi Pascal, I eventually ported it to C# 2.0 and 3.0. This project has spent most of its life in a dormant state, and was painfully revived every couple of years.
This scanner and parser generator has the following features:
- Combines scanner and parser definitions in a single grammar definition
- Maps grammar rules to C# classes
- Maps alternatives to C# interfaces
- EBNF style grammar is restricted to ease mapping to classes
- Supports repetition with separators
- Syntax tree is strongly typed
- Syntax tree contains all source tokens
- Normally ignored tokens such as whitespaces and comments are also available in the syntax tree
- Two types of syntax tree walk mechanisms are generated
- Generates LR1 parsers
Here is a simplified version of REBNF in REBNF:
Grammar = Rule…;
Rule : ConcatenationRule | AlternativesRule;
AlternativesRule = identifier ‘:’ Alternatives ‘;’;
ConcatenationRule = identifier ‘=’ Expression ‘;’;
Expression = Term…;
Term : Option | Sequence | Alternatives;
Option = ‘[‘ Expression ‘]’ [‘!’ | ‘!!’ | ‘!!!’];
Sequence = SimpleFactor [‘/’ | ‘\’ | ‘/\’ SimpleFactor] ‘…’;
Alternatives = SimpleFactor\’|’…;
SimpleFactor : identifier | string;
Apart from the list and exclamation marks elements, this should be fairly self-explanatory. Note that contrary to EBNF, the alternative operator | has the highest precedence in REBNF. Since my grammar is more restrictive than EBNF, I am calling it Restricted EBNF.
The following examples illustrate the list feature:
"A"… matches A, AA, AAA, etc
"A"\"B"… matches A, ABA, ABABA, etc
"A"/"B"… matches AB, ABAB, ABABAB, etc
"A"/\"B"… matches A, AB, ABA, ABAB, etc
This construct is specifically aimed at repetitions with a separator or an operator. It also deals with the special case of the optional last separator.
The exclamation marks allow to link disparate optional factor:
["A"]! "B" ["C"]! matches B and ABC but won't match AB or BC.
["A"]! ["B"]!! C ["D"]! ["E"]!! matches C, ACD, BCE and ABCDE
I shall give a more detailed definition of REBNF in a future post.
Mapping to classes and interfaces
Production rules are mapped to classes or interfaces. An alternatives production rule (defined with a colon) maps to an interface. All simple factors listed on the right-hand-side must implement that interface. Note that the generated interfaces are empty. I will illustrate how these interfaces can be used in the next post.
A concatenation production rule (defined with an equal sign) maps to a class. Every factor defined on the right-hand-side maps to a property in the generated class. Sequences map to one or two generic lists (second list is for the separators). A constructor is generated for every possible reduction of the production rule.
I will use the following production rule to illustrates this mapping:
Rule = “private”|”public” “class” identifier [“:” identifier\”,”…];
Results in the following class being generated
// Rule = "private"|"public" "class" identifier [":" identifier\","...]; public partial class Rule { // Rule = "private" "class" identifier; public Rule(@private @private, @class @class, identifier @identifier) {...} // Rule = "public" "class" identifier; public Rule(@public @public, @class @class, identifier identifier) {...} // Rule = "private" "class" identifier ":" identifier\","...; public Rule(@private @private, @class @class, identifier identifier, Colon colon, DoubleList<identifier,Coma> listOfIdentifierAndComa) {...} // Rule = "public" "class" identifier ":" identifier\","...; public Rule(@public @public, @class @class, identifier identifier, Colon colon, DoubleList<identifier,Coma> listOfIdentifierAndComa) {...} public @private @private { get; private set; } public @public @public { get; private set; } public @class @class { get; private set; } public identifier identifier { get; private set; } public Colon Colon { get; private set; } public List<identifier> identifiers { get; private set; } public List<Coma> Comas { get; private set; } }
The generator will first generate all the classes and interfaces. These classes are then compiled into a parser. The output of the parser is an instance of the grammar’s root class. The parser itself is LR1 and at every reduction a new instance of the class corresponding to the production is created using the matching constructor. For historical reasons, the current implementation doesn’t fully leverage all the existing features of C# and relies far too heavily on generated source code. I hope to discuss some of the issues and advantages of using generated classes in a future post.
My next post will present a small but complete example and will illustrate how to take advantage of the interfaces associated to an alternative rule.