Saturday 16 July 2011

Damn, I never figured that defining the grammar would be so difficult. I guess that it's easy for me to think "The C grammar but cut all the declarator crap add some C++ features and types as expressions" and be done with it, but actually defining it so that I can parse it is a significantly greater challenge, even using compiler-compiler tools like yacc and lex. It turns out that you have to define grammar rules in a very context-sensitive fashion respective to exactly what is in the grammar, and modifying an existing grammar or defining a new one is a decidedly non-trivial process.

For example, take the following example grammar:

expression : arithmetic-expression
           | access-expression;

access-expression : expression '[' expression ']';

arithmetic-expression : expression '+' expression;


Of course, whilst this is the intent of the code, then actually this is thoroughly ambiguous and a simple input will yield the problem:

1 + 2[3]

Is it a:

(1 + 2)[3]

or b:

1 + (2[3])

This means that the need for disambiguating parenthesis is confusingly implied and the recursion has to be specified. The expression syntax instead has to look something like this:

expression : '(' arithmetic_expression ')'
           | access_expression;

access_expression : arithmetic_expression '[' arithmetic_expression ']';
arithmetic_expression : bitwise_expression '+' bitwise_expression;


You get the picture. A pity that this process cannot be automated. The grammar for even a relatively simple language like C is actually hundreds of lines long, a lot more complex than I had imagined. However, should I successfully complete a grammar, then both the parser and the lexer will be generated for me, which is nice, and the first step towards realizing my meta-programming type-inferring dream.

No comments:

Post a Comment