The Mysterious Grammar of Lua

30 Mar 2019

Lua is a wonderful programming/scripting language that is both lightweight and extensible.

Around five years ago, I wrote a dummy programming language that resembles the syntax of Lua. Back then I did not know anything about Formal Language Theory and I wrote what felt natural to me, later on to discover that what I wrote is called a recursive descent parser (albeit an awful realization of it). Nevertheless, it does resemble Lua, and I was reviewing the code for laughs and possibly modifications, but something struck me as terribly odd, and here it is:

x=1x=3print(x)

Running this code in Lua interpreter prints 3 (while my implementation prints 0–just to show you how bad it is).

What the hell is this? How is that even valid syntax? What does it do? How does it work?

TL;DR: These are three different statements. ALAS! Because Lua has no semicolons, I hear you say, but it’s not that simple, go ahead and fire up your favorite Python interpreter and run the same code… Syntax Error. Still not convinced? Try with JavaScript, still, Syntax Error. So what’s so special about Lua?

If you are not familiar with topics like lexical analysis, parsing and abstract syntax trees, I strongly recommend reading this article and get back here.

Lexical Analysis

Of course that line above wasn’t what struck me, and I didn’t forget to add spaces and was surprised by the result, what surprised me most was this snippet from my lexer implementation (in C):

if (isblank(str[pos])) {
    /* ignore whitespace */
    while (isblank(str[pos])) ++pos;
    return get_token_value();
}

What this does, is silently consume all white space characters (spaces and tabs) and the parser never sees them. This is normal, almost all programming languages do that (apart from ones that use whitespaces for context, such as Python), so it’s not really odd as it seems. However, it made me curious to immediately check where linefeed (LF) characters ('\n' in most languages, or simply new line) get consumed, the lexer does read and report them as ASCII characters to the parser, but the parser silently ignores them (I won’t show code for that). Furthermore, the canonical Lua implementation actually consumes linefeed characters during lexical analysis and the parser never sees them.

That means that neither Lua nor my dummy immitation of it require newlines anywhere in the program, there is an optional semicolon support to mark the end of statements, but it’s not required, in fact nothing is required to terminate statements–at all.

The tokens stream generated by the lexer would be as follows:

Name(x) '=' Number(1) Name(x) '=' Number(3) Name(print) '(' Name(x) ')'

Parsing

You can check The Complete Syntax of Lua in EBNF, I’ll use some snippets here (with numbered rules for convenience):

chunk ::= {stat [`;´]} [laststat [`;´]]     (1)
stat ::=  varlist `=´ explist |             (2)
          functioncall                      (3)
          ...
varlist ::= var {`,´ var}                   (4)
var ::=  Name ...                           (5)
explist ::= {exp `,´} exp                   (6)
exp ::=  nil | false | true | Number ...    (7)

Let’s try to parse our code in LL fashion with these rules, all programs start with chunk, and the only token that marks the start of ‘laststat’ is ‘return’, so we definitely would expand to at least one ‘stat’:

chunk
stat [`;´] {stat [`;´]} [laststat [`;´]]    (Using rule 1)
varlist `=´ explist [`;´] {stat [`;´]} [laststat [`;´]]    (Using rule 2)
var {`,´ var} `=´ explist [`;´] {stat [`;´]} [laststat [`;´]]    (Using rule 4)
Name(x) {`,´ var} `=´ explist [`;´] {stat [`;´]} [laststat [`;´]]    (Using rule 5)
Name(x) `=´ explist [`;´] {stat [`;´]} [laststat [`;´]]    (Reduction)
Name(x) `=´ {exp `,´} exp [`;´] {stat [`;´]} [laststat [`;´]]    (Using rule 6)
Name(x) `=´ exp {`,´} {exp `,´} [`;´] {stat [`;´]} [laststat [`;´]]    (Factoring)
Name(x) `=´ Number(1) {`,´} {exp `,´} [`;´] {stat [`;´]} [laststat [`;´]]    (Using rule 7)
Name(x) `=´ Number(1) {exp `,´} [`;´] {stat [`;´]} [laststat [`;´]]    (Reduction)
Name(x) `=´ Number(1) [`;´] {stat [`;´]} [laststat [`;´]]    (Reduction)
Name(x) `=´ Number(1) {stat [`;´]} [laststat [`;´]]    (Reduction)
Name(x) `=´ Number(1) stat [`;´] {stat [`;´]} [laststat [`;´]]    (Expand, next token matches 'stat')
...

You see in the last line, that immediately after the terminal Number (which represents the 1 in x=1), we are back again at expecting a statement.

So the generated AST would be:

Or simply, a program consisting of three statements: two assignments followed by a function call.

This doesn’t work in Python (even with semicolon being optional) because all statements in Python grammar are required to terminate with a NEWLINE character.

For JavaScript, the situation is a bit obscure since all EBNF grammar I have found so far would allow the same construction, my assumption is that it fails in a later stage of the compilation. If anyone could point me on why JavaScript doesn’t allow it, feel free to ping me.

glimpse of everything

The Mysterious Grammar of Lua

Lexical Analysis

Parsing

Related Posts

Hello, World! without BIOS 16 Nov 2018