{:check ["true"]}

Index

Building Lexer with ANTLR

ANTLR Lexical Grammar

ANTLR is a powerful parser generator library. In this section, we will see how it can help us generate a lexical analyzer based on regular expression rules.


split=4

ANTLR requires one to author a lexer grammar file. The grammar file has the following general structure.

lexer grammar <Grammar_name>;

@header {
  package <package_name>;
};

<Token_name> : <Regex pattern> ;
<Token_name> : <Regex pattern> ;
<Token_name> : <Regex pattern> -> skip;
...

Note:

-> skip; indicates that the token should be recognized by the lexer, but will not be included the token stream.


split=4

Here is an example of a lexer grammar file.

mylexer/MyLexer.g4

lexer grammar MyLexer;

@header {
    package mylexer;
}

KEYWORD: 'constant' | 'var' | 'function';
IDENT: [a-z]+;
NUMBER: [0-9](\.[0-9]+)?;
OPEN_PAREN: '(';
CLOSE_PAREN: ')';
OPEN_BRACE: '{';
CLOSE_BRACE: '}';
SEMICOLON: ';';
OP: ('+' | '-' | '*' | '/');
EQ: '=';
WS: [ \t\r\n]+ -> skip;
split=4

Note:

  • The grammar name must agree with the file name.
  • The generated lexer will be a Java class named mylexer.MyLexer.
  • According to the Java convention, the generated Java file must reside in the directory path that agrees with the package name $CLASSPATH/mylexer/.

Compilation

For the compilation process, we need to have the complete ANTLR library jar file. We will assume that the antlr jar file is located at: /data/shared/antlr-4.9.1-complete.jar


split=4

We need to first generate the Java source file for the lexer. This is done by:

java -jar /data/shared/antlr-4.9.1-complete.jar mylexer/MyLexer.g4

This will generate the MyLexer.java source file.


split=4

Next, we need to compile the generated Java source file to Java class file. We need to pay attention to several finer points:

  • Ensure that you include the correct classpath that includes both the current directory . and the antlr jar file.
javac -cp /data/shared/antlr-4.9.1-complete.jar:. mylexer/MyLexer.java

This will generate mylexer/MyLexer.class

Using the generated lexer

This is a Kotlin Jupyter Notebook that demonstrates how the generated Lexer Java class can be used to perform lexical analysis of a simple program in Kotlin.

4 Lexer With ANTLR

Antlr based lexical analyzer

This is the mylexer/MyLexer.g4:

lexer grammar MyLexer;

@header {
    package mylexer;
}

KEYWORD: 'constant' | 'var' | 'function';
IDENT: [a-z]+;
NUMBER: [0-9](.[0-9]+)?;
OPEN_PAREN: '(';
CLOSE_PAREN: ')';
OPEN_BRACE: '{';
CLOSE_BRACE: '}';
SEMICOLON: ';';
OP: ('+' | '-' | '*' | '/');
EQ: '=';
WS: [ \t\r\n]+ -> skip;

Make sure we load the proper libraries.

  • Antlr runtime library
  • local package lookup in the current directory
In [1]:
@file:DependsOn("/data/shared/antlr-4.9.1-complete.jar")
@file:DependsOn(".")
In [2]:
import org.antlr.v4.runtime.*;
import mylexer.MyLexer;

Consider the same source code.

In [3]:
val source : String = 
    """
    constant pi = 3.1415;
    var radius = 10.4;
    var area = pi * square(radius);

    function square(x) {
      return x * x;
    }
    """.trimIndent()

We can construct an ANTLRInputStream from the string object.

In [4]:
val input = ANTLRInputStream(source)

ANTLR has generated a lexer based on the lexer grammar file.

In [5]:
var lexer = MyLexer(input);

We can get the token stream, and populate it using the ANTLR common token stream API.

In [6]:
var tokens = CommonTokenStream(lexer);
tokens.fill()
print("There are:" + tokens.size() + " tokens")
There are:34 tokens

Let's print out the tokens.

In [14]:
for(i in 0 until tokens.size()) {
    val token = tokens.get(i)
    val typename = 
        if(token.type >= 0)
            lexer.tokenNames[token.type]
        else
            "EOF"
    println(typename + ":" + token.text)
}
KEYWORD:constant
IDENT:pi
'=':=
NUMBER:3.1415
';':;
KEYWORD:var
IDENT:radius
'=':=
NUMBER:1
NUMBER:0.4
';':;
KEYWORD:var
IDENT:area
'=':=
IDENT:pi
OP:*
IDENT:square
'(':(
IDENT:radius
')':)
';':;
KEYWORD:function
IDENT:square
'(':(
IDENT:x
')':)
'{':{
IDENT:return
IDENT:x
OP:*
IDENT:x
';':;
'}':}
EOF:<EOF>