{:check ["true"]}
ANTLR is a powerful parser generator library. In this section, we will see how it can help us generate a lexical analyzer based on regular expression rules.
split=4
ANTLR requires one to author a lexer grammar file. The grammar file has the following general structure.
lexer grammar <Grammar_name>; @header { package <package_name>; }; <Token_name> : <Regex pattern> ; <Token_name> : <Regex pattern> ; <Token_name> : <Regex pattern> -> skip; ...
Note:
-> skip;
indicates that the token should be recognized by the lexer, but will not be included the token stream.
split=4
Here is an example of a lexer grammar file.
mylexer/MyLexer.g4
lexer grammar MyLexer; @header { package mylexer; } KEYWORD: 'constant' | 'var' | 'function'; IDENT: [a-z]+; NUMBER: [0-9](\.[0-9]+)?; OPEN_PAREN: '('; CLOSE_PAREN: ')'; OPEN_BRACE: '{'; CLOSE_BRACE: '}'; SEMICOLON: ';'; OP: ('+' | '-' | '*' | '/'); EQ: '='; WS: [ \t\r\n]+ -> skip;
split=4
Note:
- The grammar name must agree with the file name.
- The generated lexer will be a Java class named
mylexer.MyLexer
.- According to the Java convention, the generated Java file must reside in the directory path that agrees with the package name
$CLASSPATH/mylexer/
.
For the compilation process, we need to have the complete ANTLR library jar file.
We will assume that the antlr jar file is located at: /data/shared/antlr-4.9.1-complete.jar
split=4
We need to first generate the Java source file for the lexer. This is done by:
java -jar /data/shared/antlr-4.9.1-complete.jar mylexer/MyLexer.g4
This will generate the
MyLexer.java
source file.
split=4
Next, we need to compile the generated Java source file to Java class file. We need to pay attention to several finer points:
- Ensure that you include the correct classpath that includes both the current directory
.
and the antlr jar file.javac -cp /data/shared/antlr-4.9.1-complete.jar:. mylexer/MyLexer.java
This will generate
mylexer/MyLexer.class
This is a Kotlin Jupyter Notebook that demonstrates how the generated Lexer Java class can be used to perform lexical analysis of a simple program in Kotlin.
This is the mylexer/MyLexer.g4
:
lexer grammar MyLexer;
@header {
package mylexer;
}
KEYWORD: 'constant' | 'var' | 'function';
IDENT: [a-z]+;
NUMBER: [0-9](.[0-9]+)?;
OPEN_PAREN: '(';
CLOSE_PAREN: ')';
OPEN_BRACE: '{';
CLOSE_BRACE: '}';
SEMICOLON: ';';
OP: ('+' | '-' | '*' | '/');
EQ: '=';
WS: [ \t\r\n]+ -> skip;
Make sure we load the proper libraries.
@file:DependsOn("/data/shared/antlr-4.9.1-complete.jar")
@file:DependsOn(".")
import org.antlr.v4.runtime.*;
import mylexer.MyLexer;
Consider the same source code.
val source : String =
"""
constant pi = 3.1415;
var radius = 10.4;
var area = pi * square(radius);
function square(x) {
return x * x;
}
""".trimIndent()
We can construct an ANTLRInputStream from the string object.
val input = ANTLRInputStream(source)
ANTLR has generated a lexer based on the lexer grammar file.
var lexer = MyLexer(input);
We can get the token stream, and populate it using the ANTLR common token stream API.
var tokens = CommonTokenStream(lexer);
tokens.fill()
print("There are:" + tokens.size() + " tokens")
Let's print out the tokens.
for(i in 0 until tokens.size()) {
val token = tokens.get(i)
val typename =
if(token.type >= 0)
lexer.tokenNames[token.type]
else
"EOF"
println(typename + ":" + token.text)
}