Tokens and Lexical Analysis¶

Choose The Alphabet¶

Consider the following Java program:

interface Adder {
    int adder(int a, int b);
}

Let's think in terms of symbols.

interface $\dots$ $\longrightarrow$ i n t e r f a c e ...

How many symbols in $s$?

48 characters = 48 symbols (including whitespaces)

How many symbols in $s$ if we use binary alphabet?

i $\rightarrow 105 \rightarrow 01101001$
n $\rightarrow 110 \rightarrow 01101111$

Let's look at the string lengths using different encodings.

$\Sigma_\mathrm{CHAR} = \{a, b, c, \dots\}$
$\Sigma_\mathrm{BIN} = \{0, 1\}$

Is it better to use $\Sigma_\mathrm{CHAR}$ or $\Sigma_\mathrm{BIN}$ when we want to understand the Java code?

// This constructs a string which is
// the Java program.

fun javaprog() =
    """
    interface Adder {
        int adder(int a, int b);
    }
    """.trimIndent()

// This constructs a string which is
// the Java program.

fun javaprog() =
    """
    interface Adder {
        int adder(int a, int b);
    }
    """.trimIndent()

// We can count the characters
println("Length in characters: ${javaprog().length}")

Length in characters: 48

// We can count the characters
println("Length in characters: ${javaprog().length}")

Length in characters: 48

// Now, we will concatenate all the binary
// representations of characters.
var binaryString = ""

for(s in javaprog()) {
    binaryString += Integer.toBinaryString(s.toInt())
}

println(binaryString)

1101001110111011101001100101111001011001101100001110001111001011000001000001110010011001001100101111001010000011110111010100000100000100000100000110100111011101110100100000110000111001001100100110010111100101010001101001110111011101001000001100001101100100000110100111011101110100100000110001010100111101110101111101

// Now, we will concatenate all the binary
// representations of characters.
var binaryString = ""

for(s in javaprog()) {
    binaryString += Integer.toBinaryString(s.toInt())
}

println(binaryString)

1101001110111011101001100101111001011001101100001110001111001011000001000001110010011001001100101111001010000011110111010100000100000100000100000110100111011101110100100000110000111001001100100110010111100101010001101001110111011101001000001100001101100100000110100111011101110100100000110001010100111101110101111101

Can we think of a more suitable alphabet to examine the Java program?

Yes.
We want to choose an alphabet specific to the Java programming language.

val newString = listOf<String>(
    "interface",
    "Adder",
    "{",
    "int",
    "adder",
    "(",
    "int",
    "a",
    ",",
    "int",
    "b",
    ")",
    "}",
    ";",
)

newString

[interface, Adder, {, int, adder, (, int, a, ,, int, b, ), }, ;]

val newString = listOf<String>(
    "interface",
    "Adder",
    "{",
    "int",
    "adder",
    "(",
    "int",
    "a",
    ",",
    "int",
    "b",
    ")",
    "}",
    ";",
)

newString

[interface, Adder, {, int, adder, (, int, a, ,, int, b, ), }, ;]

What if each symbol is also annotated by additional information about its role in the program?

The annotation will provide additional semantic information

It will be easier to understand the Java source code from the annotated symbols.

enum class T {
    Keyword,
    Name,
    Separator,
    Type,
    Puntuation,
}

enum class T {
    Keyword,
    Name,
    Separator,
    Type,
    Puntuation,
}

data class Annotated (
    val value: String,
    val type: T,
)

data class Annotated (
    val value: String,
    val type: T,
)

val annotatedString = listOf(
    Annotated("interface", T.Keyword),
    Annotated("Adder", T.Name),
    Annotated("{", T.Separator),
    Annotated("int", T.Type),
    Annotated("adder", T.Name),
    Annotated("(", T.Separator),
    Annotated("int", T.Type),
    Annotated("a", T.Name),
    Annotated(",", T.Puntuation),
    Annotated("int", T.Type),
    Annotated("b", T.Name),
    Annotated(")", T.Separator),
    Annotated("}", T.Separator),
    Annotated(";", T.Puntuation),  
)

annotatedString

[Annotated(value=interface, type=Keyword), Annotated(value=Adder, type=Name), Annotated(value={, type=Separator), Annotated(value=int, type=Type), Annotated(value=adder, type=Name), Annotated(value=(, type=Separator), Annotated(value=int, type=Type), Annotated(value=a, type=Name), Annotated(value=,, type=Puntuation), Annotated(value=int, type=Type), Annotated(value=b, type=Name), Annotated(value=), type=Separator), Annotated(value=}, type=Separator), Annotated(value=;, type=Puntuation)]

val annotatedString = listOf(
    Annotated("interface", T.Keyword),
    Annotated("Adder", T.Name),
    Annotated("{", T.Separator),
    Annotated("int", T.Type),
    Annotated("adder", T.Name),
    Annotated("(", T.Separator),
    Annotated("int", T.Type),
    Annotated("a", T.Name),
    Annotated(",", T.Puntuation),
    Annotated("int", T.Type),
    Annotated("b", T.Name),
    Annotated(")", T.Separator),
    Annotated("}", T.Separator),
    Annotated(";", T.Puntuation),  
)

annotatedString

[Annotated(value=interface, type=Keyword), Annotated(value=Adder, type=Name), Annotated(value={, type=Separator), Annotated(value=int, type=Type), Annotated(value=adder, type=Name), Annotated(value=(, type=Separator), Annotated(value=int, type=Type), Annotated(value=a, type=Name), Annotated(value=,, type=Puntuation), Annotated(value=int, type=Type), Annotated(value=b, type=Name), Annotated(value=), type=Separator), Annotated(value=}, type=Separator), Annotated(value=;, type=Puntuation)]

Definition¶

A token is a symbol with additional annotation

data class Token(val lexeme: String, val type: Type)

A lexeme is string content of the token. Lexemes are extracted directly from the source code.

Lexical Analysis¶

Lexical analysis is the process of converting a source code into a sequence of tokens.
Lexical analyzer is a function that maps character based source code to a string of tokens:

$$ \mathrm{lex}: \Sigma^*_\mathrm{CHAR} \to \mathrm{Token}^* $$

Building Lexical Analyzers With Regular Expressions¶

Determine token types¶

enum class TokenTypes {
    Keyword,
    Name,
    Separator,
    JavaType,
    Puntuation,
}

enum class TokenTypes {
    Keyword,
    Name,
    Separator,
    JavaType,
    Puntuation,
}

Define lexeme patterns for each token type¶

val patterns = mapOf(
    TokenTypes.JavaType to Regex("int|float|double|boolean|char"),
    TokenTypes.Name to Regex("..."),
    // ...
)

val patterns = mapOf(
    TokenTypes.JavaType to Regex("int|float|double|boolean|char"),
    TokenTypes.Name to Regex("..."),
    // ...
)

What's next?¶

We will be discussing defining languages using regular expressions.

We will study the connection between regular expressions and automata.

We will be understand how regular expression libraries work by building pattern matchers using automata.

We will be looking at how to build lexical analyzers with ANTLR.

1 Tokens