{:check ["true"]}

Index

1 Tokens

Tokens and Lexical Analysis

Choose The Alphabet

Consider the following Java program:

interface Adder {
    int adder(int a, int b);
}

Let's think in terms of symbols.

interface $\dots$ $\longrightarrow$ i n t e r f a c e ...

How many symbols in $s$?

48 characters = 48 symbols (including whitespaces)

How many symbols in $s$ if we use binary alphabet?

  • i $\rightarrow 105 \rightarrow 01101001$
  • n $\rightarrow 110 \rightarrow 01101111$

Let's look at the string lengths using different encodings.

  1. $\Sigma_\mathrm{CHAR} = \{a, b, c, \dots\}$


  2. $\Sigma_\mathrm{BIN} = \{0, 1\}$


Is it better to use $\Sigma_\mathrm{CHAR}$ or $\Sigma_\mathrm{BIN}$ when we want to understand the Java code?

In [1]:
// This constructs a string which is
// the Java program.

fun javaprog() =
    """
    interface Adder {
        int adder(int a, int b);
    }
    """.trimIndent()
In [2]:
// We can count the characters
println("Length in characters: ${javaprog().length}")
Length in characters: 48
In [3]:
// Now, we will concatenate all the binary
// representations of characters.
var binaryString = ""

for(s in javaprog()) {
    binaryString += Integer.toBinaryString(s.toInt())
}

println(binaryString)
1101001110111011101001100101111001011001101100001110001111001011000001000001110010011001001100101111001010000011110111010100000100000100000100000110100111011101110100100000110000111001001100100110010111100101010001101001110111011101001000001100001101100100000110100111011101110100100000110001010100111101110101111101

Can we think of a more suitable alphabet to examine the Java program?

  • Yes.

  • We want to choose an alphabet specific to the Java programming language.

In [4]:
val newString = listOf<String>(
    "interface",
    "Adder",
    "{",
    "int",
    "adder",
    "(",
    "int",
    "a",
    ",",
    "int",
    "b",
    ")",
    "}",
    ";",
)

newString
Out[4]:
[interface, Adder, {, int, adder, (, int, a, ,, int, b, ), }, ;]

What if each symbol is also annotated by additional information about its role in the program?

  1. The annotation will provide additional semantic information


  1. It will be easier to understand the Java source code from the annotated symbols.
In [5]:
enum class T {
    Keyword,
    Name,
    Separator,
    Type,
    Puntuation,
}
In [6]:
data class Annotated (
    val value: String,
    val type: T,
)
In [7]:
val annotatedString = listOf(
    Annotated("interface", T.Keyword),
    Annotated("Adder", T.Name),
    Annotated("{", T.Separator),
    Annotated("int", T.Type),
    Annotated("adder", T.Name),
    Annotated("(", T.Separator),
    Annotated("int", T.Type),
    Annotated("a", T.Name),
    Annotated(",", T.Puntuation),
    Annotated("int", T.Type),
    Annotated("b", T.Name),
    Annotated(")", T.Separator),
    Annotated("}", T.Separator),
    Annotated(";", T.Puntuation),  
)

annotatedString
Out[7]:
[Annotated(value=interface, type=Keyword), Annotated(value=Adder, type=Name), Annotated(value={, type=Separator), Annotated(value=int, type=Type), Annotated(value=adder, type=Name), Annotated(value=(, type=Separator), Annotated(value=int, type=Type), Annotated(value=a, type=Name), Annotated(value=,, type=Puntuation), Annotated(value=int, type=Type), Annotated(value=b, type=Name), Annotated(value=), type=Separator), Annotated(value=}, type=Separator), Annotated(value=;, type=Puntuation)]

Definition

  • A token is a symbol with additional annotation

    data class Token(val lexeme: String, val type: Type)
    
  • A lexeme is string content of the token. Lexemes are extracted directly from the source code.

Lexical Analysis

  • Lexical analysis is the process of converting a source code into a sequence of tokens.

  • Lexical analyzer is a function that maps character based source code to a string of tokens:

    $$ \mathrm{lex}: \Sigma^*_\mathrm{CHAR} \to \mathrm{Token}^* $$

Building Lexical Analyzers With Regular Expressions

Determine token types

In [8]:
enum class TokenTypes {
    Keyword,
    Name,
    Separator,
    JavaType,
    Puntuation,
}

Define lexeme patterns for each token type

In [9]:
val patterns = mapOf(
    TokenTypes.JavaType to Regex("int|float|double|boolean|char"),
    TokenTypes.Name to Regex("..."),
    // ...
)

What's next?

  1. We will be discussing defining languages using regular expressions.
  1. We will study the connection between regular expressions and automata.
  1. We will be understand how regular expression libraries work by building pattern matchers using automata.
  1. We will be looking at how to build lexical analyzers with ANTLR.
In [ ]: