nil
Consider the following Java program:
interface Adder {
int adder(int a, int b);
}
Let's think in terms of symbols.
interface
$\dots$ $\longrightarrow$
i
n
t
e
r
f
a
c
e
...
How many symbols in $s$?
48 characters = 48 symbols (including whitespaces)
How many symbols in $s$ if we use binary alphabet?
i
$\rightarrow 105 \rightarrow 01101001$n
$\rightarrow 110 \rightarrow 01101111$Let's look at the string lengths using different encodings.
$\Sigma_\mathrm{CHAR} = \{a, b, c, \dots\}$
$\Sigma_\mathrm{BIN} = \{0, 1\}$
Is it better to use $\Sigma_\mathrm{CHAR}$ or $\Sigma_\mathrm{BIN}$ when we want to understand the Java code?
// This constructs a string which is
// the Java program.
fun javaprog() =
"""
interface Adder {
int adder(int a, int b);
}
""".trimIndent()
// We can count the characters
println("Length in characters: ${javaprog().length}")
// Now, we will concatenate all the binary
// representations of characters.
var binaryString = ""
for(s in javaprog()) {
binaryString += Integer.toBinaryString(s.toInt())
}
println(binaryString)
Can we think of a more suitable alphabet to examine the Java program?
Yes.
We want to choose an alphabet specific to the Java programming language.
val newString = listOf<String>(
"interface",
"Adder",
"{",
"int",
"adder",
"(",
"int",
"a",
",",
"int",
"b",
")",
"}",
";",
)
newString
What if each symbol is also annotated by additional information about its role in the program?
enum class T {
Keyword,
Name,
Separator,
Type,
Puntuation,
}
data class Annotated (
val value: String,
val type: T,
)
val annotatedString = listOf(
Annotated("interface", T.Keyword),
Annotated("Adder", T.Name),
Annotated("{", T.Separator),
Annotated("int", T.Type),
Annotated("adder", T.Name),
Annotated("(", T.Separator),
Annotated("int", T.Type),
Annotated("a", T.Name),
Annotated(",", T.Puntuation),
Annotated("int", T.Type),
Annotated("b", T.Name),
Annotated(")", T.Separator),
Annotated("}", T.Separator),
Annotated(";", T.Puntuation),
)
annotatedString
A token is a symbol with additional annotation
data class Token(val lexeme: String, val type: Type)
A lexeme is string content of the token. Lexemes are extracted directly from the source code.
Lexical analysis is the process of converting a source code into a sequence of tokens.
Lexical analyzer is a function that maps character based source code to a string of tokens:
$$ \mathrm{lex}: \Sigma^*_\mathrm{CHAR} \to \mathrm{Token}^* $$
enum class TokenTypes {
Keyword,
Name,
Separator,
JavaType,
Puntuation,
}
val patterns = mapOf(
TokenTypes.JavaType to Regex("int|float|double|boolean|char"),
TokenTypes.Name to Regex("..."),
// ...
)