{:check ["true"]}
A pattern is a string that describes a set of one or more strings.
So, each pattern represents a language.
Regular expression is a special language of patterns.
Each single letter is a valid expression:
a
b
c
Their languages are:
$L(a) = \{a\}$, $L(b) = \{b\}$, $L(c) = \{c\}$, etc.
We can define single letters using character classes.
Example:
[abc]
is a character class, with
$$L(\mathrm{[abc]}) = L(a)\cup L(b)\cup L(c) = \{a, b, c\}$$
[a-z]
is a character class where
$$ L(\mathrm{[a-z]}) = \{a, b, c, \dots, z\} $$
[^a-z]
is a character class formed by negation.
$$L(\mathrm{[\hat{} a-z]}) = \Sigma - L(\mathrm{[a-z]})$$
There are many predefined character classes.
\w
: any word character. This is equivalent to [a-zA-Z0-9_]
\W
: any non-word character. This is equivalent to [^\w]
\d
: a single digit, equivalent to [0-9]
.\D
: a non-digit character.\s
: a whitespace, including space, tab, newline, carridge return.\S
: a non-whitespace.
: any character in $\Sigma$There are also boundary classes. They do not describe single characters, but rather boundaries at strings.
^
: start of a string. This is not to be confused with negating a
character class.$
: end of a string\b
: word boundary\B
: a non-word boundaryWe can define patterns of longer (and more) strings by building up more complex regular expressions.
Building blocks
Any character class is a pattern in the language $\mathrm{Regex}$.
Alternation
If $e_1\in\mathrm{Regex}$ and $e_2\in\mathrm{Regex}$, then $e_1 | e_2 \in \mathrm{Regex}$.
$$L(e_1|e_2) = L(e_1)\cup L(e_2)$$Concatenation
If $e_1\in\mathrm{Regex}$ and $e_2\in\mathrm{Regex}$, then $e_1 e_2 \in \mathrm{Regex}$.
$$ L(e_1 e_2) = \{st : s\in L(e_1)\ \mathrm{and}\ t\in L(e_2)\}$$Repetition
If $e\in\mathrm{Regex}$, then $e*\in\mathrm{Regex}$
$L(e*)$ consists of zero or more strings from $L(e)$.
Repetition with one or more
If $e\in\mathrm{Regex}$, then $e+\in\mathrm{Regex}$
$L(e+)$ consists of one or more strings from $L(e)$.
Optional
If $e\in\mathrm{Regex}$, then $e?\in\mathrm{Regex}$
$L(e?)$ consists of zero or one string from $L(e)$.
e1 = \d+
e2 = e_1 (\. e1)?
e3 = (e2 \,)*
zero or more comma separated numbers, and ends with comma
e4 = e3 e2
e5 = (e4 '\n')+
We will cover how Kotlin handles regular expressions.
Kotlin can convert any string to regular expression.
"\\d"
"\\d".toRegex()
"\\d".toRegex().javaClass
"""\d"""
"""\d""".toRegex()
We can match strings to patterns given in regular expressions.
val numberPattern = """\d+(\.\d+)?""".toRegex()
numberPattern.matches("Hello")
numberPattern.matches("3.1415")
numberPattern.matches("3.1415")
// matches is an infix method
numberPattern matches "100"
We can also extract substrings from strings using patterns.
numberPattern.find("I will turn 44 in 2021") == null
numberPattern.find("My name is Ken") == null
numberPattern.find("My name is Ken") == null
numberPattern
.find("I will turn 44 in 2021")
?.groups
?.get(0)
?.value
numberPattern
.findAll("I will turn 44 in 2021")
.map({
match -> match.groups.get(0)?.value
}).toList()
For the purpose of lexical analyzer, we will want to limit the regular expression matching to only the start of the string.
This is handled by the ^
special boundary.
val text = "let x = \"let x = 123\""
println(text)
val pattern = "^\\w+".toRegex()
val matcher = pattern.find(text)
matcher?.value
We will build a functional lexical analyzer using regular expressions.
Let's imagine a small program. It's a programmable calculator.
val source : String =
"""
constant pi = 3.1415;
var radius = 10.4;
var area = pi * square(radius);
function square(x) {
return x * x;
}
""".trimIndent()
print(source)
Let's define:
TokenTypes: an enum of all possible token types.
Token: a class that has the token type and its lexeme.
enum class TokenTypes {
Keyword,
Identifier,
Number,
OpenParen,
CloseParen,
OpenBrace,
CloseBrace,
Semicolon,
Operator,
Whitespace,
Eq,
}
data class Token(
val t: TokenTypes,
val lexeme: String,
) {
override fun toString(): String {
return "\"$lexeme\": $t"
}
}
For each token type, we will also define a regular expression. Now that we add a ^
to make sure only match from start of the string.
val patterns = mapOf(
TokenTypes.Keyword to "^(constant|var|function)".toRegex(),
TokenTypes.Identifier to "^[a-z]\\w*".toRegex(),
TokenTypes.Number to "^\\d+(.\\d+)?".toRegex(),
TokenTypes.OpenParen to "^\\(".toRegex(),
TokenTypes.CloseParen to "^\\)".toRegex(),
TokenTypes.OpenBrace to "^\\{".toRegex(),
TokenTypes.CloseBrace to "^\\}".toRegex(),
TokenTypes.Semicolon to "^;".toRegex(),
TokenTypes.Operator to "^[*/+\\-]".toRegex(),
TokenTypes.Whitespace to "^[ \\t\\n\\r]".toRegex(),
TokenTypes.Eq to "^=".toRegex(),
)
Let's build a function that will perform the lexical analysis.
fun getToken(source: String): Token? {
for((t, p) in patterns) {
val match = p.find(source)
if(match != null) {
return Token(t, match.value)
}
}
return null
}
getToken(source)
fun lexer(input: String): List<Token> {
var source = input
val tokens = mutableListOf<Token>()
var i = 0
while(source.length > 0) {
val t = getToken(source)
if(t != null) {
println("[$i]: $t")
tokens.add(t)
source = source.substring(t.lexeme.length)
} else {
throw Exception("Cannot find suitable rule for\n${source}")
}
i += 1
}
return tokens
}
Now, we can make use of the lexer.
val allTokens: List<Token>? = try {
lexer(source)
} catch(e: Exception) {
println(e)
null
}
Having the program as tokens provide a great deal of flexibility in code analysis.
Remove all whitespaces since our program is insensitive to the whitespace tokens.
We can check who many variables assignments are there.
//
// We are using List.filter with
// an anonymous function
// to remove all the whitespace tokens
//
val tokens = allTokens.filter({
it.t != TokenTypes.Whitespace
})
tokens.forEach(::println)
tokens.mapIndexed {
i, token ->
if(token.t == TokenTypes.Eq) {
i
} else {
null
}
}.filterNotNull().map {
tokens.get(it - 1)
}.forEachIndexed {
i, t -> println("[$i] $t")
}
Regular expressions can be used to perform tokenization.
Langauge designer needs to figure out token types and their respective patterns as regular expressions.
The lexical analyzer is just boilerplate code.
We will explore how ANTLR can be used to generate lexers from a lexer grammar.