{:check ["true"]}

Index

2 Regular Expressions

Regular Expressions

  • How are regular expressions defined?
  • How do we make use of regular expressions in Kotlin?
  • How can we perform string analysis using regular expressions?
  • How can we build an (inefficient) lexical analyzer using regular expressions?

Patterns

A pattern is a string that describes a set of one or more strings.

So, each pattern represents a language.

Regular Expressions

Regular expression is a special language of patterns.

  • Let $\mathrm{Regex}$ be the language of all valid regular expressions.
  • Each $e\in \mathrm{Regex}$ is a regular expression.
  • Each expression $e\in\mathrm{Regex}$ has a language $L(e)$.

Each single letter is a valid expression:

a
b
c

Their languages are:

$L(a) = \{a\}$, $L(b) = \{b\}$, $L(c) = \{c\}$, etc.

We can define single letters using character classes.

Example:

  • [abc] is a character class, with

    $$L(\mathrm{[abc]}) = L(a)\cup L(b)\cup L(c) = \{a, b, c\}$$

  • [a-z] is a character class where

    $$ L(\mathrm{[a-z]}) = \{a, b, c, \dots, z\} $$

  • [^a-z] is a character class formed by negation.

    $$L(\mathrm{[\hat{} a-z]}) = \Sigma - L(\mathrm{[a-z]})$$

There are many predefined character classes.

  • \w: any word character. This is equivalent to [a-zA-Z0-9_]
  • \W: any non-word character. This is equivalent to [^\w]
  • \d: a single digit, equivalent to [0-9].
  • \D: a non-digit character.
  • \s: a whitespace, including space, tab, newline, carridge return.
  • \S: a non-whitespace
  • .: any character in $\Sigma$

There are also boundary classes. They do not describe single characters, but rather boundaries at strings.

  • ^: start of a string. This is not to be confused with negating a character class.
  • $: end of a string
  • \b: word boundary
  • \B: a non-word boundary

Regular expression for longer strings

We can define patterns of longer (and more) strings by building up more complex regular expressions.

Building blocks

Any character class is a pattern in the language $\mathrm{Regex}$.

Alternation

If $e_1\in\mathrm{Regex}$ and $e_2\in\mathrm{Regex}$, then $e_1 | e_2 \in \mathrm{Regex}$.

$$L(e_1|e_2) = L(e_1)\cup L(e_2)$$

Concatenation

If $e_1\in\mathrm{Regex}$ and $e_2\in\mathrm{Regex}$, then $e_1 e_2 \in \mathrm{Regex}$.

$$ L(e_1 e_2) = \{st : s\in L(e_1)\ \mathrm{and}\ t\in L(e_2)\}$$

Repetition

If $e\in\mathrm{Regex}$, then $e*\in\mathrm{Regex}$

$L(e*)$ consists of zero or more strings from $L(e)$.

Repetition with one or more

If $e\in\mathrm{Regex}$, then $e+\in\mathrm{Regex}$

$L(e+)$ consists of one or more strings from $L(e)$.

Optional

If $e\in\mathrm{Regex}$, then $e?\in\mathrm{Regex}$

$L(e?)$ consists of zero or one string from $L(e)$.

Examples

  • e1 = \d+
  • A number
  • e2 = e_1 (\. e1)?
  • A decimal
  • e3 = (e2 \,)*

  • zero or more comma separated numbers, and ends with comma

  • e4 = e3 e2
  • a list of comma separate numbers with at least one number
  • e5 = (e4 '\n')+
  • one or more line of comma separated numbers
  • we can consider using this regular expression to parse CSV files

Regular Expressions in Kotlin

We will cover how Kotlin handles regular expressions.

Kotlin can convert any string to regular expression.

In [1]:
"\\d"
Out[1]:
\d
In [2]:
"\\d".toRegex()
Out[2]:
\d
In [3]:
"\\d".toRegex().javaClass
Out[3]:
class kotlin.text.Regex
In [4]:
"""\d"""
Out[4]:
\d
In [5]:
"""\d""".toRegex()
Out[5]:
\d

We can match strings to patterns given in regular expressions.

In [6]:
val numberPattern = """\d+(\.\d+)?""".toRegex()
In [7]:
numberPattern.matches("Hello")
Out[7]:
false
In [8]:
numberPattern.matches("3.1415")
Out[8]:
true
In [9]:
numberPattern.matches("3.1415")
Out[9]:
true
In [10]:
// matches is an infix method
numberPattern matches "100"
Out[10]:
true

We can also extract substrings from strings using patterns.

In [11]:
numberPattern.find("I will turn 44 in 2021") == null
Out[11]:
false
In [12]:
numberPattern.find("My name is Ken") == null
Out[12]:
true
In [13]:
numberPattern.find("My name is Ken") == null
Out[13]:
true
In [14]:
numberPattern
    .find("I will turn 44 in 2021")
    ?.groups
    ?.get(0)
    ?.value
Out[14]:
44
In [15]:
numberPattern
    .findAll("I will turn 44 in 2021")
    .map({
        match -> match.groups.get(0)?.value
    }).toList()
Out[15]:
[44, 2021]

For the purpose of lexical analyzer, we will want to limit the regular expression matching to only the start of the string.

This is handled by the ^ special boundary.

In [16]:
val text = "let x = \"let x = 123\""
println(text)
let x = "let x = 123"
In [21]:
val pattern = "^\\w+".toRegex()
val matcher = pattern.find(text)
matcher?.value
Out[21]:
let

What's next?

We will build a functional lexical analyzer using regular expressions.

3 Regexp Analyzer

Regular Expression Based Lexical Analyzer

A small program

Let's imagine a small program. It's a programmable calculator.

In [1]:
val source : String = 
    """
    constant pi = 3.1415;
    var radius = 10.4;
    var area = pi * square(radius);

    function square(x) {
      return x * x;
    }
    """.trimIndent()
print(source)
constant pi = 3.1415;
var radius = 10.4;
var area = pi * square(radius);

function square(x) {
  return x * x;
}

Let's define:

  1. TokenTypes: an enum of all possible token types.

  2. Token: a class that has the token type and its lexeme.

In [2]:
enum class TokenTypes {
    Keyword,
    Identifier,
    Number,
    OpenParen,
    CloseParen,
    OpenBrace,
    CloseBrace,
    Semicolon,
    Operator,
    Whitespace,
    Eq,
}
In [3]:
data class Token(
    val t: TokenTypes,
    val lexeme: String,
) {
    override fun toString(): String {
        return "\"$lexeme\": $t"
    }
}

For each token type, we will also define a regular expression. Now that we add a ^ to make sure only match from start of the string.

In [9]:
val patterns = mapOf(
    TokenTypes.Keyword    to "^(constant|var|function)".toRegex(),
    TokenTypes.Identifier to "^[a-z]\\w*".toRegex(),
    TokenTypes.Number     to "^\\d+(.\\d+)?".toRegex(),
    TokenTypes.OpenParen  to "^\\(".toRegex(),
    TokenTypes.CloseParen to "^\\)".toRegex(),
    TokenTypes.OpenBrace  to "^\\{".toRegex(),
    TokenTypes.CloseBrace to "^\\}".toRegex(),
    TokenTypes.Semicolon  to "^;".toRegex(),
    TokenTypes.Operator   to "^[*/+\\-]".toRegex(),
    TokenTypes.Whitespace to "^[ \\t\\n\\r]".toRegex(),
    TokenTypes.Eq         to "^=".toRegex(),
)

Let's build a function that will perform the lexical analysis.

In [10]:
fun getToken(source: String): Token? {
    for((t, p) in patterns) {
        val match = p.find(source)
        if(match != null) {
            return Token(t, match.value)
        }
    }
    return null
}
In [11]:
getToken(source)
Out[11]:
"constant": Keyword
In [12]:
fun lexer(input: String): List<Token> {
    var source = input
    val tokens = mutableListOf<Token>()
    var i = 0
    while(source.length > 0) {
        val t = getToken(source)
        if(t != null) {
            println("[$i]: $t")
            tokens.add(t)
            source = source.substring(t.lexeme.length)
        } else {
            throw Exception("Cannot find suitable rule for\n${source}")
        }
        i += 1
    }
    return tokens
}

Now, we can make use of the lexer.

In [47]:
val allTokens: List<Token>? = try {
    lexer(source)    
} catch(e: Exception) {
    println(e)
    null
}
[0]: "constant": Keyword
[1]: " ": Whitespace
[2]: "pi": Identifier
[3]: " ": Whitespace
[4]: "=": Eq
[5]: " ": Whitespace
[6]: "3.1415": Number
[7]: ";": Semicolon
[8]: "
": Whitespace
[9]: "var": Keyword
[10]: " ": Whitespace
[11]: "radius": Identifier
[12]: " ": Whitespace
[13]: "=": Eq
[14]: " ": Whitespace
[15]: "10.4": Number
[16]: ";": Semicolon
[17]: "
": Whitespace
[18]: "var": Keyword
[19]: " ": Whitespace
[20]: "area": Identifier
[21]: " ": Whitespace
[22]: "=": Eq
[23]: " ": Whitespace
[24]: "pi": Identifier
[25]: " ": Whitespace
[26]: "*": Operator
[27]: " ": Whitespace
[28]: "square": Identifier
[29]: "(": OpenParen
[30]: "radius": Identifier
[31]: ")": CloseParen
[32]: ";": Semicolon
[33]: "
": Whitespace
[34]: "
": Whitespace
[35]: "function": Keyword
[36]: " ": Whitespace
[37]: "square": Identifier
[38]: "(": OpenParen
[39]: "x": Identifier
[40]: ")": CloseParen
[41]: " ": Whitespace
[42]: "{": OpenBrace
[43]: "
": Whitespace
[44]: " ": Whitespace
[45]: " ": Whitespace
[46]: "return": Identifier
[47]: " ": Whitespace
[48]: "x": Identifier
[49]: " ": Whitespace
[50]: "*": Operator
[51]: " ": Whitespace
[52]: "x": Identifier
[53]: ";": Semicolon
[54]: "
": Whitespace
[55]: "}": CloseBrace

Having the program as tokens provide a great deal of flexibility in code analysis.

  1. Remove all whitespaces since our program is insensitive to the whitespace tokens.

  2. We can check who many variables assignments are there.

In [50]:
//
// We are using List.filter with
// an anonymous function
// to remove all the whitespace tokens
//
val tokens = allTokens.filter({
    it.t != TokenTypes.Whitespace
})

tokens.forEach(::println)
"constant": Keyword
"pi": Identifier
"=": Eq
"3.1415": Number
";": Semicolon
"var": Keyword
"radius": Identifier
"=": Eq
"10.4": Number
";": Semicolon
"var": Keyword
"area": Identifier
"=": Eq
"pi": Identifier
"*": Operator
"square": Identifier
"(": OpenParen
"radius": Identifier
")": CloseParen
";": Semicolon
"function": Keyword
"square": Identifier
"(": OpenParen
"x": Identifier
")": CloseParen
"{": OpenBrace
"return": Identifier
"x": Identifier
"*": Operator
"x": Identifier
";": Semicolon
"}": CloseBrace
In [52]:
tokens.mapIndexed {
    i, token -> 
    if(token.t == TokenTypes.Eq) {
        i
    } else {
        null
    }
}.filterNotNull().map { 
    tokens.get(it - 1)
}.forEachIndexed { 
    i, t -> println("[$i] $t")
}
[0] "pi": Identifier
[1] "radius": Identifier
[2] "area": Identifier

Conclusion

Regular expressions can be used to perform tokenization.

  1. Langauge designer needs to figure out token types and their respective patterns as regular expressions.

  2. The lexical analyzer is just boilerplate code.

What's next?

We will explore how ANTLR can be used to generate lexers from a lexer grammar.