Regular Expressions¶

How are regular expressions defined?
How do we make use of regular expressions in Kotlin?
How can we perform string analysis using regular expressions?
How can we build an (inefficient) lexical analyzer using regular expressions?

Patterns¶

A pattern is a string that describes a set of one or more strings.

So, each pattern represents a language.

Regular Expressions¶

Regular expression is a special language of patterns.

Let $\mathrm{Regex}$ be the language of all valid regular expressions.

Each $e\in \mathrm{Regex}$ is a regular expression.

Each expression $e\in\mathrm{Regex}$ has a language $L(e)$.

Each single letter is a valid expression:

Their languages are:

$L(a) = \{a\}$, $L(b) = \{b\}$, $L(c) = \{c\}$, etc.

We can define single letters using character classes.

Example:

[abc] is a character class, with

$$L(\mathrm{[abc]}) = L(a)\cup L(b)\cup L(c) = \{a, b, c\}$$

[a-z] is a character class where

$$ L(\mathrm{[a-z]}) = \{a, b, c, \dots, z\} $$

[^a-z] is a character class formed by negation.

$$L(\mathrm{[\hat{} a-z]}) = \Sigma - L(\mathrm{[a-z]})$$

There are many predefined character classes.

\w: any word character. This is equivalent to [a-zA-Z0-9_]

\W: any non-word character. This is equivalent to [^\w]

\d: a single digit, equivalent to [0-9].

\D: a non-digit character.

\s: a whitespace, including space, tab, newline, carridge return.

\S: a non-whitespace

.: any character in $\Sigma$

There are also boundary classes. They do not describe single characters, but rather boundaries at strings.

^: start of a string. This is not to be confused with negating a character class.

$: end of a string

\b: word boundary

\B: a non-word boundary

Regular expression for longer strings¶

We can define patterns of longer (and more) strings by building up more complex regular expressions.

Building blocks

Any character class is a pattern in the language $\mathrm{Regex}$.

Alternation

If $e_1\in\mathrm{Regex}$ and $e_2\in\mathrm{Regex}$, then $e_1 | e_2 \in \mathrm{Regex}$.

$$L(e_1|e_2) = L(e_1)\cup L(e_2)$$

Concatenation

If $e_1\in\mathrm{Regex}$ and $e_2\in\mathrm{Regex}$, then $e_1 e_2 \in \mathrm{Regex}$.

$$ L(e_1 e_2) = \{st : s\in L(e_1)\ \mathrm{and}\ t\in L(e_2)\}$$

Repetition

If $e\in\mathrm{Regex}$, then $e*\in\mathrm{Regex}$

$L(e*)$ consists of zero or more strings from $L(e)$.

Repetition with one or more

If $e\in\mathrm{Regex}$, then $e+\in\mathrm{Regex}$

$L(e+)$ consists of one or more strings from $L(e)$.

Optional

If $e\in\mathrm{Regex}$, then $e?\in\mathrm{Regex}$

$L(e?)$ consists of zero or one string from $L(e)$.

Examples¶

e1 = \d+

A number

e2 = e_1 (\. e1)?

A decimal

e3 = (e2 \,)*
zero or more comma separated numbers, and ends with comma

e4 = e3 e2

a list of comma separate numbers with at least one number

e5 = (e4 '\n')+

one or more line of comma separated numbers

we can consider using this regular expression to parse CSV files

Regular Expressions in Kotlin¶

We will cover how Kotlin handles regular expressions.

Kotlin can convert any string to regular expression.

"\\d"

\d

"\\d"

\d

"\\d".toRegex()

\d

"\\d".toRegex()

\d

"\\d".toRegex().javaClass

class kotlin.text.Regex

"\\d".toRegex().javaClass

class kotlin.text.Regex

"""\d"""

\d

"""\d"""

\d

"""\d""".toRegex()

\d

"""\d""".toRegex()

\d

We can match strings to patterns given in regular expressions.

val numberPattern = """\d+(\.\d+)?""".toRegex()

val numberPattern = """\d+(\.\d+)?""".toRegex()

numberPattern.matches("Hello")

false

numberPattern.matches("Hello")

false

numberPattern.matches("3.1415")

true

numberPattern.matches("3.1415")

true

numberPattern.matches("3.1415")

true

numberPattern.matches("3.1415")

true

// matches is an infix method
numberPattern matches "100"

true

// matches is an infix method
numberPattern matches "100"

true

We can also extract substrings from strings using patterns.

numberPattern.find("I will turn 44 in 2021") == null

false

numberPattern.find("I will turn 44 in 2021") == null

false

numberPattern.find("My name is Ken") == null

true

numberPattern.find("My name is Ken") == null

true

numberPattern.find("My name is Ken") == null

true

numberPattern.find("My name is Ken") == null

true

numberPattern
    .find("I will turn 44 in 2021")
    ?.groups
    ?.get(0)
    ?.value

44

numberPattern
    .find("I will turn 44 in 2021")
    ?.groups
    ?.get(0)
    ?.value

44

numberPattern
    .findAll("I will turn 44 in 2021")
    .map({
        match -> match.groups.get(0)?.value
    }).toList()

[44, 2021]

numberPattern
    .findAll("I will turn 44 in 2021")
    .map({
        match -> match.groups.get(0)?.value
    }).toList()

[44, 2021]

For the purpose of lexical analyzer, we will want to limit the regular expression matching to only the start of the string.

This is handled by the ^ special boundary.

val text = "let x = \"let x = 123\""
println(text)

let x = "let x = 123"

val text = "let x = \"let x = 123\""
println(text)

let x = "let x = 123"

val pattern = "^\\w+".toRegex()
val matcher = pattern.find(text)
matcher?.value

let

val pattern = "^\\w+".toRegex()
val matcher = pattern.find(text)
matcher?.value

let

What's next?¶

We will build a functional lexical analyzer using regular expressions.

Regular Expression Based Lexical Analyzer¶

A small program¶

Let's imagine a small program. It's a programmable calculator.

val source : String = 
    """
    constant pi = 3.1415;
    var radius = 10.4;
    var area = pi * square(radius);

    function square(x) {
      return x * x;
    }
    """.trimIndent()
print(source)

constant pi = 3.1415;
var radius = 10.4;
var area = pi * square(radius);

function square(x) {
  return x * x;
}

val source : String = 
    """
    constant pi = 3.1415;
    var radius = 10.4;
    var area = pi * square(radius);

    function square(x) {
      return x * x;
    }
    """.trimIndent()
print(source)

constant pi = 3.1415;
var radius = 10.4;
var area = pi * square(radius);

function square(x) {
  return x * x;
}

Let's define:

TokenTypes: an enum of all possible token types.
Token: a class that has the token type and its lexeme.

enum class TokenTypes {
    Keyword,
    Identifier,
    Number,
    OpenParen,
    CloseParen,
    OpenBrace,
    CloseBrace,
    Semicolon,
    Operator,
    Whitespace,
    Eq,
}

enum class TokenTypes {
    Keyword,
    Identifier,
    Number,
    OpenParen,
    CloseParen,
    OpenBrace,
    CloseBrace,
    Semicolon,
    Operator,
    Whitespace,
    Eq,
}

data class Token(
    val t: TokenTypes,
    val lexeme: String,
) {
    override fun toString(): String {
        return "\"$lexeme\": $t"
    }
}

data class Token(
    val t: TokenTypes,
    val lexeme: String,
) {
    override fun toString(): String {
        return "\"$lexeme\": $t"
    }
}

For each token type, we will also define a regular expression. Now that we add a ^ to make sure only match from start of the string.

val patterns = mapOf(
    TokenTypes.Keyword    to "^(constant|var|function)".toRegex(),
    TokenTypes.Identifier to "^[a-z]\\w*".toRegex(),
    TokenTypes.Number     to "^\\d+(.\\d+)?".toRegex(),
    TokenTypes.OpenParen  to "^\\(".toRegex(),
    TokenTypes.CloseParen to "^\\)".toRegex(),
    TokenTypes.OpenBrace  to "^\\{".toRegex(),
    TokenTypes.CloseBrace to "^\\}".toRegex(),
    TokenTypes.Semicolon  to "^;".toRegex(),
    TokenTypes.Operator   to "^[*/+\\-]".toRegex(),
    TokenTypes.Whitespace to "^[ \\t\\n\\r]".toRegex(),
    TokenTypes.Eq         to "^=".toRegex(),
)

val patterns = mapOf(
    TokenTypes.Keyword    to "^(constant|var|function)".toRegex(),
    TokenTypes.Identifier to "^[a-z]\\w*".toRegex(),
    TokenTypes.Number     to "^\\d+(.\\d+)?".toRegex(),
    TokenTypes.OpenParen  to "^\\(".toRegex(),
    TokenTypes.CloseParen to "^\\)".toRegex(),
    TokenTypes.OpenBrace  to "^\\{".toRegex(),
    TokenTypes.CloseBrace to "^\\}".toRegex(),
    TokenTypes.Semicolon  to "^;".toRegex(),
    TokenTypes.Operator   to "^[*/+\\-]".toRegex(),
    TokenTypes.Whitespace to "^[ \\t\\n\\r]".toRegex(),
    TokenTypes.Eq         to "^=".toRegex(),
)

Let's build a function that will perform the lexical analysis.

fun getToken(source: String): Token? {
    for((t, p) in patterns) {
        val match = p.find(source)
        if(match != null) {
            return Token(t, match.value)
        }
    }
    return null
}

fun getToken(source: String): Token? {
    for((t, p) in patterns) {
        val match = p.find(source)
        if(match != null) {
            return Token(t, match.value)
        }
    }
    return null
}

getToken(source)

"constant": Keyword

getToken(source)

"constant": Keyword

fun lexer(input: String): List<Token> {
    var source = input
    val tokens = mutableListOf<Token>()
    var i = 0
    while(source.length > 0) {
        val t = getToken(source)
        if(t != null) {
            println("[$i]: $t")
            tokens.add(t)
            source = source.substring(t.lexeme.length)
        } else {
            throw Exception("Cannot find suitable rule for\n${source}")
        }
        i += 1
    }
    return tokens
}

fun lexer(input: String): List<Token> {
    var source = input
    val tokens = mutableListOf<Token>()
    var i = 0
    while(source.length > 0) {
        val t = getToken(source)
        if(t != null) {
            println("[$i]: $t")
            tokens.add(t)
            source = source.substring(t.lexeme.length)
        } else {
            throw Exception("Cannot find suitable rule for\n${source}")
        }
        i += 1
    }
    return tokens
}

Now, we can make use of the lexer.

val allTokens: List<Token>? = try {
    lexer(source)    
} catch(e: Exception) {
    println(e)
    null
}

[0]: "constant": Keyword
[1]: " ": Whitespace
[2]: "pi": Identifier
[3]: " ": Whitespace
[4]: "=": Eq
[5]: " ": Whitespace
[6]: "3.1415": Number
[7]: ";": Semicolon
[8]: "
": Whitespace
[9]: "var": Keyword
[10]: " ": Whitespace
[11]: "radius": Identifier
[12]: " ": Whitespace
[13]: "=": Eq
[14]: " ": Whitespace
[15]: "10.4": Number
[16]: ";": Semicolon
[17]: "
": Whitespace
[18]: "var": Keyword
[19]: " ": Whitespace
[20]: "area": Identifier
[21]: " ": Whitespace
[22]: "=": Eq
[23]: " ": Whitespace
[24]: "pi": Identifier
[25]: " ": Whitespace
[26]: "*": Operator
[27]: " ": Whitespace
[28]: "square": Identifier
[29]: "(": OpenParen
[30]: "radius": Identifier
[31]: ")": CloseParen
[32]: ";": Semicolon
[33]: "
": Whitespace
[34]: "
": Whitespace
[35]: "function": Keyword
[36]: " ": Whitespace
[37]: "square": Identifier
[38]: "(": OpenParen
[39]: "x": Identifier
[40]: ")": CloseParen
[41]: " ": Whitespace
[42]: "{": OpenBrace
[43]: "
": Whitespace
[44]: " ": Whitespace
[45]: " ": Whitespace
[46]: "return": Identifier
[47]: " ": Whitespace
[48]: "x": Identifier
[49]: " ": Whitespace
[50]: "*": Operator
[51]: " ": Whitespace
[52]: "x": Identifier
[53]: ";": Semicolon
[54]: "
": Whitespace
[55]: "}": CloseBrace

val allTokens: List<Token>? = try {
    lexer(source)    
} catch(e: Exception) {
    println(e)
    null
}

[0]: "constant": Keyword
[1]: " ": Whitespace
[2]: "pi": Identifier
[3]: " ": Whitespace
[4]: "=": Eq
[5]: " ": Whitespace
[6]: "3.1415": Number
[7]: ";": Semicolon
[8]: "
": Whitespace
[9]: "var": Keyword
[10]: " ": Whitespace
[11]: "radius": Identifier
[12]: " ": Whitespace
[13]: "=": Eq
[14]: " ": Whitespace
[15]: "10.4": Number
[16]: ";": Semicolon
[17]: "
": Whitespace
[18]: "var": Keyword
[19]: " ": Whitespace
[20]: "area": Identifier
[21]: " ": Whitespace
[22]: "=": Eq
[23]: " ": Whitespace
[24]: "pi": Identifier
[25]: " ": Whitespace
[26]: "*": Operator
[27]: " ": Whitespace
[28]: "square": Identifier
[29]: "(": OpenParen
[30]: "radius": Identifier
[31]: ")": CloseParen
[32]: ";": Semicolon
[33]: "
": Whitespace
[34]: "
": Whitespace
[35]: "function": Keyword
[36]: " ": Whitespace
[37]: "square": Identifier
[38]: "(": OpenParen
[39]: "x": Identifier
[40]: ")": CloseParen
[41]: " ": Whitespace
[42]: "{": OpenBrace
[43]: "
": Whitespace
[44]: " ": Whitespace
[45]: " ": Whitespace
[46]: "return": Identifier
[47]: " ": Whitespace
[48]: "x": Identifier
[49]: " ": Whitespace
[50]: "*": Operator
[51]: " ": Whitespace
[52]: "x": Identifier
[53]: ";": Semicolon
[54]: "
": Whitespace
[55]: "}": CloseBrace

Having the program as tokens provide a great deal of flexibility in code analysis.

Remove all whitespaces since our program is insensitive to the whitespace tokens.
We can check who many variables assignments are there.

//
// We are using List.filter with
// an anonymous function
// to remove all the whitespace tokens
//
val tokens = allTokens.filter({
    it.t != TokenTypes.Whitespace
})

tokens.forEach(::println)

"constant": Keyword
"pi": Identifier
"=": Eq
"3.1415": Number
";": Semicolon
"var": Keyword
"radius": Identifier
"=": Eq
"10.4": Number
";": Semicolon
"var": Keyword
"area": Identifier
"=": Eq
"pi": Identifier
"*": Operator
"square": Identifier
"(": OpenParen
"radius": Identifier
")": CloseParen
";": Semicolon
"function": Keyword
"square": Identifier
"(": OpenParen
"x": Identifier
")": CloseParen
"{": OpenBrace
"return": Identifier
"x": Identifier
"*": Operator
"x": Identifier
";": Semicolon
"}": CloseBrace

//
// We are using List.filter with
// an anonymous function
// to remove all the whitespace tokens
//
val tokens = allTokens.filter({
    it.t != TokenTypes.Whitespace
})

tokens.forEach(::println)

"constant": Keyword
"pi": Identifier
"=": Eq
"3.1415": Number
";": Semicolon
"var": Keyword
"radius": Identifier
"=": Eq
"10.4": Number
";": Semicolon
"var": Keyword
"area": Identifier
"=": Eq
"pi": Identifier
"*": Operator
"square": Identifier
"(": OpenParen
"radius": Identifier
")": CloseParen
";": Semicolon
"function": Keyword
"square": Identifier
"(": OpenParen
"x": Identifier
")": CloseParen
"{": OpenBrace
"return": Identifier
"x": Identifier
"*": Operator
"x": Identifier
";": Semicolon
"}": CloseBrace

tokens.mapIndexed {
    i, token -> 
    if(token.t == TokenTypes.Eq) {
        i
    } else {
        null
    }
}.filterNotNull().map { 
    tokens.get(it - 1)
}.forEachIndexed { 
    i, t -> println("[$i] $t")
}

[0] "pi": Identifier
[1] "radius": Identifier
[2] "area": Identifier

tokens.mapIndexed {
    i, token -> 
    if(token.t == TokenTypes.Eq) {
        i
    } else {
        null
    }
}.filterNotNull().map { 
    tokens.get(it - 1)
}.forEachIndexed { 
    i, t -> println("[$i] $t")
}

[0] "pi": Identifier
[1] "radius": Identifier
[2] "area": Identifier

Conclusion¶

Regular expressions can be used to perform tokenization.

Langauge designer needs to figure out token types and their respective patterns as regular expressions.
The lexical analyzer is just boilerplate code.

What's next?¶

We will explore how ANTLR can be used to generate lexers from a lexer grammar.

Index

2 Regular Expressions

Regular Expressions¶

Patterns¶

Regular Expressions¶

Regular expression for longer strings¶

Examples¶

Regular Expressions in Kotlin¶

What's next?¶

3 Regexp Analyzer

Regular Expression Based Lexical Analyzer¶

A small program¶

Conclusion¶

What's next?¶