Deep Dive with Leex and Yecc - Part 2
This is the second post in a series about using Erlang’s leex and yecc in Elixir. In part one, we introduced dice rolling calculations and the need for a DSL to more easily translate those systems into functional code. We looked at various types of arithmetic notations systems to learn how we should structure our DSL, and ended up settling on one that is most familiar with the world at large - infix notation.
Today, we take our next step in the journey and learn how to use leex to tokenize our DSL.
Various dice in a 3-D printed dice tower.
Tokenizing with Leex
In the last post, we looked at how we convert a world problem “Calculate the value of five minus six times negative seven plus three,” into notation by identifying and writing down each symbol, then applying the notation’s grammar to it. Computers need to do the same, and they need instructions on how to do it. The first step in the process, identifying symbols, is called lexical analysis, and more specifically for us, tokenization.
Converting the string “1+2-3” into tokens.
Writing your own lexical analyzer can be painful, especially when you take into account all the edge cases and potential unknowns. That’s why many people turn to libraries that can generate lexical analyzers, known as lexer generators. Two of the most established lexer generators are lex and flex. Erlang has its own lexer generator, called leex, which operates very similar to lex and flex.
Generating Our First Lexer
Lexer generators have their own DSL that describes how text should be tokenized. Instead of describing how it all works, let’s generate one of our own and poke around:
- create a new elixir project named
dice_roller
viamix new dice_roller
- create a folder named
src
in the project’s root folder - create a file named
dice_lexer.xrl
in thesrc
folder and open it in your editor of choice - type the following code into your editor and save your file:
Boom, you have generated your very own number lexer! Now to take it for a spin. Start iex with iex -S mix run
. Next, type "1" |> to_charlist() |> :dice_lexer.string()
and hit enter. You should see something similar to the following:
It works! Notice the two lines talking about compiling the .xrl
and .erl
files? When you compile your project or run it, elixir compiles any erlang files in the src
folder. Also, remember how leex is a lexer generator? The .xlr
file is compiled and generates the lexer’s erlang (.erl
) file. Finally, elixir compiles the .erl
file and loads it as we start our application.
Now to poke around and explore the lexer itself…
Interesting. So what did our brief exploration reveal?
- the leex file is compiled into a module with the same name as the file (i.e.
:dice_lexer
) - we tokenize strings with our lexer by calling
:dice_lexer.string/1
- the lexer succeeds with character lists but fails on strings
- the lexer can tokenize integers, floating point numbers, and the
+
operator - the lexer cannot handle any other operators nor spaces
- the lexer returns a tuple
{:ok, tokens, end_line_number}
Breaking Down the Leex File
We’ve generated our first lexer, used it to tokenize numbers and the +
sign, and explored some of the edges. Now let’s go back and see how it all fits together. We’ll start with the leex file that now contains additional comments[^erlang info]:
First, there are three things to note about working with erlang. Erlang function calls and statements end in a period (.
). If you don’t use a period, your erlang code won’t compile. Also note that erlang specifies atoms the way elixir specifies variables. In the above, token
, int
, float
, and op
appear as :token
, :int
, and :float
, and :op
in elixir. Finally, erlang variables begin with a capital letter. Thus, the left hand side of each definition is a variable.
As noted in the comments above, each leex file is composed of three main sections
- Definitions are used to generate macro defintions, whose values are often regular expressions. In our case, a regular expression to capture numbers.
- Rules are a set of regular expressions that will, upon matching anywhere in the string, send the matched text to the erlang code on the right side of the rule’s
:
separator and generate tokens. - Erlang code contains any erlang functions or the like used in this leex rule’s file.
Definitions are relatively simple, with the variable on the left and the value (e.g. regular expressions and constants) on the right. We won’t spend any more time on that. Also, we aren’t using any custom erlang code, so we’ll skip the last section. For now, we will focus on Token Rules.
Token Rules
Token rules are where the magic happens, so let’s dive in by breaking down the first rule into its pieces:
{Number}
: This references theNumber
definition, whose value was the regular expression[0-9]+
.:
: This separates the regular expression on the left and the erlang code on the right.{token, {int, TokenLine, TokenChars}}
the erlang code that acts upon the found value, in our case a number
The regular expression on the left side of each rule is relatively simple. You combine defintions and strings into regular expressions. In our case, our first rule looks for numbers using {Number}
. In the second rule, we find floating point values using the regular expression {Number}\.{Number}
, while the third is used to grab any operators, which for now is only +
. However, the right hand side of each rule has a bit more going on.
Token Rules can only return the following values:
{token, Token}
: returns a token consisting of the text matched in the regular expression{end_token, Token}
: returns a token, but also serves as a way of saying this is the last tokenskip_token
: skips the token entirely and is great for filtering out spaces and more{error, ErrString}
: an error string is returned
When returning a token, leex provides three predefined variables:
TokenChars
: characters matched by this ruleTokenLen
: number of matched charactersTokenLine
: line number of matched characters
Armed with this knowledge, we can now investigate the right hand side of each rule.
{token, _}
: this tells leex we are returning a tokenint
,float
,op
: this atom identifies the type of data in this token tupleTokenLine
: the line number of this tokenTokenChars
: the characters matched by this rule
Adding more Token Rules
To reinforce how leex rules work, we’ll add another operation to our DSL - dice rolling. When adding a rule to leex, the high-level steps are the same:
- add or change defintions to accommodate the new token
- create a token rule that tells the lexer how to tokenize the expression
- verify our rules through testing edge cases
Dice Rolling Operator
The first step in adding a new rule is deciding if we need to add a new defintion, or change an existing definition. Dice rolling is an operation, and has its own operator (‘-‘). Looking over our existing definitions, we find:
Looks like we already have a defintion for operators, and the regular expression used matches our idea of an operator (+
). Therefore, we’ll change this existing definition by adding the dice rolling operator (d
):
Easy enough. Now to take it for a spin.
No surprises here, the dice rolling operator (d
) reacts the same as addition. However, the last expression we tried had spaces, which failed. If we are going to make a DSL that’s familiar, we need to handle spaces.
Skipping Spaces
While spaces have no bearing on the outcome of a calculated equation, people often add them to their algebraic equations to improve readability. Compare the examples of following equation, one with spaces the other without:
5+6d(4+3^5/(1d4))+(5+1d4)d(4d6)
5 + 6d(4+3^5 / (1d4)) + (5+1d4)d(4d6)
If we are going to make a DSL that feels familiar, we’ll have to take spaces into account. Fortunately, leex provides an easy way to strip them out:
Trying it out:
As mentioned earlier in the post, leex can return tokens, including a special one named skip_token
. This tells leex to drop this token and continuing tokenizing the rest of the string.
Handling Parentheses
Next up are parentheses. Infix uses them to group expressions that should be calculated before others. This means we need to know the boundaries of a parenthetical expression. In that case, we should have separate tokens for left ((
) and right ()
) parentheses:
and trying it out:
You’ll notice that we use the parenthetical characters themselves instead of another named atom. The reason for this will become more apparent when we work with yecc.
Adding Other Operators
The other operators we care about include multiplication (*
), division (/
), and subtraction (-
). Instead of doing all the work here, I leave that up to the reader. If you get stuck, I’ve put up a github repo with the final leex file and an example module using it.
Up Next
Using leex, we can convert our dice roll symbols into actual tokens. We start by creating defintions of the basic elements. Next, we create rules that use our definitions to interpret one or more characters in a string and convert them into token tuples. Now that we have our token tuples, we need to apply a grammar to them.
In the third blog post of the series, we will investigate taking our tokens and applying the grammar of our DSL to them using erlang’s yecc.
Post by: rishenko