Treetop - I'd rather have one problem

Some people, when faced with a problem think,“I know, I’ll use regular expressions”.

Now they have two problems.

I’d rather have one problem.

Treetop • Roland Swingler • LRUG May 2009

Tuesday, 19 May 2009

This quotation is used a lot in presentations, normally before the presenter delves into some gnarly regexps. I’m looking for a better way.

Example 1

I run a film listing site: http://filmli.st. All the data is scraped from other sites - getting the data is easy with net/http or httparty or similar and then parsing the html with nokogiri or hpricot, but...

... you still need to turn a text string like this into a list of Times so you can do interesting things with it. Regexps? No. That way lies madness.

Example 2

Chatroom bots need to be able to distinguish between messages that they should take actions on and those which they should ignore. How should we define what messages they should listen out for?

/^\s*whereis\s+(.+?)(?:\s+(?:on\s+)?(.+?))?\s*$/

Regular expressions? Pretty confusing.

whereis <person> [[on] <day>]

Much nicer to have a simpler language.

Example 3

Scenario: producing human-readable tests Given I have non-technical stakeholders When I write some integration tests Then they should be understandable by everyone

Wouldn’t it be great if someone had written a library like this?

They have! Cucumber. Cucumber’s implementation got me started looking into...

Treetop. A ruby Parsing Expression Grammar. Basically a parser generator, but really simple.

What is a parser?

A parser determines whether strings are syntactically valid according to a set of rules known as a grammar.

Yes / No

From a theoretical viewpoint, parsers just say true or false, depending on whether the string is valid or not.

Syntax Tree

Not so useful, so instead we get back a syntax tree we can do useful things with.

whereis <person> [on <day>]

Lets try building a tree for this example. You can consider a string to be a list of characters, but to start getting meaning from it, you need a tree.

wordswords

We have some words...

wordswords variable variable

variables...

words variable

optional part

words variable

an optional part of an expression (enclosed with square brackets)

optional part

words variable words variable

expression

and a root node for the whole expression

grammar Messageend

lets build that up in treetop. Each of those four types of node in the tree is going to have a rule. We write these rules in a grammar - you think of it like a ruby module.

grammar Message rule expression (words / variable / optional_part)+ endend

The first rule for the whole expression. Lots of things should be familiar from regular expressions - ‘+’ for one or more, brackets for grouping, and ‘/’ is like the regexp ‘|’ for alternation. So this says an expression is one or more words, variables or optional parts, in any order.

grammar Message rule expression (words / variable / optional_part)+ end

rule words [^><\[\]]+ endend

words - character classes, just like regexps

rule words [^><\[\]]+ end

rule variable '<' identifier:( [a-zA-Z_] [a-zA-Z_0-9 ]* ) '>' endend

variables are enclosed with angle brackets, can be any valid ruby identifier string, and are labeled so we can use part of the text later.

rule words [^><\[\]]+ end

rule variable '<' identifier:( [a-zA-Z_] [a-zA-Z_0-9 ]* ) '>' end

rule optional_part "[" expression "]" endend

optional parts are enclosed with square brackets. Here we see that rules can be recursive - which makes the parser significantly more powerful than regular expressions.

$ tt message.treetop

We compile the grammar with the command line tt command - you can also load grammars dynamicaly

require ‘message’

parser = MessageParser.newtree = parser.parse(“whereis <person>...”)

this gives us a parser we can call from ruby code

require ‘message’

parser = MessageParser.newtree = parser.parse(“whereis <person>...”)

tree.elements[0].text_value #=> “whereis ”

tree.elements[1].identifier.text_value#=> “person”

each node knows about its children and its text_value. The label we defined earlier provides sugar methods to access particular subnodes.

Fri/Sun-Tue 4.00 7.00

Another example. This time we’ll think about the tree in a top down fashion rather than bottom up. This is closer to how treetop will actually evaluate an expression.

expression

days times

Fri / Sun-Tue 4.00 7.00

time time

expression

day day range

Fri / Sun - Tue 4 . 00 7 . 00

hrs mins

expression

day day range

day day

rule expression days “ ” timesend

rule times time (“ ” time)+end

rule time hours “.” minutesend

rule hours 1 [0-2] / [0-9]end

rule minutes [0-5] [0-9]end

rule days (day !“-” / day_range) (“/” days)?end

rule day_range day “-” dayend

rule day “Mon”/“Tue”/“Wed”/“Thu”/“Fri”/“Sat”/“Sun”end

The bit highlighted in red is a negative lookahead assertion. We need this because treetop evaluates alternatives from left to right - if we didn’t have the assertion then Sun-Tue would match Sun as a Day, not a DayRange, and we’d be left with “-Tue” which isn’t valid.

Enriching Nodes

Adding in some semantics

rule time hours “.” minutesend

irb> aTimeNode.text_value #=> “9.00”irb> aTimeNode.elements.size #=> 3irb> aTimeNode.hours.text_value #=> “9”

rule time hours “.” minutes { def to_seconds hours.to_i * 60 * 60 + minutes.to_i * 60 end }end

irb> aTimeNode.text_value #=> “9.00”irb> aTimeNode.to_seconds #=> 32400

We can add in methods inline in the grammar. This is just like a module scope, and we can do any ruby we like in here.

# in film_time.treetoprule time hours “.” minutes <TimeNode>end

# in another .rb fileclass TimeNode < Treetop::Runtime::SyntaxNode def to_seconds hours.to_i * 60 * 60 + minutes.to_i * 60 endend

Cleaner in my mind to split these out into actual subclasses of SyntaxNode - keeps the grammar more readable. In some cases you need to have modules rather than subclasses.

Interpretation & Compilation

We’re going to build up a regular expression for the bot example. Each node will be reponsible for building a different part of the regexp.