View
3.778
Download
0
Category
Preview:
DESCRIPTION
Talk given at LRUG, may, 2009 about Treetop, a ruby parsing expression grammar. It should hopefully convince you that parsers fit better than regular expressions in quite a few cases.
Citation preview
Some people, when faced with a problem think,“I know, I’ll use regular expressions”.
Now they have two problems.
I’d rather have one problem.
Treetop • Roland Swingler • LRUG May 2009
Tuesday, 19 May 2009
This quotation is used a lot in presentations, normally before the presenter delves into some gnarly regexps. I’m looking for a better way.
Example 1
Tuesday, 19 May 2009
Tuesday, 19 May 2009
I run a film listing site: http://filmli.st. All the data is scraped from other sites - getting the data is easy with net/http or httparty or similar and then parsing the html with nokogiri or hpricot, but...
<span>Fri/Sun-Tue 10.45 12.30 (Tue) 12.40 (not Tue) 4.00 7.00 9.30; Wed 3.00 7.30 9.00</span>
Tuesday, 19 May 2009
... you still need to turn a text string like this into a list of Times so you can do interesting things with it. Regexps? No. That way lies madness.
Example 2
Tuesday, 19 May 2009
Tuesday, 19 May 2009
Chatroom bots need to be able to distinguish between messages that they should take actions on and those which they should ignore. How should we define what messages they should listen out for?
/^\s*whereis\s+(.+?)(?:\s+(?:on\s+)?(.+?))?\s*$/
Tuesday, 19 May 2009
Regular expressions? Pretty confusing.
whereis <person> [[on] <day>]
Tuesday, 19 May 2009
Much nicer to have a simpler language.
Example 3
Tuesday, 19 May 2009
Scenario: producing human-readable tests Given I have non-technical stakeholders When I write some integration tests Then they should be understandable by everyone
Tuesday, 19 May 2009
Wouldn’t it be great if someone had written a library like this?
Tuesday, 19 May 2009
They have! Cucumber. Cucumber’s implementation got me started looking into...
Tuesday, 19 May 2009
Treetop. A ruby Parsing Expression Grammar. Basically a parser generator, but really simple.
What is a parser?
Tuesday, 19 May 2009
A parser determines whether strings are syntactically valid according to a set of rules known as a grammar.
Yes / No
Tuesday, 19 May 2009
From a theoretical viewpoint, parsers just say true or false, depending on whether the string is valid or not.
Syntax Tree
Tuesday, 19 May 2009
Not so useful, so instead we get back a syntax tree we can do useful things with.
whereis <person> [on <day>]
Tuesday, 19 May 2009
Lets try building a tree for this example. You can consider a string to be a list of characters, but to start getting meaning from it, you need a tree.
whereis <person> [on <day>]
wordswords
Tuesday, 19 May 2009
We have some words...
whereis <person> [on <day>]
wordswords variable variable
Tuesday, 19 May 2009
variables...
whereis <person> [on <day>]
words variable
optional part
words variable
Tuesday, 19 May 2009
an optional part of an expression (enclosed with square brackets)
whereis <person> [on <day>]
optional part
words variable words variable
expression
Tuesday, 19 May 2009
and a root node for the whole expression
grammar Messageend
Tuesday, 19 May 2009
lets build that up in treetop. Each of those four types of node in the tree is going to have a rule. We write these rules in a grammar - you think of it like a ruby module.
grammar Message rule expression (words / variable / optional_part)+ endend
Tuesday, 19 May 2009
The first rule for the whole expression. Lots of things should be familiar from regular expressions - ‘+’ for one or more, brackets for grouping, and ‘/’ is like the regexp ‘|’ for alternation. So this says an expression is one or more words, variables or optional parts, in any order.
grammar Message rule expression (words / variable / optional_part)+ end
rule words [^><\[\]]+ endend
Tuesday, 19 May 2009
words - character classes, just like regexps
grammar Message rule expression (words / variable / optional_part)+ end
rule words [^><\[\]]+ end
rule variable '<' identifier:( [a-zA-Z_] [a-zA-Z_0-9 ]* ) '>' endend
Tuesday, 19 May 2009
variables are enclosed with angle brackets, can be any valid ruby identifier string, and are labeled so we can use part of the text later.
grammar Message rule expression (words / variable / optional_part)+ end
rule words [^><\[\]]+ end
rule variable '<' identifier:( [a-zA-Z_] [a-zA-Z_0-9 ]* ) '>' end
rule optional_part "[" expression "]" endend
Tuesday, 19 May 2009
optional parts are enclosed with square brackets. Here we see that rules can be recursive - which makes the parser significantly more powerful than regular expressions.
$ tt message.treetop
Tuesday, 19 May 2009
We compile the grammar with the command line tt command - you can also load grammars dynamicaly
require ‘message’
parser = MessageParser.newtree = parser.parse(“whereis <person>...”)
Tuesday, 19 May 2009
this gives us a parser we can call from ruby code
require ‘message’
parser = MessageParser.newtree = parser.parse(“whereis <person>...”)
tree.elements[0].text_value #=> “whereis ”
tree.elements[1].identifier.text_value#=> “person”
Tuesday, 19 May 2009
each node knows about its children and its text_value. The label we defined earlier provides sugar methods to access particular subnodes.
Fri/Sun-Tue 4.00 7.00
Tuesday, 19 May 2009
Another example. This time we’ll think about the tree in a top down fashion rather than bottom up. This is closer to how treetop will actually evaluate an expression.
Fri/Sun-Tue 4.00 7.00
expression
Tuesday, 19 May 2009
Fri/Sun-Tue 4.00 7.00
expression
days times
Tuesday, 19 May 2009
Fri / Sun-Tue 4.00 7.00
time time
expression
days
day day range
times
Tuesday, 19 May 2009
Fri / Sun - Tue 4 . 00 7 . 00
time
hrs mins
time
hrs mins
expression
days
day day range
day day
times
Tuesday, 19 May 2009
rule expression days “ ” timesend
Tuesday, 19 May 2009
rule times time (“ ” time)+end
rule time hours “.” minutesend
rule hours 1 [0-2] / [0-9]end
rule minutes [0-5] [0-9]end
Tuesday, 19 May 2009
rule days (day !“-” / day_range) (“/” days)?end
rule day_range day “-” dayend
rule day “Mon”/“Tue”/“Wed”/“Thu”/“Fri”/“Sat”/“Sun”end
Tuesday, 19 May 2009
The bit highlighted in red is a negative lookahead assertion. We need this because treetop evaluates alternatives from left to right - if we didn’t have the assertion then Sun-Tue would match Sun as a Day, not a DayRange, and we’d be left with “-Tue” which isn’t valid.
Enriching Nodes
Tuesday, 19 May 2009
Adding in some semantics
rule time hours “.” minutesend
irb> aTimeNode.text_value #=> “9.00”irb> aTimeNode.elements.size #=> 3irb> aTimeNode.hours.text_value #=> “9”
Tuesday, 19 May 2009
rule time hours “.” minutes { def to_seconds hours.to_i * 60 * 60 + minutes.to_i * 60 end }end
irb> aTimeNode.text_value #=> “9.00”irb> aTimeNode.to_seconds #=> 32400
Tuesday, 19 May 2009
We can add in methods inline in the grammar. This is just like a module scope, and we can do any ruby we like in here.
# in film_time.treetoprule time hours “.” minutes <TimeNode>end
# in another .rb fileclass TimeNode < Treetop::Runtime::SyntaxNode def to_seconds hours.to_i * 60 * 60 + minutes.to_i * 60 endend
Tuesday, 19 May 2009
Cleaner in my mind to split these out into actual subclasses of SyntaxNode - keeps the grammar more readable. In some cases you need to have modules rather than subclasses.
Interpretation & Compilation
Tuesday, 19 May 2009
We’re going to build up a regular expression for the bot example. Each node will be reponsible for building a different part of the regexp.
whereis <person> [on <day>]
/^whereis (.+?)(?:\s+on (.+?))?$/
optional part
words variable words variable
expression
Tuesday, 19 May 2009
whereis <person> [on <day>]
/^whereis (.+?)(?:\s+on (.+?))?$/
optional part
words variable words variable
expression
Tuesday, 19 May 2009
whereis <person> [on <day>]
/^whereis (.+?)(?:\s+on (.+?))?$/
optional part
words variable words variable
expression
Tuesday, 19 May 2009
whereis <person> [on <day>]
/^whereis (.+?)(?:\s+on (.+?))?$/
optional part
words variable words variable
expression
Tuesday, 19 May 2009
whereis <person> [on <day>]
/^whereis (.+?)(?:\s+on (.+?))?$/
optional part
words variable words variable
expression
Tuesday, 19 May 2009
Interpreter Pattern
Tuesday, 19 May 2009
This is confusing - it comes from GoF. Actually we’re doing compilation here. Each node gets an interpret method - you treat the syntax tree as a composite.
# expressiondef interpret children = elements.map {|node| node.interpret } RegExp.compile(“^” + children.join + “$”)end
Tuesday, 19 May 2009
# wordsdef interpret Regexp.escape(text_value)end
Tuesday, 19 May 2009
# variabledef interpret “(.+?)”end
Tuesday, 19 May 2009
# optional_partdef interpret children = elements.map {|node| node.interpret } “(?:\s+” + children.join + “)?”end
Tuesday, 19 May 2009
Adding context
Tuesday, 19 May 2009
For anything more than a simple language, you’ll need to pass around context as you interpret the tree.
# expressiondef interpret(context=[]) children = elements.map do |node| node.interpret(context) end matcher = RegExp.new(“^” + children.join + “$”) ...
Tuesday, 19 May 2009
In our case we just want to record the list of variable names, so an Array will suffice. Each interpret method now needs to take this context.
# variabledef interpret(context) context << identifier.text_value.to_sym “(.+?)”end
Tuesday, 19 May 2009
# expressiondef interpret(context=[]) children = elements.map do |node| node.interpret(context) end matcher = RegExp.new(“^” + children.join + “$”)
class << matcher send(:define_method, :variables) do context end end matcherend
Tuesday, 19 May 2009
we decorate the regular expression with a list of the variables. In the real code, the returned match objects are also decorated so you have methods for each variable and don’t have to remember the captured groups by position
Other Options
Tuesday, 19 May 2009
You can also build external interpreters / compilers that use the tree
Complications?
Tuesday, 19 May 2009
# We want to write:hello [world]
# We actually mean:hello[ world]
Tuesday, 19 May 2009
whitespace shuffling. In the reall code, grammar is more complicated - most of the complication comes from dealing with edge cases here
# We should optimize:hello [[[world]]]
# To this:hello [world]
Tuesday, 19 May 2009
This isn’t done in the real code, but should be.
# Left recursion without consuming input BAD:rule infinity_and_beyond infinity_and_beyond / “foo”end
Tuesday, 19 May 2009
Problems?
Tuesday, 19 May 2009
Slow.
Other libraries
Tuesday, 19 May 2009
Racc - accepts yacc grammars. Racc runtime is part of the ruby std dist. so once you’ve built your parser there is no dependency. Ragel - used by mongrel/thin.
Thanks!
Twitter: @knaveofdiamonds
XMPP bot:http://github.com/knaveofdiamonds/harken
Film listings for London’s indie cinemas:http://filmli.st
Treetop:http://github.com/nathansobo/treetophttp://treetop.rubyforge.org
Tuesday, 19 May 2009
Recommended