Parsing With PLY

This is part 2 of 2 in parsing. Part 1 parses chemical equations using hand-written code.

Parsing with PLY

The ALGOL programming language in the late 1950s/early 1960s introduced the idea of computer languages based on a machineunderstandable grammar. The notation is called BNF for Backus-Naur Form. It wasn't until the 1970s though that people understood thepractice well enough to write general-purpose tools which helped build parsers. The first popular book describing the process is known as TheDragon Book because it had a picture of a knight attacking a dragon on the cover; a hint that this was a hard topic being conquered. The backcover had Don Quixote attacking windmills. It taught how to take the theory and turn it into practice. It's considered a classic because of itsimpact but the material is dated.

This is also the time when lex and yacc were introduced as part of Unix. Lex is a lexer and yacc is "yet another compiler compiler." As you caninfer, it wasn't the first. The term "compiler compiler" means it uses a language used to build a parser for another language. The term "compiler"is a misnomer in this case and the classical distinction between "compiled" and "interpreted" language has mostly become irrelevant.

There are many parser tools for Python. A web search will dig them up for you. The tools take several different approaches. The one I liked themost some years back was SPARK, Last month I looked around and found that PLY is the more modern tool for the niche that SPARK filed. Theauthor of PLY is David Beazley, who is also the author of SWIG.

PLY splits the processing up between the lexer and the parser. As with the final code in the "parsing by hand" talk the lexer assumes all thetokens are described by regular expressions and assumes they are all potentially values at any point in the input. This is usually true forlanguages designed to be read by parser generator tools, but false for most bioinformatics file formats. (Eg, "GENE" can be someone's name, asection header and a protein sequence, all in the same file. The meaning depends on the location in the file. PLY supports stateful systems likethis but I haven't tried that out.)

tokenizing with PLY's 'lex' module

PLY uses variables starting with "t_" to indicate the token patterns. If the variable is a string then it is interpreted as a regular expression and thematch value is the value for the token. If the variable is a function then its docstring contains the pattern and the function is called with thematched token. The function is free to modify the token or return a new token to be used in its place. If nothing is returned then the match isignore. Usually the function only changes the "value" attribute, which is initially the matched text. In the following the t_COUNT converts thevalue to an int.

import lex

tokens = ( "SYMBOL", "COUNT")

t_SYMBOL = ( r"C[laroudsemf]?|Os?|N[eaibdpos]?|S[icernbmg]?|P[drmtboau]?|" r"H[eofgas]?|A[lrsgutcm]|B[eraik]?|Dy|E[urs]|F[erm]?|G[aed]|" r"I[nr]?|Kr?|L[iaur]|M[gnodt]|R[buhenaf]|T[icebmalh]|" r"U|V|W|Xe|Yb?|Z[nr]")

def t_COUNT(t): r"\d+" t.value = int(t.value) return t

def t_error(t): raise TypeError("Unknown text '%s'" % (t.value,))

lex.lex()

lex.input("CH3COOH")for tok in iter(lex.token, None): print repr(tok.type), repr(tok.value)

BTW, iter(f, sentinel) is a handy function. It calls f() and returns the return value using the iterator protocol. When the returned value equalsthe sentinel value it raises a StopIterationError. In other words, I converted the sequence of multiple function calls into an iterator, letting myuse a for loop.

When I run the code I get the following

'SYMBOL' 'C''SYMBOL' 'H''COUNT' 3'SYMBOL' 'C''SYMBOL' 'O''SYMBOL' 'O''SYMBOL' 'H'

You can see that the count was properly converted to an integer

parsing with PLY's 'yacc' module

PLY's parser works on tokens. It uses a BNF grammar that describes how those tokens are assembled. Parsers can handle some ambiguity. Inthis chemical equation example the grammar is ambiguous after reading a chemical symbol. There could be repeat count afterward or not. Thereare a huge number of techniques to resolve the ambiguity such as as lookahead (see what the next tokens are) and precedence rules (given thechoice between "*" and "+" use "*" first).

The parsing algorithms go by names like LALR(1), SLR, LL and LR. The most general purpose is the Earley algorithm (which SPARK used) butit's usually slower than the more limited ones I just listed. The limited ones only parse less complicated grammars. You might think that's aproblem, but in practice that's rarely the case. People have a hard time with complex grammars as well and it's better to have a syntax whichpeople can understand without deep thought.

http://www.dalkescientific.com/index.html

http://www.dalkescientific.com/index.html

http://dalkescientific.com/writings/NBN/parsing_by_hand.html

http://en.wikipedia.org/wiki/Backus-Naur_form

http://en.wikipedia.org/wiki/Principles_of_Compiler_Design

Here's the parser part of my chemical equation evaluator.

class Atom(object): def __init__(self, symbol, count): self.symbol = symbol self.count = count

# When parsing starts, try to make a "chemical_equation" because it's# the name on left-hand side of the first p_* function definition.def p_species_list(p): "chemical_equation : chemical_equation species" p[0] = p[1] + [p[2]]

def p_species(p): "chemical_equation : species" p[0] = [p[1]]

def p_single_species(p): """ species : SYMBOL species : SYMBOL COUNT """ if len(p) == 2: p[0] = Atom(p[1], 1) elif len(p) == 3: p[0] = Atom(p[1], p[2]) def p_error(p): print "Syntax error at '%s'" % p.value yacc.yacc()

print yacc.parse("H2SO4")

As you can see, I can have one or more rule defined per parser function, and I can have multiple functions which work on the same left-handside term. When using yacc, remember to return the information through p[0]. I keep expecting to "return" the object instead.

When I run the above, combined with the lexer code (and with the lexer test code removed) I get the following output

[<__main__.Atom object at 0xb83b0>, <__main__.Atom object at 0xc14d0>, <__main__.Atom object at 0xc12b0>]

That proved rather unhelpful. I can see there are 3 atoms but I can't see what the atoms are. I'll modify the Atom class to add a __repr__method.

class Atom(object): def __init__(self, symbol, count): self.symbol = symbol self.count = count def __repr__(self): return "Atom(%r, %r)" % (self.symbol, self.count)

and with that in place I see

[Atom('H', 2), Atom('S', 1), Atom('O', 4)]

parsetab.py

Generating the parser can be expensive (that is, take a lot of time). PLY will store the parser in a file called "parsetab.py" and reuse it if thegrammar is unchanged. If you with you can change the filename to something else.

The new count functions

There is only a minor change to the count functions. I don't need the tokenize step because PLY handles that for me. I can pass a string directlyinto the yacc.parse functions. Here's are the new versions

import collections

def atom_count(s): count = 0 for atom in yacc.parse(s): count += atom.count return count

def element_counts(s): counts = collections.defaultdict(int) for atom in yacc.parse(s): counts[atom.symbol] += atom.count return counts

Because the function APIs are unchanged I can use the same test suite as before. When I do that I found two problems:

1) I didn't get a TypeError when the input was an invalid string. This was because I was only printing the error, with

def p_error(p): print "Syntax error at '%s'" % p.value

when I should have raised an expection, with

def p_error(p): raise TypeError("unknown text at %r" % (p.value,))

2) I didn't support the empty string. My previous code allowed "" to mean "no chemical compound". Supporting that is straight-forward butrequired a bit of rearranging. I had to add a new start token which can either be the empty string (in which case p[0] gets an empty list) or be alist of chemical species. The new grammar is

def p_chemical_equation(p): """ chemical_equation : chemical_equation : species_list """ if len(p) == 1: # the empty string means there are no atomic symbols p[0] = [] else: p[0] = p[1]

def p_species_list(p): "species_list : species_list species" p[0] = p[1] + [p[2]]

def p_species(p): "species_list : species" p[0] = [p[1]]

def p_single_species(p): """ species : SYMBOL species : SYMBOL COUNT """ if len(p) == 2: p[0] = Atom(p[1], 1) elif len(p) == 3: p[0] = Atom(p[1], p[2])

The final code

For posterity's sake, here's the final code with all the fixes, docstrings and self-test code.

# A grammar for chemical equations like "H2O", "CH3COOH" and "H2SO4"# Uses David Beazley's PLY parser.# Implements two functions: count the total number of atoms in the equation and# count the number of times each element occurs in the equation.

import leximport yacc

tokens = ( "SYMBOL", "COUNT")

t_SYMBOL = ( r"C[laroudsemf]?|Os?|N[eaibdpos]?|S[icernbmg]?|P[drmtboau]?|" r"H[eofgas]?|A[lrsgutcm]|B[eraik]?|Dy|E[urs]|F[erm]?|G[aed]|" r"I[nr]?|Kr?|L[iaur]|M[gnodt]|R[buhenaf]|T[icebmalh]|" r"U|V|W|Xe|Yb?|Z[nr]")

def t_COUNT(t): r"\d+" t.value = int(t.value) return t

def t_error(t): raise TypeError("Unknown text '%s'" % (t.value,))

lex.lex()

class Atom(object): def __init__(self, symbol, count): self.symbol = symbol self.count = count def __repr__(self): return "Atom(%r, %r)" % (self.symbol, self.count)

# When parsing starts, try to make a "chemical_equation" because it's# the name on left-hand side of the first p_* function definition.# The first rule is empty because I let the empty string be validdef p_chemical_equation(p): """ chemical_equation : chemical_equation : species_list """ if len(p) == 1: # the empty string means there are no atomic symbols p[0] = [] else: p[0] = p[1]

def p_species_list(p): "species_list : species_list species" p[0] = p[1] + [p[2]]

def p_species(p): "species_list : species" p[0] = [p[1]]

def p_single_species(p): """ species : SYMBOL species : SYMBOL COUNT """ if len(p) == 2: p[0] = Atom(p[1], 1)

elif len(p) == 3: p[0] = Atom(p[1], p[2]) def p_error(p): raise TypeError("unknown text at %r" % (p.value,)) yacc.yacc()

######

import collections

def atom_count(s): """calculates the total number of atoms in the chemical equation >>> atom_count("H2SO4") 7 >>> """ count = 0 for atom in yacc.parse(s): count += atom.count return count

def element_counts(s): """calculates counts for each element in the chemical equation >>> element_counts("CH3COOH")["C"] 2 >>> element_counts("CH3COOH")["H"] 4 >>> """ counts = collections.defaultdict(int) for atom in yacc.parse(s): counts[atom.symbol] += atom.count return counts

######def assert_raises(exc, f, *args): try: f(*args) except exc: pass else: raise AssertionError("Expected %r" % (exc,))

def test_element_counts(): assert element_counts("CH3COOH") == {"C": 2, "H": 4, "O": 2} assert element_counts("Ne") == {"Ne": 1} assert element_counts("") == {} assert element_counts("NaCl") == {"Na": 1, "Cl": 1} assert_raises(TypeError, element_counts, "Blah") assert_raises(TypeError, element_counts, "10") assert_raises(TypeError, element_counts, "1C")

def test_atom_count(): assert atom_count("He") == 1 assert atom_count("H2") == 2 assert atom_count("H2SO4") == 7 assert atom_count("CH3COOH") == 8 assert atom_count("NaCl") == 2 assert atom_count("C60H60") == 120 assert_raises(TypeError, atom_count, "SeZYou") assert_raises(TypeError, element_counts, "10") assert_raises(TypeError, element_counts, "1C")

def test(): test_atom_count() test_element_counts()

if __name__ == "__main__": test() print "All tests passed."

Copyright © 2001-2013 Andrew Dalke Scientific AB

Documents

Parsing With PLY