4
7KLV LV SDUW RI LQ SDUVLQJ 3DUW SDUVHV FKHPLFDO HTXDWLRQV XVLQJ KDQGZULWWHQ FRGH 3DUVLQJ ZLWK 3/< 7KH $/*2/ SURJUDPPLQJ ODQJXDJH LQ WKH ODWH VHDUO\ V LQWURGXFHG WKH LGHD RI FRPSXWHU ODQJXDJHV EDVHG RQ D PDFKLQH XQGHUVWDQGDEOH JUDPPDU 7KH QRWDWLRQ LV FDOOHG %1) IRU %DFNXV1DXU )RUP ,W ZDVQW XQWLO WKH V WKRXJK WKDW SHRSOH XQGHUVWRRG WKH SUDFWLFH ZHOO HQRXJK WR ZULWH JHQHUDOSXUSRVH WRROV ZKLFK KHOSHG EXLOG SDUVHUV 7KH ILUVW SRSXODU ERRN GHVFULELQJ WKH SURFHVV LV NQRZQ DV 7KH 'UDJRQ %RRN EHFDXVH LW KDG D SLFWXUH RI D NQLJKW DWWDFNLQJ D GUDJRQ RQ WKH FRYHU D KLQW WKDW WKLV ZDV D KDUG WRSLF EHLQJ FRQTXHUHG 7KH EDFN FRYHU KDG 'RQ 4XL[RWH DWWDFNLQJ ZLQGPLOOV ,W WDXJKW KRZ WR WDNH WKH WKHRU\ DQG WXUQ LW LQWR SUDFWLFH ,WV FRQVLGHUHG D FODVVLF EHFDXVH RI LWV LPSDFW EXW WKH PDWHULDO LV GDWHG 7KLV LV DOVR WKH WLPH ZKHQ OH[ DQG \DFF ZHUH LQWURGXFHG DV SDUW RI 8QL[ /H[ LV D OH[HU DQG \DFF LV \HW DQRWKHU FRPSLOHU FRPSLOHU $V \RX FDQ LQIHU LW ZDVQW WKH ILUVW 7KH WHUP FRPSLOHU FRPSLOHU PHDQV LW XVHV D ODQJXDJH XVHG WR EXLOG D SDUVHU IRU DQRWKHU ODQJXDJH 7KH WHUP FRPSLOHU LV D PLVQRPHU LQ WKLV FDVH DQG WKH FODVVLFDO GLVWLQFWLRQ EHWZHHQ FRPSLOHG DQG LQWHUSUHWHG ODQJXDJH KDV PRVWO\ EHFRPH LUUHOHYDQW 7KHUH DUH PDQ\ SDUVHU WRROV IRU 3\WKRQ $ ZHE VHDUFK ZLOO GLJ WKHP XS IRU \RX 7KH WRROV WDNH VHYHUDO GLIIHUHQW DSSURDFKHV 7KH RQH , OLNHG WKH PRVW VRPH \HDUV EDFN ZDV 63$5. /DVW PRQWK , ORRNHG DURXQG DQG IRXQG WKDW 3/< LV WKH PRUH PRGHUQ WRRO IRU WKH QLFKH WKDW 63$5. ILOHG 7KH DXWKRU RI 3/< LV 'DYLG %HD]OH\ ZKR LV DOVR WKH DXWKRU RI 6:,* 3/< VSOLWV WKH SURFHVVLQJ XS EHWZHHQ WKH OH[HU DQG WKH SDUVHU $V ZLWK WKH ILQDO FRGH LQ WKH SDUVLQJ E\ KDQG WDON WKH OH[HU DVVXPHV DOO WKH WRNHQV DUH GHVFULEHG E\ UHJXODU H[SUHVVLRQV DQG DVVXPHV WKH\ DUH DOO SRWHQWLDOO\ YDOXHV DW DQ\ SRLQW LQ WKH LQSXW 7KLV LV XVXDOO\ WUXH IRU ODQJXDJHV GHVLJQHG WR EH UHDG E\ SDUVHU JHQHUDWRU WRROV EXW IDOVH IRU PRVW ELRLQIRUPDWLFV ILOH IRUPDWV (J *(1( FDQ EH VRPHRQHV QDPH D VHFWLRQ KHDGHU DQG D SURWHLQ VHTXHQFH DOO LQ WKH VDPH ILOH 7KH PHDQLQJ GHSHQGV RQ WKH ORFDWLRQ LQ WKH ILOH 3/< VXSSRUWV VWDWHIXO V\VWHPV OLNH WKLV EXW , KDYHQW WULHG WKDW RXW WRNHQL]LQJ ZLWK 3/<V OH[ PRGXOH 3/< XVHV YDULDEOHV VWDUWLQJ ZLWK WB WR LQGLFDWH WKH WRNHQ SDWWHUQV ,I WKH YDULDEOH LV D VWULQJ WKHQ LW LV LQWHUSUHWHG DV D UHJXODU H[SUHVVLRQ DQG WKH PDWFK YDOXH LV WKH YDOXH IRU WKH WRNHQ ,I WKH YDULDEOH LV D IXQFWLRQ WKHQ LWV GRFVWULQJ FRQWDLQV WKH SDWWHUQ DQG WKH IXQFWLRQ LV FDOOHG ZLWK WKH PDWFKHG WRNHQ 7KH IXQFWLRQ LV IUHH WR PRGLI\ WKH WRNHQ RU UHWXUQ D QHZ WRNHQ WR EH XVHG LQ LWV SODFH ,I QRWKLQJ LV UHWXUQHG WKHQ WKH PDWFK LV LJQRUH 8VXDOO\ WKH IXQFWLRQ RQO\ FKDQJHV WKH YDOXH DWWULEXWH ZKLFK LV LQLWLDOO\ WKH PDWFKHG WH[W ,Q WKH IROORZLQJ WKH WB&2817 FRQYHUWV WKH YDOXH WR DQ LQW LPSRUW OH[ WRNHQV 6<0%2/ &2817 WB6<0%2/ U&>ODURXGVHPI@"_2V"_1>HDLEGSRV@"_6>LFHUQEPJ@"_3>GUPWERDX@"_ U+>HRIJDV@"_$>OUVJXWFP@_%>HUDLN@"_'\_(>XUV@_)>HUP@"_*>DHG@_ U,>QU@"_.U"_/>LDXU@_0>JQRGW@_5>EXKHQDI@_7>LFHEPDOK@_ U8_9_:_;H_<E"_=>QU@ GHI WB&2817W U?G WYDOXH LQWWYDOXH UHWXUQ W GHI WBHUURUW UDLVH 7\SH(UURU8QNQRZQ WH[W V WYDOXH OH[OH[ OH[LQSXW&+&22+ IRU WRN LQ LWHUOH[WRNHQ 1RQH SULQW UHSUWRNW\SH UHSUWRNYDOXH %7: ſř ƀ LV D KDQG\ IXQFWLRQ ,W FDOOV ſƀ DQG UHWXUQV WKH UHWXUQ YDOXH XVLQJ WKH LWHUDWRU SURWRFRO :KHQ WKH UHWXUQHG YDOXH HTXDOV WKH VHQWLQHO YDOXH LW UDLVHV D 6WRS,WHUDWLRQ(UURU ,Q RWKHU ZRUGV , FRQYHUWHG WKH VHTXHQFH RI PXOWLSOH IXQFWLRQ FDOOV LQWR DQ LWHUDWRU OHWWLQJ P\ XVH D IRU ORRS :KHQ , UXQ WKH FRGH , JHW WKH IROORZLQJ 6<0%2/ & 6<0%2/ + &2817 6<0%2/ & 6<0%2/ 2 6<0%2/ 2 6<0%2/ + <RX FDQ VHH WKDW WKH FRXQW ZDV SURSHUO\ FRQYHUWHG WR DQ LQWHJHU SDUVLQJ ZLWK 3/<V \DFF PRGXOH 3/<V SDUVHU ZRUNV RQ WRNHQV ,W XVHV D %1) JUDPPDU WKDW GHVFULEHV KRZ WKRVH WRNHQV DUH DVVHPEOHG 3DUVHUV FDQ KDQGOH VRPH DPELJXLW\ ,Q WKLV FKHPLFDO HTXDWLRQ H[DPSOH WKH JUDPPDU LV DPELJXRXV DIWHU UHDGLQJ D FKHPLFDO V\PERO 7KHUH FRXOG EH UHSHDW FRXQW DIWHUZDUG RU QRW 7KHUH DUH D KXJH QXPEHU RI WHFKQLTXHV WR UHVROYH WKH DPELJXLW\ VXFK DV DV ORRNDKHDG VHH ZKDW WKH QH[W WRNHQV DUH DQG SUHFHGHQFH UXOHV JLYHQ WKH FKRLFH EHWZHHQ DQG XVH ILUVW 7KH SDUVLQJ DOJRULWKPV JR E\ QDPHV OLNH /$/5 6/5 // DQG /5 7KH PRVW JHQHUDO SXUSRVH LV WKH (DUOH\ DOJRULWKP ZKLFK 63$5. XVHG EXW LWV XVXDOO\ VORZHU WKDQ WKH PRUH OLPLWHG RQHV , MXVW OLVWHG 7KH OLPLWHG RQHV RQO\ SDUVH OHVV FRPSOLFDWHG JUDPPDUV <RX PLJKW WKLQN WKDWV D SUREOHP EXW LQ SUDFWLFH WKDWV UDUHO\ WKH FDVH 3HRSOH KDYH D KDUG WLPH ZLWK FRPSOH[ JUDPPDUV DV ZHOO DQG LWV EHWWHU WR KDYH D V\QWD[ ZKLFK SHRSOH FDQ XQGHUVWDQG ZLWKRXW GHHS WKRXJKW

Parsing With PLY

Embed Size (px)

DESCRIPTION

ply

Citation preview

This is part 2 of 2 in parsing. Part 1 parses chemical equations using hand-written code.

Parsing with PLY

The ALGOL programming language in the late 1950s/early 1960s introduced the idea of computer languages based on a machineunderstandable grammar. The notation is called BNF for Backus-Naur Form. It wasn't until the 1970s though that people understood thepractice well enough to write general-purpose tools which helped build parsers. The first popular book describing the process is known as TheDragon Book because it had a picture of a knight attacking a dragon on the cover; a hint that this was a hard topic being conquered. The backcover had Don Quixote attacking windmills. It taught how to take the theory and turn it into practice. It's considered a classic because of itsimpact but the material is dated.

This is also the time when lex and yacc were introduced as part of Unix. Lex is a lexer and yacc is "yet another compiler compiler." As you caninfer, it wasn't the first. The term "compiler compiler" means it uses a language used to build a parser for another language. The term "compiler"is a misnomer in this case and the classical distinction between "compiled" and "interpreted" language has mostly become irrelevant.

There are many parser tools for Python. A web search will dig them up for you. The tools take several different approaches. The one I liked themost some years back was SPARK, Last month I looked around and found that PLY is the more modern tool for the niche that SPARK filed. Theauthor of PLY is David Beazley, who is also the author of SWIG.

PLY splits the processing up between the lexer and the parser. As with the final code in the "parsing by hand" talk the lexer assumes all thetokens are described by regular expressions and assumes they are all potentially values at any point in the input. This is usually true forlanguages designed to be read by parser generator tools, but false for most bioinformatics file formats. (Eg, "GENE" can be someone's name, asection header and a protein sequence, all in the same file. The meaning depends on the location in the file. PLY supports stateful systems likethis but I haven't tried that out.)

tokenizing with PLY's 'lex' module

PLY uses variables starting with "t_" to indicate the token patterns. If the variable is a string then it is interpreted as a regular expression and thematch value is the value for the token. If the variable is a function then its docstring contains the pattern and the function is called with thematched token. The function is free to modify the token or return a new token to be used in its place. If nothing is returned then the match isignore. Usually the function only changes the "value" attribute, which is initially the matched text. In the following the t_COUNT converts thevalue to an int.

import lex

tokens = ( "SYMBOL", "COUNT")

t_SYMBOL = ( r"C[laroudsemf]?|Os?|N[eaibdpos]?|S[icernbmg]?|P[drmtboau]?|" r"H[eofgas]?|A[lrsgutcm]|B[eraik]?|Dy|E[urs]|F[erm]?|G[aed]|" r"I[nr]?|Kr?|L[iaur]|M[gnodt]|R[buhenaf]|T[icebmalh]|" r"U|V|W|Xe|Yb?|Z[nr]")

def t_COUNT(t): r"\d+" t.value = int(t.value) return t

def t_error(t): raise TypeError("Unknown text '%s'" % (t.value,))

lex.lex()

lex.input("CH3COOH")for tok in iter(lex.token, None): print repr(tok.type), repr(tok.value)

BTW, iter(f, sentinel) is a handy function. It calls f() and returns the return value using the iterator protocol. When the returned value equalsthe sentinel value it raises a StopIterationError. In other words, I converted the sequence of multiple function calls into an iterator, letting myuse a for loop.

When I run the code I get the following

'SYMBOL' 'C''SYMBOL' 'H''COUNT' 3'SYMBOL' 'C''SYMBOL' 'O''SYMBOL' 'O''SYMBOL' 'H'

You can see that the count was properly converted to an integer

parsing with PLY's 'yacc' module

PLY's parser works on tokens. It uses a BNF grammar that describes how those tokens are assembled. Parsers can handle some ambiguity. Inthis chemical equation example the grammar is ambiguous after reading a chemical symbol. There could be repeat count afterward or not. Thereare a huge number of techniques to resolve the ambiguity such as as lookahead (see what the next tokens are) and precedence rules (given thechoice between "*" and "+" use "*" first).

The parsing algorithms go by names like LALR(1), SLR, LL and LR. The most general purpose is the Earley algorithm (which SPARK used) butit's usually slower than the more limited ones I just listed. The limited ones only parse less complicated grammars. You might think that's aproblem, but in practice that's rarely the case. People have a hard time with complex grammars as well and it's better to have a syntax whichpeople can understand without deep thought.

Here's the parser part of my chemical equation evaluator.

class Atom(object): def __init__(self, symbol, count): self.symbol = symbol self.count = count

# When parsing starts, try to make a "chemical_equation" because it's# the name on left-hand side of the first p_* function definition.def p_species_list(p): "chemical_equation : chemical_equation species" p[0] = p[1] + [p[2]]

def p_species(p): "chemical_equation : species" p[0] = [p[1]]

def p_single_species(p): """ species : SYMBOL species : SYMBOL COUNT """ if len(p) == 2: p[0] = Atom(p[1], 1) elif len(p) == 3: p[0] = Atom(p[1], p[2]) def p_error(p): print "Syntax error at '%s'" % p.value yacc.yacc()

print yacc.parse("H2SO4")

As you can see, I can have one or more rule defined per parser function, and I can have multiple functions which work on the same left-handside term. When using yacc, remember to return the information through p[0]. I keep expecting to "return" the object instead.

When I run the above, combined with the lexer code (and with the lexer test code removed) I get the following output

[<__main__.Atom object at 0xb83b0>, <__main__.Atom object at 0xc14d0>, <__main__.Atom object at 0xc12b0>]

That proved rather unhelpful. I can see there are 3 atoms but I can't see what the atoms are. I'll modify the Atom class to add a __repr__method.

class Atom(object): def __init__(self, symbol, count): self.symbol = symbol self.count = count def __repr__(self): return "Atom(%r, %r)" % (self.symbol, self.count)

and with that in place I see

[Atom('H', 2), Atom('S', 1), Atom('O', 4)]

parsetab.py

Generating the parser can be expensive (that is, take a lot of time). PLY will store the parser in a file called "parsetab.py" and reuse it if thegrammar is unchanged. If you with you can change the filename to something else.

The new count functions

There is only a minor change to the count functions. I don't need the tokenize step because PLY handles that for me. I can pass a string directlyinto the yacc.parse functions. Here's are the new versions

import collections

def atom_count(s): count = 0 for atom in yacc.parse(s): count += atom.count return count

def element_counts(s): counts = collections.defaultdict(int) for atom in yacc.parse(s): counts[atom.symbol] += atom.count return counts

Because the function APIs are unchanged I can use the same test suite as before. When I do that I found two problems:

1) I didn't get a TypeError when the input was an invalid string. This was because I was only printing the error, with

def p_error(p): print "Syntax error at '%s'" % p.value

when I should have raised an expection, with

def p_error(p): raise TypeError("unknown text at %r" % (p.value,))

2) I didn't support the empty string. My previous code allowed "" to mean "no chemical compound". Supporting that is straight-forward butrequired a bit of rearranging. I had to add a new start token which can either be the empty string (in which case p[0] gets an empty list) or be alist of chemical species. The new grammar is

def p_chemical_equation(p): """ chemical_equation : chemical_equation : species_list """ if len(p) == 1: # the empty string means there are no atomic symbols p[0] = [] else: p[0] = p[1]

def p_species_list(p): "species_list : species_list species" p[0] = p[1] + [p[2]]

def p_species(p): "species_list : species" p[0] = [p[1]]

def p_single_species(p): """ species : SYMBOL species : SYMBOL COUNT """ if len(p) == 2: p[0] = Atom(p[1], 1) elif len(p) == 3: p[0] = Atom(p[1], p[2])

The final code

For posterity's sake, here's the final code with all the fixes, docstrings and self-test code.

# A grammar for chemical equations like "H2O", "CH3COOH" and "H2SO4"# Uses David Beazley's PLY parser.# Implements two functions: count the total number of atoms in the equation and# count the number of times each element occurs in the equation.

import leximport yacc

tokens = ( "SYMBOL", "COUNT")

t_SYMBOL = ( r"C[laroudsemf]?|Os?|N[eaibdpos]?|S[icernbmg]?|P[drmtboau]?|" r"H[eofgas]?|A[lrsgutcm]|B[eraik]?|Dy|E[urs]|F[erm]?|G[aed]|" r"I[nr]?|Kr?|L[iaur]|M[gnodt]|R[buhenaf]|T[icebmalh]|" r"U|V|W|Xe|Yb?|Z[nr]")

def t_COUNT(t): r"\d+" t.value = int(t.value) return t

def t_error(t): raise TypeError("Unknown text '%s'" % (t.value,))

lex.lex()

class Atom(object): def __init__(self, symbol, count): self.symbol = symbol self.count = count def __repr__(self): return "Atom(%r, %r)" % (self.symbol, self.count)

# When parsing starts, try to make a "chemical_equation" because it's# the name on left-hand side of the first p_* function definition.# The first rule is empty because I let the empty string be validdef p_chemical_equation(p): """ chemical_equation : chemical_equation : species_list """ if len(p) == 1: # the empty string means there are no atomic symbols p[0] = [] else: p[0] = p[1]

def p_species_list(p): "species_list : species_list species" p[0] = p[1] + [p[2]]

def p_species(p): "species_list : species" p[0] = [p[1]]

def p_single_species(p): """ species : SYMBOL species : SYMBOL COUNT """ if len(p) == 2: p[0] = Atom(p[1], 1)

elif len(p) == 3: p[0] = Atom(p[1], p[2]) def p_error(p): raise TypeError("unknown text at %r" % (p.value,)) yacc.yacc()

######

import collections

def atom_count(s): """calculates the total number of atoms in the chemical equation >>> atom_count("H2SO4") 7 >>> """ count = 0 for atom in yacc.parse(s): count += atom.count return count

def element_counts(s): """calculates counts for each element in the chemical equation >>> element_counts("CH3COOH")["C"] 2 >>> element_counts("CH3COOH")["H"] 4 >>> """ counts = collections.defaultdict(int) for atom in yacc.parse(s): counts[atom.symbol] += atom.count return counts

######def assert_raises(exc, f, *args): try: f(*args) except exc: pass else: raise AssertionError("Expected %r" % (exc,))

def test_element_counts(): assert element_counts("CH3COOH") == {"C": 2, "H": 4, "O": 2} assert element_counts("Ne") == {"Ne": 1} assert element_counts("") == {} assert element_counts("NaCl") == {"Na": 1, "Cl": 1} assert_raises(TypeError, element_counts, "Blah") assert_raises(TypeError, element_counts, "10") assert_raises(TypeError, element_counts, "1C")

def test_atom_count(): assert atom_count("He") == 1 assert atom_count("H2") == 2 assert atom_count("H2SO4") == 7 assert atom_count("CH3COOH") == 8 assert atom_count("NaCl") == 2 assert atom_count("C60H60") == 120 assert_raises(TypeError, atom_count, "SeZYou") assert_raises(TypeError, element_counts, "10") assert_raises(TypeError, element_counts, "1C")

def test(): test_atom_count() test_element_counts()

if __name__ == "__main__": test() print "All tests passed."

Copyright © 2001-2013 Andrew Dalke Scientific AB