30
Text Parsing in Text Parsing in Python Python - Gayatri Nittala - Gayatri Nittala - - Madhubala Vasireddy Madhubala Vasireddy

Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

Embed Size (px)

Citation preview

Page 1: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

Text Parsing in PythonText Parsing in Python

- Gayatri Nittala- Gayatri Nittala

- Madhubala - Madhubala VasireddyVasireddy

Page 2: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

Text ParsingText Parsing

►The three W’s! The three W’s! ►Efficiency and PerfectionEfficiency and Perfection

Page 3: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

What is Text Parsing?What is Text Parsing?

►common programming taskcommon programming task►extract or split a sequence of extract or split a sequence of

characterscharacters

Page 4: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

Why is Text Parsing?Why is Text Parsing?

► Simple file parsingSimple file parsing A tab separated fileA tab separated file

►Data extractionData extraction Extract specific information from log fileExtract specific information from log file

► Find and replaceFind and replace► Parsers – syntactic analysisParsers – syntactic analysis►NLPNLP

Extract information from corpusExtract information from corpus POS TaggingPOS Tagging

Page 5: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

Text Parsing MethodsText Parsing Methods

►String FunctionsString Functions►Regular ExpressionsRegular Expressions►ParsersParsers

Page 6: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

String FunctionsString Functions

►String module in pythonString module in python Faster, easier to understand and maintainFaster, easier to understand and maintain

► If you can do, DO IT!If you can do, DO IT!►Different built-in functionsDifferent built-in functions

Find-ReplaceFind-Replace Split-JoinSplit-Join Startswith and EndswithStartswith and Endswith Is methodsIs methods

Page 7: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

Find and ReplaceFind and Replace

►find, index, rindex, replacefind, index, rindex, replace►EX: Replace a string in all files in a EX: Replace a string in all files in a

directorydirectoryfiles = glob.glob(path)files = glob.glob(path)for line in fileinput.input(files,inplace=1):for line in fileinput.input(files,inplace=1):

lineno = 0lineno = 0 lineno = string.find(line, stext)lineno = string.find(line, stext) if lineno >0:if lineno >0: line =line.replace(stext, rtext)line =line.replace(stext, rtext) sys.stdout.write(line)sys.stdout.write(line)

Page 8: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

startswith and endswithstartswith and endswith

► Extract quoted words from the given textExtract quoted words from the given textmyString = "\"123\"";myString = "\"123\"";

if (myString.startswith("\""))if (myString.startswith("\""))

print "string with double quotes“print "string with double quotes“

► Find if the sentences are interrogative or Find if the sentences are interrogative or exclamative exclamative

►What an amazing game that was! What an amazing game that was! ►Do you like this?Do you like this?

endings = ('!', '?')endings = ('!', '?')

sentence.endswith(endings)sentence.endswith(endings)

Page 9: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

isMethodsisMethods

►to check alphabets, numerals, to check alphabets, numerals, character case etccharacter case etc m = 'xxxasdf ‘m = 'xxxasdf ‘ m.isalpha()m.isalpha() FalseFalse

Page 10: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

Regular ExpressionsRegular Expressions

►concise way for complex patternsconcise way for complex patterns►amazingly powerfulamazingly powerful►wide variety of operationswide variety of operations►when you go beyond simple, think when you go beyond simple, think

about regular expressions!about regular expressions!

Page 11: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

Real world problemsReal world problems

►Match IP Addresses, email addresses, Match IP Addresses, email addresses, URLsURLs

►Match balanced sets of parenthesisMatch balanced sets of parenthesis►Substitute wordsSubstitute words►TokenizeTokenize►ValidateValidate►CountCount►Delete duplicatesDelete duplicates►Natural Language processingNatural Language processing

Page 12: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

RE in PythonRE in Python

► Unleash the power - built-in re moduleUnleash the power - built-in re module► FunctionsFunctions

to compile patternsto compile patterns►compliecomplie

to perform matchesto perform matches► match, search, findall, finditermatch, search, findall, finditer

to perform opertaions on match objectto perform opertaions on match object► group, start, end, spangroup, start, end, span

to substituteto substitute► sub, subnsub, subn

► - Metacharacters- Metacharacters

Page 13: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

Compiling patternsCompiling patterns

►re.complile()re.complile()►pattern for IP Address pattern for IP Address

^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$ ^\d+\.\d+\.\d+\.\d+$^\d+\.\d+\.\d+\.\d+$ ^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$ ^([01]?\d\d?|2[0-4]\d|25[0-])\.^([01]?\d\d?|2[0-4]\d|25[0-])\. ([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])$([01]?\d\d?|2[0-4]\d|25[0-5])$

Page 14: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

Compiling patternsCompiling patterns►pattern for matching parenthesispattern for matching parenthesis

\(.*\)\(.*\) \([^)]*\)\([^)]*\) \([^()]*\)\([^()]*\)

Page 15: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

SubstituteSubstitute

► Perform several string substitutions on a given stringPerform several string substitutions on a given stringimport reimport redef make_xlat(*args, **kwargs):def make_xlat(*args, **kwargs):

adict = dict(*args, **kwargs)adict = dict(*args, **kwargs)rx = re.compile('|'.join(map(re.escape, adict)))rx = re.compile('|'.join(map(re.escape, adict)))def one_xlate(match):def one_xlate(match):

return adict[match.group(0)]return adict[match.group(0)]def xlate(text):def xlate(text):

return rx.sub(one_xlate, text)return rx.sub(one_xlate, text)return xlatereturn xlate

Page 16: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

CountCount

►Split and count words in the given textSplit and count words in the given text p = re.compile(r'\W+')p = re.compile(r'\W+') len(p.split('This is a test for split().'))len(p.split('This is a test for split().'))

Page 17: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

TokenizeTokenize

►Parsing and Natural Language Parsing and Natural Language ProcessingProcessing s = 'tokenize these words's = 'tokenize these words' words = re.compile(r'\b\w+\b|\$')words = re.compile(r'\b\w+\b|\$') words.findall(s)words.findall(s) ['tokenize', 'these', 'words']['tokenize', 'these', 'words']

Page 18: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

Common PitfallsCommon Pitfalls

►operations on fixed strings, single operations on fixed strings, single character class, no case sensitive character class, no case sensitive issuesissues

►re.sub() and string.replace()re.sub() and string.replace()►re.sub() and string.translate()re.sub() and string.translate()►match vs. searchmatch vs. search►greedy vs. non-greedygreedy vs. non-greedy

Page 19: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

PARSERSPARSERS

►Flat and Nested textsFlat and Nested texts►Nested tags, Programming language Nested tags, Programming language

constructsconstructs►Better to do less than to do more!Better to do less than to do more!

Page 20: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

Parsing Non flat textsParsing Non flat texts

►GrammarGrammar►StatesStates►Generate tokens and Act on themGenerate tokens and Act on them►Lexer - Generates a stream of tokensLexer - Generates a stream of tokens►Parser - Generate a parse tree out of Parser - Generate a parse tree out of

the tokensthe tokens►Lex and YaccLex and Yacc

Page 21: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

Grammar Vs REGrammar Vs RE

► Floating PointFloating Point #---- EBNF-style description of Python ---##---- EBNF-style description of Python ---# floatnumber ::= pointfloat | exponentfloatfloatnumber ::= pointfloat | exponentfloat pointfloat ::= [intpart] fraction | intpart "."pointfloat ::= [intpart] fraction | intpart "." exponentfloat ::= (intpart | pointfloat) exponentfloat ::= (intpart | pointfloat)

exponentexponent intpart ::= digit+intpart ::= digit+ fraction ::= "." digit+fraction ::= "." digit+ exponent ::= ("e" | "E") ["+" | "-"] digit+exponent ::= ("e" | "E") ["+" | "-"] digit+ digit ::= "0"..."9"digit ::= "0"..."9"

Page 22: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

Grammar Vs REGrammar Vs REpat = r'''(?x)pat = r'''(?x) ( # exponentfloat( # exponentfloat ( # intpart or pointfloat( # intpart or pointfloat ( # pointfloat( # pointfloat (\d+)?[.]\d+ # optional intpart with fraction(\d+)?[.]\d+ # optional intpart with fraction || \d+[.] # intpart with period\d+[.] # intpart with period ) # end pointfloat) # end pointfloat || \d+ # intpart\d+ # intpart ) # end intpart or pointfloat) # end intpart or pointfloat [eE][+-]?\d+ # exponent[eE][+-]?\d+ # exponent ) # end exponentfloat) # end exponentfloat || ( # pointfloat( # pointfloat (\d+)?[.]\d+ # optional intpart with fraction(\d+)?[.]\d+ # optional intpart with fraction || \d+[.] # intpart with period\d+[.] # intpart with period ) # end pointfloat) # end pointfloat ''''''

Page 23: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

PLY - The Python Lex and PLY - The Python Lex and YaccYacc

►higher-level and cleaner grammar higher-level and cleaner grammar languagelanguage

►LALR(1) parsing LALR(1) parsing ►extensive input validation, error extensive input validation, error

reporting, and diagnosticsreporting, and diagnostics►Two moduoles lex.py and yacc.pyTwo moduoles lex.py and yacc.py

Page 24: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

Using PLY - Lex and Yacc Using PLY - Lex and Yacc

►Lex:Lex:► Import the [lex] moduleImport the [lex] module►Define a list or tuple variable 'tokens', the Define a list or tuple variable 'tokens', the

lexer is allowed to producelexer is allowed to produce►Define tokens - by assigning to a specially Define tokens - by assigning to a specially

named variable ('t_tokenName')named variable ('t_tokenName')►Build the lexerBuild the lexer

mylexer = lex.lex()mylexer = lex.lex() mylexer.input(mytext) # handled by yaccmylexer.input(mytext) # handled by yacc

Page 25: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

LexLex

t_NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'t_NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'

def t_NUMBER(t):def t_NUMBER(t): r'\d+'r'\d+' try:try: t.value = int(t.value)t.value = int(t.value) except ValueError:except ValueError: print "Integer value too large", t.valueprint "Integer value too large", t.value t.value = 0t.value = 0 return treturn t

t_ignore = " \t"t_ignore = " \t"

Page 26: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

YaccYacc

► Import the 'yacc' moduleImport the 'yacc' module►Get a token map from a lexerGet a token map from a lexer►Define a collection of grammar rulesDefine a collection of grammar rules►Build the parserBuild the parser

yacc.yacc()yacc.yacc() yacc.parse('x=3')yacc.parse('x=3')

Page 27: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

YaccYacc

► Specially named functions having a 'p_' Specially named functions having a 'p_' prefix prefix

def p_statement_assign(p):def p_statement_assign(p): 'statement : NAME "=" expression''statement : NAME "=" expression' names[p[1]] = p[3]names[p[1]] = p[3]

def p_statement_expr(p):def p_statement_expr(p): 'statement : expression''statement : expression' print p[1]print p[1]

Page 28: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

SummarySummary

► String FunctionsString Functions

A thumb rule - if you can do, do it.A thumb rule - if you can do, do it.► Regular ExpressionsRegular Expressions

Complex patterns - something beyond Complex patterns - something beyond simple!simple!

► Lex and YaccLex and Yacc

Parse non flat texts - that follow some Parse non flat texts - that follow some rulesrules

Page 29: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

ReferencesReferences► http://docs.python.org/http://docs.python.org/► http://code.activestate.com/recipes/langs/http://code.activestate.com/recipes/langs/

python/python/► http://www.regular-expressions.info/http://www.regular-expressions.info/► http://www.dabeaz.com/ply/ply.htmlhttp://www.dabeaz.com/ply/ply.html►Mastering Regular Expressions by Jeffrey E F. Mastering Regular Expressions by Jeffrey E F.

FriedlFriedl► Python Cookbook by Alex Martelli, Anna Martelli Python Cookbook by Alex Martelli, Anna Martelli

& David Ascher& David Ascher► Text processing in Python by David MertzText processing in Python by David Mertz

Page 30: Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

Thank YouThank YouQ & AQ & A