View
216
Download
0
Embed Size (px)
Citation preview
CSE467/567Computational Linguistics
Carl [email protected]
Computer Science & Engineering
University at Buffalo
Fall 2006 CSE 467/5672
Levels of processing
phonetics/phonology – sounds morphology – word structure syntax – sentence structure semantics – meaning pragmatics – goals of language use discourse – utterances in context
Fall 2006 CSE 467/5673
Words: the building blocks of sentences
the
D
d og
N
N P
ch ased
V
the
D
cat
N
N P
V P
S
Fall 2006 CSE 467/5674
Words have internal structure
readable = read + able readability = read + able + ity
the structure of words can be described using a regular grammar
Fall 2006 CSE 467/5675
Chomsky hierarchy
regularlanguages
context-freelanguages
context-sensitivelanguages
unrestrictedlanguages
Fall 2006 CSE 467/5676
Problem
I often need to find an e-mail, but I have thousands of e-mails in my various folders. Suppose I want to find an e-mail about geese. The e-mail may mention “geese” or “goose”; also, if it appears at the start of a sentence, its initial letter will be capitalized. Need to match “goose”, “geese”, “Goose” or “Geese”.
Fall 2006 CSE 467/5677
Regular expressions (in Perl)
“a regular expression is an algebraic notation for characterizing a set of strings” [p. 22]
Regular expressions are commonly used to specify search strings. For example, the UNIX utility program grep lets the user specify a pattern to search for in files.
Fall 2006 CSE 467/5678
Sequences of characters
Matching a sequence of characters/…/
Examples:/a/ matches the character ‘a’/fred/ matches the string ‘fred’
Note:/fred/ does not match the string ‘Fred’!
In other words, patterns are case-sensitive.
Fall 2006 CSE 467/5679
Character disjunction(character classes)
Square brackets are used to indicate disjunction of characters.
Examples:/[Ff]/ matches either ‘f’ or ‘F’/[Ff]red/ matches either ‘fred’ or ‘Fred’
This form of disjunction applies only at the character level. A set of characters in square brackets are sometimes referred to as a character class.
Fall 2006 CSE 467/56710
Ranges
Sometimes it is useful to specify “any digit” or “any letter”.
“Any digit” can be written as /[0123456789]/, since any of the ten digits satisfies the pattern.
An alternative is to use a special range notation: /[0-9]/
Any letter can be specified as /[A-Za-z]/
Range notation does not extend the power of regular expressions, but gives us a convenient way to express them.
Fall 2006 CSE 467/56711
Complementing character classes
To search for a character that is not in a character class, use the caret (^) in front of the character class that is enclosed in square brackets.
Examples:
/[^a]/ matches anything except ‘a’
/[^0-9]/ matches anything except a digit
Fall 2006 CSE 467/56712
Matching 0 or 1 occurrence
The ‘?’ matches zero or one occurrences of the preceding expression.
Examples:/a?/ matches ‘a’ or ‘’ (nothing)/cats?/ matches ‘cat’ or ‘cats’Note that the “preceding expression”, in these examples, is a single letter. We’ll see how to form longer expressions later.
Fall 2006 CSE 467/56713
The Kleene star and plus
The Kleene star (*) matches zero or more occurrences of the preceding expression.
Examples:/a*/ matches ‘’, ‘a’, ‘aa’, ‘aaa’, etc./[ab]*/ matches ‘’, ‘a’, ‘b’, ‘aa’, ‘ab’, ‘ba’, ‘bb’, etc.
+ matches one or more occurrences+ is not necessary: /[ab]+/ is equiv. to /[ab][ab]*/
Fall 2006 CSE 467/56714
Wildcard
The period (.) matches any single character except the newline (\n).
Fall 2006 CSE 467/56715
Anchors
Anchors are used to restrict a match to a particular position within a string.
^ anchors to the start of a string$ anchors to the end of a string
/[Ff]red/ matches both ‘Fred’ and ‘Fred is home’ /^[Ff]red$/ matches ‘Fred’ but not ‘Fred is home’
\b anchors to a word boundary\B anchors to a non-boundary
Fall 2006 CSE 467/56716
Conjunction
Two regular expressions are conjoined by juxtaposition (placing the expressions side by side).
Examples:
/a/ matches ‘a’
/m/ matches ‘m’
/am/ matches ‘am’ but not ‘a’ or ‘m’ alone
Fall 2006 CSE 467/56717
Disjunction
We have already seen disjunction of characters using the square bracket notation
General disjunction is expressed using the vertical bar (|), also called the pipe symbol.
This form of disjunction allows us to match any one of the alternative patterns, not just characters like the [ ] disjunction form.
Fall 2006 CSE 467/56718
Grouping
Parentheses, ‘(’ and ‘)’, are used to group subpatterns of a larger pattern.
Ex: /[Gg](ee)|(oo)se/
Fall 2006 CSE 467/56719
Replacement
In addition to matching, we can do replacements when a match is found:
Example:To replace the British spelling of color with the American spelling, we can write:
s/colour/color/
Fall 2006 CSE 467/56720
Registers – saving matches
To save a match from part of a pattern, to reuse it later on, Perl provides registers
Registers are named \#, where # is the number of the register Ex.
DE DO DO DO DE DA DA DAIS ALL I WANT TO SAY TO YOU
/(D[AEO].)*/ will match the first line
/(D[AEO])(.D[AEO]) \2 \2\s \1 (.D[AEO]) \3 \3/ matches it more specifically
This pattern also matches strings like DA DE DE DE DA DO DO DO
\s matches a whitespace character
Fall 2006 CSE 467/56721
For more information
PERL Regular Expression TUTorial– http://perldoc.perl.org/perlretut.html
PERL Regular Expression reference page– http://perldoc.perl.org/perlre.html