21
CSE467/567 Computational Linguistics Carl Alphonce [email protected] Computer Science & Engineering University at Buffalo

CSE467/567 Computational Linguistics Carl Alphonce [email protected] Computer Science & Engineering University at Buffalo

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo

CSE467/567Computational Linguistics

Carl [email protected]

Computer Science & Engineering

University at Buffalo

Page 2: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo

Fall 2006 CSE 467/5672

Levels of processing

phonetics/phonology – sounds morphology – word structure syntax – sentence structure semantics – meaning pragmatics – goals of language use discourse – utterances in context

Page 3: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo

Fall 2006 CSE 467/5673

Words: the building blocks of sentences

the

D

d og

N

N P

ch ased

V

the

D

cat

N

N P

V P

S

Page 4: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo

Fall 2006 CSE 467/5674

Words have internal structure

readable = read + able readability = read + able + ity

the structure of words can be described using a regular grammar

Page 5: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo

Fall 2006 CSE 467/5675

Chomsky hierarchy

regularlanguages

context-freelanguages

context-sensitivelanguages

unrestrictedlanguages

Page 6: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo

Fall 2006 CSE 467/5676

Problem

I often need to find an e-mail, but I have thousands of e-mails in my various folders. Suppose I want to find an e-mail about geese. The e-mail may mention “geese” or “goose”; also, if it appears at the start of a sentence, its initial letter will be capitalized. Need to match “goose”, “geese”, “Goose” or “Geese”.

Page 7: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo

Fall 2006 CSE 467/5677

Regular expressions (in Perl)

“a regular expression is an algebraic notation for characterizing a set of strings” [p. 22]

Regular expressions are commonly used to specify search strings. For example, the UNIX utility program grep lets the user specify a pattern to search for in files.

Page 8: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo

Fall 2006 CSE 467/5678

Sequences of characters

Matching a sequence of characters/…/

Examples:/a/ matches the character ‘a’/fred/ matches the string ‘fred’

Note:/fred/ does not match the string ‘Fred’!

In other words, patterns are case-sensitive.

Page 9: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo

Fall 2006 CSE 467/5679

Character disjunction(character classes)

Square brackets are used to indicate disjunction of characters.

Examples:/[Ff]/ matches either ‘f’ or ‘F’/[Ff]red/ matches either ‘fred’ or ‘Fred’

This form of disjunction applies only at the character level. A set of characters in square brackets are sometimes referred to as a character class.

Page 10: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo

Fall 2006 CSE 467/56710

Ranges

Sometimes it is useful to specify “any digit” or “any letter”.

“Any digit” can be written as /[0123456789]/, since any of the ten digits satisfies the pattern.

An alternative is to use a special range notation: /[0-9]/

Any letter can be specified as /[A-Za-z]/

Range notation does not extend the power of regular expressions, but gives us a convenient way to express them.

Page 11: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo

Fall 2006 CSE 467/56711

Complementing character classes

To search for a character that is not in a character class, use the caret (^) in front of the character class that is enclosed in square brackets.

Examples:

/[^a]/ matches anything except ‘a’

/[^0-9]/ matches anything except a digit

Page 12: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo

Fall 2006 CSE 467/56712

Matching 0 or 1 occurrence

The ‘?’ matches zero or one occurrences of the preceding expression.

Examples:/a?/ matches ‘a’ or ‘’ (nothing)/cats?/ matches ‘cat’ or ‘cats’Note that the “preceding expression”, in these examples, is a single letter. We’ll see how to form longer expressions later.

Page 13: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo

Fall 2006 CSE 467/56713

The Kleene star and plus

The Kleene star (*) matches zero or more occurrences of the preceding expression.

Examples:/a*/ matches ‘’, ‘a’, ‘aa’, ‘aaa’, etc./[ab]*/ matches ‘’, ‘a’, ‘b’, ‘aa’, ‘ab’, ‘ba’, ‘bb’, etc.

+ matches one or more occurrences+ is not necessary: /[ab]+/ is equiv. to /[ab][ab]*/

Page 14: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo

Fall 2006 CSE 467/56714

Wildcard

The period (.) matches any single character except the newline (\n).

Page 15: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo

Fall 2006 CSE 467/56715

Anchors

Anchors are used to restrict a match to a particular position within a string.

^ anchors to the start of a string$ anchors to the end of a string

/[Ff]red/ matches both ‘Fred’ and ‘Fred is home’ /^[Ff]red$/ matches ‘Fred’ but not ‘Fred is home’

\b anchors to a word boundary\B anchors to a non-boundary

Page 16: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo

Fall 2006 CSE 467/56716

Conjunction

Two regular expressions are conjoined by juxtaposition (placing the expressions side by side).

Examples:

/a/ matches ‘a’

/m/ matches ‘m’

/am/ matches ‘am’ but not ‘a’ or ‘m’ alone

Page 17: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo

Fall 2006 CSE 467/56717

Disjunction

We have already seen disjunction of characters using the square bracket notation

General disjunction is expressed using the vertical bar (|), also called the pipe symbol.

This form of disjunction allows us to match any one of the alternative patterns, not just characters like the [ ] disjunction form.

Page 18: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo

Fall 2006 CSE 467/56718

Grouping

Parentheses, ‘(’ and ‘)’, are used to group subpatterns of a larger pattern.

Ex: /[Gg](ee)|(oo)se/

Page 19: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo

Fall 2006 CSE 467/56719

Replacement

In addition to matching, we can do replacements when a match is found:

Example:To replace the British spelling of color with the American spelling, we can write:

s/colour/color/

Page 20: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo

Fall 2006 CSE 467/56720

Registers – saving matches

To save a match from part of a pattern, to reuse it later on, Perl provides registers

Registers are named \#, where # is the number of the register Ex.

DE DO DO DO DE DA DA DAIS ALL I WANT TO SAY TO YOU

/(D[AEO].)*/ will match the first line

/(D[AEO])(.D[AEO]) \2 \2\s \1 (.D[AEO]) \3 \3/ matches it more specifically

This pattern also matches strings like DA DE DE DE DA DO DO DO

\s matches a whitespace character

Page 21: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo

Fall 2006 CSE 467/56721

For more information

PERL Regular Expression TUTorial– http://perldoc.perl.org/perlretut.html

PERL Regular Expression reference page– http://perldoc.perl.org/perlre.html