20
Textual Patterns Finding the needle(s) in the textual haystack http://www.flickr.com/photos/seadigs/4810947743/sizes/z/in/photostream/

Finding the needle(s) in the textual haystack

Embed Size (px)

Citation preview

Page 1: Finding the needle(s) in the textual haystack

Textual Patterns

Finding the needle(s) in the textual haystack

http://www.flickr.com/photos/seadigs/4810947743/sizes/z/in/photostream/

Page 2: Finding the needle(s) in the textual haystack

Patterns

Consider the text above. How would you identify… Proper names?… Email addresses?… Dates?

From: Gow, Joe <[email protected]>Subject: Reminder About Open Forums TodayDate: March 25, 2011 8:44:08 AM CDTBcc: [email protected]

Hello, everyone. I just wanted to send a quick reminder about the two campus wide Open Forums we're holding today from 2 to 3 and 3 to 4 p.m. in the Cleary Center. I'll host the first session from 2 to 3, and we'll cover any topics you'd like to discuss. Then from 3 to 4 Vice Chancellor Bob Hetzel will lead a conversation about the plans for a new Cowley Science Building. Please join us!

Thanks,

Joe

Joe Gow, ChancellorUniversity of Wisconsin-La Crosse

Page 3: Finding the needle(s) in the textual haystack

PatternsWhat do you think of when you see the following?

MM/DD/YYYY

This is a (string) pattern.

Are there different patterns for this same thing?

How would you describe the pattern of a credit card number?

Page 4: Finding the needle(s) in the textual haystack

Regular ExpressionsRegular expressions are “formulas” for string patterns.

Regular expressions follow a standard notation.

Regular expressions can be used in various computer applications and programming languages.

Applying a regular expression to a string (piece of text) is called pattern matching.

- The regular expression might match the string (or part of it) or it might not.

Page 5: Finding the needle(s) in the textual haystack

Regular Expression NotationRegular expressions use a standard pattern language.

Any (non-meta) character is a pattern. The character pattern represents itself.

The '.' (period) is a pattern. The period (a meta character) pattern represents "any character"

If A and B are both patterns, then so areAB : This represents the pattern A followed by pattern B

F. matches Fa FR and F3 but not fa or aF

A|B : This represents either the pattern A or the pattern BP|Q matches P and Q but not R

Parentheses are special; they form a pattern group. Anything in parenthesis is a group. A group is one "thing".

(red|blue) fish matches what strings?

Page 6: Finding the needle(s) in the textual haystack

Example

(1|2|3|4|5|6|7|8|9|10|11|12):(0|1|2|3|4|5)(0|1|2|3|4|5|6|7|8|9)

How would you write an expression for the time on a digital 12-hour clock?

1|2|3|4|5|6|7|8|9|10|11|12

A regular expression matching any possible minute:(0|1|2|3|4|5)(0|1|2|3|4|5|6|7|8|9)

[HINT: Let’s divide & conquer]

A regular expression matching any possible hour:

A regular expression matching any possible time:

Page 7: Finding the needle(s) in the textual haystack

Repeating Patterns within PatternsQuantifiers are used to allow and constrain repetitions. If re is a regular expression (pattern), then so are:

re* represents zero or more repetitions of re

re+ represents one or more repetitions of re

re? represents zero or one occurrences of re

re{n} represents exactly n repetitions of re (n is some positive integer)

re{m,n} represents at least m and no more than n repetitions of re

(n, m are positive integers, m ≤ n)

Write a regular expression for Social Security Numbers123-45-6789

Page 8: Finding the needle(s) in the textual haystack

Example

• TextI sometimes wonder if the manufactures of foolproof items keep a fool or two on their payroll.

Patten: o{2}1?

Page 9: Finding the needle(s) in the textual haystack

Escaped CharactersSome characters have special meaning in regular expressions, and others have no printable form. Such characters can still be represented using a 2-character notation, known as an escape code.

\+ represents +

\. represents .

\n represents the new line character

The same technique works for * ? ( ) { } [ ] \ ^ $ |

\t represents the tab character

\r represents the carriage return character

\v represents the vertical tab character

\f represents the form feed character

Page 10: Finding the needle(s) in the textual haystack

Location Symbols

There are also two “location” symbols.

^ matches the start of a new line, including right after \n$ matches the end of a new line, including right before \n

Page 11: Finding the needle(s) in the textual haystack

Sample Regular Expressions(snow|rain)(flake|drop)

g(rr|ee)*

W.*W

B\.C\.

^Right now.$

^Right now.\$

Page 12: Finding the needle(s) in the textual haystack

Character ClassesSquare brackets enclose a character class (a set of

characters). The class will match any one character from the set. Within brackets…

specific characters can be listed ranges are denoted using -

Examples [aDb] matches a or D or b and nothing else[c-e] matches c or d or e and nothing else[a-z] matches any lowercase letter and nothing else

[a-zA-Z0-9] matches any alphabetic or numeric symbol

[a+*] matches a or + or * and nothing else

Page 13: Finding the needle(s) in the textual haystack

Examples

Which of the following match [a-z][0-9]*abc1z93a-9

Which of the following match [0-9]*[02468]039929354

Give a pattern for social security numbers using character classes.

Page 14: Finding the needle(s) in the textual haystack

Example 1: Phone Numbers

Create a regular expression to match phone numbers. The phone numbers can take on the following forms:

800-555-1212800 555 1212800.555.12121-800-555-1212800-555-1212-1234800-555-1212x1234

Page 15: Finding the needle(s) in the textual haystack

Example 1: Phone Numbers• Divide and conquer

Note that each phone number has at most four parts.• prefix (the number 1)• area code• trunk (first three digits)• rest (next 4 digits)• extension (last digits. May be between 1 and 4 in length)

• Consider defining each of these parts – what is the prefix?– what is the area code?– what is the trunk?– what is the rest?– what is the extension?

Page 16: Finding the needle(s) in the textual haystack

Example 1: Phone Numbers• We need to 'conquer' by combining the solutions for the parts.• Rules:

– The prefix is optional– One of the following must occur between the prefix and the area code:

space, comma, dash, period– One of the following must occur between the area code and the trunk:

space, comma, dash, period– One of the following must occur between the trunk and the rest: space,

comma, dash, period– An ‘x’ must occur between the rest and the extension.

Page 17: Finding the needle(s) in the textual haystack

Example 2: User NameSuppose the rules for some system are that a user name must begin with a capital letter, followed by lowercase letters and/or dashes and/or periods. The length of user names are restricted to 3 to 16 characters.

ExamplesDaveD.-rileyRdave

Invaliddave doesn’t begin with a capital letter

DDR3 capital letters and digits not permitted after first symbol

R too short

Page 18: Finding the needle(s) in the textual haystack

Example 3: MAC AddressEvery computer network connection has a unique MAC address that is expressed as six numbers separated by colons. Each number consists of two hexadecimal digits.

Examples10:22:93:04:91:00AF:0C:AA:ED:B7:21

Invalid10:22:93:04:91 too short

10:22:013:04:91 numbers must be two digits long, not three

AG:0C:AA:ED:B7:21 the letter “G” is not a hexadecimal digit

Page 19: Finding the needle(s) in the textual haystack

Example 4: IPV4Internet addresses are referred to as IP numbers. A common address consists of four positive integers separated by periods. These integers must each be within the range of 0-255.

Examples1.01.001.0255.255.255.255193.24.17.2

Invalid256.255.255.255 no number can be greater than 255

193.24.175. too few numbers

193.24:17.2 separators must be periods

Page 20: Finding the needle(s) in the textual haystack

Example 5: Email Addresses• An email address consists of two strings separated by a @

localString @ domainString• localString

– Must be one or more of the following characters: alphabetic, digits (0 through 9), or any of these !#$&’+-_/=?^`{|}~

– Periods are permitted but with the following restrictions: the first and last characters cannot be periods and there cannot be any consecutive periods.

– Note: There is another unusual notation for selected characters only allowed inside double quotes, which we will ignore.

• domainString– Must be one or more of the following characters: alphabetic, digits, dashes or

periods.– Alternately, the domain could be written as a pair of square brackets enclosing

four numbers separated by periods, where each of the four numbers is a non-negative number of one to three digits.

e.g., [138.93.200.0]