27
The Nearshore Advantage! Regular expressions Paolo Carrasco

Regular expressions

Embed Size (px)

DESCRIPTION

Trainning given a few years ago about Regular Expressions

Citation preview

Page 1: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Regular expressionsPaolo Carrasco

Page 2: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Agenda• Definition• How a regex engine works• Applications

– Matching• Regex Pattern• Characters

– Literal Characters– Special characters – Character classes/sets

• Anchors • Repetition and optional items

– Extracting• Grouping• Alternation• Groups

– Replacing• Advanced topics

– Greediness– Lookahead and lookbehind

• Techniques for Faster Expressions

Page 3: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Definition

• Basically, a regular expression is a pattern describing a certain amount of text.

• Their name comes from the mathematical theory on which they are based.

• Regular expressions provide a powerful, flexible, and efficient method for processing text.

• Sometimes it is also named regex or regexp.

Page 4: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

How a regex engine works

Regex Engine

The centerpiece of text processing with regular expressions is the regex engine.A regex "engine" is a piece of software that can process regular expressions, trying to match the pattern to the given string. Usually, the engine is part of a larger application and you do not access the engine directly.

API

Page 5: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

How a regex engine works

Regex Engine

Replace-ment

InputPattern

At a minimum, processing text using

regular expressions requires that the

engine be provided with the

following two items of information:

•The regular expression pattern to

identify in the text.

•The text to parse for the regular

expression pattern.

• (Optionally) A replacement string.

Page 6: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Applications

Matching

Extracting

Replacing

• The extensive pattern-matching

notation of regular expressions enables

you to quickly parse large amounts of

text to find specific character patterns;

to validate text to ensure that it

matches a predefined pattern; to

extract, edit, replace, or delete text

substrings; and to add the extracted

strings to a collection in order to

generate a report.

Page 7: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Applications: MatchingDetermine whether a string matches a regular expression pattern.

Page 8: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Regular Expression Pattern

Pattern

Literal Characters

Operators

Constructs

A regular expression is a pattern that the regular expression engine attempts to match in input text. A pattern consists of one or more character literals, operators, or constructs.

Page 9: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Characters: Literal Chars

• All characters except [\^$.|?*+() refer to the simple meaning of characters.

• All characters except the listed special characters match a single instance of themselves.

• Note: The regex engines are case sensitive by default.

Page 10: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Characters: Special chars or metacharacters

• Most of the times we will need special characters for complex matches.

• These characters are often called metacharacters.

[ \ ^ $ . | ? * + ( ) • The dot or period is one of

the most commonly used. Unfortunately, it is also the most commonly misused metacharacter.

• When a special char is needed as literal, it requires a backslash followed by any of metacharacters.

• A backslash escapes special characters to suppress their special meaning.

Page 11: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Escaped character Description Pattern Matches

\t Matches a tab, \u0009.

(\w+)\t "item1\t", "item2\t" in "item1\titem2\t"

\r Matches a carriage return, \u000D. (\r is not equivalent to the newline character, \n.)

\r\n(\w+) "\r\nThese" in "\r\nThese are\ntwo lines."

\n Matches a new line, \u000A.

\r\n(\w+) "\r\nThese" in "\r\nThese are\ntwo lines."

\s Matches an empty space

\w+\s\w+ “Hello world” in “Hello world”

Characters: Non-printable or escaped chars

Page 12: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Char classes/setsA character class matches any one of a set of characters.The order of the characters inside a character class does not matter.

• A character in the input string must match one of a specified set of characters.

Positive character

group

• A character in the input string must not match one of a specified set of characters.

Negative character

group

• The dot or period character is a wildcard character that matches any character except \n

Any character

Shorthand classes• Since certain character

classes are used often, a series of shorthand character classes are available.

• Shorthand character classes can be used both inside and outside the square brackets.

• Some shorthand have negated versions.

Page 13: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

AnchorsAnchor Description

^ The match must occur at the beginning of the string or line.

$ The match must occur at the end of the string or line, or before \n at the end of the string or line.

\b The match must occur on a word boundary.

\B The match must not occur on a word boundary.

• Anchors match a position before, after or between characters.

• They can be used to "anchor" the regex match at a certain position.

Page 14: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Repetition and optional items

• Match-zero-or-more*• Match-one-or-more+• Match-zero-or-one?• Interval{ }

Page 15: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Hands On Lab 1Processing which emails match a request.

Page 16: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Applications: ExtractingExtract a single match or the first match or all matches.

Page 17: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Grouping

• A group, also known as a subexpression, consists of an "open-group operator", any number of other operators, and a "close-group operator". (blahblahblahblahblahblahblahblahblah)

• Regex treats this sequence as a unit, just as mathematics and programming languages treat a parenthesized expression as a unit.

open-group-operator close-group-operator

Page 18: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Grouping: Alternation

• Alternation match one of a choice of regular expressions:– If you put the character(s) representing the

alternation operator between any two regular expressions A and B, the result matches the union of the strings that A and B match.

• It operates on the largest possible surrounding regular expressions. Thus, the only way you can delimit its arguments is to use grouping.

Page 19: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Grouping: Groups

• By placing part of a regular expression inside round brackets or parentheses, you can group that part of the regular expression together.

Page 20: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Applications: ReplacingReplace a matched substring.

Page 21: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Replacement TextSintax

Feature .NET Java Perl ECMA Ruby

$& (whole regex match) YES error YES YES no

$0 (whole regex match) YES YES no no no

$1 through $99 (backreference) YES YES YES YES no

${1} through ${99} (backreference) YES error YES no no

${group} (named backreference) YES error no no no

$` (backtick; subject text to the left of the match)

YES error YES YES no

$' (straight quote; subject text to the right of the match)

YES error YES YES no

$_ (entire subject string) YES error YES IE only no

$+ (highest-numbered group in the regex)

YES error noIE and Firefox

no

$$ (escape dollar with another dollar) YES error no YES no

$ (unescaped dollar as literal text) YES error no YES YES

Page 22: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Advanced Topics"Everything should be made as simple as possible, but not simpler."

Page 23: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Greediness

• The repetition operators or quantifiers are greedy. • They will expand the match as far as they can, and

only give back if they must to satisfy the remainder of the regex.

• The quick fix to this problem is to make the quantifier lazy instead of greedy. Lazy quantifiers are sometimes also called "ungreedy" or "reluctant". You can do this by putting a question mark behind the plus in the regex.

Page 24: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Lookahead and lookbehind

Negative lookahead• It is indispensable if you

want to match something not followed by something else.

Positive lookahead• It is indispensable if you

want to match something not followed by something else.

Collectively, these are called "lookaround". They do not consume characters in the string, but only assert whether a match is possible or not.

Page 25: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Techniques for Faster Expressions•Avoid recompiling •Use non-capturing parentheses •Don't add superfluous parentheses •Don't use superfluous character classes •Use leading anchors

Common Sense Techniques

•Expose ^ and \G at the front of expressions •Expose $ at the end of expressions Expose Anchors

•The repetition operators or quantifiers are greedy.Lazy Versus Greedy: Be

Specific

•Put the most likely alternative first

•Distribute into the end of alternation

Lead the Engine to a Match

Page 26: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Hands On Lab 2Extract data from email and format it.

Page 27: Regular expressions

The Nearshore Advantage!The Nearshore Advantage!

Resources and Tools• Regex Testers

– http://www.gskinner.com/RegExr/– http://osteele.com/tools/rework/

• Regex Patterns Library– http://regexlib.com

• Complete Tutorials– http://www.regular-expressions.info/– Javascript:

• http://www.w3schools.com/jsref/jsref_obj_regexp.asp

– .NET: • http://msdn.microsoft.com/en-us/library/hs600312.aspx

– Java: • http://java.sun.com/docs/books/tutorial/essential/regex/index.html