GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for...

Preview:

Citation preview

GREP

Whats Grep?

Grep is a popular unix program that supports a special programming language for doing regular expressions

The grammar in use for software doing regular expressions are based on grep; perl extends it further.

ANY

Regular ExpressionSearch String

Compiles

Engine parses your search string

produces a state machine

FALSEFALSE

Searches

Input sent into State Machine

Conceptually, 1 shape/letter at a time

TRUETRUE

Found:The State Machine Object changes state (in this example it is set to true)

User checks machine state when it completes running

Grep Expressions

The “grep” language for doing Regular Expressions on text processing

Grep pattern is another name

called “Regular Expressions”

Grep Expressions

A string of text to match with special characters

“john.*”

would return True on a search of:“john was here”

Grep Expressions“.*\.txt”

.* is anything (.) any length (*)

\. is literally a . (the \ before it means the next character is literal; that is not special)

txt is just letter matching

This would filter out txt files

Its similar to what you see in windows, but its not the same--its more powerful than simple “wildcards” (*) you often see.

Special Chars

. = any single character

^ = beginning of a line

$ = end of line

\w = word & number characters

\d = decimals (numbers)

\ = escape char

Backslash \ (leans to the left)

most popular escape character

Uses:

sneak past Illegal characters

make secret code characters

Data encoding always has them

Examples

… = three of ANYTHING

\d\d\d = three numbers (decimals)

remember the \ is the escape code

\w\w\w = three letters (no symbols)

good: abc

bad: a34, ab!

Approach

searching for “john” or “joan”

What is the difference between them?

jo_n

what symbol works?

jo\wn

jo.n

Special Chars

\D = non numbers

\W = non-word characters

\s = white space

\S = non white space

\n = new line (return/enter key)

\t = tab

\s\s\s = three whitespaces

tabs, space, possibly newlines

\D\s\W = non-decimal, space, non-word

Examples:

x 4, ! !, = 4, A <tab> 5

Quantity Chars* = 0 or more

? = 0 or 1

+ = 1 or more

[] = any of the chars in the [abc]

[^] = NOT any of the chars in []

[a-zA-Z] = ranges of chars

Examples

X+ = 1 or more X

XXX

[XYZ] = any of these 1 chars

X, Y, Z

[XYZxyz]+ = 1+ of any of these

y, XYz, zYZZyX, ZZzzzzz

EXAMPLES

[a-zA-Z0-9] = any word or number but no spaces

\.?$ = maybe ends with a .

remember: $ is end of line

.* = 0 to ∞ of any letter

[^abc]* = 0 to ∞ anything but lowercase a,b, or c

Problems

UniCode vs ASCII

Reg.Exp. language is older than UniCode

Many new Engines support UniCode

Minor Extensions to the language will be required for full UniCode support

Options

RegExp Engines typically have options

ignoreCase

saves you from doing [Aa] for each

global

repeats if a match was found until the end of the input; by default: it stops at the 1st match (useful for replace)

Options

multiline

Most breakup the input into lines:

At end of line, it resets for next line

This would make it ignore line endings (unless you use ^ or $ which refer to the beginning and end of lines)

/Common Use/

/string/ similar to “quotes” on strings

if you use “string” you must escape:

/\d\d/ (match 2 digit pattern)

vs

“\\d\\d” (match 2 digit string)

Recommended