View
245
Download
0
Category
Preview:
Citation preview
GREP
Whats Grep?
Grep is a popular unix program that supports a special programming language for doing regular expressions
The grammar in use for software doing regular expressions are based on grep; perl extends it further.
ANY
Regular ExpressionSearch String
Compiles
Engine parses your search string
produces a state machine
FALSEFALSE
Searches
Input sent into State Machine
Conceptually, 1 shape/letter at a time
TRUETRUE
Found:The State Machine Object changes state (in this example it is set to true)
User checks machine state when it completes running
Grep Expressions
The “grep” language for doing Regular Expressions on text processing
Grep pattern is another name
called “Regular Expressions”
Grep Expressions
A string of text to match with special characters
“john.*”
would return True on a search of:“john was here”
Grep Expressions“.*\.txt”
.* is anything (.) any length (*)
\. is literally a . (the \ before it means the next character is literal; that is not special)
txt is just letter matching
This would filter out txt files
Its similar to what you see in windows, but its not the same--its more powerful than simple “wildcards” (*) you often see.
Special Chars
. = any single character
^ = beginning of a line
$ = end of line
\w = word & number characters
\d = decimals (numbers)
\ = escape char
Backslash \ (leans to the left)
most popular escape character
Uses:
sneak past Illegal characters
make secret code characters
Data encoding always has them
Examples
… = three of ANYTHING
\d\d\d = three numbers (decimals)
remember the \ is the escape code
\w\w\w = three letters (no symbols)
good: abc
bad: a34, ab!
Approach
searching for “john” or “joan”
What is the difference between them?
jo_n
what symbol works?
jo\wn
jo.n
Special Chars
\D = non numbers
\W = non-word characters
\s = white space
\S = non white space
\n = new line (return/enter key)
\t = tab
\s\s\s = three whitespaces
tabs, space, possibly newlines
\D\s\W = non-decimal, space, non-word
Examples:
x 4, ! !, = 4, A <tab> 5
Quantity Chars* = 0 or more
? = 0 or 1
+ = 1 or more
[] = any of the chars in the [abc]
[^] = NOT any of the chars in []
[a-zA-Z] = ranges of chars
Examples
X+ = 1 or more X
XXX
[XYZ] = any of these 1 chars
X, Y, Z
[XYZxyz]+ = 1+ of any of these
y, XYz, zYZZyX, ZZzzzzz
EXAMPLES
[a-zA-Z0-9] = any word or number but no spaces
\.?$ = maybe ends with a .
remember: $ is end of line
.* = 0 to ∞ of any letter
[^abc]* = 0 to ∞ anything but lowercase a,b, or c
Problems
UniCode vs ASCII
Reg.Exp. language is older than UniCode
Many new Engines support UniCode
Minor Extensions to the language will be required for full UniCode support
Options
RegExp Engines typically have options
ignoreCase
saves you from doing [Aa] for each
global
repeats if a match was found until the end of the input; by default: it stops at the 1st match (useful for replace)
Options
multiline
Most breakup the input into lines:
At end of line, it resets for next line
This would make it ignore line endings (unless you use ^ or $ which refer to the beginning and end of lines)
/Common Use/
/string/ similar to “quotes” on strings
if you use “string” you must escape:
/\d\d/ (match 2 digit pattern)
vs
“\\d\\d” (match 2 digit string)
Recommended