View
1.272
Download
1
Category
Preview:
DESCRIPTION
Presentation at LDC09: Introduction To Regex in Lasso 8.5
Citation preview
2. What is regex? Regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters. ( Wikipedia: http://en.wikipedia.org/wiki/Regex) In plain English: Regex is a text-searching language. 3. How regex works Three components are needed:
5. A regular expression that defines what to search for (e.g. d to find a digit) 6. #1 Regex Engine
7. [string_replaceregexp] 8. [regexp] 9. [compare_regexp] 10. [compare_notregexp] 11. [match_regexp] 12. [match_notregexp] 13. #2 Some Text To Search Against
14. There may be performance and memory challenges using regex against a sizably large [string] 15. #3 Regular Expressions: The regex language
17. White Space 18. Character Classes 19. Shorthand Character Classes
21. Quantifiers 22. Grouping 23. Literals All characters search for their literal selves except for the following: [^$.|?*+() they require being escaped when searched for as a literal. Example: [string_findregexp('LDC is fun!',-find='fun')] LP8:array: (fun) L9:array(fun) 24. Literals (cont) By default, regex is case-sensitive.Use the (?i) switch to make it case-insensitive. Examples: [string_findregexp('ABC abc',-find='abc')] LP8:array: (abc) L9:array(abc) [string_findregexp('ABC abc',-find='(?i)abc')] LP8:array: (ABC), (abc) L9:array(ABC, abc) 25. Escaping Characters In regular expressions, depending on the context, various characters have special meaning.In order to specify the literal character, you must escape it with a backslash ().And because the backslash has special meaning in Lasso, it means you must double the backslashes in Lasso (). 26. Escaping Characters (cont) Example: [string_findregexp('[date] returns the date', -find='date')] LP8:array: ([date]) L9:array([date]) [string_findregexp('[date] returns the date', -find='[date]')] LP8: array:(d),(a),(t),(e),(e),(t),(t),(e),(d),(a),(t),(e) L9: array(d, a, t, e, e, t, t, e, d, a, t, e) 27. Dot A dot (aka period symbol .) will match any single character except line returns.Use the switch (?s) to turn on matching line returns too. Example: [string_findregexp('LDC is fun! Turn on a fan.', -find='f.n')] LP8:array: (fun), (fan) L9:array(fun, fan) 28. Dot (cont) [string_findregexp('1 2 3',-find='.')] LP8: array: (1), (2), (3) L9:array(1, 2, 3) [string_findregexp('1 2 3',-find='(?s).')] LP8: array: (1), ( ), (2), ( ), (3) L9:array(1, , 2, , 3) 29. White Space To find white space, use the Lasso equivalents: Return = Newline = Tab = Example: [string_findregexp('1 2 3',-find=' ')] LP8:array: ( ), ( ) L9:array( , ) 30. Character Classes Used to match against a set of characters contained within square brackets [ ].Order of characters within the class does not matter (i.e. [abc] == [cba]).Reserved characters are ^-]. Example: [string_findregexp('New Years Eve is 2009-12-31', -find='[123ae]')] LP8:array: (e), (e), (a), (e), (2), (1), (2), (3), (1) L9: array(e, e, a, e, 2, 1, 2, 3, 1) 31. Character Classes (cont) Hyphen denotes a range (e.g. [0-9] means 0,1,2,..,9 and [a-z] means a,b,c,...,z). Example: [string_findregexp('abcdef',-find='[b-d]')] LP8:array: (b), (c), (d) L9:array(b, c, d) 32. Character Classes (cont) A caret after the opening square bracket denotes characters to omit instead of find. Example: [string_findregexp('abcdef',-find='[^b-d]')] LP8:array: (a), (e), (f) L9:array(a, e, f) 33. Shorthand Character Classes =[0-9] =[^0-9] [a-zA-Z0-9_] [^a-zA-Z0-9_] [] [^] Example: [string_findregexp('1a2b3c',-find='')] LP8:array: (1), (2), (3) L9:array(1, 2, 3) [string_findregexp('1a2b3c',-find='')] LP8:array: (a), (b), (c) L9:array(a, b, c) 34. Shorthand Character Classes (cont) Example: [string_findregexp('1a2b3c',-find='')] LP8:array: (1), (a), (2), (b), (3), (c) L9:array(1, a, 2, b, 3, c) [string_findregexp('1 2 3',-find='')] LP8: array: ( ), ( ) L9:array( , ) 35. Positional Matching ^ matches beginning of text, $ matches end of text, and (?m) switch makes ^ and $ match beginning and ending of each line. Example: [string_findregexp('1 2 3',-find='^')] LP8: array: (1) L9:array(1)[string_findregexp('1 2 3',-find='(?m)^')] LP8: array: (1), (2), (3) L9:array(1, 2, 3) 36. Positional Matching (cont) matches a word boundary (the position between a word character and a non-word character or start/end of line). Example: [string_findregexp('cape and ape',-find='ape')] LP8: array: (ape) L9:array(ape) [string_findregexp('cape and ape',-find='ape')] LP8: array: (ape), (ape) L9:array(ape, ape) 37. Alternation Vertical bar (|) is an OR operand for regex. Example: [string_findregexp('cat and rat',-find='cat|rat')] LP8: array: (cat), (rat) L9:array(cat, rat) 38. Quantifiers Specifies the number to find: * = 0 or more + = 1 or more ? = 0 or 1 {n} = n times {n,m} = min n, max m times {n, }= min n, no max Example: [string_findregexp('123aaabbb', -find='0*1+2?3{1}a{1,2}ab{2,}')] LP8: array: (123aaabbb) L9:array(123aaabbb) 39. Grouping Round brackets ( ) group the regex together, allowing quantifiers to be used on the group or to perform AND/OR with regex.They also create backreferences, which we won't cover in this session, but know that Lasso returns the group match in addition to the overall match. Example: [string_findregexp('cat and rat',-find='(c|r)at')] LP8: array: (cat), (c), (rat), (r) L9:array(cat, c, rat, r) 40. Grouping (cont) There is an option for non-capturing groups: (?: regex here...) Example: [string_findregexp('cat and rat',-find='(?:c|r)at')] LP8: array: (cat), (rat) L9:array(cat, rat) 41. Tips for Regular Expressions
42. When using regular expressions obtained from outside sources, you'll need to double-up the backslashes () for Lasso (e.g. d+ becomes +). 43. User-input used as part of a regular expression must be encoded (http://tagswap.net/lp_regexp_encode) 44. Putting it all together
45. Often, there are several ways to match.If one approach doesn't work, try another. 46. Great reference and tutorial site: www.regular-expressions.info 47. Examples Extract names from comma-delimited list: [string_findregexp('Abe Smith, Bob Jones, Cindy Hart, Darla King',-find='+++')] LP8:array: (Abe Smith), (Bob Jones), (Cindy Hart), (Darla King) L9:array(Abe Smith, Bob Jones, Cindy Hart, Darla King) 48. Examples (cont) Extract phone numbers into a packed format: [string_findregexp('(213) 555-1212',-find='') ->join('')] [string_findregexp('213-555-1212',-find='') ->join('')] [string_findregexp('213 555 1212',-find='') ->join('')] LP8: 2135551212 2135551212 2135551212 L9:2135551212 2135551212 2135551212 49. Examples (cont) Extract data from HTML: [string_findregexp('',-find='name="secret" value="[^"]+')] LP8: array: (name="secret" value="123) L9:array(name="secret" value="123) [string_findregexp('',-find='name="secret" value="([^"]+)')] LP8: array: (name="secret" value="123), (123) L9:array(name="secret" value="123, 123) 50. Examples (cont) Extract data from HTML: [string_findregexp('',-find='(?:name="secret" value=")[^"]+')]LP8: array: (name="secret" value="123) L9:array(name="secret" value="123) [string_findregexp('',-find='(?
Recommended