Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Welcome to CSC 108
Introduction to Computer Programming
Lecture W10C
Drs. Michael Liut, Andi Bergen, Larry Zhang
Mathematical and Computational Sciences
University of Toronto Mississauga
November 20 2020
This session is being recorded!
1 / 39
Regular Expressions (Regex)
Definition 1 (A sequence of characters that forms a
search pattern)
1. Phone numbers
2. Email addresses
3. Postal Codes / Zip Codes
4. Valid variable names
(e.g., variable names cannot start with digits)
5. etc.
2 / 39
Regular Expressions (Regex)
1. Groups: (), (?:) capturing vs. non-capturing
2. Quantifiers: *? {1,2}
3. Character classes: [A-Za-z]
4. Escape characters: \.
5. Logical operators: a | b
6. Use of raw string in python: r‘regex’
3 / 39
Simple Usage
1 >>> import re
2 >>> txt = "Today’s topic in CSC108: Regexes."
3 >>> x = re.search("CSC108", txt)
4 >>> if x:
5 ... print("Yes, there’s a match")
6 ‘Yes, there’s a match’
7
8 We could use ‘‘in” for this simple example.
4 / 39
Find all phone numbers with area code 416
We don’t know exactly what we are looking for (e.g., 416-555-1235),
but only a pattern.
1 txt = ‘143-614-3330, 556-732-3881, 680-964-1127, 568-769-3556,
2 099-887-1597, 081-997-3959, 842-502-6372, 406-648-1681,
3 416-475-8283, 259-778-2868, 105-776-7011, 912-576-5192,
4 018-087-9554, 975-845-6860, 702-619-1033, 326-382-3556,
5 416-294-6744, 957-135-4565, 667-624-1973, 603-418-9850’
We could use split(), loops, if, startswith() and string splicing.
5 / 39
Find all phone numbers with area code 416
I A phone number is a sequence of characters that follows a
pattern
I Pattern: 416- followed by 3 digits, a dash and then 4 digits
6 / 39
Findall
I Search literals: "CSC108", "416-"
I Search ranges/classes: [a-zA-Z], [0-9]
I Wild cards: . Dot matches anything
I Find zero/one or more occurrences: * + (e.g., .* or .+)
I Escape characters: \. \? \+
I Logical operators: a∣∣ b (Either a or b, not both)
I Specific number of occurrences: {1} {2} {3}
7 / 39
Findall
How do we find a phone number with “416” area code?
I Search literals “416” “-”
I Search ranges/classes: [0-9]
I Specific number of occurrences: {1} {2} {3}
8 / 39
Findall
Find all phone numbers with “416” in this string.
1 ‘143-614-3330, 556-732-3881, 680-964-1127, 568-769-3556,
2 099-887-1597, 081-997-3959, 842-502-6372, 406-648-1681,
3 416-475-8283, 259-778-2868, 105-776-7011, 912-576-5192,
4 018-087-9554, 975-845-6860, 702-619-1033, 326-382-3556,
5 416-294-6744, 957-135-4565, 667-624-1973, 603-418-9850’
9 / 39
Findall
1 >>> x = re.findall("416-[0-9]{3}-[0-9]{4}", txt)
2 >>> print(x)
3 [‘416-475-8283’, ‘416-294-6744’]
4 >>> x = re.findall("416-\d{3}-\d{4}", txt)
5 >>> print(x)
6 [‘416-475-8283’, ‘416-294-6744’]
7
10 / 39
Algorithm Steps
“416-\d{3}-\d{4}” Search text: “416-555-1234 , 516-55-51234”
I Iterate through the input, try to match it to the current
char of the regex
I If it matches, advance to the next input char and next char
in the regex
I If there was a match and the regex has no next char, add
the string to the results
I If it did not match, advance to the next input char and reset
the regex to its start
11 / 39
Algorithm Steps II
“416-\d{3}-\d{4}”
Search text: “416-555-1234 , 516-55-51234”
12 / 39
Algorithm Steps II
“416-\d{3}-\d{4}”
Search text: “416-555-1234 , 516-55-51234”
13 / 39
Algorithm Steps II
“416-\d{3}-\d{4}”
Search text: “416-555-1234 , 516-55-51234”
14 / 39
Algorithm Steps II
“416-\d{3}-\d{4}”
Search text: “416-555-1234 , 516-55-51234”
15 / 39
Algorithm Steps II
“416-\d{3}-\d{4}”
Search text: “416-555-1234 , 516-55-51234”
16 / 39
Algorithm Steps II
“416-\d{3}-\d{4}” First occurrence
Search text: “416-555-1234 , 516-55-51234”
17 / 39
Algorithm Steps II
“416-\d{3}-\d{4}” Second occurrence
Search text: “416-555-1234 , 516-55-51234”
18 / 39
Algorithm Steps II
“416-\d{3}-\d{4}” Third occurrence
Search text: “416-555-1234 , 516-55-51234”
19 / 39
Algorithm Steps II
“416-\d{3}-\d{4}”
Search text: “416-555-1234 , 516-55-51234”
20 / 39
Algorithm Steps II
“416-\d{3}-\d{4}” First occurrence
Search text: “416-555-1234 , 516-55-51234”
21 / 39
Algorithm Steps II
“416-\d{3}-\d{4}” Second occurrence
Search text: “416-555-1234 , 516-55-51234”
22 / 39
Algorithm Steps II
“416-\d{3}-\d{4}” Third occurrence
Search text: “416-555-1234 , 516-55-51234”
23 / 39
Algorithm Steps II
“416-\d{3}-\d{4}” Fourth occurrence
Search text: “416-555-1234 , 516-55-51234”
24 / 39
Algorithm Steps II
“416-\d{3}-\d{4}” Fourth occurrence.
Regex is empty, take result, reset regex.
Search text: “416-555-1234 , 516-55-51234”
25 / 39
Algorithm Steps II
“416-\d{3}-\d{4}”
Search text: “416-555-1234, 516-55-51234”
Result: [“416-555-1234”]
Continue to try and match one item of the regex at a time with
the next character of the input string, until the input string is
empty
26 / 39
More examples (Find special characters)
1 >>> txt = "Abc. Hello! 100 + 8"
2 >>> x = re.findall(".!+", txt)
3 >>> print(x)
4 [‘o!’] <-- We want . ! +
5 >>> x = re.findall("\.\!\+", txt)
6 >>> print(x)
7 [ ] <-- Still not quite correct
8 >>> x = re.findall("[!.+]+", txt)
9 >>> print(x)
10 [‘.’, ‘!’, ‘+’]
27 / 39
More examples (Groups)
1 >>> txt = "csccsccsc"
2 >>> x = re.findall(r‘(csc)’, txt)
3 >>> print(x)
4 [‘csc’, ‘csc’, ‘csc’]
5 >>> x = re.findall(r‘(?:csc){3}’, txt)
6 >>> print(x)
7 [‘csccsccsc’]
28 / 39
More examples (Or)
1 >>> txt = "csccsccsc"
2 >>> x = re.findall("csc∣∣ cs", txt)
3 >>> print(x)
4 [‘csc’, ‘csc’, ‘csc’]
5 >>> x = re.findall("cs∣∣ csc", txt)
6 >>> print(x)
7 [‘cs’, ‘cs’, ‘cs’]
29 / 39
More examples (Words)
1 >>> txt = "Mike Miller, Mick Furrier, Mike Baker,
2 Myke Mason"
3 >>> x = re.findall("Mike \w*", txt)
4 >>> print(x)
5 [‘Mike Miller’, ‘Mike Baker’]
6 >>> x = re.findall("(M[iy]ke \w*)", txt)
7 >>> print(x)
8 [‘Mike Miller’, ‘Mike Baker’, ‘Myke Mason’]
30 / 39
Task 1
1 def find_specific(txt: str) -> List[str]:
2 """ Return a list of matches where the string
3 begins with one ‘a’ followed by one or more ‘b’
4 >>> find_specific("aaabbbabbacb")
5 [‘abbb’, ‘abb’]
6 >>> find_specific("")
7 []
8 """
9 pass
31 / 39
Answer
1 def find_specific(txt: str) -> List[str]:
2 """ Return a list of matches where the string
3 begins with one ‘a’ followed by one or more ‘b’
4 >>> find_specific("aaabbbabbacb")
5 [‘abbb’, ‘abb’]
6 >>> find_specific("")
7 []
8 """
9 return re.findall("ab+", txt)
32 / 39
Task 2
1 def run_me(txt: str, my_regex: str) -> List[str]:
2 if len(my_regex) <= 6:
3 return re.findall(my_regex, txt)
4 return ["Fail"]
5
6 def task2_helper(txt: str):
7 """ Create a regex that is no more than 6 characters long
8 that matches when a string, or substring, starts with
9 an ‘a’ followed by 0 or more numerals and ends
10 with the following string: ‘CS’
11 >>> task2_helper("a998289CSC is great aaCS")
12 [‘a998289CS’, ‘aCS’]
13 >>> task2_helper("aCSC108")
14 [‘aCS’]
15 >>> task2_helper("")
16 []
17 """
18 your_regex = r""
19 return run_me(txt, your_regex)
33 / 39
Answer
1 def run_me(txt: str, my_regex: str) -> List[str]:
2 if len(my_regex) <= 6:
3 return re.findall(my_regex, txt)
4 return ["Fail"]
5
6 def task2_helper(txt: str):
7 """ Create a regex that is no more than 6 characters long
8 that matches when a string, or substring starts with
9 an ‘a’ followed by 0 or more numerals and ends
10 with the following string: ‘CS’
11 >>> task2_helper("a998289CSC is great aaCS")
12 [‘a998289CS’, ‘aCS’]
13 >>> task2_helper("aCSC108")
14 [‘aCS’]
15 >>> task2_helper("")
16 []
17 """
18 your_regex = r‘a\d*CS’19 return run_me(txt, your_regex)
34 / 39
Task 3
1 def task3(txt: str) -> bool:
2 """Return true if the input text contains only
3 valid variables names in Python.
4 Hint: are the doctests complete?
5 >>> task3("x y foo foobar")
6 True
7 >>> task3("x y x-y foobar")
8 False
9 >>> task3("foo_bar")
10 True
11 >>> task3(" ")
12 False
13 >>> task3("")
14 False
15 """
16 pass
35 / 39
Answer
1 def task3(txt: str) -> bool:
2 """Returns true if the input text contains only valid variables names in Python.
3 Hint: are the doctests complete?
4 >>> task3("x y foo foobar")
5 True
6 >>> task3("x y x-y foobar")
7 False
8 >>> task3("foo_bar")
9 True
10 >>> task3(" ")
11 False
12 >>> task3("")
13 False
14 """
15 res = re.findall("[A-Za-z_]\w]*", txt)
16 if not res:
17 return False
18 return len(txt.replace(" ", "")) == len("".join(res))
36 / 39
Task 4
Rhyming words...
1 def task4(target: str, valid_words: List[str], quality: int = 2) -> List[str]:
2 """Find all words that rhyme with the target string.
3 Words are considered to rhyme when their last <quality> characters are identical.
4
5 target: the string for which you want to find all rhyming words for
6 valid_words: contains all words in the language
7 quality: how many characters at the end of the target have to match
8 >>> task4("bat", ["cat", "mat", "butter", "adjective", "flat"])
9 [‘cat’, ‘mat’, ‘flat’]
10 >>> task4("batter", ["cat", "weather", "butter", "adjective", "flatter"])
11 [‘weather’, ‘butter’, ‘flatter’]
12 >>> task4("batter", ["cat", "weather", "butter", "adjective", "flatter"], 3)
13 [‘butter’, ‘flatter’]
14 >>> task4("a", ["cat", "weather", "butter", "adjective", "flatter"])
15 []
16 """
17 pass
37 / 39
Answer
1 def task4(target: str, valid_words: List[str], quality: int = 2) -> List[str]:
2 """Find all words that rhyme with the target string.
3 Words are considered to rhyme when their last <quality> characters are identical.
4
5 target: the string for which you want to find all rhyming words for
6 valid_words: contains all words in the language
7 quality: how many characters at the end of the target have to match
8 >>> task4("bat", ["cat", "mat", "butter", "adjective", "flat"])
9 [‘cat’, ‘mat’, ‘flat’]
10 >>> task4("batter", ["cat", "weather", "butter", "adjective", "flatter"])
11 [‘weather’, ‘butter’, ‘flatter’]
12 >>> task4("batter", ["cat", "weather", "butter", "adjective", "flatter"], 3)
13 [‘butter’, ‘flatter’]
14 >>> task4("a", ["cat", "weather", "butter", "adjective", "flatter"])
15 []
16 """
17 if len(target) < quality:
18 return []
19 v = ",".join(valid_words)
20 reg = "\w+" + target[-quality:]
21 return re.findall(reg, v)
38 / 39
Next Time
1. Sorting.
2. Time and Complexity.
39 / 39