Welcome to CSC 108108/lectures/W10C.pdf · 2020. 12. 9. · Welcome to CSC 108 Introduction to...

Preview:

Citation preview

Welcome to CSC 108

Introduction to Computer Programming

Lecture W10C

Drs. Michael Liut, Andi Bergen, Larry Zhang

Mathematical and Computational Sciences

University of Toronto Mississauga

November 20 2020

This session is being recorded!

1 / 39

Regular Expressions (Regex)

Definition 1 (A sequence of characters that forms a

search pattern)

1. Phone numbers

2. Email addresses

3. Postal Codes / Zip Codes

4. Valid variable names

(e.g., variable names cannot start with digits)

5. etc.

2 / 39

Regular Expressions (Regex)

1. Groups: (), (?:) capturing vs. non-capturing

2. Quantifiers: *? {1,2}

3. Character classes: [A-Za-z]

4. Escape characters: \.

5. Logical operators: a | b

6. Use of raw string in python: r‘regex’

3 / 39

Simple Usage

1 >>> import re

2 >>> txt = "Today’s topic in CSC108: Regexes."

3 >>> x = re.search("CSC108", txt)

4 >>> if x:

5 ... print("Yes, there’s a match")

6 ‘Yes, there’s a match’

7

8 We could use ‘‘in” for this simple example.

4 / 39

Find all phone numbers with area code 416

We don’t know exactly what we are looking for (e.g., 416-555-1235),

but only a pattern.

1 txt = ‘143-614-3330, 556-732-3881, 680-964-1127, 568-769-3556,

2 099-887-1597, 081-997-3959, 842-502-6372, 406-648-1681,

3 416-475-8283, 259-778-2868, 105-776-7011, 912-576-5192,

4 018-087-9554, 975-845-6860, 702-619-1033, 326-382-3556,

5 416-294-6744, 957-135-4565, 667-624-1973, 603-418-9850’

We could use split(), loops, if, startswith() and string splicing.

5 / 39

Find all phone numbers with area code 416

I A phone number is a sequence of characters that follows a

pattern

I Pattern: 416- followed by 3 digits, a dash and then 4 digits

6 / 39

Findall

I Search literals: "CSC108", "416-"

I Search ranges/classes: [a-zA-Z], [0-9]

I Wild cards: . Dot matches anything

I Find zero/one or more occurrences: * + (e.g., .* or .+)

I Escape characters: \. \? \+

I Logical operators: a∣∣ b (Either a or b, not both)

I Specific number of occurrences: {1} {2} {3}

7 / 39

Findall

How do we find a phone number with “416” area code?

I Search literals “416” “-”

I Search ranges/classes: [0-9]

I Specific number of occurrences: {1} {2} {3}

8 / 39

Findall

Find all phone numbers with “416” in this string.

1 ‘143-614-3330, 556-732-3881, 680-964-1127, 568-769-3556,

2 099-887-1597, 081-997-3959, 842-502-6372, 406-648-1681,

3 416-475-8283, 259-778-2868, 105-776-7011, 912-576-5192,

4 018-087-9554, 975-845-6860, 702-619-1033, 326-382-3556,

5 416-294-6744, 957-135-4565, 667-624-1973, 603-418-9850’

9 / 39

Findall

1 >>> x = re.findall("416-[0-9]{3}-[0-9]{4}", txt)

2 >>> print(x)

3 [‘416-475-8283’, ‘416-294-6744’]

4 >>> x = re.findall("416-\d{3}-\d{4}", txt)

5 >>> print(x)

6 [‘416-475-8283’, ‘416-294-6744’]

7

10 / 39

Algorithm Steps

“416-\d{3}-\d{4}” Search text: “416-555-1234 , 516-55-51234”

I Iterate through the input, try to match it to the current

char of the regex

I If it matches, advance to the next input char and next char

in the regex

I If there was a match and the regex has no next char, add

the string to the results

I If it did not match, advance to the next input char and reset

the regex to its start

11 / 39

Algorithm Steps II

“416-\d{3}-\d{4}”

Search text: “416-555-1234 , 516-55-51234”

12 / 39

Algorithm Steps II

“416-\d{3}-\d{4}”

Search text: “416-555-1234 , 516-55-51234”

13 / 39

Algorithm Steps II

“416-\d{3}-\d{4}”

Search text: “416-555-1234 , 516-55-51234”

14 / 39

Algorithm Steps II

“416-\d{3}-\d{4}”

Search text: “416-555-1234 , 516-55-51234”

15 / 39

Algorithm Steps II

“416-\d{3}-\d{4}”

Search text: “416-555-1234 , 516-55-51234”

16 / 39

Algorithm Steps II

“416-\d{3}-\d{4}” First occurrence

Search text: “416-555-1234 , 516-55-51234”

17 / 39

Algorithm Steps II

“416-\d{3}-\d{4}” Second occurrence

Search text: “416-555-1234 , 516-55-51234”

18 / 39

Algorithm Steps II

“416-\d{3}-\d{4}” Third occurrence

Search text: “416-555-1234 , 516-55-51234”

19 / 39

Algorithm Steps II

“416-\d{3}-\d{4}”

Search text: “416-555-1234 , 516-55-51234”

20 / 39

Algorithm Steps II

“416-\d{3}-\d{4}” First occurrence

Search text: “416-555-1234 , 516-55-51234”

21 / 39

Algorithm Steps II

“416-\d{3}-\d{4}” Second occurrence

Search text: “416-555-1234 , 516-55-51234”

22 / 39

Algorithm Steps II

“416-\d{3}-\d{4}” Third occurrence

Search text: “416-555-1234 , 516-55-51234”

23 / 39

Algorithm Steps II

“416-\d{3}-\d{4}” Fourth occurrence

Search text: “416-555-1234 , 516-55-51234”

24 / 39

Algorithm Steps II

“416-\d{3}-\d{4}” Fourth occurrence.

Regex is empty, take result, reset regex.

Search text: “416-555-1234 , 516-55-51234”

25 / 39

Algorithm Steps II

“416-\d{3}-\d{4}”

Search text: “416-555-1234, 516-55-51234”

Result: [“416-555-1234”]

Continue to try and match one item of the regex at a time with

the next character of the input string, until the input string is

empty

26 / 39

More examples (Find special characters)

1 >>> txt = "Abc. Hello! 100 + 8"

2 >>> x = re.findall(".!+", txt)

3 >>> print(x)

4 [‘o!’] <-- We want . ! +

5 >>> x = re.findall("\.\!\+", txt)

6 >>> print(x)

7 [ ] <-- Still not quite correct

8 >>> x = re.findall("[!.+]+", txt)

9 >>> print(x)

10 [‘.’, ‘!’, ‘+’]

27 / 39

More examples (Groups)

1 >>> txt = "csccsccsc"

2 >>> x = re.findall(r‘(csc)’, txt)

3 >>> print(x)

4 [‘csc’, ‘csc’, ‘csc’]

5 >>> x = re.findall(r‘(?:csc){3}’, txt)

6 >>> print(x)

7 [‘csccsccsc’]

28 / 39

More examples (Or)

1 >>> txt = "csccsccsc"

2 >>> x = re.findall("csc∣∣ cs", txt)

3 >>> print(x)

4 [‘csc’, ‘csc’, ‘csc’]

5 >>> x = re.findall("cs∣∣ csc", txt)

6 >>> print(x)

7 [‘cs’, ‘cs’, ‘cs’]

29 / 39

More examples (Words)

1 >>> txt = "Mike Miller, Mick Furrier, Mike Baker,

2 Myke Mason"

3 >>> x = re.findall("Mike \w*", txt)

4 >>> print(x)

5 [‘Mike Miller’, ‘Mike Baker’]

6 >>> x = re.findall("(M[iy]ke \w*)", txt)

7 >>> print(x)

8 [‘Mike Miller’, ‘Mike Baker’, ‘Myke Mason’]

30 / 39

Task 1

1 def find_specific(txt: str) -> List[str]:

2 """ Return a list of matches where the string

3 begins with one ‘a’ followed by one or more ‘b’

4 >>> find_specific("aaabbbabbacb")

5 [‘abbb’, ‘abb’]

6 >>> find_specific("")

7 []

8 """

9 pass

31 / 39

Answer

1 def find_specific(txt: str) -> List[str]:

2 """ Return a list of matches where the string

3 begins with one ‘a’ followed by one or more ‘b’

4 >>> find_specific("aaabbbabbacb")

5 [‘abbb’, ‘abb’]

6 >>> find_specific("")

7 []

8 """

9 return re.findall("ab+", txt)

32 / 39

Task 2

1 def run_me(txt: str, my_regex: str) -> List[str]:

2 if len(my_regex) <= 6:

3 return re.findall(my_regex, txt)

4 return ["Fail"]

5

6 def task2_helper(txt: str):

7 """ Create a regex that is no more than 6 characters long

8 that matches when a string, or substring, starts with

9 an ‘a’ followed by 0 or more numerals and ends

10 with the following string: ‘CS’

11 >>> task2_helper("a998289CSC is great aaCS")

12 [‘a998289CS’, ‘aCS’]

13 >>> task2_helper("aCSC108")

14 [‘aCS’]

15 >>> task2_helper("")

16 []

17 """

18 your_regex = r""

19 return run_me(txt, your_regex)

33 / 39

Answer

1 def run_me(txt: str, my_regex: str) -> List[str]:

2 if len(my_regex) <= 6:

3 return re.findall(my_regex, txt)

4 return ["Fail"]

5

6 def task2_helper(txt: str):

7 """ Create a regex that is no more than 6 characters long

8 that matches when a string, or substring starts with

9 an ‘a’ followed by 0 or more numerals and ends

10 with the following string: ‘CS’

11 >>> task2_helper("a998289CSC is great aaCS")

12 [‘a998289CS’, ‘aCS’]

13 >>> task2_helper("aCSC108")

14 [‘aCS’]

15 >>> task2_helper("")

16 []

17 """

18 your_regex = r‘a\d*CS’19 return run_me(txt, your_regex)

34 / 39

Task 3

1 def task3(txt: str) -> bool:

2 """Return true if the input text contains only

3 valid variables names in Python.

4 Hint: are the doctests complete?

5 >>> task3("x y foo foobar")

6 True

7 >>> task3("x y x-y foobar")

8 False

9 >>> task3("foo_bar")

10 True

11 >>> task3(" ")

12 False

13 >>> task3("")

14 False

15 """

16 pass

35 / 39

Answer

1 def task3(txt: str) -> bool:

2 """Returns true if the input text contains only valid variables names in Python.

3 Hint: are the doctests complete?

4 >>> task3("x y foo foobar")

5 True

6 >>> task3("x y x-y foobar")

7 False

8 >>> task3("foo_bar")

9 True

10 >>> task3(" ")

11 False

12 >>> task3("")

13 False

14 """

15 res = re.findall("[A-Za-z_]\w]*", txt)

16 if not res:

17 return False

18 return len(txt.replace(" ", "")) == len("".join(res))

36 / 39

Task 4

Rhyming words...

1 def task4(target: str, valid_words: List[str], quality: int = 2) -> List[str]:

2 """Find all words that rhyme with the target string.

3 Words are considered to rhyme when their last <quality> characters are identical.

4

5 target: the string for which you want to find all rhyming words for

6 valid_words: contains all words in the language

7 quality: how many characters at the end of the target have to match

8 >>> task4("bat", ["cat", "mat", "butter", "adjective", "flat"])

9 [‘cat’, ‘mat’, ‘flat’]

10 >>> task4("batter", ["cat", "weather", "butter", "adjective", "flatter"])

11 [‘weather’, ‘butter’, ‘flatter’]

12 >>> task4("batter", ["cat", "weather", "butter", "adjective", "flatter"], 3)

13 [‘butter’, ‘flatter’]

14 >>> task4("a", ["cat", "weather", "butter", "adjective", "flatter"])

15 []

16 """

17 pass

37 / 39

Answer

1 def task4(target: str, valid_words: List[str], quality: int = 2) -> List[str]:

2 """Find all words that rhyme with the target string.

3 Words are considered to rhyme when their last <quality> characters are identical.

4

5 target: the string for which you want to find all rhyming words for

6 valid_words: contains all words in the language

7 quality: how many characters at the end of the target have to match

8 >>> task4("bat", ["cat", "mat", "butter", "adjective", "flat"])

9 [‘cat’, ‘mat’, ‘flat’]

10 >>> task4("batter", ["cat", "weather", "butter", "adjective", "flatter"])

11 [‘weather’, ‘butter’, ‘flatter’]

12 >>> task4("batter", ["cat", "weather", "butter", "adjective", "flatter"], 3)

13 [‘butter’, ‘flatter’]

14 >>> task4("a", ["cat", "weather", "butter", "adjective", "flatter"])

15 []

16 """

17 if len(target) < quality:

18 return []

19 v = ",".join(valid_words)

20 reg = "\w+" + target[-quality:]

21 return re.findall(reg, v)

38 / 39

Next Time

1. Sorting.

2. Time and Complexity.

39 / 39

Recommended