Welcome to CSC 108108/lectures/W10C.pdf · 2020. 12. 9. · Welcome to CSC 108 Introduction to...

Welcome to CSC 108

Introduction to Computer Programming

Lecture W10C

Drs. Michael Liut, Andi Bergen, Larry Zhang

Mathematical and Computational Sciences

University of Toronto Mississauga

November 20 2020

This session is being recorded!

1 / 39

Regular Expressions (Regex)

Definition 1 (A sequence of characters that forms a

search pattern)

1. Phone numbers

2. Email addresses

3. Postal Codes / Zip Codes

4. Valid variable names

(e.g., variable names cannot start with digits)

5. etc.

2 / 39

Regular Expressions (Regex)

1. Groups: (), (?:) capturing vs. non-capturing

2. Quantifiers: *? {1,2}

3. Character classes: [A-Za-z]

4. Escape characters: \.

5. Logical operators: a | b

6. Use of raw string in python: r‘regex’

3 / 39

Simple Usage

1 >>> import re

2 >>> txt = "Today’s topic in CSC108: Regexes."

3 >>> x = re.search("CSC108", txt)

4 >>> if x:

5 ... print("Yes, there’s a match")

6 ‘Yes, there’s a match’

8 We could use ‘‘in” for this simple example.

4 / 39

Find all phone numbers with area code 416

We don’t know exactly what we are looking for (e.g., 416-555-1235),

but only a pattern.

1 txt = ‘143-614-3330, 556-732-3881, 680-964-1127, 568-769-3556,

2 099-887-1597, 081-997-3959, 842-502-6372, 406-648-1681,

3 416-475-8283, 259-778-2868, 105-776-7011, 912-576-5192,

4 018-087-9554, 975-845-6860, 702-619-1033, 326-382-3556,

5 416-294-6744, 957-135-4565, 667-624-1973, 603-418-9850’

We could use split(), loops, if, startswith() and string splicing.

5 / 39

Find all phone numbers with area code 416

I A phone number is a sequence of characters that follows a

pattern

I Pattern: 416- followed by 3 digits, a dash and then 4 digits

6 / 39

Findall

I Search literals: "CSC108", "416-"

I Search ranges/classes: [a-zA-Z], [0-9]

I Wild cards: . Dot matches anything

I Find zero/one or more occurrences: * + (e.g., .* or .+)

I Escape characters: \. \? \+

I Logical operators: a∣∣ b (Either a or b, not both)

I Specific number of occurrences: {1} {2} {3}

7 / 39

Findall

How do we find a phone number with “416” area code?

I Search literals “416” “-”

I Search ranges/classes: [0-9]

I Specific number of occurrences: {1} {2} {3}

8 / 39

Findall

Find all phone numbers with “416” in this string.

1 ‘143-614-3330, 556-732-3881, 680-964-1127, 568-769-3556,

2 099-887-1597, 081-997-3959, 842-502-6372, 406-648-1681,

3 416-475-8283, 259-778-2868, 105-776-7011, 912-576-5192,

4 018-087-9554, 975-845-6860, 702-619-1033, 326-382-3556,

5 416-294-6744, 957-135-4565, 667-624-1973, 603-418-9850’

9 / 39

Findall

1 >>> x = re.findall("416-[0-9]{3}-[0-9]{4}", txt)

2 >>> print(x)

3 [‘416-475-8283’, ‘416-294-6744’]

4 >>> x = re.findall("416-\d{3}-\d{4}", txt)

5 >>> print(x)

6 [‘416-475-8283’, ‘416-294-6744’]

10 / 39

Algorithm Steps

“416-\d{3}-\d{4}” Search text: “416-555-1234 , 516-55-51234”

I Iterate through the input, try to match it to the current

char of the regex

I If it matches, advance to the next input char and next char

in the regex

I If there was a match and the regex has no next char, add

the string to the results

I If it did not match, advance to the next input char and reset

the regex to its start

11 / 39

Algorithm Steps II

“416-\d{3}-\d{4}”