12
Lecture 5 Regular Expressions CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine

Lecture 5 Regular Expressions CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine

Embed Size (px)

Citation preview

Page 1: Lecture 5 Regular Expressions CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine

Lecture 5Regular Expressions

CSCI – 1900 Mathematics for Computer Science

Fall 2014

Bill Pine

Page 2: Lecture 5 Regular Expressions CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine

CSCI 1900 Lecture 5 - 2

Lecture Introduction

• Reading– Rosen - Section 13.4 (pages 879 - 880)

• Review of Strings• Regular Expressions

Page 3: Lecture 5 Regular Expressions CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine

CSCI 1900 Lecture 5 - 3

Review of Strings

• Recall: String : a sequence of letters or symbols written without commas

• Example:– The sequences of characters : W, a, k, e, , u, p– Is represented by the string “Wake up”

• Another example– a, b, a, b, a, b, a, … is a sequence, i.e.,

“abababa…” is a string– The corresponding set is {a, b}

Page 4: Lecture 5 Regular Expressions CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine

CSCI 1900 Lecture 5 - 4

Strings and Regular Expressions

• Given a set A, A* is the set of all finite sequences of elements of A• Example:

– A = alphabet = {a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z}

– A* = words (the finite sequences, from A, written without commas)

– A* contains all possible words, even those that are unpronounceable or make no sense such as “prsartkc”

• The empty sequence or empty string is represented with

Page 5: Lecture 5 Regular Expressions CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine

CSCI 1900 Lecture 5 - 5

Catenation

• Two strings may be joined into a single string• Assume w1 = s1s2s3s4…sn and w2 = t1t2t3t4…tk

• The catenation of w1 with w2 is the sequence s1s2s3s4…snt1t2t3t4…tk

• Notation: catenation of w1 with w2 is written as w1w2 or w1w2,

• Example – w1 = Bat w2 = woman

– w1w2 = Batwoman

Page 6: Lecture 5 Regular Expressions CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine

CSCI 1900 Lecture 5 - 6

Catenation (cont)

• In many computer languages, the | pipe symbol or + is usually used to denote catenation

• Sometimes catenation is referred to as concatenation

Page 7: Lecture 5 Regular Expressions CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine

CSCI 1900 Lecture 5 - 7

Some Properties of Catenation

• If w1, w2 are elements of A*, then w1w2 is an element of A*

• w = w and w = w where is the null string• A subset B of A* has its own set B* which contains

sentences made up from the words of A• For example:

B = {Kirk, Spock, Flies, Runs, Well, Ship} is a subset of A* where A = Latin alphabetThe string “KirkRunsWell” is an element of B*

Page 8: Lecture 5 Regular Expressions CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine

CSCI 1900 Lecture 5 - 8

Regular Expressions

• The following is from http://etext.lib.virginia.edu/helpsheets/regex.html:

"Regular expressions trace back to the work of an American mathematician by the name of Stephen Kleene (one of the most influential figures in the development of theoretical computer science) who developed regular expressions as a notation for describing what he called 'the algebra of regular sets.' His work eventually found its way into some early efforts with computational search algorithms, and from there to some of the earliest text-manipulation tools on the Unix platform (including ed and grep). In the context of computer searches, the '*' is formally known as a 'Kleene star.'“

Page 9: Lecture 5 Regular Expressions CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine

CSCI 1900 Lecture 5 - 9

Regular Expressions (cont)

• A regular expression on a set A is a recursive formula for a sequence

• A regular expression consists of – The elements of A,– And the symbols ( , ) , , * ,

• These symbols have the following interpretations– ( and ) are grouping symbols– is the OR symbol– * means zero or more catenations – is the null string

Page 10: Lecture 5 Regular Expressions CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine

CSCI 1900 Lecture 5 - 10

Regular Expressions (cont)

• An expression is regular if it can be constructed according to the following five rules – The symbol is a regular expression (RE1)– If x A, the symbol x is a regular expression (RE2)– If and are regular expressions, then the expression

is regular (RE3)– If and are regular expressions, then the expression (

) is regular (RE4)– If is a regular expression, then the expression ()* is

regular (RE5)

Page 11: Lecture 5 Regular Expressions CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine

CSCI 1900 Lecture 5 - 11

Regular Expressions (cont)

• A regular expression over A corresponds to a subset of A*

• This is called a regular subset of A* or just regular set

• These subsets are built based on the rules corresponding to the previous five rules

Page 12: Lecture 5 Regular Expressions CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine

CSCI 1900 Lecture 5 - 12

Key Concepts Summary

• Review of Strings• Regular Expressions