Cs3102: Theory of Computation Class 10: DFAs in Practice Spring 2010 University of Virginia David...

Preview:

Citation preview

cs3102: Theory of Computation

Class 10: DFAs in Practice

Spring 2010University of VirginiaDavid Evans

Menu

• Today:– Preparing for Exam 1– Language class for Deterministic PDAs– Applications of DFAs

• Thursday:– Exam Review (if you send questions and/or topics)– Applications of probabilistic DFAs and Grammars

Exam 1

• In class, next Tuesday, 2 March • Covers:

Classes 1-9(10 and 11)

Sipser Ch 0-2

Problem Sets 1-3 + Comments

Exam 1

Note: unlike nearly all other sets we draw in this class, all of these sets are finite, and the size (roughly) represents the relative size.

What’s on the Exam?Definitions

Language, problem, setsConstructing and understanding computing models

Finite automata (DFA, NFA)Pushdown automata (DPDA, NPDA)Grammars (Context-Free Grammar)

Language Classes: Regular and Context FreeShow a language is in the classShow a language is not in the classProve or disprove a closure property

Proof MethodsProof by InductionProof by ConstructionUnderstand and use the pumping lemmas for RL and CFL

Sample exam on website should give you a good idea what to expect

Your exam will probably also have “what’s wrong with this proof” questions

Exam 1 Notesheet

For Exam 1, you may use only:– Your own brain and body– A low-tech writing instrument (pen or pencil) – A single page (both sides) of notes that you create

You may work with others to create your notes page.

Admiral Grace Hopper

John von Neumann

Albert Einstein

Exam Help Available

• Office Hours:– Thursdays, 8:30-9:30am– Thursdays, after class– Fridays, 10-11:30am (Sonali in Stacks)– Mondays, 1:15-3pm

• TA’s Exam Review Session– This Sunday, 5-6:30pm, Olsson 228E

s

All Languages

RegularLanguages

(DFA, NFA, RE, RG)

Finite Languages

Context-Free(CFG or NPDA)

w

an

anbncn

ww

Where are the languages recognized by a Deterministic PDA?

Proving Set Equivalence

A = B A B and B A

Sets A and B are equivalent if A is a subset of B and B is a subset of A.

BA

A B B A

Proving Formalism Equivalence

Proving Formalism Equivalence

Proving Formalism Non-Equivalence

s

All Languages

RegularLanguages

(DFA, NFA, RE, RG)

Context-Free(CFG or NPDA)

Which of these could be true?

anbn

RegularLanguages

(DFA, NFA, RE, RG)

Context-Free (NPDA)

DPDA

RegularLanguages

(DFA, NFA, RE, RG)

Context-Free (NPDA)

DPDA

How can we distinguish these two plausible possibilities?

RegularLanguages

(DFA, NFA, RE, RG)

Context-Free (NPDA)

DPDA

RegularLanguages

(DFA, NFA, RE, RG)

Context-Free (NPDA)

DPDA

How can we distinguish these two plausible possibilities?

Find some language A that can be recognized by some NPDA but not by any DPDA.

A

Prove by construction: for any NPDA, there is a DPDA that recognizes the same language.

ε, ε$

a, ε+

ε, εε

b, +εε, $ ε

ε, ε

εb, +ε

b, εεε, $ ε

Proof by contradiction: Assume there is a DPDA that recognizes A. Show how to construct a NPDA that recognizes some language we know is not context free.

Proved by construction: We showed an NPDA that recognizes A.

Proof by contradiction. Suppose there is a DPDA M that recognizes A.It must be in an accept state only after processing aibi and aib2i.

…a, αβ b, αβ

2i transitions, consuming 0i1i

…b, αβ b, αβ

i transitions, consuming 1i

Construct M’: copy all the states on the second half, replacing b with c:

…a, αβ b, αβ …c, αβ c, αβ

What is the language of M’?

Proof by contradiction. Suppose there is a DPDA M that recognizes A.It must be in an accept state only after processing aibi and aib2i.

…a, αβ b, αβ …b, αβ b, αβ

Construct M’: copy all the states on the second half, replacing b with c:

…a, αβ b, αβ …c, αβ c, αβ

Not a Context-Free Language!

We have a contradiction: if A is in L(DPDA), we could use the DPDA that recognizes A to construct an DPDA that recognizes a non-context-free language! Hence, A must not be in L(DPDA).

s

All Languages

RegularLanguages

(DFA, NFA, RE, RG)

Context-Free(CFG or NPDA)

anbn

A

Deterministic Context-Free LanguagesRecognized by a DPDA (or DCFG)

Context-Free Languages DeterministicContext-Free Languages

Regular Languages

DFAs in Practice

MalwareScanner

W32.Bolzano.Gen: 576a222bd2c20400558b4c240cd9ffff07fbffffff{0-2}5c4e544c445200{0-2}5c57494e4e545c73797374656d33325c6e746f736b726e6c2e65786500{0-29}3b4658

W32.MyLife.E: 7a6172793230*40656d61696c2e636f6d

Note: These are the signatures from ClamAV, an open source virus scanner.

FilesNetwork

Traffic

String Matching

q0 q1 q2 q3 q4 q5

t r u t h

We hold these truths to be self-evident, that …

How much work is it to scan a string of length N for a signature?

Faster String Matching

q0 q1 q2 q3 q4 q5

t r u t h

We hold these truths to be self-evident, that …

s[4] = h?s[10] = h?

truthtruth

s[9] = t?s[8] = u?

truthtruth

truthSkip table:a, b, c, d, e, f, g, i, j, k, l, m, n, o, p, q,

r, s, v, w, x, y, z: 6h: 0r: 4t: 1u: 2

DFA / Skipping DFA

Is a “Skipping DFA” still a DFA?

(That is, does it still only accept the Regular Languages?)

J. Strother Moore (UT Austin)

Boyer-Moore Fast String Searching Algorithm (1977)

Best case: N/(w+1) comparisons where N is the length of the text and w is the length of the search string

Is this fast enough for a malware scanner?

Virus Detection

Total number of signatures: 720,033

2

4

6

8

10

12

11/01 05/02 12/02 06/03 01/04 08/04 02/05 09/05 03/06

Date

Size

(MB

)Symantec

RAV AV

Nate Paul’s study

Can we scan one input for many possible malware signatures quickly?

Combining DFAs?Regular languages closed under union:

q0

qA0

qB0

qA1

qB1

ε

ε

a

a

How many states are there now?

Signatures

First byte: Set of signatures:00000000 ~720000/25600000001 ~720000/25600000010 ~720000/256…11111111 ~720000/256

Try a Trie

q0

q00

q01

q02

qFF

0x00

0x01

0x02

0xFF

q0000

q0001

q0002

q01FF

0x00

0x01

0x02

0xFF…

720000/(256*256) ~ 11

Alfred V. Aho and Margaret J. Corasick, 1975

q0000Alureona

0x02

Scanner Demo

http://www.virustotal.com

Evasive Malware

Metamorphic Code: as virus propagates, each new copy is different

How hard is it to automatically modify code without changing its behavior?

Detecting Evasive Malware

• Less exact signatures (e.g., W32.MyLife.E:

7a6172793230*40656d61696c2e636f6d)– Dangerous – start matching benign programs if you’re not

careful!• Behavioral signatures: match the behavior, not the

program text– Undecidable in general (we’ll see in a few weeks)– Expensive and difficult in practice (but done by all decent

scanners)

Faster String Scanning

Charge

• We focus on DFAs, NFAs, PDAs, CFGs, etc. as abstract models: Number of states, time to process, etc. don’t matter

• Lots of real applications of these models: but in practice, what matters is different

If you have topics you want me to review, post comments (on today’s class announcement) by 5pm tomorrow.

Recommended