View
217
Download
3
Category
Tags:
Preview:
Citation preview
cs3102: Theory of Computation
Class 10: DFAs in Practice
Spring 2010University of VirginiaDavid Evans
Menu
• Today:– Preparing for Exam 1– Language class for Deterministic PDAs– Applications of DFAs
• Thursday:– Exam Review (if you send questions and/or topics)– Applications of probabilistic DFAs and Grammars
Exam 1
• In class, next Tuesday, 2 March • Covers:
Classes 1-9(10 and 11)
Sipser Ch 0-2
Problem Sets 1-3 + Comments
Exam 1
Note: unlike nearly all other sets we draw in this class, all of these sets are finite, and the size (roughly) represents the relative size.
What’s on the Exam?Definitions
Language, problem, setsConstructing and understanding computing models
Finite automata (DFA, NFA)Pushdown automata (DPDA, NPDA)Grammars (Context-Free Grammar)
Language Classes: Regular and Context FreeShow a language is in the classShow a language is not in the classProve or disprove a closure property
Proof MethodsProof by InductionProof by ConstructionUnderstand and use the pumping lemmas for RL and CFL
Sample exam on website should give you a good idea what to expect
Your exam will probably also have “what’s wrong with this proof” questions
Exam 1 Notesheet
For Exam 1, you may use only:– Your own brain and body– A low-tech writing instrument (pen or pencil) – A single page (both sides) of notes that you create
You may work with others to create your notes page.
Admiral Grace Hopper
John von Neumann
Albert Einstein
Exam Help Available
• Office Hours:– Thursdays, 8:30-9:30am– Thursdays, after class– Fridays, 10-11:30am (Sonali in Stacks)– Mondays, 1:15-3pm
• TA’s Exam Review Session– This Sunday, 5-6:30pm, Olsson 228E
s
All Languages
RegularLanguages
(DFA, NFA, RE, RG)
Finite Languages
Context-Free(CFG or NPDA)
w
an
anbncn
ww
Where are the languages recognized by a Deterministic PDA?
Proving Set Equivalence
A = B A B and B A
Sets A and B are equivalent if A is a subset of B and B is a subset of A.
BA
A B B A
Proving Formalism Equivalence
Proving Formalism Equivalence
Proving Formalism Non-Equivalence
s
All Languages
RegularLanguages
(DFA, NFA, RE, RG)
Context-Free(CFG or NPDA)
Which of these could be true?
anbn
RegularLanguages
(DFA, NFA, RE, RG)
Context-Free (NPDA)
DPDA
RegularLanguages
(DFA, NFA, RE, RG)
Context-Free (NPDA)
DPDA
How can we distinguish these two plausible possibilities?
RegularLanguages
(DFA, NFA, RE, RG)
Context-Free (NPDA)
DPDA
RegularLanguages
(DFA, NFA, RE, RG)
Context-Free (NPDA)
DPDA
How can we distinguish these two plausible possibilities?
Find some language A that can be recognized by some NPDA but not by any DPDA.
A
Prove by construction: for any NPDA, there is a DPDA that recognizes the same language.
ε, ε$
a, ε+
ε, εε
b, +εε, $ ε
ε, ε
εb, +ε
b, εεε, $ ε
Proof by contradiction: Assume there is a DPDA that recognizes A. Show how to construct a NPDA that recognizes some language we know is not context free.
Proved by construction: We showed an NPDA that recognizes A.
Proof by contradiction. Suppose there is a DPDA M that recognizes A.It must be in an accept state only after processing aibi and aib2i.
…a, αβ b, αβ
2i transitions, consuming 0i1i
…b, αβ b, αβ
i transitions, consuming 1i
Construct M’: copy all the states on the second half, replacing b with c:
…a, αβ b, αβ …c, αβ c, αβ
What is the language of M’?
Proof by contradiction. Suppose there is a DPDA M that recognizes A.It must be in an accept state only after processing aibi and aib2i.
…a, αβ b, αβ …b, αβ b, αβ
Construct M’: copy all the states on the second half, replacing b with c:
…a, αβ b, αβ …c, αβ c, αβ
Not a Context-Free Language!
We have a contradiction: if A is in L(DPDA), we could use the DPDA that recognizes A to construct an DPDA that recognizes a non-context-free language! Hence, A must not be in L(DPDA).
s
All Languages
RegularLanguages
(DFA, NFA, RE, RG)
Context-Free(CFG or NPDA)
anbn
A
Deterministic Context-Free LanguagesRecognized by a DPDA (or DCFG)
Context-Free Languages DeterministicContext-Free Languages
Regular Languages
DFAs in Practice
MalwareScanner
W32.Bolzano.Gen: 576a222bd2c20400558b4c240cd9ffff07fbffffff{0-2}5c4e544c445200{0-2}5c57494e4e545c73797374656d33325c6e746f736b726e6c2e65786500{0-29}3b4658
W32.MyLife.E: 7a6172793230*40656d61696c2e636f6d
Note: These are the signatures from ClamAV, an open source virus scanner.
FilesNetwork
Traffic
String Matching
q0 q1 q2 q3 q4 q5
t r u t h
We hold these truths to be self-evident, that …
How much work is it to scan a string of length N for a signature?
Faster String Matching
q0 q1 q2 q3 q4 q5
t r u t h
We hold these truths to be self-evident, that …
s[4] = h?s[10] = h?
truthtruth
s[9] = t?s[8] = u?
truthtruth
truthSkip table:a, b, c, d, e, f, g, i, j, k, l, m, n, o, p, q,
r, s, v, w, x, y, z: 6h: 0r: 4t: 1u: 2
DFA / Skipping DFA
Is a “Skipping DFA” still a DFA?
(That is, does it still only accept the Regular Languages?)
J. Strother Moore (UT Austin)
Boyer-Moore Fast String Searching Algorithm (1977)
Best case: N/(w+1) comparisons where N is the length of the text and w is the length of the search string
Is this fast enough for a malware scanner?
Virus Detection
Total number of signatures: 720,033
2
4
6
8
10
12
11/01 05/02 12/02 06/03 01/04 08/04 02/05 09/05 03/06
Date
Size
(MB
)Symantec
RAV AV
Nate Paul’s study
Can we scan one input for many possible malware signatures quickly?
Combining DFAs?Regular languages closed under union:
q0
qA0
qB0
qA1
qB1
ε
ε
a
a
…
…
How many states are there now?
Signatures
First byte: Set of signatures:00000000 ~720000/25600000001 ~720000/25600000010 ~720000/256…11111111 ~720000/256
Try a Trie
q0
q00
q01
q02
qFF
0x00
0x01
0x02
0xFF
…
q0000
q0001
q0002
q01FF
0x00
0x01
0x02
0xFF…
720000/(256*256) ~ 11
Alfred V. Aho and Margaret J. Corasick, 1975
q0000Alureona
0x02
Evasive Malware
Metamorphic Code: as virus propagates, each new copy is different
How hard is it to automatically modify code without changing its behavior?
Detecting Evasive Malware
• Less exact signatures (e.g., W32.MyLife.E:
7a6172793230*40656d61696c2e636f6d)– Dangerous – start matching benign programs if you’re not
careful!• Behavioral signatures: match the behavior, not the
program text– Undecidable in general (we’ll see in a few weeks)– Expensive and difficult in practice (but done by all decent
scanners)
Faster String Scanning
Charge
• We focus on DFAs, NFAs, PDAs, CFGs, etc. as abstract models: Number of states, time to process, etc. don’t matter
• Lots of real applications of these models: but in practice, what matters is different
If you have topics you want me to review, post comments (on today’s class announcement) by 5pm tomorrow.
Recommended