69

Fact: There are people who don’t know how to match patterns!

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Fact: There are people who don’t know how to match patterns!

Is this a match??

Where does one start?

1. Prof. Amir’s course

2. Are there any books on Pattern Matching?

Books, Books, …

1. Computing Patterns in Strings, Bill Smyth, to appear in 2003

2. Pattern matching and text compression algorithms, M. Crochemore and T. Lecroq, Chapter 8 in the “Computer Science and Engineering Handbook”,to appear in 2003

More to appear this year…

To appear:

“Applied Combinatorics on Words”, 2003

and … “Word Combinatorics”

Edited by: M. Lothaire

Is “Pattern Matching” a community?

1. Conferences

2. Bibliographies

3. Pattern Matching Forum

4. Software and Animations

Conferences

Pattern Matching:

CPM, SPIRE, Prague Stringology Conference

Sister Conferences:

SIGIR, LATIN, DCC, KDD, …

Theory Conferences:

STOC, FOCS, ICALP, SODA, ESA, WADS, SWAT, STACS,…

Collection HomeUp:

Bibliographies on Theory/Foundations of Computer Science

The Collection ofComputer Science Bibliographies

Bibliography on Pattern Matching[   About   |  Browse   |   Statistics   ]

2002Most recent reference:

yesSupported:2Number of online publications:

November 29, 2002Last update:2184Number of references:

Search the Bibliography

Help on: [ Syntax | Options | Improving your query | Query examples ] Boolean operators:and and or. Use () to group boolean subexpressions.

Example: (specification or verification) and asynchronous

Max. of matches     Results:

Options:

  Query:

BibliographyAuthor: Thiery LecroqMaintained by: U. of Karlsruhe, Germany

online papers only

Pattern Matching PointersMaintained by: Stefano Lonardi

Contents (last updated: MMon Aug 12 16:01:47 PDT 2002)

People: [ A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z ]

Pattern Matching Discussion Boards Conference announcements

Resources: on-line bibliographies, journals, proceedings, software, newgroups.

Pattern Matching Pointers

The purpose of this page is to serve as an index to information relevant to Pattern Matching/Computational Biologist researchers. We prefer to point to information rather than store it locally. We include all submissions that seem appropriate. However, inclusion should not be interpreted as an endorsement of a contribution's accuracy or importance.

IntroductionCombinatorial Pattern Matching addresses issues of searching and matching strings and more complicated patterns such as trees, regular expressions, graphs, point sets, and arrays. The goal is to derive non-trivial combinatorial properties for such structures and then to exploit these properties in order to achieve improved performance for the corresponding computational problem.

    Discussion           Topics           Last Day           Last Week           Tree View

    Documentation           Getting Started           Formatting           Troubleshooting           Program Credits     Utilities           New Messages           Keyword Search           Contact           Edit Profile           Administration

Pattern Matching ForumPattern Matching Problems

Pattern Matching Discussion Board: Pattern Matching Problems

  Subtopic Msgs   Last Updated

    <String Libray> anyone? 4    

    Matching pattern for stable marriage problem

2   11/07 07:00am

    Point pattern matching 2    

    Pattern matching of strings 5    

    Bit Pattern Matching 2    

    String matching? 3    

    Pattern Matching in C 5   12/30 05:17pm

    Person demographics matching

1    

    X-ray image comparison and matching

2   07/18 04:03am

    I ,m aware of how I start or use algorithm approach to difine the similarty between two shapes

1    

    I ,m aware of how I start or use algorithm approach to difine the similarty between two shapes

2    

String Matching Animations

1. 30 Exact String Matching Algorithms Animated in Java, Christian Charras and Thiery Lecroq,http://www-igm.univ-mlv.fr/~lecroq/string/

2. Java Applets for Sequence Comparison Algorithms,Christian Charras and Thiery Lecroq

3.Animations around the Globe:Stephen Campbell, UK; Gusfield, USA; Navarro, Chile; Buhler, Germany;Cássia, Brasil; Michailidis, Macedonia;

Obviously, there is a history.

Is there a future?

Applications inPattern Matching

1. Computational Biology2. Compilers3. Search Engines4. XML5. Musicology6. Meteorology7. Image Processing/X-Ray8. Databases/SQL

Applications inPattern Matching

9. Data Mining10. TCP-IP Routing Tables11. Soundex12. Data Compression13. P2P Networking

(Napster,Kaaza,Gnutella)14. Intrusion Detection

Applications inPattern Matching

1. Computational Biology2. Compilers3. Search Engines4. XML5. Musicology6. Meteorology7. Image Processing/X-Ray8. Databases/SQL

Computational Biology

1. Sequence Alignment - LCS, Edit Distance,…

2. Multiple Sequence Alignment

3. Gene Finding

4. Phylogeny

5. Physical Mappings

6. Genome Rearrangements

7. DNA Chips and Gene Networks

Sequence Alignment –Edit Distance

Dynamic programming: O(nm)

D(i,j) A A C T

0 1 2 3 4

A 1 0 1 2 3

G 2 1 1 2 3

T 3 2 2 2 2

Edit Operations

1. Insert 2. Delete 3. Mismatch

Sequence Alignment

1.Nucleic Acids: ENTREZ, SRS, BankIt, EMBL, NDB, dbEST

2.Proteins: SwissProt, PIR, OWL, Molecules ‘R Us3.Chromosome Maps: CEPH-Genethon, CHLC, NCBI 4.Factors and Motifs: TFD, Prosite (New!)5.Enzymes: REBASE, ENZYME, EC Enzyme DB, Merops6.Organism specific databases: Many!

Databases

Alignment Algo’s: BLAST, FASTA, PAM, Prosite, BLOCKS, BLOSUM, Teiresias

Sequence Alignment

Online Approximate Queries Indexing with Errors

.

.

.

Pattern

• For (small!) constant distance, seems that there may be hope…

Database

Multiple Sequence Alignment

Problem: Strings S1,…,Sk – find S closest to strings

Closest: sum-of-pairs, distance-from-consensus

Solution: Dynamic Programming – exponential in k NP-completeness, heuristics, approximations

Multiple Alignment to a Phylogenetic tree

{ aba cdaa daab mada dag lab abda daa }

aba

abda

cdaa

daa

mada

dgab

lab

dag

1

2 2 12

22

Optimal alignment: 2+2+2+2+2+1+1=12

From Jeremy Buhler’sWeb pages

1. Choosing Cell Populations

2. mRNA Extraction and

Reverse Transcription

3. Fluorescent Labeling of cDNA's

4. Hybridization to a DNA Microarray

5. Scanning the Hybridized Array

6. Interpreting the Scanned Image

DNA Chips – Sequencing by Hybridization

Computational Biology

1. Sequence Alignment - LCS, Edit Distance,…

2. Multiple Sequence Alignment

3. Gene Finding

4. Phylogeny

5. Physical Mappings

6. Genome Rearrangements

7. DNA Chips and Gene Networks

Applications inPattern Matching

1. Computational Biology2. Compilers3. Search Engines4. XML5. Musicology6. Meteorology7. Image Processing/X-Ray8. Databases/SQL

Compilers

Grammars (EBNF): Regular Expressions

Regular Expression Search: Search of m-length

Regular Expression in n-length Text.

Time: O(nm) Since 1968!!!

Compilers

Structure MatchingYale: CS 421/521 (ML and SML): Compilers and Interpreters, Fall 2002

Parameterized/FunctionMatching

Parameterized Matching a b b c a b z x x x y y z x y z y z x y w z x y

Prog.c

int a,b;

a=1;a = g(a)*5+f(a);b=2;a = func(a,b);a = a*g(b);b=1;b = g(b)*5+f(b);….

c=1;c = g(c)*5+f(c);

Pattern

Parameterized/FunctionMatching

Parameterized Matching a b b c a b z x x x y y z x y z y z x y w z x y

f(a)=x

Prog.c

int a,b;

a=1;a = g(a)*5+f(a);b=2;a = func(a,b);a = a*g(b);b=1;b = g(b)*5+f(b);….

c=1;c = g(c)*5+f(c);

Pattern

Parameterized/FunctionMatching

Parameterized Matching a b b c a b z x x x y y z x y z y z x y w z x y

f(a)=x, f(b)=y

Prog.c

int a,b;

a=1;a = g(a)*5+f(a);b=2;a = func(a,b);a = a*g(b);b=1;b = g(b)*5+f(b);….

c=1;c = g(c)*5+f(c);

Pattern

Parameterized/FunctionMatching

Parameterized Matching a b b c a b z x x x y y z x y z y z x y w z x y

f(a)=x, f(b)=y, f(c)=z

Prog.c

int a,b;

a=1;a = g(a)*5+f(a);b=2;a = func(a,b);a = a*g(b);b=1;b = g(b)*5+f(b);….

c=1;c = g(c)*5+f(c);

Pattern

Parameterized/FunctionMatching

Parameterized Matching a b b c a b z x x x y y z x y z y z x y w z x y

f(a)=x, f(b)=y, f(c)=z

Prog.c

int a,b;

a=1;a = g(a)*5+f(a);b=2;a = func(a,b);a = a*g(b);b=1;b = g(b)*5+f(b);….

c=1;c = g(c)*5+f(c);

Pattern

Applications inPattern Matching

1. Computational Biology2. Compilers3. Search Engines4. XML5. Musicology6. Meteorology7. Image Processing/X-Ray8. Databases/SQL

Search Engines

1. Crawl

2. Index

3. Search

Standard Web Search Engine Architecture

crawl theweb

create an inverted

index

Check for duplicates,store the

documents

Inverted index

Search engine servers

userquery

Show results To user

DocIds

Search EngineQueries

Search on inverted index (use ranking schemes – most cited, link analysis, most visited –

term frequency) - PageRankTM

Inverted index

Challenges

1. Distributed queries.2. Boolean queries.

Distributed Queries

Inverted index

Inverted index

Inverted index

Inverted index

Inverted index

Inverted index …

Boolean Queries

BFS Dijkstra

Dijkstra

BFS

...

...

...

563 33 2131 …...12 78 33 …...

And = Intersection

Or = Union

Not = not in list

Applications inPattern Matching

1. Computational Biology2. Compilers3. Search Engines4. XML5. Musicology6. Meteorology7. Image Processing/X-Ray8. Databases/SQL

XML – Extended Markup Language

XML document Tree representation

XML

XQuery: Query language for XML

Query types: …, path expression, …

More extensive: Tree Pattern Matching, Kilpeläinen [92] (Hot in Automata community)

XML

Tree Pattern Matching

a

a

c

c

d

d

c g

da

d a

f

bc

d

d a

TreePattern

XML

Tree Pattern Matching

a

a

c

c

d

d

c g

da

d a

f

bc

d

d a

TreePattern

XML

Tree Pattern Matching

a

a

c

c

d

d

c g

da

d a

f

bc

d

d a

TreePattern

XML

Tree Pattern Matching

a

a

c

c

d

d

c g

da

d a

f

bc

d

d a

TreePattern

XML

Extended Tree Pattern Matching

a

a

c

c

d

d

c g

da

d a

f

ba

d

d a

TreePattern

XML

Extended Tree Pattern Matching

a

a

c

c

d

d

c g

da

d a

f

ba

d

d a

TreePattern

XML

Extended Tree Pattern Matching

a

a

c

c

d

d

c g

da

d a

f

ba

d

d a

TreePattern

Applications inPattern Matching

1. Computational Biology2. Compilers3. Search Engines4. XML5. Musicology6. Meteorology7. Image Processing/X-Ray8. Databases/SQL

Musicology

1. Music Comparison2. Music Information Retrieval3. Music Pattern Induction

Attributes: Duration and Pitch

Properties: transposition invariance, polyphony, and the musical context

Copyright Infringement: (US federal court)

Pitch sequences

Musicology

1. Music Comparison2. Music Information Retrieval3. Music Pattern Induction

~=

Musicology

1. Music Comparison2. Music Information Retrieval3. Music Pattern Induction

..

..

..

..

Pattern

Musicology

1. Music Comparison2. Music Information Retrieval3. Music Pattern Induction

“the importance of parallelism (that is, approximate or literal repetition) in musical structure cannot be overestimated.The more parallelism one can detect, the more internally coherent an analysis becomes, and the less independent information must be processed and retained in hearingor remembering a piece.”

Lerdahl and Jackendoff:

Musicology

1. Music Comparison2. Music Information Retrieval3. Music Pattern Induction

Attributes: Duration and Pitch

Copyright Infringement: (US federal court)

Pitch sequences

Properties: transposition invariance, polyphony, and the musical context

Pattern Matching?

1. Find approximate match to pitch sequences, with distance defined by properties.

2. Can music be de-polyphony-ized? i.e. create multiple monophony tracks by differentiating patterns?

3. Automatic detection of transposition invariance?

Musicology

Projects in MIR(Music Information Retreival):

1. University of Waikato, New Zealand 2. University of Massachusetts, US3. King's College and City University in London, UK 4. Université Pierre et Marie Curie, France5. Università degli Studi di Milano, Italy6. University of Helsinki, Finland 7. more…

New annual conference since 2000: MIR

Applications inPattern Matching

1. Computational Biology2. Compilers3. Search Engines4. XML5. Musicology6. Meteorology7. Image Processing/X-Ray8. Databases/SQL

Meteorology

Immediate weather prediction = Atmospheric Models, e.g. Eta, RUCS, AVN/MRF, Ensembles, MM5, ARP5, MOS, Global Ocean Model, etc. Based upon work of hydrodynamist V. Bjerknes (1904)

Long term weather prediction = Pattern Search El Niño, La Niña, Climate Prediction Centers, Military, NASA, Air Quality Research, Large Scale Computers

Meteorology

Weather Pattern Recognition

Difficult!

Measure? temperature, wind speed, wind direction,

Atmospheric image recognition

Pei and Lin (1995) operations: scaling, rotation, translation, and skew (due to the curvature of earth)

σθ

MeteorologyWeather Pattern Recognition

Measure: temperature, wind speed, wind direction, Parameters: height of measurement (3, 10 meters off ground), elevation, barometric pressure, cloudiness, stability measurement,…

σθ

Applications inPattern Matching

1. Computational Biology2. Compilers3. Search Engines4. XML5. Musicology6. Meteorology7. Image Processing/X-Ray8. Databases/SQL

Applications inPattern Matching

9. Data Mining10. TCP-IP Routing Tables11. Soundex12. Data Compression13. P2P Networking

(Napster,Kaaza,Gnutella)14. Intrusion Detection

Futuristic Paradigms in

Pattern Matching?

I don’t know. But, let me give you some general

thoughts…

Pattern Discovery

Pattern Discovery -Examples

Mining1. Data(base) Mining

2. Text mining

3. Web mining

Motif Discovery

Motif Discovery

Motif = Pattern of the form x1-x2-…-xn

where “-” is a bounded gap

Biological Databases – Find all frequent Motifs

Current research: Suffix trees for text

with gaps

Approximation Algorithms

Examples

1. Shortest Common Supertring

2. Edit Distance with Block operations

3. Phylogenetic trees

4. Evolutionary trees

5. …

Summary

• Look beyond your focused research

• Don’t try this at home.