73
Multi-seed lossless filtration Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology, Puschino, Russia CPM (Istanbul) July 5-7, 2004

Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Embed Size (px)

Citation preview

Page 1: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Multi-seed lossless filtrationMulti-seed lossless filtration

Gregory KucherovLaurent Noé

LORIA/INRIA, Nancy, France

Mikhail RoytbergInstitute of Mathematical Problems in Biology,

Puschino, Russia

CPM (Istanbul)July 5-7, 2004

Page 2: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Text filtration: general principleText filtration: general principle

potential matches

Page 3: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Text filtration: general principleText filtration: general principle

potential matches

Page 4: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Text filtration: general principleText filtration: general principle

lossless and lossy filters

true match

Page 5: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Filtration applied to local similarity searchFiltration applied to local similarity search

potential similarities

Page 6: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Filtration applied to local similarity searchFiltration applied to local similarity search

potential similarities

Page 7: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Filtration applied to local similarity searchFiltration applied to local similarity search

true similarities

Page 8: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

GaplessGapless similarit similarities. Hamming distance.ies. Hamming distance.

Similarities are defined through Hamming distance

GCTACGACTTCGAGCTGC

...CTCAGCTATGACCTCGAGCGGCCTATCTA...

Page 9: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

GaplessGapless similarit similarities. Hamming distance.ies. Hamming distance.

Similarities are defined through Hamming distance

Page 10: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

GaplessGapless similarit similarities. Hamming distance.ies. Hamming distance.

Similarities are defined through Hamming distance

(m,k)-problem, (m,k)-instances

m

k

Page 11: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

GaplessGapless similarit similarities. Hamming distance.ies. Hamming distance.

Similarities are defined through Hamming distance

(m,k)-problem, (m,k)-instances This work: lossless filtering

m

k

Page 12: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Lossless filtering by contiguous fragmentLossless filtering by contiguous fragment

PEX (Navarro&Raffinot 2002)– Searching for a contiguous pattern

PEX with errors– Searching for a contiguous pattern with l possible errors

• Efficient only for small alphabets and small l

m=18

k=3

11

km

####

conserved1

#########(1)

(m,k)

Page 13: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Superposition of two filtersSuperposition of two filters

Pevzner&Waterman 1995

Idea: combine PEX with another filter based on a regularly-spaced seed

PEX :

spaced PEX (matches occurring at every k positions).

####

#---#---#---#

#---#---#---# #---#---#---# #---#---#---# #---#---#---#

k+1

Page 14: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Spaced seedsSpaced seeds

Spaced seeds (spaced q-grams)– proposed by Burkhardt & Kärkkäinen (CPM 2001) for solving (m,k)-

problems

Principle– Searching for spaced rather than contiguous patterns

– Selectivity• defined by the weight of the seed (number of #’s)

###-##

Page 15: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

ExExaamplemple: (18,3)-problem: (18,3)-problem

###-##

###-##

###-##

###-## ###-## ###-##

Page 16: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Spaced seeds for sequence comparisonSpaced seeds for sequence comparison

Ma, Tromp, Li 2002 (PatternHunter)

Estimating seed sensitivity: Keich et al 2002, Buhler et al 2003, Brejova et al 2003, Choi&Zhang 2004, Choi et al 2004, Kucherov et al 2004, ...

Extended seed models: BLASTZ 2003, Brejova et al 2003, Chen&Sung 2003, Noé&Kucherov 2004, ...

Page 17: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

This work: lossless filtration using spaced seed families (extension of Burkhard&Karkkainen 2001)

single filter based on several distinct seeds each seed detects a part of (m,k)-instances but

together they must detect all (m,k)-instances

Families of spaced seedsFamilies of spaced seeds

Independent work (lossy seed families for sequence alignment):

Li, Ma, Kisman, Tromp 2004 (PatternHunter II) Xu, Brown, Li, Ma, this conference Sun, Buhler, RECOMB 2004 (Mandala)

Page 18: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

– every (18,3)-instance contains an occurrence of a seed of F

– all seeds of the family have the same weight 7

Example: (18.3)-problem (cont)Example: (18.3)-problem (cont)

Family F solvesthe (18,3)-problem

##-#-#######---#--##-#

F

Page 19: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

##-##-########-####--#####-##---#-#####----####-######---#-#-##-#####-#-#-#-----###

Example: (18.3)-problem (cont)Example: (18.3)-problem (cont)

##-#-#######---#--##-#

###-##---#-###

###---#--##-# ###---#--##-#

w=7

w=9

Page 20: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

####

###-##

##-##-########-####--#####-##---#-#####----####-######---#-#-##-#####-#-#-#-----###

Comparative selectivityComparative selectivity

##-#-#######---#--##-#

w=4 ~39. 10-4

w=5 ~9.8 10-4

w=7 ~1.2 10-4

w=9 ~0.23 10-4

Selectivity of families on Bernoulli similarities (p(match) = 1/4) estimated as the probability for one of the seeds to occur at a given position

Page 21: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

How far should we goHow far should we go

A trivial extreme solution – … would be to pick all seeds of weight m – k– selectivity 100% (no false positives)– prohibitive cost except for very small problems

We are interested in intermediate solutions:– relatively small number of seeds (< 10) to keep the hash table of a

reasonable size,– the seed weight sufficiently large to obtain a good selectivity

kmC ~

Page 22: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

ResultsResults

Computing properties of seed families Seed design

– Seed expansion/contraction– Periodic seeds– Seed optimality– Heuristic seed design

Experiments– Examples of designed seed families– Application to computing specific oligonucleotides

Page 23: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

MeMeaasursuringing the the efficefficiency of a familyiency of a family

Burkhard&Karkkainen: optimal threshold of a seed: minimal number of seed occurrences over all (m,k)-instances

A seed family F is lossless iff the optimal threshold TF(m,k)1

TF(m,k) can be computed by a dynamic programming algorithm in time O(m·k·2(S+1)) and space O(k·2(S+1)), where S is the maximal length of a seed from F

optimizations are possible (see the paper) the resulting space and time complexity is the same as

in the Burkhard&Karkkainen algorithm

Page 24: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

MeMeaasursuringing the the efficefficiency of a family (cont)iency of a family (cont)

Using a similar DP technique we can compute, within the same time complexity bound:

the number UF(m,k) of undetected (m,k)-similarities for a (lossy) family F

the contribution of a seed of F, i.e. the number of (m,k)-similarities detected exclusively by this seed

[see the paper for details]

Page 25: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Design Design of seedof seed famil familiesies

Pruning exhaustive search tree (Burkhard&Karkkainen)

– Construct all solutions of weight w from solutions of weight w – 1

– Example:if ##--#--# and ##-#---# are solutions of weight w-1,

consider their «union» ##-##--# of weight w.

– Prohibitive cost: • more than a week for computing all single-seed solutions of

the (50,5)-problem• the search space blows up for multi-seed families

Page 26: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Seed expansion/contractionSeed expansion/contraction

Burkhard&Karkkainen: the only two solutions of weight 12 solving the (50,5)-problem:

###-#--###-#--###-#

#-#-#---#-----#-#-#---#-----#-#-#---#

Page 27: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Seed expansion/contractionSeed expansion/contraction

Burkhard&Karkkainen: the only two solutions of weight 12 solving the (50,5)-problem:

###-#--###-#--###-#

#-#-#---#-----#-#-#---#-----#-#-#---#

the only solution of weight 12 of the (25,2)-problem

Page 28: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Seed expansion/contractionSeed expansion/contraction

Burkhard&Karkkainen: the only two solutions of weight 12 solving the (50,5)-problem:

###-#--###-#--###-#

#-#-#---#-----#-#-#---#-----#-#-#---#

– Let be the i-regular expansion of F obtained by inserting i-1 jokers between successive positions of each seed of F

– Example:If F = { ###-# , ##-## } then

= { #-#-#---# , #-#---#-# } = { #--#--#-----# , #--#-----#--# }

Fi

F2F3

the only solution of weight 12 of the (25,2)-problem

Page 29: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Seed expansion/contractionSeed expansion/contraction (cont)(cont)

Lemma:

– If a family F solves an (m,k)–problem, then both F and solves the (i·m, (i+1)·k- 1)–problem

– If a family solves the (i·m,k)–problem, then its i-contraction F solves the (m, )-problem

Fi

Fi

ik

##-#-#######---#--##-#

##-#-#######---#--##-#

#-#---#---#-#-#-##-#-#-------#-----#-#-#

(18,3)

(36,7)

Page 30: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Periodic seedsPeriodic seeds

Iterating short seeds with good properties

into longer seeds

###-#--###-#--###-#

###-#--

Page 31: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Cyclic problemCyclic problem

Lemma: If a seed Q solves a cyclic (m,k)-problem, then the seed Qi=[Q,- (m-s(Q))]i solves the linear (m·(i+1)+s(Q)-1,k)-problem.

Cyclic (11,3)-problem

Linear (29,3)-problem

###-#--#---

###-#--#---###-#--#

Page 32: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Extension to multi-seed caseExtension to multi-seed case

Cyclic (11,3)-problem

Linear (25,3)-problem

###-#--#---

###-#--#---###-#--##--#---###-#--#---###

Page 33: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Extension to multi-seed caseExtension to multi-seed case

Cyclic (11,3)-problem

Linear (25,3)-problem

###-#--#---

###-#--#---###-#--# #--#---###-#--#---###

Page 34: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

AAsymptotsymptotic optimalityic optimality

Theorem:Fix a number of errors k. Let w(m) be the maximal weight

of a seed solving the linear (m,k)-problem. Then

the fraction of the number of jokers tends to 0 but the convergence speed depends on k

seed expansion cannot provide an asymptotically optimal solution

( )

Page 35: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Non-asymptotic optimality Non-asymptotic optimality

Fix a number of errors k. For each seed (seed family) Q there exists mQ s.t. mmQ, Q

solves the (m,k)-problem For a class of seeds , Q is an optimal seed in iff Q

realizes the minimal mQ over all seeds of

Lemma: Let n be an integer and r=n/3. For every k2, seed #n-r-#r is

optimal among seeds of weight n with one joker.

Example: ####-## is optimal among the seeds of weight 6 with 1 joker: it solves all (m,2)-problems for m≥16, all (m,3)-problems for m≥20

Page 36: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Heuristic seed design: genetic algorithmHeuristic seed design: genetic algorithm

a population of seed families is evolving by mutating and crossing over

seed families are screened against sets of difficult (m,k)-instances

for a family that detects all difficult instances, the number of undetected similarities is computed by a DP algorithm. A family is kept if it yields a smaller number than currently known families do

compute the contribution of each seed of the family. Mutate the least “valuable” seeds.

difficult(m,k)-instances

seed families

select and reorderselect

Page 37: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Example: (25,2)-problemExample: (25,2)-problem

Page 38: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Application Application of lossless filtering: of lossless filtering: oligooligo design design

Specific oligonucleotides: small DNA molecules (10-50bp) that hybridize with a target sequence and do not hybridize with background sequences (e.g. the rest of the genome)

Formalization: given a sequence (or database), find all windows of length m which do not occur elsewhere within k substitution errors

Page 39: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Seed design: (32,5)-problemSeed design: (32,5)-problem

Page 40: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

ExperimentExperiment

This filter has been applied to the rice EST database (100015 sequences of total size ~42 Mbp)

All 32-windows occurring elsewhere within 5 errors have been computed

The computation took slightly more than 1 hour on a P4 3GHz computer

87% of the database have been “filtered out”

Page 41: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Further questionsFurther questions

Combinatorial structure of optimal seed families

Efficient design algorithm

Page 42: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

QuestionsQuestions

agctga

g?cc??

tatgag

caa?ga

cca??a

ctc?gc

ggcgca

tctagg

ag??ac

c???tc

ttcttc

g

???? ??

Page 43: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

ConclusionConclusionss

Méthode de filtrage pour pattern-matching approché– Basée sur le design et l’utilisation d’une famille de graines

espacées.– Sélective en pratique mais nécessite un effort de calcul pour le

design des graines.

Extensions possibles– Considérer des graines espacées autorisant une erreur.

Problèmes ouverts– Un algorithme efficace pour le design de la famille de graines

optimale ?

Page 44: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

RéférencesRéférences

[1] S. Burkhardt and J. Kärkkäinen, Better Filtering with Gapped q-Grams, Fundamenta Informaticae, 23:1001-1018 2003

[2] P.Pevzner and M.Waterman, Multiple Filtration and Approximate Pattern Matching, Algorithmica 13(1/2), 135-154 1995

[3] J.SantaLucia, A unified view of polymer and oligonucleotide DNA nearest-neighbor thermodynamics, Biochemistry 95:1460-1465 1998

[4] G.Navarro and M.Raffinot, Flexible Pattern Matching in Strings -- Practical on-line search algorithms for texts, Cambridge University Press 2002

[5] …

Page 45: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Problème poséProblème posé

Problème biologiqueOligonucléotide : fragment d’ADN de taille fixée qui ne s’apparie qu’avec une région déterminée sur une séquence cible.

Rechercher les oligonucléotides spécifiques à une séquence.

Design d’oligos • Puces à ADN.

Design d’amorces• PCR

Page 46: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Problème poséProblème posé

Spécificité Etant données:

– Une séquence cible S– Une séquence de fond B

Trouver un motif de taille m qui s’apparie avec une région de S et aucune région de B

Page 47: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Problème poséProblème posé

Comment définir un oligonucléotides spécifique?– C’est un fragment d’ADN M de taille fixée m.

– Il doit être spécifique : • s’apparier avec une région d’une séquence cible S

appariement exact

• être éloignée de tout fragment d’un séquence de fond B.

Page 48: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

ExempleExemple

Sur le problème (m=18,k=3)

###.##

###.##

###.##

###.## ###.## ###.##

Page 49: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

ExempleExemple

Sur le problème (m=18,k=3)

###.##

###.##

###.##

###.## ###.## ###.##

Page 50: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

ExempleExemple

Sur le problème (m=18,k=3)

###.##

###.##

###.##

###.## ###.## ###.##

Page 51: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

ExempleExemple

Sur le problème (m=18,k=3)

###.##

###.##

###.##

###.## ###.## ###.##

Page 52: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

ExempleExemple

Sur le problème (m=18,k=3)

###.##

###.##

###.##

###.## ###.## ###.##

Page 53: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

ExempleExemple

Sur le problème (m=18,k=3)

###.##

###.##

###.##

###.## ###.## ###.##

Page 54: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

ExempleExemple

Sur le problème (m=18,k=3)

###.##

###.##

###.##

###.## ###.## ###.##

Page 55: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

ExempleExemple

Sur le problème (m=18,k=3)

###.##

###.##

###.##

###.## ###.## ###.##

Page 56: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

ExempleExemple

Sur le problème (m=18,k=3)

###.##

###.##

###.##

###.## ###.## ###.##

Page 57: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

ExempleExemple

Sur le problème (m=18,k=3)

###.##

###.##

###.##

###.## ###.## ###.##

Page 58: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

ExempleExemple

Sur le problème (m=18,k=3)

###.##

###.##

###.##

###.## ###.## ###.##

Page 59: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

ExempleExemple

Sur le problème (m=18,k=3)

###.##

###.##

###.##

###.## ###.## ###.##

Page 60: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

ExempleExemple

Sur le problème (m=18,k=3)

###.##

###.##

###.##

###.## ###.## ###.##

Page 61: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

ExempleExemple

Sur le problème (m=18,k=3)

###.##

###.##

###.##

###.## ###.## ###.##

Page 62: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Combinaison de FiltresCombinaison de Filtres

De nombreux algorithmes proposent une solution de double filtrage, et donnent comme mesure la sélectivité globale de l’ensemble des deux filtres.

Filtre 1

Filtre 2Q

T

Page 63: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Combinaison de filtresCombinaison de filtres

Combinaison de filtres améliore toujours la sélectivité théorique

En pratique, l’efficacité dépend de la sélectivité du premier filtre utilisé.

Filtre 1

Filtre 2

Page 64: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Exemple

##.##.########.####..#####.##...#.#####....####.######...#.#.##.#####.#.#.#.....###

Famille de graines espacéesFamille de graines espacées

##.#.#######...#..##.#

###...#.#.##.## ##....####.######.#.#.#.....###

##.#.#### ##.#.####

Page 65: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

MeMeaasursuringing the the efficefficiency of a familyiency of a family

Problèmes posés : – Mesurer le nombre d’instances non détectées par une famille.– Mesurer la contribution d’une graine sur le nombre d’instances

résolues.

Algorithme de programmation dynamique– Idée : ramener les instances des problèmes (m,k) à des sous

problèmes (m’< m, k’< k) en introduisant un mot connu w.– Ne pas parcourir les sous instances triviales ou amenant à des

résultats prévisibles par un pré-calcul.

w (m’,k’)

Page 66: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Mesure de l’efficacité des famillesMesure de l’efficacité des familles

Schéma général

Les mots w peuvent être de taille limitée– Span de la plus grande graine de la famille

– Ne conserver que le suffixe w[ |w| - spmax+1 .. w ] Pré-calcul

Considérer pour chaque mot w son plus grand suffixe qui peut donner lieu à un match.

G (w,m’,k’)G (w.1, m’+1 ,k’ )

G (w.0, m’+1 ,k’-1 )

w (m’,k’)

Page 67: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Résultats asymptotiques sur le problème Résultats asymptotiques sur le problème ((mm,,kk)) circulaire circulaire

On considère le poids de la graine optimale w(m) d’un problème (m,k) circulaire (k fixé)

Nouvelles

Bonne : le ratio entre le nombre de jokers de la graine et sa longueur totale tend vers 0.

Mauvaise : c’est une convergence d’autant plus lente que k est grand

Page 68: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Résultats asymptotiques sur le problème Résultats asymptotiques sur le problème ((mm,,kk)) linéaire linéaire

On considère le poids de la graine optimale w(m) d’un problème (m,k) linéaire (k fixé)

Nouvelles

Bonne : le ratio entre le nombre de jokers de la graine et sa longueur totale tend vers 0.

Mauvaise : c’est une convergence d’autant plus lente que k est grand

Page 69: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Heuristic seed design: genetic algorithmHeuristic seed design: genetic algorithm

Optimisation

Algorithme génétique (optimisation stochastique)– Sélection de familles de graines résolvant le plus grand

nombre d’instances de (m,k)• évolution (par un certain nombre de techniques) des graines

constituant la famille• mesure du nombre d’instances de (m,k) non résolues

– Algorithme génétique : convergence vers solution optimale non garantie …

(et peu probable sur grandes instances)

Page 70: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Méthode proposée pour le designMéthode proposée pour le design

Algorithme réalisant le design d’une famille de graines– Données :

• Un problème (m,k)

• Une taille de famille s et le poids w des graines souhaitées

– Résultat :• éventuellement une famille de s graines de poids w résolvant

le problème (m,k)• Sinon la meilleure famille actuellement trouvée et le nombre

d’instances de (m,k) qui ne sont pas détectées par cette famille

Méthode développée: méthode heuristique.

Page 71: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

ObjectifObjectif

Recherche dans la séquence cible S de tous les motifs spécifiques.– motifs spécifiques : ceux dont les variantes obtenues en substituant k lettres

ne sont pas retrouvées dans le texte B.

– faire appel à des techniques de Pattern-Matching approché (filtrage du texte).

Page 72: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Application Application to to oligooligo design design

La méthode proposée peut servir de filtre pour la recherche d’oligos spécifiques– Elle ne peut s’y substituer totalement

• Calcul de l’énergie libre (énergie de non hybridation) sur la séquence cible, et la séquence de fond (modèle de SantaLucia[3])

• En particulier, vérifier que l’oligo ne puisse pas se replier sur lui même.

– Elle ne représente qu’une heuristique pour la recherche de l’oligo optimal (en terme d’énergie de liaison)

• Il faut quelquefois considérer les erreurs de type indel.• l’oligo optimal (en terme de sélectivité) n’est pas

nécessairement le complémentaire exact.

Page 73: Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Example: (25,3)-problemExample: (25,3)-problem