21
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas ctas@cs CMSC 838 Presentation

Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays

  • Upload
    brant

  • View
    45

  • Download
    0

Embed Size (px)

DESCRIPTION

Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays. Nazif Cihan Tas ctas@cs CMSC 838 Presentation. Motivation. DNA microarrays techniques are used intensely for identification of biological agents Gene Expression Studies Diagnostic Purposes - PowerPoint PPT Presentation

Citation preview

Page 1: Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays

Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for

DNA Microarrays

Nazif Cihan Tas

ctas@csCMSC 838 Presentation

       

   

Page 2: Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays

CMSC 838T – Presentation

Motivation

DNA microarrays techniques are used intensely for identification of biological agents Gene Expression Studies Diagnostic Purposes

Identification of Microorganisms in samples Item Extraction

Complex Problem Find the necessary probes and the temperature Probe sets should be reliably detect and differentiate target

sequences Large Databases NEW!! Homologous Genes (how to find specific probes)

Page 3: Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays

CMSC 838T – Presentation

Talk Overview

Overview of talk Motivation Problem Statement Algorithm Mathematical Aspects Experimentation Discussion

Page 4: Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays

CMSC 838T – Presentation

Problem Statement

Positive Probes Database set S0

Target S1

For each sequence in S1, find at least one probe

For S0 - S1 try to avoid it (but do not care if happens)

High Specificity: # of non-target matches are minimized High Sensitivity: # of covered target seq. is maximized

S0S1

Page 5: Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays

CMSC 838T – Presentation

Problem Statement

Negative Probes Determine as few as possible probes which together hybridizes

with all sequences in S0 - S1 but with NONE in S1.

High Specificity: No seq. in S1 may hybridize

High Sensitivity: Max # of seq. in S0 - S1 be covered

S1S0

Page 6: Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays

CMSC 838T – Presentation

Probe Design Constraints

Sequence Related Length of probes Deviation of melting temperature of probe-target hybrids must

be low (for physical reasons) No self complementary regions longer than four nucleotides

(not descriptive enough) Melting temperatures of target and non-target seq. must be

larger than a predefined (too close, too hard to identify) Ensuring a minimum number of mismatches is enough

(homologous sequences)

System Related Execution Time Usability

Page 7: Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays

CMSC 838T – Presentation

Algorithm

Overview Probe Generation Hybridization Prediction Probe Selection

Page 8: Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays

CMSC 838T – Presentation

AlgorithmProbe Generation

Subproblem: Generate probe candidates for the sequences Keep the set as small as possible without losing any optimal

candidate (exclude infeasible ones)

Suffix Tree Why?

Allows fast recognition of repetitive subsequences Identifies non-unique probes (i.e. with more than one target) Efficient for memory and for T computation (reduce time)

How? Tree is constructed from the sequences Traversed (Watson-Crick complement)

Page 9: Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays

CMSC 838T – Presentation

Suffix Tree

Input: TACTACA TACTACA ACTACA CTACA TACA ACA CA A

$ denotes end of string

Constructed in linear time

Page 10: Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays

CMSC 838T – Presentation

Probe Generation

Further Improvements Filters applied for cut off

Probe length (predefined) G-C content (for temperature) Self-complementarity

Probes should not contain complements as subsequences

Finally, remove highly conserved (non-specific) regions Insert into hashtables according to their lengths

Page 11: Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays

CMSC 838T – Presentation

Algorithm

Page 12: Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays

CMSC 838T – Presentation

AlgorithmHybridization Prediction

Subproblem: Search for the right probe Search is expensive, Intelligent Hashing used

Design A frame is moved over target and nontarget seqs. with several lengths

Previous algorithm (Kaderali 2002): Use the suffix tree At each step, hash values are calculated. If hit, predict melting

temperature, store in hybridization matrix. If there are too many hits for a probe, then it is not unique, remove it Why intelligent?

Hash time is linear Allows inexact matching because of hashing (No analysis)

Parallelization Several threads are searching for probe targets.

Tree and hashtables are fixed. One thread writes to the final matrix

Page 13: Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays

CMSC 838T – Presentation

Hybridization Prediction

Empirical Simulation: One million random probe-target pairings generated Four mismatches or one insertion or deletion plus one strong

central mismatch chosen T<20 C for 93%

Complexity is O ( |S0| |S1| ) Possible probe candidates is |S1| (linear) Each position of database S0 must be checked

Page 14: Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays

CMSC 838T – Presentation

Algorithm

Page 15: Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays

CMSC 838T – Presentation

AlgorithmProbe Selection

Subproblem: Use the hybridization matrix to finalize the probe selection

We have positive probes and negative probes to proceed Algorithm Analysis:

For each probe candidate g: #of matches in S1

b: #of matches in S0 - S1

t: highest melting point in S1

Probes for which g or b values is too large, are removed Sort according to g,b and t. Apply Depth First Search

Advantages Performs well (No comparison though) Guarantees to choose all specific probes if any were found.

Disadvantages can NOT guarantee optimal selection in terms of coverage

Page 16: Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays

CMSC 838T – Presentation

Negative Probe Selection

Let S2 =S0 - S1 and B subset of S2 . The probes that

detect S1 also detects some of B elements.

Algorithm for Negative Probes Apply probe generating and preselection for B.

Conduct hybridization on B U S1 .

Remove the probes which hybridizes with S1 .

Sort the remaining probes according to their hit number. Successively select the probes which covers most target seq.

Guarantees optimal solution for coverage and probe number usage

Page 17: Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays

CMSC 838T – Presentation

AlgorithmProbe Selection

Page 18: Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays

CMSC 838T – Presentation

Mathematical Aspects

Page 19: Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays

CMSC 838T – Presentation

Experimentation

Parallelized on SMP platform Classic worker-producer Intel Dual Pentium III 933 MHz, 1 GB memory

Test data ssu rRNA of ARB project 20.282 ssu rRNA sequences 1.401 < lengths < 4.179 %97 of them are shorter than 2.000 bases

Page 20: Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays

CMSC 838T – Presentation

Discussion

High Performance Execution is linear with size of database, decreases if longer

probes are used

Low Memory Consumption Depends on the size of the sequence selection, NOT database

size

Automatic Design of Group Probes and negative probes

High Quality Probe Design

Page 21: Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays

CMSC 838T – Presentation

Discussion

Comparison with previous work vs. ARB

Not suited for large scale probe design vs. LCF

Does not consider highly conserved data Memory consumption is high Works well with short probes only

vs. others Mostly can not deal with insertion and deletions Execution is slow