University of Nebraska at Omaha Innovative Database Models and Advanced Tools in Bioinformatics Hesham H. Ali UNO Bioinformatics Research Group Department

University of Nebraska at Omaha

Innovative Database Models and Advanced Tools in Bioinformatics

Hesham H. AliUNO Bioinformatics Research Group

Department of Computer Science College of Information Science and Technology

Key Challenges Facing Bioinformatics Research

• Significant gaps between tool developers and tool users– Different objectives

– Different funding agencies

– Different academic cultures

• Significant problems with available Biological Data – Archival based

– Lack of structure

Source: ncbi.nih.gov

Problems with Current Biological Data

• The availability of large biological data and the increasing rate in producing new data, available in public data banks or via microarray data

• The increasing pressure to maximize the use of the available data, particularly to impact key related industries (biotech companies, biotech drugs)

• The large degree of heterogeneity of the available data in terms of quality

UNO Bioinformatics Research Group

• Group Triangle

– Research motivated by real biological problems

– Innovative Database Models

– Advanced tools

Biological Questions Addresses by our Group

• Molecular diagnosis - Identification– Sequence based id– Enzyme (cutting order) based id– Instrumentation (Mass Spec, WAVE) based id

• Basic Molecular Biology - Gene regulation– Microarray Analysis

– Motif discovery/searching

• Epidemiology and Clinical Research– Patient tracking system– Clinical expert system

Bioinformatics Solutions to These Problems

• Develop new inventive database models– Custom database for specific domains– Centralized Structured integrated data

• Develop innovative Bioinformatics tools– Clustering algorithms– Advanced motif finding approaches

Database Models

• Customized (Private) Solution:

Custom based Data Base Model

High degree of quality and consistency

• Centralized (Public) Solution:

New Curated Integrated and Structured DataBase Model

Model One: Custom Databases

• Allowing researchers to create custom sets of genetic data suited to their specific needs.

• Allowing researchers to control the quality of genetic data in their custom data sets through fine-tuning parameters.

• Searching data using optimal alignment algorithms, rather than using heuristic methods.

• Giving researchers/clinicians the ability to formulate sequence identification concepts and test their ideas against a validated database

• Incorporating information from GenBank if needed

The Sequence Identification Problem

• Identification of organisms using obtained sequences is a very important problem

• Relying on wet lab methods only is not enough• Employing identification algorithms using signature motifs

to complement the experimental approaches• Currently, no robust software tool is available for aiding

researchers and clinicians in the identification process• Such a tool would have to utilize biological knowledge and

databases to identify sequences• Issues related to size of data and quality of data are suspect

and would need to be dealt with

Nebraska gets its very own organism

While trying to pinpoint the cause of a lung infection in local cancer patients, they discovered a previously unknown micro-organism. And they've named it "mycobacterium nebraskense," after the Cornhusker state.

It was discovered few weeks ago using Mycoalign: A Bioinformatics program developed at PKI

Source: Omaha World Herald, March 21, 2005

Model Two: Centralized Database - the Integrated Model

A new integrated model based on:

– Organized and curated database

– True non-redundancy by having one record for each polymorphic set with pointer to the rest of the set if needed

– Allowing advanced queries

– Being user-friendly and employing true automation

– Employing various algorithms with different levels of accuracy and speed for conducting homology searches.

The Clean Gene Package

• A set of integrated database and alignment tools: – Edited and curated

– Web based

– Of manageable size

– Based on hierarchical database model

– Utilize various alignment algorithms

– Allows advanced automated queries

– Allows fast and accurate searches

The Key Challenges

• The New Structured relational database model• Identification of equivalence classes of records

(polymorphic sets)• Identification of a good representative for each set• Curation and classification• Accurate annotation• Advanced data mining tools• A user-friendly interface that employs true automation for

interfacing with the database

Tool I: Clustering Biological Data

• Clustering is a fundamental technique in finding a structure in a collection of unlabeled data.

• Basically, clustering is the process of organizing objects into groups whose members are similar in some way.

• A good Clustering tool is a key component in analyzing microarray data

Message Passing Clustering (MPC)

• Inspired by real-world situations: elements with similar attributes cluster together simultaneously

• Advantages:– Easy to understand and use.– Taking the advantage of communication among data objects, MPC is

able to balance the global and local structure and be performed in parallel.

– “Message” has flexible structure which allows further development to fit to different research interests.

– We have extended the basic MPC to • Weighted MPC• Stochastic MPC• Semi-supervised MPC

Basic MPC

3 M .in tra ce llu la re M a c-D

4 M .in tra ce llu la re M a c-J

1 M .in tra ce llu la re M in -A

3 M .ch e lo n a e M ch -C

1 M .ch e lo n a e M ch -A

2 M .ch e lo n a e M ch -B

2 M .xe n o p iII/M xe -B

1 M .xe n o p iI/M xe -A

3 M .xe n o p iIII/M xe

1 M .ka n sa siiM ka -A

1 M .g o rd o n a e M g o -A

2 M .g o rd o n a e M g o -B

4 M .g o rd o n a e M g o -D

3 M .g o rd o n a e M g o -C

5 M .g o rd o n a e M g o -E

2 M .ka n sa siiM ka -B

4 M .ka n sa siiM ka -D

3 M .ka n sa siiM ka -C

5 M .ka n sa siiM ka -F

3 M .p e re g rin u m M p e

1 M .p e re g rin u m

1 M .fo rtiu tu m M fo

2 M .p e re g rin u m M p e

2 M .fla ve sce n sM fla -B

1 M .fla ve sce n sM fla -A A TCC1 4 4 7 4

3 M .fla ve sce n sM fla -A

3 M .te rra e III

1 M .te rra e I

2 M .te rra e II

4 M .fo rtu itu m M fo



2 M .in tra ce llu la re M A C-E

5 M .in tra ce llu la re M a c-L

M.intracellulare ATCC 35770 *M.intracellulare S 348 *

M.intracellulare S 350 *M.intracellulare ATCC 35847 *

M.intracellulare S 348 *

M.chelonae DSM 43276M.chelonae ATCC 35752M.chelonae ATCC 19536

M.xenopi S 88

M.xenopi S 91M.xenopi ATCC 19250

M.kansasii ATCC 12478 #

M.kansasii S 221 #M.kansasii S 233 #M.kansasii S 536 #M.kansasii DSM 44431 #

M.gordonae ATCC 14470

M.gordonae Bo 10681/99M.gordonae ATCC 35756

M.gordonae Bo 11340/99M.gordonae Bo 9411/99

M.peregrinum S 254 ^M.peregrinum ATCC 14467 ^

M.peregrinum ATCC 700686 ^

M.fortiutum ATCC 49403 $

M.flavescens DSM 43531M.flavescens ATCC 14474M.flavescens ATCC 23033M.terrae S 281M.terrae ATCC 15755M.terrae DSM 43541

M.fortiutum ATCC 6841 $M.fortiutum ATCC 49404 $M.fortiutum ATCC 43266 $

M.gordonae

M.flavescens

M.terrae

M.chelonae

M.xenopi

3M.intracellulareMac-D

4M.intracellulareMac-J

1M.intracellulareMin-A

3M.chelonaeMch-C

1M.chelonaeMch-A

2M.chelonaeMch-B

2M.xenopiII/Mxe-B

1M.xenopiI/Mxe-A

3M.xenopiIII/Mxe

1M.kansasiiMka-A

1M.gordonaeMgo-A

2M.gordonaeMgo-B

4M.gordonaeMgo-D

3M.gordonaeMgo-C

5M.gordonaeMgo-E

2M.kansasiiMka-B

4M.kansasiiMka-D

3M.kansasiiMka-C

5M.kansasiiMka-F

3M.peregrinumMpe

1M.peregrinum

1M.fortiutumMfo

2M.peregrinumMpe

2M.flavescensMfla-B

1M.flavescensMfla-A A TC C 14474

3M.flavescensMfla-A

3M.terraeIII

1M.terraeI

2M.terraeII

4M.fortuitumMfo

2M.fortuitumMfo

3M.fortuitumMfo

2M.intracellulareMA C -E

5M.intracellulareMac-L





M.xenopi S 88












M.gordonae

M.flavescens

M.terrae

M.chelonae

M.xenopi

3M.intracellulareMac-D

4M.intracellulareMac-J

1M.intracellulareMin-A

3M.chelonaeMch-C

1M.chelonaeMch-A

2M.chelonaeMch-B

2M.xenopiII/Mxe-B

1M.xenopiI/Mxe-A

3M.xenopiIII/Mxe

1M.kansasiiMka-A

1M.gordonaeMgo-A

2M.gordonaeMgo-B

4M.gordonaeMgo-D

3M.gordonaeMgo-C

5M.gordonaeMgo-E

2M.kansasiiMka-B

4M.kansasiiMka-D

3M.kansasiiMka-C

5M.kansasiiMka-F

3M.peregrinumMpe

1M.peregrinum

1M.fortiutumMfo

2M.peregrinumMpe

2M.flavescensMfla-B

1M.flavescensMfla-A A TC C 14474

3M.flavescensMfla-A

3M.terraeIII

1M.terraeI

2M.terraeII

4M.fortuitumMfo

2M.fortuitumMfo

3M.fortuitumMfo

2M.intracellulareMA C -E

5M.intracellulareMac-L





M.xenopi S 88












M.gordonae

M.flavescens

M.terrae

M.chelonae

M.xenopi

M.xenopi

M.kansasii

M.intracellulare

M.gordonae

M.terrae

M.peregrinum

M.fortuitum

M.flavescens

M.chelonae

M.chelonae ATCC 35752M.chelonae ATCC 19536M.chelonae DSM 43276M.flavescens ATCC 14474M.flavescens ATCC 23033M.flavescens DSM 43531M.fortiutum ATCC 49403M.fortuitum ATCC 49404M.fortuitum ATCC 43266M.fortuitum ATCC 6841M.peregrinum ATCC 14467M.peregrinum S 254M.peregrinum ATCC 700686M.terrae ATCC 15755M.terrae DSM 43541M.terrae S 281M.gordonae ATCC 14470M.gordonae ATCC 35756M.gordonae Bo 11340/99M.gordonae Bo 9411/99M.gordonae Bo 10681/99M.intracellulare ATCC 13950M.intracellulare ATCC 35770M.intracellulare S 348M.intracellulare ATCC 35847M.intracellulare S 350M.kansasii ATCC 12478M.kansasii S 221M.kansasii S 536M.kansasii DSM 44431M.kansasii S 233M.xenopi ATCC 19250M.xenopi S 91M.xenopi S 88 M.xe nopi

M.kans as ii

M.intrace llulare

M.gordonae

M.te rrae

M.pe re grinum

M.fortuitum

M.flav e s ce ns

M.che lonae

M.chelonae ATCC 35752M.chelonae ATCC 19536M.chelonae DSM 43276M.flavescens ATCC 14474M.flavescens ATCC 23033M.flavescens DSM 43531M.fortiutum ATCC 49403M.fortuitum ATCC 49404M.fortuitum ATCC 43266M.fortuitum ATCC 6841M.peregrinum ATCC 14467M.peregrinum S 254M.peregrinum ATCC 700686M.terrae ATCC 15755M.terrae DSM 43541M.terrae S 281M.gordonae ATCC 14470M.gordonae ATCC 35756M.gordonae Bo 11340/99M.gordonae Bo 9411/99M.gordonae Bo 10681/99M.intracellulare ATCC 13950M.intracellulare ATCC 35770M.intracellulare S 348M.intracellulare ATCC 35847M.intracellulare S 350M.kansasii ATCC 12478M.kansasii S 221M.kansasii S 536M.kansasii DSM 44431M.kansasii S 233M.xenopi ATCC 19250M.xenopi S 91M.xenopi S 88

M.chelonae ATCC 35752M.chelonae ATCC 19536M.chelonae DSM 43276M.flavescens ATCC 14474M.flavescens ATCC 23033M.flavescens DSM 43531M.fortiutum ATCC 49403M.fortuitum ATCC 49404M.fortuitum ATCC 43266M.fortuitum ATCC 6841M.peregrinum ATCC 14467M.peregrinum S 254M.peregrinum ATCC 700686M.terrae ATCC 15755M.terrae DSM 43541M.terrae S 281M.gordonae ATCC 14470M.gordonae ATCC 35756M.gordonae Bo 11340/99M.gordonae Bo 9411/99M.gordonae Bo 10681/99M.intracellulare ATCC 13950M.intracellulare ATCC 35770M.intracellulare S 348M.intracellulare ATCC 35847M.intracellulare S 350M.kansasii ATCC 12478M.kansasii S 221M.kansasii S 536M.kansasii DSM 44431M.kansasii S 233M.xenopi ATCC 19250M.xenopi S 91M.xenopi S 88

a. NJ b. MPC

• The phylogenetic trees of Mycobacterium (9 species, 34 strains), constructed by the Neighbor Joining and MPC method.

Weighted MPC (WMPC)—

with Adaptive Feature Scaling • Add weight associated with each cluster-feature

pair. A single feature have multiple weights in different clusters and, in one cluster, all features may have different weights.

• Update the weights during the clustering process. If on some dimension, the similarity between two going-to-merge clusters is high (/low), then we increase (/decrease) the weight on that dimension in the newly merged cluster.

• Test WMPC on Colon Cancer data (2000 genes in 40 tumor and 22 normal samples), giving higher classification rate.• Two benefits:

– Strengthen the signal features while reducing the noise features, so making clustering results more accurate. – More importantly, reveal the contribution of the features (genes) to the clusters (samples), so that identify the set of genes responsible for certain diseases.

Stochastic MPC (SMPC)Based on Kernel Functions

distance

TargetObject

0

Tie ?

a b

c

d e f

Chance to merge?

Kick out ?

Kernel Density Estimates Using Gaussian KernelsProbability Density Estimates Based on Little Gaussian Kernel Functions

Semi-supervised MPC

• Clustering methods are considered unsupervised, meaning that the reduction is derived solely from the data rather than reflecting any previous knowledge.

• Classification methods are considered supervised, because in the training phase, samples classes are already known, and we classify the objects into known groups.

• Between clustering and classification: Unlabeled data with prior knowledge, such as constraints and hypotheses.

• The goal of semi-supervised clustering is to guide the clustering, using the prior knowledge, to get better partitions.

Semi-supervised MPC Instance-level Constraints

• Colon Cancer data (2000 genes in 40 tumor and 22 normal samples).

• We cluster samples with genes as features. Since the samples (instances) labels (constrains) are known, it is call instance-level constraints.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35 40 45 50 55 60

Number of samples with an initial label

Accuracy (Rand index)

OP IP

We want to see how well our method could separate the normal and tumor tissues based on different numbers of known labels for the samples as prior knowledge.OP: Output partition after clustering. IP: Input constraints presented before clustering.Combining the power of clustering with background information achieves better performance than either in isolation.

Cluster t-statisticCluster

SizeGene Name (ORF)

1 -5.73831 22T57079, T60860, R38758, R55828, H55759, X58401, H63361, R56207, U25435, R09138, H82741, H16096, H64576, L34059, U12140, H78346, H62177, M55422,

H89481, U12134, H11460, R28608

2 -5.11799 19R00254, H41147, R42765, T51139, D21205, H70635, H69819, H71122, H81802, R46731, M29550, H14607, R88749, J00146, H18451, M96859, X69115, X51435,

H06189,

3 13.9581 17 L12723, M81637, D59253, X65873, M77698, T84049, M88108, R56401, H09719, X74330, D26018, D26067, R56630, X13482, H16991, T86749, H92195

4 9.81885 14 R39531, R21901, J02645, R34876, H72234, T92259, H47107, T65740, Z15115, D43951, L19437, X82103, U29175, T65438

5 -8.42722 42M16029*, M16029*, D38549, U03100, L24038, J02906, X93499, M73481, T71207, R73129, X56253, M81758, R27017, U18934, X17651, H47650, H89688, H23135, R39130, H82631, H45526*, H45526*, D90188, L06895, L34774, T58756, L11370, U34252, H78063, D21239, M96824, M24470, T77446, R53612,

R38513*, R38513*, R44677, M98045, T88712, R54317, M77477, H11054

6 14.9275 33R42127, T90350, L25941, T65790, R22779, M90516, R11485, X68194, X63629, J05032, M34175, X74795, U34074, M58050, H20512, U18299, H80114, R67987,

R56399, U10324, R56443, X17025, R40717*, R40717*, D13630, L24203, T96873, X53586, X73478, X14618, M55543, H42884, X54101

7 10.3472 12 D13315, T70062, U30825, T89115, D26600, T60437, D50063, R20804, R09479, R42837, D14658, H87473

8 -9.94085 14 T61333, M28882, M69066, M14539, T67406, M37721, T79831, M69135, M36634, U25138, R48303, U31525, T51539, H21042

9 8.59304 11 U22055, L41559, M88279, R37114, R96357, U29607, H82719, H04802, M22632, R27813, R60195

10 5.97321 29H48027, M14200, T52642, R85464, T52343, T72503, T49732, T95048, T47584, R21547, L28010, M21339, T87527, T61338, T79813, L36844, M34192, M69238,

L26405, H06970, M84326, T70063, X66363, H85878, X62153, T67905, T57872, Z48950, R62425

11 -7.47053 16 T93284, H80975, R73052, X05610, X79683, X55187, U30827, T64974, D31887, M92843, R67358, R46753, X68277, H50623, H15813, X51345

12 12.6072 12 T63591, T63370, T53412, T53396, T63133, R44884, R84411, X12671, R08183, M22382, D63874, D21261

13 5.07119 37 H46728, T57686, T68848, M11354, T69026, M14630, T72879, T47144, T61627, X61971, M94345, R42570, H13194, T65580, H15542, J03077, D30655

14 -5.35587 16 T61661, T62220, T96832, H80240, H86060, T63508, H24754, T63499, M11799, L28809, T63484, M57710, M33680, U12255, H88360, R78934

Semi-supervised MPCAttribute-level Constraints

Gene clusters illustrating differentially expressed genes in tumor and normal samples

a. Cluster 6

b. Cluster 8

Generalizations

• WMPC extends the unweighted MPC to the weighted MPC.– If we initialize all entries in w to be 1 and never change the weights,

MPC-AFS becomes a regular MPC.

• SMPC extends the deterministic MPC to the stochastic MPC.– If we choose the particular kernel function (rectangular) and the

particular bandwidth parameter (the minimum distance between the target cluster and all the others) to estimate the probability, SMPC can be reduced to a regular MPC .

• Semi-supervised MPC extends unsupervised MPC to somehow supervised MPC.– Unsupervised MPC can be considered as a special case of semi-

supervised MPC with null background info and constraints.

Tool II: Motif Finding/Data Mining Tool

• Given a set of known binding sites, develop a representation of these binding sites that can be used to search for additional instances of those binding sites in the genome.

• Given a set of sequences known to be co-regulated (i.e. by an expression array) determine the binding locations in the sequence and determine a representation for binding specificity.

Motif Representations

• Static Sequences: tataat• Regular Expressions (RegEx): tat[at].t• Sequences with N errors: tataat:2• RegEx with N errors: tat[at].t:2• Mononucleotide Scoring Matrices:

• Dinucleotide Scoring Matrices (HMMs)

• Multinucleotide

Scoring Matrices

a t [at] . t t

a 75% 25% 50% 25% 25% 100%

25% 75% 50% 25% 75% 0%

0% 0% 0% 25% 0% 0%

0% 0% 0% 25% 0% 0%

tgc

1 2 3 4 5 6

Searching for Known Motifs

1. Obtain a multiple sequence alignment of known motifs (e.g. from gel shift assay)

…atagtt……aattat……attatt……ttactt…

2. Constructrepresentation

a t [at] . t t

a 75% 25% 50% 25% 25% 100%

25% 75% 50% 25% 75% 0%

0% 0% 0% 25% 0% 0%

0% 0% 0% 25% 0% 0%

tgc

3. Score all possiblewindows in the dataSet based on:

∑=b b

ibibseq p

ffiI ,

2, log)(4. Output results that exist overa specified threshold from data set

Finding Unknown Motifs1. Input a set of co-expressed

sequences that are related by micro-array experiment

2. Input: motif length n

3. Score all possible windows by firstConstructing a multiple sequence Alignment of the window to all otherpossible matches in the other sequences

4. Rank the set of allpossible scoring matricesof length n based on information contentrelative to background.

∑=b b

ibibseq p

ffiI ,

2, log)( 5. Output an ordered list of motifsand corresponding scoring matrices.

AGAST: Advanced Grammar Alignment Search Tool

• Capitalize on the advantages of alignment.

• Provide a formal and robust method for computing bio-relationships

• Provide optimum results based on the input.

• Calculate relationships in the same time as alignment.

• Allow for user knowledge and subsequence relationships.

• Record attributes and sequence attributes can be considered simultaneously.

• Dynamically construct requisite algorithms in a user friendly way, thus limiting development time and technical knowledge requirements.

AGAST: Advanced Grammar Alignment Search Tool

Advantages:• It can evaluate regular expressions important to biology

as well as traditional RegEx tools.

• It can evaluate traditional alignments.

• It can do any combination of RegEx and traditional alignments.

BioRegEx

BioRegEx II

Example of an Advanced Query

Find a sequence that contains:

tatatagcagcccatgagccggcccgcadtgctagttcag

Transcription Start Site

5-10 basesStart Codon

Any Number of BasesFunctional Unit

Example Query:

tatata.*{5,10}atg.*[gc]ca[at]gct[atgc]g:2.*

tatatagcaggggcccatgagccggcccccadagctcgttcag

tatatagcagcccatgagccggcccgcadtgagttcag

Score: 0

Score: 2

Conventional Problem

• Motif Searching programs do not calculate based on combinatorial regulation modules (instead they calculate based on probability of a single motif).

• We have developed and tested a program that considers an ordered set of motifs (or sequence attributes) and searches based on a context of adjacent elements (or grammar).

Next Steps

• Extract multiple motifs in the context of regulatory control networks.

• Use phylogenetic footprinting and gene regulatory network information to compare and contrast gene regulation networks and extrapolate combinatorial control mechanisms and corresponding motifs upstream of genes.

Other Current Research Projects

Advanced tools for identifying splice sites Using ab initio Bayesian networks based approaches Using homology graph theory based approaches

Fast Recognition of Microorganisms using enzyme cutting sequence, mass spectrometry or sequence based approaches

Gene Prediction using Comparative Genomics Reconstructing Gene Regulatory Networks Clustering Techniques for Simplifying Protein Sequences

Acknowledgment

Kiran Bastola Alexander Churbanov Xutao Deng Huimin Geng Steven Hinrichs Xiaolu Huang Daniel Kuyper Mark Pauley Daniel Quest

Documents

University of Nebraska at Omaha Innovative Database Models and Advanced Tools in Bioinformatics Hesham H. Ali UNO Bioinformatics Research Group Department