Computational Methods in Bioinformatics-Dr Elshafei

7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

1/34

March 16, 2004 1

Computational

Methods inBioinformatics

Dr. Moustafa ElshafeiSystems Engineering Department


2/34

March 16, 2004 2

Topics What is Bioinformatics ?

Introduction to Molecular genetics Some challenging problems

Review of the current computationaltechniques.

Future approaches

Conclusion


3/34

March 16, 2004 3

What is Bioinformatics?

Bioinformatics is a managementinformation system for molecular biology

Organization of a huge amount of

information in Gene Banks and proteinBanks

Data mining and analysis tools Modeling, interpreting and predicting

Biological activities.


4/34

March 16, 2004 4

Introduction to molecular

genetics Molecules

Lipids

Proteins


5/34

March 16, 2004 5

Nucleus and Nucleolus

Plant Cell Note the large nucleus

and nucleolus in the

centre of the cell


6/34

March 16, 2004 6

Chromosomes

and Genes


7/34

March 16, 2004 7

How long is DNA?

DNA helix ( 2 nm wide) are rounded on

histone fibre of diameter 11 nm, then

compacted in 30 nm cromation fiber, thencoiled in 700 nm diameter then formed as

chromosomes 1400 nm diameter.If the the DNA strand of the human gene had 1 mm

diameter, it would stretch to 25km. It would be winded

and twisted, and coiled until it becomes a chromosome of

50 cm diameter and 4 meter length.


8/34

March 16, 2004 8

Chromosomes Chromosomes are the cellular components

that contain genes, in animals and plantsthey are located in Nucleolus;

Genes are the functional units ofinheritance.

Genes are specific segments of DNA that

code for specific proteins which control cellstructure and function.


9/34

March 16, 2004 9

Number of chromosomes vary from

organism to another

Human 46,

Chicken 78, Mouse 40,

wheat 42, corn 20, Fruit fly 8, scorpion 4


10/34

March 16, 2004 10

Genes & Genetics


11/34

March 16, 2004 11

Deoxyribonucleic acid (DNA)

Pair of sequence of four nucleotides:

cytosine (C), guanine (G), adenine (A), and thymine (T). A

Pairs with T, and C pairs with G, the pairs held together by

hydrogen bonds.

TCTCGGCATTAGGGCCT


12/34

March 16, 2004 12

TCTCGGCATTAGGGCCT

AGAGCCGTAATCCCGGA


13/34

March 16, 2004 13

Genome length in nucleotide pairs

Virus 5k E.Coli 4700k

Human being 3,000,000k Corn 4,500,000k

Salamander 72,500,000k

G d t i


14/34

March 16, 2004 14

Genes and proteins Genes are segments of DNA which code for proteins.

A segment of the DNA that codes for a specific protein is astructural gene.

Protein synthesis is also governed by a genetic code

Every function in a cell is controlled by some kind ofproteins .

Proteins are formed by strands from 20 amino acids

Every three nucleotides are called codons.

The 64 possible codons are mapped into, Start, Stop, andone of the 20 amino acids


15/34

March 16, 2004 15

Protein Mapping

Protein consists of a chain of amino acidsThere are 20 amino acids

Each amino acid is coded by three bases.

During protein synthesis T->U; DNA->mRNA


16/34

March 16, 2004 16

Protein Expression


17/34

March 16, 2004 17

Genes length between 30k-250k, exon region 69-

3106 bp. Introns can be as large as 32k

Mean internal coding exon 150 bp.

Eukaryotes have only 10% of their DNA coding

for proteins. Humans may have a little as 1%coding for proteins. Viruses and prokaryotes use a

great deal more of their DNA.

Human genome project completed 2003, 3 billion

bp, and about 30,000 genes, compared to 13,600for the fruit fly, and over 14,000 genes in

mosquitoes, Rice 50,000.


18/34

March 16, 2004 18

If the number of genes really turns out

to be about 30,000, then this can be atestament to the marvellous design of

life. Only a genius could create us withso few genes performing so many

functions

A famous scientist in genetics.


19/34

March 16, 2004 19

An RNA gene is any gene that is not

translated into a protein. Commonly-usedsynonyms of "RNA gene" are noncoding

RNA or (ncRNA).

RNA genes code certain Regulatory

functions.

RNA genes are not predictable by currentalgorithms. Not clear how many of these

are hidden in the human genome.


20/34

March 16, 2004 20

Gene

Banks


21/34

March 16, 2004 21

Challenges

1-Gene finding: try to identify a potential gene region in DNA,

however, only 1-3% of human genome is translated into proteins.

2- Finding a region of interest. Raw sequencing is performed on pieces

of random lengths between 500 to 5000 pbs. With possible largeoverlapping parts at both ends, 6 possible interpretation of each strand.

Need for algorithms to align the fragments

3-Multiple Alignment of a set of genes to reveal regions of similarities,

and cross species changes.

4- Local alignment and similarity search, Statistical grouping,

clustering, statistical similarity measures for course classification.

5- Protein structure prediction: given a protein sequence, how itwould fold itself into a specific 3D complex shape.

Locating the non-coding genes (RNA)


22/34

March 16, 2004 22

Methods

Similarity Search

Content search

Signal Search


23/34

March 16, 2004 23

Common Software Uses

Similarity analysis Sequence analysis

Sequence alignment

Population genetics statistical analysis

Format conversion, Database maintenance

and searching

b h


24/34

March 16, 2004 24

Data base Fast Search

BLAST & FASTA

Query data base for DNAs similar to a givensequence.

Rely on identification of brief subsequences (k-tuples). Multiple k-tuples serve as seeds forextended alignment.

Versions for DNA and protein sequences.

Limited capability to handle gaps in coding

regions.


25/34

March 16, 2004 25

Gene Prediction/Gene analysis

The most common : GRAIL* FGENEH/FGENES

MZEF GENSCAN*

Procrustes

GeneID

GeneParser

HMMgene

GRAIL


26/34

March 16, 2004 26

GRAIL

Gene Recognition and Analysis Link

There are multiple versions:

Grail 1, Grail 1a, Grail 2, GRAIL III, etc.

GRAIL II uses Neural to classify Introns and

Exons. GRAIL III Uses Dynamic Programming to find the

optimal combinations of Introns and Exons.

Refinements: consideration of contextual

information, and linguistic methods.

GenScan


27/34

March 16, 2004 27

GenScan

Predicts complete gene structures

Input sequence may represent more than one gene

It follows a probabilistic model

Uses Markov Model, Generalized HiddenMarkov Model.


28/34

March 16, 2004 28

Multiple Sequence Alignment Programs

Discover the commonalities and evolutionary

relations among a set of genes or proteins.

Examples

ClustalW

DiAlign MAP

Alignment editors

Bioedit


29/34

March 16, 2004 29

ClustalW

finds the best global alignment for a set ofinput sequences (nucleic acid or protein).

A global alignment refers to the best match

over the total length of the sequences.

Produces a similarity tree with scores


30/34

March 16, 2004 30

CLUSTALW

Step 1: Pairwise alignment, distance matrix

Calculates distance scores between pairs

Cost: O(q2

l2

) , q number of sequences, l mean length Step 2: Guide tree

Group nearest first

Build tree sequentially Cost: O(q3)

Step 3: Progressive alignment

Align, starting at leaves of tree Cost: O(ql2)

Other programs (MAP) use DP to find the most

likely evolutionary sequence.

P t i St t P di ti


31/34

March 16, 2004 31

Protein Structure Prediction

NNs are the bases for many known software

packages for predicting protein structures.

The main software packages : nnPredict

Predict Protein

Predator

PSIPRED

SOPMA


32/34

March 16, 2004 32

POSSIBLE RESEARCH

DIRECTIONS

Neuro Fuzzy techniques Genetic Algorithm

Theory of Error Correction codes

Wavelets

Spectrum analysis

Dynamic modeling of protein expression.


33/34

March 16, 2004 33


34/34

March 16, 2004 34

THANK YOU

Documents

Computational Methods in Bioinformatics-Dr Elshafei