115
Bioinformatics Programming 1 EE, NCKU Tien-Hao Chang (Darby Chang)

Bioinformatics Programming

  • Upload
    pier

  • View
    45

  • Download
    0

Embed Size (px)

DESCRIPTION

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang). Molecular biology. Nucleic acid DNA RNA Central dogma Transcription Translation Protein Amino acid Primary structure Secondary structure Tertiary structure. Nucleic acid. - PowerPoint PPT Presentation

Citation preview

Page 1: Bioinformatics Programming

Bioinformatics Programming

1

EE, NCKUTien-Hao Chang (Darby Chang)

Page 2: Bioinformatics Programming

Molecular biology Nucleic acid

– DNA

– RNA

Central dogma– Transcription

– Translation

Protein– Amino acid

– Primary structure

– Secondary structure

– Tertiary structure

2

Page 3: Bioinformatics Programming

Nucleic acid A nucleic acid is a macromolecule

composed of chains of monomeric nucleotide

In biochemistry these molecules carry genetic information or form structures within cells

The most common nucleic acids are deoxyribonucleic acid (DNA) and ribonucleic acid (RNA)

3

Page 4: Bioinformatics Programming

4

http://juang.bst.ntu.edu.tw/BC2008/images/NA%20Fig1.jpg

Page 5: Bioinformatics Programming

Nucleic acid components

Sugar

5

http://www.mun.ca/biology/scarr/Fg10_09b_revised.gif

Page 6: Bioinformatics Programming

Nucleic acid components

Base Purine

–Adenine (A) and guanine (G)

Pyrimidine–Thymine (T), cytosine (C)

–Uracil (U, only in RNA)

6

Page 7: Bioinformatics Programming

7

http://www.elmhurst.edu/~chm/vchembook/images/580bases.gif

Page 8: Bioinformatics Programming

8

http://fig.cox.miami.edu/~cmallery/150/chemistry/sf3x14a.jpg

Page 9: Bioinformatics Programming

DNA Chemically, DNA is a long polymer of simple

units called nucleotides, with a backbone made of sugars and phosphate groups joined by ester bonds

Attached to each sugar is oneof four types of moleculescalled bases

It is the sequence of these fourbases along the backbone thatencodes information

9

http://upload.wikimedia.org/wikipedia/commons/8/87/DNA_orbit_animated_small.gif

Page 10: Bioinformatics Programming

DNA

Base pairing Each type of base on one strand forms a bond with just one

type of base on the other strand Here, purines form hydrogen bonds to pyrimidines, with A

bonding only to T, and C bonding only to G The two types of base pairs form different numbers of

hydrogen bonds, AT forming two hydrogen bonds, and GC forming three hydrogen bonds

Chargaff rule– A=T and G=C

DNA sequence– 5’CpGpCpApApTpT

3’TpTpApApCpGpC

– CGCGAATT

10

Page 11: Bioinformatics Programming

11

http://openlearn.open.ac.uk/file.php/2645/S377_1_005i.jpg

Page 12: Bioinformatics Programming

12

Double helix

http://www.coe.drexel.edu/ret/personalsites/2005/dayal/curriculum1_files/image001.jpg

Page 13: Bioinformatics Programming

Hydrogen bond A hydrogen bond exists between an

electronegative atom and a hydrogen atom bonded to another electronegative atom

This type of force always involves a hydrogen atom and the energy of this attraction is close to that of weak covalent bonds (155 kJ/mol), thus the name – Hydrogen Bonding

Biological functions– DNA/RNA base paring– protein secondary/tertiary structure formation– some properties of water molecule– antibody-antigen (and other protein-protein) binding

13

Page 14: Bioinformatics Programming

14

Hydrogen bond is resulted from electronegativity

http://upload.wikimedia.org/wikipedia/commons/4/43/Liquid_water_hydrogen_bond.png

Page 15: Bioinformatics Programming

15

Major and minor grooves

http://courses.biology.utah.edu/horvath/biol.3525/1_DNA/Fig2/marty_1.jpg

Page 16: Bioinformatics Programming

DNA structure

16

http://www.youtube.com/watch?v=qy8dk5iS1f0&NR=1

Page 17: Bioinformatics Programming

Any Questions?

17

About DNA

Page 18: Bioinformatics Programming

Central dogma

18

Page 19: Bioinformatics Programming

19

http://fig.cox.miami.edu/~cmallery/255/255hist/mcb4.1.dogma.jpg

Page 20: Bioinformatics Programming

Central dogma The process by witch information is

extracted from the nucleotide sequence of a gene and then used to make a protein is essentially the same for all livingthings on Earth and is describedby the grandly named centraldogma of molecular biology

Information in cells passesfrom DNA to RNA to proteins

20

http://upload.wikimedia.org/wikipedia/commons/3/3a/Crick's_1958_central_dogma.svg

Page 21: Bioinformatics Programming

RNA Information stored from DNA is used to make a more

transient, single-stranded polynucleotide called RNA (Ribonucleic Acid)

RNA is very similar to DNA, but differs in a few important structural details– in the cell RNA is usually single stranded, while DNA is

usually double stranded

– RNA nucleotides contain ribose while DNA contains deoxyribose (a type of ribose that lacks one oxygen atom)

– in RNA the nucleotide uracil substitutes for thymine, which is present in DNA

21

Page 22: Bioinformatics Programming

22

http://www.dadamo.com/wiki/dna-rna.png

Page 23: Bioinformatics Programming

Central dogma

Transcription Transcription is the synthesis of RNA under the

direction of DNA Both nucleic acid sequences use the same

language, and the information is simply transcribed, or copied, from one molecule to the other

DNA sequence is enzymatically copied by RNA polymerase to produce a complementary nucleotide RNA strand, called messenger RNA (mRNA)

23

Page 24: Bioinformatics Programming

DNA transcription

24

http://www.youtube.com/watch?v=vJSmZ3DsntU

Page 25: Bioinformatics Programming

Transcription detail

25

http://www-class.unl.edu/biochem/gp2/m_biology/animation/m_animations/gene2.swf

Page 26: Bioinformatics Programming

RNA

Various types mRNA

– messenger RNA (mRNA) is the RNA that carries information from DNA to the ribosome

– the coding sequence of the mRNA determines the amino acid sequence in the protein that is produced

Non-coding RNA– many RNAs do not code for protein

– these non-coding RNA can be encoded by their own genes (RNA genes), but can also derive from mRNA introns

– the most prominent examples of non-coding RNAs are transfer RNA (tRNA) and ribosomal RNA (rRNA), both of which are involved in the process of translation

– there are also non-coding RNAs involved in gene regulation, RNA processing and other roles

26

Page 27: Bioinformatics Programming

Central dogma

Translation Translation is the second stage of protein

biosynthesis Translation occurs in the cytoplasm where the

ribosomes are located In translation, mRNA is decoded to produce a

specific polypeptide according to the rules specified by the genetic code

Many types of transcribed RNA, such as transfer RNA, ribosomal RNA, and small nuclear RNA are not necessarily translated into an amino acid sequence

27

Page 28: Bioinformatics Programming

From RNA to protein synthesis

28

http://www.youtube.com/watch?v=NJxobgkPEAo

Page 29: Bioinformatics Programming

Protein translation

29

http://www.youtube.com/watch?v=nl8pSlonmA0

Page 30: Bioinformatics Programming

30

Genetic code

http://biology.kenyon.edu/courses/biol114/Chap05/code.gif

Page 31: Bioinformatics Programming

Any Questions?

31

About central dogma

Page 32: Bioinformatics Programming

Protein

32

Page 33: Bioinformatics Programming

Protein Proteins are large organic compounds

made of amino acids arranged in a linear chain and joined together by peptide bonds between the carboxyl and amino groups of adjacent amino acid residues

Proteins can also work together to achieve a particular function, and they often associate to form stable complexes

33

Page 34: Bioinformatics Programming

Protein

Amino acid In chemistry, an amino acid is a molecule that

contains both amine and carboxyl functional groups

In biochemistry, this term refers to alpha-amino acids with the general formula H2NCHRCOOH, where R is an organic substituent

In the alpha amino acids, the amino and carboxylate groups are attached to the same carbon, which is called the α–carbon

34

Page 35: Bioinformatics Programming

35

http://upload.wikimedia.org/wikipedia/commons/thumb/c/ce/AminoAcidball.svg/702px-AminoAcidball.svg.png

Page 36: Bioinformatics Programming

Amino acid

Various side chains The various alpha amino acids differ

in which side chain (R group) is attached to their alpha carbon

They can vary in size from just a hydrogen atom in glycine through a methyl group in alanine to a large heterocyclic group in tryptophan

36

Page 37: Bioinformatics Programming

37

http://upload.wikimedia.org/wikipedia/commons/thumb/3/37/Aa.svg/2000px-Aa.svg.png

Page 38: Bioinformatics Programming

38

http://juang.bst.ntu.edu.tw/BC2008/images/Amino(1)%202007/A1-7.JPG

Page 39: Bioinformatics Programming

39

http://juang.bst.ntu.edu.tw/BC2008/images/Amino(1)%202007/A1-9.JPG

Page 40: Bioinformatics Programming

40

http://www.russell.embl-heidelberg.de/aas/other_images/lb3.gif

Page 41: Bioinformatics Programming

Amino acid

The building blocks of proteins Amino acids combine in a condensation reaction

that releases water and the new “amino acid residue” that is held together by a peptide bond

Proteins are defined by their unique sequence of amino acid residues; this sequence is the primary structure of the protein

Just as the letters of the alphabet can be combined to form an almost endless variety of words, amino acids can be linked in varying sequences to form a vast variety of proteins

41

Page 42: Bioinformatics Programming

42

Peptide bond

http://upload.wikimedia.org/wikipedia/commons/thumb/6/6d/Peptidformationball.svg/2000px-Peptidformationball.svg.png

Page 43: Bioinformatics Programming

43

http://juang.bst.ntu.edu.tw/BC2008/images/Amino(1)%202007/A1-11.JPG

Page 44: Bioinformatics Programming

44

http://juang.bst.ntu.edu.tw/BC2008/images/Amino(1)%202007/A1-13.JPG

Page 45: Bioinformatics Programming

Protein

After knowing amino acids Amino acids form short polymer chains

called peptides or longer chains called either polypeptides or proteins

The process of such formation from an mRNA template is known as translation, which is part of protein biosynthesis

Twenty amino acids are encoded by the standard genetic code and are called proteinogenic or standard amino acids

45

Page 46: Bioinformatics Programming

Protein structure hierarchy

46

Page 47: Bioinformatics Programming

47

http://cropandsoil.oregonstate.edu/classes/css430/lecture%209-07/figure-09-03.JPG

Page 48: Bioinformatics Programming

48

http://juang.bst.ntu.edu.tw/BC2008/images/Protein(1)%202007/P1-4.JPG

Page 49: Bioinformatics Programming

49

http://juang.bst.ntu.edu.tw/BC2008/images/Protein(1)%202007/P1-8.JPG

Page 50: Bioinformatics Programming

50

http://juang.bst.ntu.edu.tw/BC2008/images/Protein(1)%202007/P1-9.JPG

Page 51: Bioinformatics Programming

Protein structure hierarchy

Secondary structure In biochemistry and structural biology,

secondary structure is the general three-dimensional form of local segments of biopolymers such as proteins and nucleic acids (DNA/RNA)

It does not, however, describe specific atomic positions in three-dimensional space, which are considered to be tertiary structure

51

Page 52: Bioinformatics Programming

52

http://juang.bst.ntu.edu.tw/BC2008/images/Protein(2)%202007/P2-3.JPG

Page 53: Bioinformatics Programming

Protein structure hierarchy

Tertiary structure The three-dimensional structure of a protein or any other

macromolecule, as defined by the atomic coordinates Describe the spatial relations among it secondary structures Tertiary structure is considered to be largely determined by the

protein’s primary sequence, or the sequence of amino acids of which it is composed

The majority of protein structures known to date have been solved with the experimental technique of X-ray crystallography

A second common way of solving protein structures uses NMR (Nuclear Magnetic Resonance)– lower-resolution data and is limited to relatively small proteins

– can provide time-dependent information about the motion of a protein in solution

53

Page 54: Bioinformatics Programming

54

http://campusapps.fullerton.edu/news/arts/2003/photos/protein-art.jpg

Page 55: Bioinformatics Programming

Protein structure hierarchy

Quaternary structure Many proteins are actually

assemblies of more than onepolypeptide chain, which in thecontext of the larger assemblageare known as protein subunits

In addition to the tertiary structureof the subunits, multiple-subunitproteins possess a quaternarystructure, which is the arrangementinto which the subunits assemble

55

http://courses.cm.utexas.edu/jrobertus/ch339k/overheads-1/ch6_quat-struct1.jpg

Page 56: Bioinformatics Programming

Protein sub-structure

56

Page 57: Bioinformatics Programming

Protein sub-structure

Domain A protein domain is a part of protein

sequence and structure that can evolve,function, and exist independently of therest of the protein chain

Each domain forms a compact three-dimensional structure and often can beindependently stable and folded

Domains vary in length from betweenabout 25 amino acids up to 500amino acids in length

The shortest domains such as zincfingers are stabilized by metal ionsor disulfide bridges

Domains often form functionalunits

57

http://upload.wikimedia.org/wikipedia/commons/6/67/1pkn.png

Page 58: Bioinformatics Programming

Protein domain

Zinc finger

58

http://upload.wikimedia.org/wikipedia/commons/7/79/Zinc_finger_DNA_complex.png

Page 59: Bioinformatics Programming

Protein sub-structure

Motif Sequence motif

– a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance

– for proteins, a sequence motif is distinguished from a structural motif, a motif formed by the three dimensional arrangement of amino acids, which may not be adjacent

Structure motif– a three-dimensional structural element or fold within the chain,

which appears also in a variety of other molecules

– in the context of proteins, the term is sometimes used interchangeably with “structural domain,” although a domain need not be a motif nor, if it contains a motif, need not be made up of only one

59

Page 60: Bioinformatics Programming

60

Page 61: Bioinformatics Programming

61

http://www.biomedcentral.com/content/figures/1471-2164-8-60-8.jpg

Page 62: Bioinformatics Programming

62

http://juang.bst.ntu.edu.tw/BC2008/images/Protein(1)%202007/P1-3.JPG

Page 63: Bioinformatics Programming

Molecular biology

Reference 台大莊榮輝教授網站

– http://juang.bst.ntu.edu.tw/BC2008/index.htm

交大分子生物學網站– http://www.life.nctu.edu.tw/~mb/c40101

.htm

63

Page 64: Bioinformatics Programming

Any Questions?

64

About molecular biology

Page 65: Bioinformatics Programming

65

Sequence alignmentIn a FASTA fileOut pairwise sequence alignment

Requirement- output alignment score (identity)- complexity/teamwork report- using Perl would be the best

Bonus- alignment allowing mismatches- output alignment

Page 66: Bioinformatics Programming

66

Deadline2010/4/27 23:59

Zip your code, step-by-step README, complexity analyses and anything worthy extra credit. Email to [email protected].

Page 67: Bioinformatics Programming

Input– Download from UniProt

– UniProt is the universal protein resource, a central repository of protein data created by combining Swiss-Prot, TrEMBL and PIR. This makes it the world's most comprehensive resource on protein information.

– http://www.uniprot.org/uniprot/?query=Saccharomyces+cerevisiae+transcription+factor+AND+reviewed%3ayes&force=yes&format=fasta

– >sp|P32333|MOT1_YEAST TATA-binding protein-associated fac...

MTSRVSRLDRQVILIETGSTQVVRNMAADQMGDLAKQHPEDILSLLSRVYPFLLVKKWET

...

TFIKTLR

>sp|Q00947|STP1_YEAST Transcription factor STP1...

MPSTTLLFPQKHIRAIPGKIYAFFRELVSGVIISKPDLSHHYSCENATKEEGKDAADEEK

...

>sp|P38830|NDT80_YEAST Meiosis-specific transcription fac...

MNEMENTDPVLQDDLVSKYERELSTEQEEDTPVILTQLNEDGTTSNYFDKRKLKIAPRST

...

Output– >MOT1_YEAST

STP1_YEAST 90

NDT80_YEAST 80

>STP1_YEAST

MOT1_YEAST 90

NDT80_YEAST 70

>NDT80_YEAST

MOT1_YEAST 80

STP1_YEAST 70 67

Page 68: Bioinformatics Programming

Sequence similarity Sequence identity Sequence alignment

– dynamic programming

– backtracking

– substitution matrix

68

Page 69: Bioinformatics Programming

Sequence similarity

Identity Which sequence is more similar to DKELIR?

– EPELIR or DKGLIR

A trivial (but useful) concept, identity– DKELIR

EPELIR

**** identity: 4/6 = 66.7%

– DKELIR

DKGLIR

** *** identity: 5/6 = 83.3%

69

Page 70: Bioinformatics Programming

Sequence similarity

Alignment When two sequence have different

lengths– DKELIR

MERPEPELIR identity: 0%

Obviously, we need to shift the first sequence by 4 residues– DKELIR

MERPEPELIR

This movement is so-called alignment70

Page 71: Bioinformatics Programming

Sequence alignment A way of arranging the sequences of DNA, RNA or protein

to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences

More complex alignments may involve gaps (EDKELIR vs. MERPEPELIR)– E----DKELIR

MERPEP--ELIR

* **** identity: 5/11 = 45.5%

And substitution matrix– E--DKELIR

MERPEPELIR

* .**** identity: 6/9 = 66.7%

71

Page 72: Bioinformatics Programming

Sequence alignment

Dynamic programming G A A T T C A G T T A

0 0 0 0 0 0 0 0 0 0 0 0G 0 1 1 1 1 1 1 1 1 1 1 1G 0 1 1 1 1 1 1 1 2 2 2 2A 0 1 2 2 2 2 2 2 2 2 2 3T 0 1 2 2 3 3 3 3 3 3 3 3C 0 1 2 2 3 3 4 4 4 4 4 4G 0 1 2 2 3 3 4 4 5 5 5 5A 0 1 2 3 3 3 4 5 5 5 5 6

72

Page 73: Bioinformatics Programming

A class of solution methods for solving sequential decision problems with a compositional cost structure

In this matrix, each element Si,j indicates

that the best alignment score between the two corresponding sub-sequences– the key is to find the

relationships between

the problem Si,j to its

sub-problems S α,β,

where α≦i and β≦j73

1

max

1,1

1,

,1

,

ji

ji

ji

ji

s

s

s

s

Page 74: Bioinformatics Programming

Dynamic programming

Insertion gapG A A T T C A G T T A

0 0 0 0 0 0 0 0 0 0 0 0G 0 1 1 1 1 1 1 1 1 1 1 1G 0 1 1 1 1 1 1 1 2 2 2 2A 0 1 2 2 2 2 2 2 2 2 2 3T 0 1 2 2 3 3 ?C 0G 0A 0

1

max

1,1

1,

,1

,

ji

ji

ji

ji

s

s

s

s

74

Page 75: Bioinformatics Programming

Dynamic programming

Deletion gapG A A T T C A G T T A

0 0 0 0 0 0 0 0 0 0 0 0G 0 1 1 1 1 1 1 1 1 1 1 1G 0 1 1 1 1 1 1 1 2 2 2 2A 0 1 2 2 2 2 2 2 2 2 2 3T 0 1 2 2 3 3 3 3 3 3 3 3C 0 1 2 2 ?G 0A 0

75

1

max

1,1

1,

,1

,

ji

ji

ji

ji

s

s

s

s

Page 76: Bioinformatics Programming

Dynamic programming

MatchG A A T T C A G T T A

0 0 0 0 0 0 0 0 0 0 0 0G 0 1 1 1 1 1 1 1 1 1 1 1G 0 1 1 1 1 1 1 1 2 2 2 2A 0 1 2 2 2 2 2 2 2 2 2 3T 0 1 2 2 ?C 0G 0A 0

76

1

max

1,1

1,

,1

,

ji

ji

ji

ji

s

s

s

s

Page 77: Bioinformatics Programming

Dynamic programming

Relationship Two key ingredients for an optimization problem to

be suitable for a dynamic-programming solution– each substructure is optimal

– sub-problems are dependent, otherwise, a divide-and-conquer approach is the choice

Since now we know the three relationships– insertion gap

– deletion gap

– match

We can easily construct an alignment based on this matrix with the so-called backtracking technique

77

Page 78: Bioinformatics Programming

Dynamic programming

BacktrackingG A A T T C A G T T A

0 0 0 0 0 0 0 0 0 0 0 0G 0 1 1 1 1 1 1 1 1 1 1 1G 0 1 1 1 1 1 1 1 2 2 2 2A 0 1 2 2 2 2 2 2 2 2 2 3T 0 1 2 2 3 3 3 3 3 3 3 3C 0 1 2 2 3 3 4 4 4 4 4 4G 0 1 2 2 3 3 4 4 5 5 5 5A 0 1 2 3 3 3 4 5 5 5 5 6

78

Page 79: Bioinformatics Programming

Backtracking

Alternative pathsG A A T T C A G T T A

0 0 0 0 0 0 0 0 0 0 0 0G 0 1 1 1 1 1 1 1 1 1 1 1G 0 1 1 1 1 1 1 1 2 2 2 2A 0 1 2 2 2 2 2 2 2 2 2 3T 0 1 2 2 3 3 3 3 3 3 3 3C 0 1 2 2 3 3 4 4 4 4 4 4G 0 1 2 2 3 3 4 4 5 5 5 5A 0 1 2 3 3 3 4 5 5 5 5 6

79

Page 80: Bioinformatics Programming

Backtracking The backtracking algorithm enumerates a set of partial

candidates that, in principle, could be completed in various ways to give all the possible solutions to the given problem

A dynamic programming matrix can produce all possible alignments of the best score from different backtracking paths

Alternative paths– G-AATTCAGTTA

GGA-T-C-G--A

* * * * * * identity: 6/12 = 50%

– G-AATTCAGTTA

GGA--TC-G--A

* * ** * * identity: 6/12 = 50%

80

Page 81: Bioinformatics Programming

Sequence alignment

Substitution matrix Some alignments may involve mismatch relationship

– E----DKELIR

MERPEP--ELIR

* **** identity: 5/11 = 45.5%

– E--DKELIR

MERPEPELIR

* .**** identity: 6/9 = 66.7%

81

),(

),(

),(

max

1,1

1,

,1

,

jiji

jji

iji

ji

bascore

bscore

ascore

s

s

s

s

3333

38555

35855

35585

35558

T

G

C

A

TGCA

Page 82: Bioinformatics Programming

Sequence alignment with substitution matrix

InitializeG A A T T C A G T T A

0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33

G -3

G -6

A -9

T -12

C -15

G -18

A -21

82

3333

38555

35855

35585

35558

T

G

C

A

TGCA

Page 83: Bioinformatics Programming

Sequence alignment with substitution matrix

MatchG A A T T C A G T T A

0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33

G -3 ?

G -6

A -9

T -12

C -15

G -18

A -21

83

3333

38555

35855

35585

35558

T

G

C

A

TGCA

Page 84: Bioinformatics Programming

Sequence alignment with substitution matrix

Deletion gapG A A T T C A G T T A

0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33

G -3 8 ?

G -6

A -9

T -12

C -15

G -18

A -21

84

3333

38555

35855

35585

35558

T

G

C

A

TGCA

Page 85: Bioinformatics Programming

Sequence alignment with substitution matrix

Insertion gapG A A T T C A G T T A

0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33

G -3 8 5 2 -1 -4 -7 -10 -13 -16 -19 -22

G -6 ?

A -9

T -12

C -15

G -18

A -21

85

3333

38555

35855

35585

35558

T

G

C

A

TGCA

Page 86: Bioinformatics Programming

Sequence alignment with substitution matrix

MismatchG A A T T C A G T T A

0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33

G -3 8 5 2 -1 -4 -7 -10 -13 -16 -19 -22

G -6 5 ?

A -9

T -12

C -15

G -18

A -21

86

3333

38555

35855

35585

35558

T

G

C

A

TGCA

Page 87: Bioinformatics Programming

Sequence alignment with substitution matrix

InitializeG A A T T C A G T T A

0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33

G -3 8 5 2 -1 -4 -7 -10 -13 -16 -19 -22

G -6 5 3 ?

A -9

T -12

C -15

G -18

A -21

3333

38555

35855

35585

35558

T

G

C

A

TGCA

87

Page 88: Bioinformatics Programming

Sequence alignment with substitution matrix

CompleteG A A T T C A G T T A

0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33

G -3 8 5 2 -1 -4 -7 -10 -13 -16 -19 -22

G -6 5 3 0 -3 -6 -9 -12 -2 -5 -8 -11

A -9 2 13 11 8 5 2 -1 -4 -7 -10 0

T -12 -1 10 8 19 16 13 10 7 4 1 -2

C -15 -4 7 5 16 14 24 21 18 15 12 9

G -18 -7 4 2 13 11 21 18 29 26 23 20

A -21 -10 1 12 10 8 18 29 26 24 21 31

88

Page 89: Bioinformatics Programming

Sequence alignment with substitution matrix

BacktrackingG A A T T C A G T T A

0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33

G -3 8 5 2 -1 -4 -7 -10 -13 -16 -19 -22

G -6 5 3 0 -3 -6 -9 -12 -2 -5 -8 -11

A -9 2 13 11 8 5 2 -1 -4 -7 -10 0

T -12 -1 10 8 19 16 13 10 7 4 1 -2

C -15 -4 7 5 16 14 24 21 18 15 12 9

G -18 -7 4 2 13 11 21 18 29 26 23 20

A -21 -10 1 12 10 8 18 29 26 24 21 31

89

Page 90: Bioinformatics Programming

Beginning Perl for Bioinformatics

90

Page 91: Bioinformatics Programming

Biology and computer science Bioinformatics

– biological data is proliferating rapidly

– computer-based tools now play an increasingly critical role in the advancement of biological research

– Bioinformatics, a rapidly evolving discipline, is the application of computational tools and techniques to the management and analysis of biological data

Recently, the new term in silico has become a common reference to biological studies carried out in the computer– in vivo in life, that is, in a living organism

– in vitro in glass, that is, in the test tube

– in silico in algorithm; most computer chips are made primarily of silicon

91

Page 92: Bioinformatics Programming

Getting started with Perl Perl is a popular programming language that's extensively used

in areas such as bioinformatics and web programming The word Perl refers to

– the language in which you will write programs• Perl programs, Perl scripts, Perl code

– the application on your computer that runs those programs• Perl interpreter

A low and long learning curve– one can get started very quickly, and then learn additional topics as

needed• ex: object-oriented programming is also well-supported in Perl

Perl's benefits– ease of programming, rapid prototyping, portability, healthy society

and abundant online resources

92

Page 93: Bioinformatics Programming

Perl

Sequences and strings Example 4-1. Putting DNA into the computer

#!/usr/bin/perl –w

# Storing DNA in a variable, and printing it out

use strict;

# First we store the DNA in a variable called $DNA

my $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';

# Next, we print the DNA onto the screen

print $DNA; Checkpoints

– # comments in Perl, like // in C

– #! magic line in Unix, ignored in Windows

– -w display warnings, a good habit

– use strict force programmers to declare variables first, a good habit

– my declare a variable

– $ scalar variable

– print convenient printf()

93

Page 94: Bioinformatics Programming

Example 4-2. Concatenating DNA#!/usr/bin/perl –w

# Concatenating DNA

# Store two DNA fragments into two variables

my $DNA1 = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';

my $DNA2 = 'ATAGTGCCGTGAGAGTGATGTAGTA';

# Using "string interpolation"

my $DNA3 = "$DNA1$DNA2";

print "$DNA3\n\n";

# An alternative way using the "dot operator":

$DNA3 = $DNA1 . $DNA2;

print "$DNA3\n\n";

# Print the same thing without using the variable $DNA3

print $DNA1, $DNA2, "\n\n";

print "$DNA1$DNA2\n\n"; Checkpoints

– "" string, notice that "$a\n" is different to '$a\n'

– . concatenation operator

– print allows multiple arguments

Maybe the Perl slogan should be, “There are more than two ways to do it.”

94

Page 95: Bioinformatics Programming

Example 4-3. Transcribing DNA into RNA#!/usr/bin/perl –w

# Transcribing DNA into RNA

use strict;

# The DNA

my $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';

# Transcribe the DNA to RNA

# Substitute all T's with U's

my $RNA = $DNA;

$RNA =~ s/T/U/g;

print "$RNA\n"; Checkpoints

– =~ binding operator

– s/// substitute command, where g is modifier of s

95

Page 96: Bioinformatics Programming

96

Page 97: Bioinformatics Programming

Using the Perl documentation A Perl programmer’s most important resource is the Perl documentation

– it should be installed on your computer

– it may also be found on the Internet at the Perl site

– just Google ‘perl’

perldoc– $ perldoc -f printf

http://www.perl.com– Documentation link Perl’s Builtin Functions Alphabetical Listing of Perl’s

Functions

The Perl documentation– check out the examples they give is usually the quickest way

– it may answer some questions but raises others• E.g., the documentation of print starts out: “Prints a string or a comma-separated list of

strings.” But then comes a bunch of gibberish (or it is just the learning curve!) Filehandles? Output streams? List context?

– it also includes several tutorials

97

Page 98: Bioinformatics Programming

Example 4-4. Calculating the reverse complement of a strand of DNA #!/usr/bin/perl -w

# Calculating the reverse complement of a strand of DNA

use strict;

# The DNA

my $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';

# Calculate the reverse complement

# Warning: this attempt will fail!

# First, copy the DNA into new variable $revcom (short for REVerse COMplement)

# Notice that variable names can use lowercase letters like "revcom“

# as well as uppercase like "DNA“. In fact, lowercase is more common.

my $revcom = reverse $DNA;

# Next substitute all bases by their complements,

# A->T, T->A, G->C, C->G

$revcom =~ s/A/T/g;

$revcom =~ s/T/A/g;

$revcom =~ s/G/C/g;

$revcom =~ s/C/G/g;

print "$revcom\n“; This examples has a logical bug Checkpoints

– reverse() reverse a string, the parentheses can be omitted

98

Page 99: Bioinformatics Programming

What is

99

The bug

Page 100: Bioinformatics Programming

Example 4-4. Calculating the reverse complement of a strand of DNA#!/usr/bin/perl -w

# Calculating the reverse complement of a strand of DNA

use strict;

# The DNA

my $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';

# The problem is that the first two substitute commands above

# change all the A's to T's (so there are no A's) and then all the

# T's to A's (so all the original A's and T's are all now A's).

# Same thing happens to the G's and C's all turning into G's.

# Make a copy of the DNA

my $revcom = reverse $DNA;

# Next substitute all bases by their complements,

# A->T, T->A, G->C, C->G

$revcom =~ tr/ACGTacgt/TGCAtgca/;

print "$revcom\n"; Checkpoints

– tr transliteration operator

100

Page 101: Bioinformatics Programming

101

Page 102: Bioinformatics Programming

Example 4-5. Reading protein sequence data from a file#!/usr/bin/perl -w

# Reading protein sequence data from a file

use strict;

# The filename of the file containing the protein sequence data

$proteinfilename = 'NM_021964fragment.pep';

# First we have to "open" the file, and associate a "filehandle" with it.

open FH, $proteinfilename;

# Now we do the actual reading of the protein sequence data from the file, by using

the angle brackets < and > to get the input from the filehandle. We store the data

into our variable $protein.

$protein = <FH>;

# Now that we've got our data, we can close the file.

close FH;

print $protein; Checkpoints

– open() open a file, like fopen() in C

– filehandle it’s an interface between program and file system, very common concept

– <> it’s an operator in Perl, just memorize it

– close() close a file, like fclose() in C

102

Page 103: Bioinformatics Programming

Example 4-6. Reading protein sequence data from a file, take 2#!/usr/bin/perl -w

# Reading protein sequence data from a file, take 2

use strict;

# Open a file

open FH, 'NM_021964fragment.pep';

# Suppose that the file has three lines, and since the read only is returning

# one line, we'll read a line and print it, three times.

# First line

my $protein = <FH>;

print $protein;

# Second line

$protein = <FH>;

print $protein;

# Third line

$protein = <FH>;

print $protein;

close FH; Checkpoints

– how stupid it is

103

Page 104: Bioinformatics Programming

Example 4-7. Reading protein sequence data from a file, take 3#!/usr/bin/perl -w

# Reading protein sequence data from a file, take 3

use strict;

# Open a file

open FH, 'NM_021964fragment.pep';

# Read the protein sequence data from the file, and

# store it into the array variable @protein

@protein = <FH>;

print @protein;

close PROTEINFILE; Checkpoints

– @ array variable• $protein[0] the first element of @protein

– print know how to print an array, no loop is required

104

Page 105: Bioinformatics Programming

Perl facilities for array– my @bases = ( 'A', 'C', 'G', 'T' );

print @bases; # A C G T

pop, push (for stack) shift, unshift (for queue)– my $base1 = pop @bases; my $base1 = shift @bases;

print $base1; # T print $base1; # A

print @bases; # A C G print @bases; # C G T

push @bases, $base1; unshift @bases, $base1;

print @bases; # A C G T print @bases; # A C G T

reverse– my @reverse = reverse @bases;

print @reverse; # T G C A

scalar (length of an array)– print scalar @bases; # 4

splice (insert/delete/replace elements at an arbitrary place)– extremely powerful!– splice @bases, 2, 0, 'X';

print @bases; # A C X G T

105

Page 106: Bioinformatics Programming

Any Questions?

106

About Perl facilities for array

Page 107: Bioinformatics Programming

How to

107

my @a = ( '1', '2', '3', '4' );

splice @a, 2, 1, 'a', 'b';

print @a; # 1 2 a b 4

answer

Page 108: Bioinformatics Programming

Example 4-8. Scalar context and list context#!/usr/bin/perl -w

# Demonstration of "scalar context" and "list context"

use strict;

my @bases = ( 'A', 'C', 'G', 'T' );

my $a = @bases;

print $a; # 4

($a) = @bases;

print $a; # A Checkpoints

– many Perl operations behave differently depending on the context in which they are used

– another example is the reverse() on a string array

108

Page 109: Bioinformatics Programming

Perl

Flow control Example 5-1. if-elsif-else

#!/usr/bin/perl -w

# if-elsif-else

use strict;

my $word = 'MNIDDKL';

# if-elsif-else conditionals

if ( 'QSTVSGE' eq $word ) {

print "QSTVSGE\n";

} elsif ( 'MNIDDKL' eq $word ) {

print "MNIDDKL--the magic word!\n";

} else {

print "Is \"$word\" a peptide?\n";

} Checkpoints

– elsif else if

– eq equal, check for equality between strings

– ne not equal, check for equality between strings

109

Page 110: Bioinformatics Programming

Example 5-2. Reading protein sequence data from a file, take 4#!/usr/bin/perl -w

# Reading protein sequence data from a file, take 4

use strict;

# In case the open fails, print an error message and exit.

my $fn = 'NM_021964fragment.pep';

unless ( open FH, $fn ) {

print "Could not open file $fn!\n";

exit;

}

# Read and print line-by-line.

my $protein;

while ( $protein = <FH> ) { print $protein; }

close FH; Checkpoints

– unless the opposite of if, just to fit English more

– openreturn false when fails

– exit interrupt the program, like exit() in C

– In Perl, an assignment returns the value of the assignment. If there is another line to read in, the assignment occurs, the $protein is not null, and the conditional is true.

110

Page 111: Bioinformatics Programming

Perl

Code layout Format A Format B

– while ($a) { while ($a)

if ($b) { {

print "ok\n"; if ($b)

} {

} print "ok\n";

}

}

Format C Format D– while ($a) while($a){if($b){print "ok\n";}}

if ($b)

{

print "ok\n";

}

}

A and B are common ways to lay out code, A is more preferred in Perl DON’T use C or D, ever! Perl provides a guide for code style.

– $ perldoc perlstyle

– however, they are not rules, and you may use your own judgment

111

Page 112: Bioinformatics Programming

Perl

Subroutine Example 6-1. A subroutine to append ACGT to DNA

#!/usr/bin/perl -w

# A program with a subroutine to append ACGT to DNA

use strict;

my $dna = 'CGACGTCTTCTCAGGCGA';

# The argument is $dna; the result is $longer_dna

my $longer_dna = &addACGT($dna); print $longer_dna;

####################

# Subroutines for Example 6-1

sub addACGT {

my $dna = shift @_;

$dna .= 'ACGT';

return $dna;

} Checkpoints

– & call subroutine, can be omitted but helpful (e.g., vi) for highlighting

– sub subroutine definition

– my $dna scoping issue

– @_ argument array, a special variable in Perl

– return return value

112

Page 113: Bioinformatics Programming

Perl

Scoping Make the variables specific to the subroutine with my

– my is a keyword in Perl that limits variables to the block where they are used (here the block is the subroutine)

Making variables local to a restricted part of a program is called scoping. There are different models of scoping:– in Perl, using my variables is known as lexical scoping, also

known as static scoping

– dynamic scoping is hard to track, use them as less as possible

Manipulation directly on $_[0] will change the argument (call by reference)

113

Page 114: Bioinformatics Programming

Any Questions?

114

About Perl subroutine

Page 115: Bioinformatics Programming

How to

115

Return multiple values?return ( 'a', 'b' );

$_[0] = 'a'; $_[1] = 'b';

answer#1

answer#2