Bioinformatics Programming

Bioinformatics Programming

1

EE, NCKUTien-Hao Chang (Darby Chang)

Molecular biology Nucleic acid

– DNA

– RNA

Central dogma– Transcription

– Translation

Protein– Amino acid

– Primary structure

– Secondary structure

– Tertiary structure

2

Nucleic acid A nucleic acid is a macromolecule

composed of chains of monomeric nucleotide

In biochemistry these molecules carry genetic information or form structures within cells

The most common nucleic acids are deoxyribonucleic acid (DNA) and ribonucleic acid (RNA)

3

4

http://juang.bst.ntu.edu.tw/BC2008/images/NA%20Fig1.jpg

Nucleic acid components

Sugar

5

http://www.mun.ca/biology/scarr/Fg10_09b_revised.gif

Nucleic acid components

Base Purine

–Adenine (A) and guanine (G)

Pyrimidine–Thymine (T), cytosine (C)

–Uracil (U, only in RNA)

6

7

http://www.elmhurst.edu/~chm/vchembook/images/580bases.gif

8

http://fig.cox.miami.edu/~cmallery/150/chemistry/sf3x14a.jpg

DNA Chemically, DNA is a long polymer of simple

units called nucleotides, with a backbone made of sugars and phosphate groups joined by ester bonds

Attached to each sugar is oneof four types of moleculescalled bases

It is the sequence of these fourbases along the backbone thatencodes information

9

http://upload.wikimedia.org/wikipedia/commons/8/87/DNA_orbit_animated_small.gif

DNA

Base pairing Each type of base on one strand forms a bond with just one

type of base on the other strand Here, purines form hydrogen bonds to pyrimidines, with A

bonding only to T, and C bonding only to G The two types of base pairs form different numbers of

hydrogen bonds, AT forming two hydrogen bonds, and GC forming three hydrogen bonds

Chargaff rule– A=T and G=C

DNA sequence– 5’CpGpCpApApTpT

3’TpTpApApCpGpC

– CGCGAATT

10

11

http://openlearn.open.ac.uk/file.php/2645/S377_1_005i.jpg

12

Double helix

http://www.coe.drexel.edu/ret/personalsites/2005/dayal/curriculum1_files/image001.jpg

Hydrogen bond A hydrogen bond exists between an

electronegative atom and a hydrogen atom bonded to another electronegative atom

This type of force always involves a hydrogen atom and the energy of this attraction is close to that of weak covalent bonds (155 kJ/mol), thus the name – Hydrogen Bonding

Biological functions– DNA/RNA base paring– protein secondary/tertiary structure formation– some properties of water molecule– antibody-antigen (and other protein-protein) binding

13

14

Hydrogen bond is resulted from electronegativity

http://upload.wikimedia.org/wikipedia/commons/4/43/Liquid_water_hydrogen_bond.png

15

Major and minor grooves

http://courses.biology.utah.edu/horvath/biol.3525/1_DNA/Fig2/marty_1.jpg

DNA structure

16

http://www.youtube.com/watch?v=qy8dk5iS1f0&NR=1

Any Questions?

17

About DNA

Central dogma

18

19

http://fig.cox.miami.edu/~cmallery/255/255hist/mcb4.1.dogma.jpg

Central dogma The process by witch information is

extracted from the nucleotide sequence of a gene and then used to make a protein is essentially the same for all livingthings on Earth and is describedby the grandly named centraldogma of molecular biology

Information in cells passesfrom DNA to RNA to proteins

20

http://upload.wikimedia.org/wikipedia/commons/3/3a/Crick's_1958_central_dogma.svg

RNA Information stored from DNA is used to make a more

transient, single-stranded polynucleotide called RNA (Ribonucleic Acid)

RNA is very similar to DNA, but differs in a few important structural details– in the cell RNA is usually single stranded, while DNA is

usually double stranded

– RNA nucleotides contain ribose while DNA contains deoxyribose (a type of ribose that lacks one oxygen atom)

– in RNA the nucleotide uracil substitutes for thymine, which is present in DNA

21

22

http://www.dadamo.com/wiki/dna-rna.png

Central dogma

Transcription Transcription is the synthesis of RNA under the

direction of DNA Both nucleic acid sequences use the same

language, and the information is simply transcribed, or copied, from one molecule to the other

DNA sequence is enzymatically copied by RNA polymerase to produce a complementary nucleotide RNA strand, called messenger RNA (mRNA)

23

DNA transcription

24

http://www.youtube.com/watch?v=vJSmZ3DsntU

Transcription detail

25

http://www-class.unl.edu/biochem/gp2/m_biology/animation/m_animations/gene2.swf

RNA

Various types mRNA

– messenger RNA (mRNA) is the RNA that carries information from DNA to the ribosome

– the coding sequence of the mRNA determines the amino acid sequence in the protein that is produced

Non-coding RNA– many RNAs do not code for protein

– these non-coding RNA can be encoded by their own genes (RNA genes), but can also derive from mRNA introns

– the most prominent examples of non-coding RNAs are transfer RNA (tRNA) and ribosomal RNA (rRNA), both of which are involved in the process of translation

– there are also non-coding RNAs involved in gene regulation, RNA processing and other roles

26

Central dogma

Translation Translation is the second stage of protein

biosynthesis Translation occurs in the cytoplasm where the

ribosomes are located In translation, mRNA is decoded to produce a

specific polypeptide according to the rules specified by the genetic code

Many types of transcribed RNA, such as transfer RNA, ribosomal RNA, and small nuclear RNA are not necessarily translated into an amino acid sequence

27

From RNA to protein synthesis

28

http://www.youtube.com/watch?v=NJxobgkPEAo

Protein translation

29

http://www.youtube.com/watch?v=nl8pSlonmA0

30

Genetic code

http://biology.kenyon.edu/courses/biol114/Chap05/code.gif

Any Questions?

31

About central dogma

Protein

32

Protein Proteins are large organic compounds

made of amino acids arranged in a linear chain and joined together by peptide bonds between the carboxyl and amino groups of adjacent amino acid residues

Proteins can also work together to achieve a particular function, and they often associate to form stable complexes

33

Protein

Amino acid In chemistry, an amino acid is a molecule that

contains both amine and carboxyl functional groups

In biochemistry, this term refers to alpha-amino acids with the general formula H2NCHRCOOH, where R is an organic substituent

In the alpha amino acids, the amino and carboxylate groups are attached to the same carbon, which is called the α–carbon

34

35

http://upload.wikimedia.org/wikipedia/commons/thumb/c/ce/AminoAcidball.svg/702px-AminoAcidball.svg.png

Amino acid

Various side chains The various alpha amino acids differ

in which side chain (R group) is attached to their alpha carbon

They can vary in size from just a hydrogen atom in glycine through a methyl group in alanine to a large heterocyclic group in tryptophan

36

37

http://upload.wikimedia.org/wikipedia/commons/thumb/3/37/Aa.svg/2000px-Aa.svg.png

38

http://juang.bst.ntu.edu.tw/BC2008/images/Amino(1)%202007/A1-7.JPG

39


40

http://www.russell.embl-heidelberg.de/aas/other_images/lb3.gif

Amino acid

The building blocks of proteins Amino acids combine in a condensation reaction

that releases water and the new “amino acid residue” that is held together by a peptide bond

Proteins are defined by their unique sequence of amino acid residues; this sequence is the primary structure of the protein

Just as the letters of the alphabet can be combined to form an almost endless variety of words, amino acids can be linked in varying sequences to form a vast variety of proteins

41

42

Peptide bond

http://upload.wikimedia.org/wikipedia/commons/thumb/6/6d/Peptidformationball.svg/2000px-Peptidformationball.svg.png

43


44


Protein

After knowing amino acids Amino acids form short polymer chains

called peptides or longer chains called either polypeptides or proteins

The process of such formation from an mRNA template is known as translation, which is part of protein biosynthesis

Twenty amino acids are encoded by the standard genetic code and are called proteinogenic or standard amino acids

45

Protein structure hierarchy

46

47

http://cropandsoil.oregonstate.edu/classes/css430/lecture%209-07/figure-09-03.JPG

48

http://juang.bst.ntu.edu.tw/BC2008/images/Protein(1)%202007/P1-4.JPG

49


50



Secondary structure In biochemistry and structural biology,

secondary structure is the general three-dimensional form of local segments of biopolymers such as proteins and nucleic acids (DNA/RNA)

It does not, however, describe specific atomic positions in three-dimensional space, which are considered to be tertiary structure

51

52



Tertiary structure The three-dimensional structure of a protein or any other

macromolecule, as defined by the atomic coordinates Describe the spatial relations among it secondary structures Tertiary structure is considered to be largely determined by the

protein’s primary sequence, or the sequence of amino acids of which it is composed

The majority of protein structures known to date have been solved with the experimental technique of X-ray crystallography

A second common way of solving protein structures uses NMR (Nuclear Magnetic Resonance)– lower-resolution data and is limited to relatively small proteins

– can provide time-dependent information about the motion of a protein in solution

53

54

http://campusapps.fullerton.edu/news/arts/2003/photos/protein-art.jpg


Quaternary structure Many proteins are actually

assemblies of more than onepolypeptide chain, which in thecontext of the larger assemblageare known as protein subunits

In addition to the tertiary structureof the subunits, multiple-subunitproteins possess a quaternarystructure, which is the arrangementinto which the subunits assemble

55

http://courses.cm.utexas.edu/jrobertus/ch339k/overheads-1/ch6_quat-struct1.jpg

Protein sub-structure

56


Domain A protein domain is a part of protein

sequence and structure that can evolve,function, and exist independently of therest of the protein chain

Each domain forms a compact three-dimensional structure and often can beindependently stable and folded

Domains vary in length from betweenabout 25 amino acids up to 500amino acids in length

The shortest domains such as zincfingers are stabilized by metal ionsor disulfide bridges

Domains often form functionalunits

57

http://upload.wikimedia.org/wikipedia/commons/6/67/1pkn.png

Protein domain

Zinc finger

58

http://upload.wikimedia.org/wikipedia/commons/7/79/Zinc_finger_DNA_complex.png


Motif Sequence motif

– a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance

– for proteins, a sequence motif is distinguished from a structural motif, a motif formed by the three dimensional arrangement of amino acids, which may not be adjacent

Structure motif– a three-dimensional structural element or fold within the chain,

which appears also in a variety of other molecules

– in the context of proteins, the term is sometimes used interchangeably with “structural domain,” although a domain need not be a motif nor, if it contains a motif, need not be made up of only one

59

60

61

http://www.biomedcentral.com/content/figures/1471-2164-8-60-8.jpg

62


Molecular biology

Reference 台大莊榮輝教授網站

– http://juang.bst.ntu.edu.tw/BC2008/index.htm

交大分子生物學網站– http://www.life.nctu.edu.tw/~mb/c40101

.htm

63

http://juang.bst.ntu.edu.tw/BC2008/index.htm

http://juang.bst.ntu.edu.tw/BC2008/index.htm

http://www.life.nctu.edu.tw/~mb/c40101.htm

http://www.life.nctu.edu.tw/~mb/c40101.htm

Any Questions?

64

About molecular biology

65

Sequence alignmentIn a FASTA fileOut pairwise sequence alignment

Requirement- output alignment score (identity)- complexity/teamwork report- using Perl would be the best

Bonus- alignment allowing mismatches- output alignment

66

Deadline2010/4/27 23:59

Zip your code, step-by-step README, complexity analyses and anything worthy extra credit. Email to [email protected].

mailto:[email protected]

Input– Download from UniProt

– UniProt is the universal protein resource, a central repository of protein data created by combining Swiss-Prot, TrEMBL and PIR. This makes it the world's most comprehensive resource on protein information.

– http://www.uniprot.org/uniprot/?query=Saccharomyces+cerevisiae+transcription+factor+AND+reviewed%3ayes&force=yes&format=fasta

– >sp|P32333|MOT1_YEAST TATA-binding protein-associated fac...

MTSRVSRLDRQVILIETGSTQVVRNMAADQMGDLAKQHPEDILSLLSRVYPFLLVKKWET

...

TFIKTLR

>sp|Q00947|STP1_YEAST Transcription factor STP1...

MPSTTLLFPQKHIRAIPGKIYAFFRELVSGVIISKPDLSHHYSCENATKEEGKDAADEEK

...

>sp|P38830|NDT80_YEAST Meiosis-specific transcription fac...

MNEMENTDPVLQDDLVSKYERELSTEQEEDTPVILTQLNEDGTTSNYFDKRKLKIAPRST

...

Output– >MOT1_YEAST

STP1_YEAST 90

NDT80_YEAST 80

>STP1_YEAST

MOT1_YEAST 90

NDT80_YEAST 70

>NDT80_YEAST

MOT1_YEAST 80

STP1_YEAST 70 67

http://www.uniprot.org/uniprot/?query=Saccharomyces+cerevisiae+transcription+factor+AND+reviewed:yes&force=yes&format=fasta

http://www.uniprot.org/uniprot/?query=Saccharomyces+cerevisiae+transcription+factor+AND+reviewed:yes&force=yes&format=fasta

Sequence similarity Sequence identity Sequence alignment

– dynamic programming

– backtracking

– substitution matrix

68

Sequence similarity

Identity Which sequence is more similar to DKELIR?

– EPELIR or DKGLIR

A trivial (but useful) concept, identity– DKELIR

EPELIR

**** identity: 4/6 = 66.7%

– DKELIR

DKGLIR

** *** identity: 5/6 = 83.3%

69

Sequence similarity

Alignment When two sequence have different

lengths– DKELIR

MERPEPELIR identity: 0%

Obviously, we need to shift the first sequence by 4 residues– DKELIR

MERPEPELIR

This movement is so-called alignment70

Sequence alignment A way of arranging the sequences of DNA, RNA or protein

to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences

More complex alignments may involve gaps (EDKELIR vs. MERPEPELIR)– E----DKELIR

MERPEP--ELIR

* **** identity: 5/11 = 45.5%

And substitution matrix– E--DKELIR

MERPEPELIR

* .**** identity: 6/9 = 66.7%

71

Sequence alignment

Dynamic programming G A A T T C A G T T A

0 0 0 0 0 0 0 0 0 0 0 0G 0 1 1 1 1 1 1 1 1 1 1 1G 0 1 1 1 1 1 1 1 2 2 2 2A 0 1 2 2 2 2 2 2 2 2 2 3T 0 1 2 2 3 3 3 3 3 3 3 3C 0 1 2 2 3 3 4 4 4 4 4 4G 0 1 2 2 3 3 4 4 5 5 5 5A 0 1 2 3 3 3 4 5 5 5 5 6

72

A class of solution methods for solving sequential decision problems with a compositional cost structure

In this matrix, each element Si,j indicates

that the best alignment score between the two corresponding sub-sequences– the key is to find the

relationships between

the problem Si,j to its

sub-problems S α,β,

where α≦i and β≦j73

1

max

1,1

1,

,1

,

ji

ji

ji

ji

s

s

s

s

Dynamic programming

Insertion gapG A A T T C A G T T A

0 0 0 0 0 0 0 0 0 0 0 0G 0 1 1 1 1 1 1 1 1 1 1 1G 0 1 1 1 1 1 1 1 2 2 2 2A 0 1 2 2 2 2 2 2 2 2 2 3T 0 1 2 2 3 3 ?C 0G 0A 0

1

max

1,1

1,

,1

,

ji

ji

ji

ji

s

s

s

s

74

Dynamic programming

Deletion gapG A A T T C A G T T A

0 0 0 0 0 0 0 0 0 0 0 0G 0 1 1 1 1 1 1 1 1 1 1 1G 0 1 1 1 1 1 1 1 2 2 2 2A 0 1 2 2 2 2 2 2 2 2 2 3T 0 1 2 2 3 3 3 3 3 3 3 3C 0 1 2 2 ?G 0A 0

75

1

max

1,1

1,

,1

,

ji

ji

ji

ji

s

s

s

s

Dynamic programming

MatchG A A T T C A G T T A

0 0 0 0 0 0 0 0 0 0 0 0G 0 1 1 1 1 1 1 1 1 1 1 1G 0 1 1 1 1 1 1 1 2 2 2 2A 0 1 2 2 2 2 2 2 2 2 2 3T 0 1 2 2 ?C 0G 0A 0

76

1

max

1,1

1,

,1

,

ji

ji

ji

ji

s

s

s

s

Dynamic programming

Relationship Two key ingredients for an optimization problem to

be suitable for a dynamic-programming solution– each substructure is optimal

– sub-problems are dependent, otherwise, a divide-and-conquer approach is the choice

Since now we know the three relationships– insertion gap

– deletion gap

– match

We can easily construct an alignment based on this matrix with the so-called backtracking technique

77

Dynamic programming

BacktrackingG A A T T C A G T T A

0 0 0 0 0 0 0 0 0 0 0 0G 0 1 1 1 1 1 1 1 1 1 1 1G 0 1 1 1 1 1 1 1 2 2 2 2A 0 1 2 2 2 2 2 2 2 2 2 3T 0 1 2 2 3 3 3 3 3 3 3 3C 0 1 2 2 3 3 4 4 4 4 4 4G 0 1 2 2 3 3 4 4 5 5 5 5A 0 1 2 3 3 3 4 5 5 5 5 6

78

Backtracking

Alternative pathsG A A T T C A G T T A

0 0 0 0 0 0 0 0 0 0 0 0G 0 1 1 1 1 1 1 1 1 1 1 1G 0 1 1 1 1 1 1 1 2 2 2 2A 0 1 2 2 2 2 2 2 2 2 2 3T 0 1 2 2 3 3 3 3 3 3 3 3C 0 1 2 2 3 3 4 4 4 4 4 4G 0 1 2 2 3 3 4 4 5 5 5 5A 0 1 2 3 3 3 4 5 5 5 5 6

79

Backtracking The backtracking algorithm enumerates a set of partial

candidates that, in principle, could be completed in various ways to give all the possible solutions to the given problem

A dynamic programming matrix can produce all possible alignments of the best score from different backtracking paths

Alternative paths– G-AATTCAGTTA

GGA-T-C-G--A

* * * * * * identity: 6/12 = 50%

– G-AATTCAGTTA

GGA--TC-G--A

* * ** * * identity: 6/12 = 50%

80

Sequence alignment

Substitution matrix Some alignments may involve mismatch relationship

– E----DKELIR

MERPEP--ELIR

* **** identity: 5/11 = 45.5%

– E--DKELIR

MERPEPELIR

* .**** identity: 6/9 = 66.7%

81

),(

),(

),(

max

1,1

1,

,1

,

jiji

jji

iji

ji

bascore

bscore

ascore

s

s

s

s

3333

38555

35855

35585

35558

T

G

C

A

TGCA

Sequence alignment with substitution matrix

InitializeG A A T T C A G T T A

0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33

G -3

G -6

A -9

T -12

C -15

G -18

A -21

82

3333

38555

35855

35585

35558

T

G

C

A

TGCA


MatchG A A T T C A G T T A

0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33

G -3 ?

G -6

A -9

T -12

C -15

G -18

A -21

83

3333

38555

35855

35585

35558

T

G

C

A

TGCA


Deletion gapG A A T T C A G T T A

0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33

G -3 8 ?

G -6

A -9

T -12

C -15

G -18

A -21

84

3333

38555

35855

35585

35558

T

G

C

A

TGCA


Insertion gapG A A T T C A G T T A

0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33

G -3 8 5 2 -1 -4 -7 -10 -13 -16 -19 -22

G -6 ?

A -9

T -12

C -15

G -18

A -21

85

3333

38555

35855

35585

35558

T

G

C

A

TGCA


MismatchG A A T T C A G T T A

0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33

G -3 8 5 2 -1 -4 -7 -10 -13 -16 -19 -22

G -6 5 ?

A -9

T -12

C -15

G -18

A -21

86

3333

38555

35855

35585

35558

T

G

C

A

TGCA


InitializeG A A T T C A G T T A

0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33

G -3 8 5 2 -1 -4 -7 -10 -13 -16 -19 -22

G -6 5 3 ?

A -9

T -12

C -15

G -18

A -21

3333

38555

35855

35585

35558

T

G

C

A

TGCA

87


CompleteG A A T T C A G T T A

0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33

G -3 8 5 2 -1 -4 -7 -10 -13 -16 -19 -22

G -6 5 3 0 -3 -6 -9 -12 -2 -5 -8 -11

A -9 2 13 11 8 5 2 -1 -4 -7 -10 0

T -12 -1 10 8 19 16 13 10 7 4 1 -2

C -15 -4 7 5 16 14 24 21 18 15 12 9

G -18 -7 4 2 13 11 21 18 29 26 23 20

A -21 -10 1 12 10 8 18 29 26 24 21 31

88


BacktrackingG A A T T C A G T T A

0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33

G -3 8 5 2 -1 -4 -7 -10 -13 -16 -19 -22

G -6 5 3 0 -3 -6 -9 -12 -2 -5 -8 -11

A -9 2 13 11 8 5 2 -1 -4 -7 -10 0

T -12 -1 10 8 19 16 13 10 7 4 1 -2

C -15 -4 7 5 16 14 24 21 18 15 12 9

G -18 -7 4 2 13 11 21 18 29 26 23 20

A -21 -10 1 12 10 8 18 29 26 24 21 31

89

Beginning Perl for Bioinformatics

90

Biology and computer science Bioinformatics

– biological data is proliferating rapidly

– computer-based tools now play an increasingly critical role in the advancement of biological research

– Bioinformatics, a rapidly evolving discipline, is the application of computational tools and techniques to the management and analysis of biological data

Recently, the new term in silico has become a common reference to biological studies carried out in the computer– in vivo in life, that is, in a living organism

– in vitro in glass, that is, in the test tube

– in silico in algorithm; most computer chips are made primarily of silicon

91

Getting started with Perl Perl is a popular programming language that's extensively used

in areas such as bioinformatics and web programming The word Perl refers to

– the language in which you will write programs• Perl programs, Perl scripts, Perl code

– the application on your computer that runs those programs• Perl interpreter

A low and long learning curve– one can get started very quickly, and then learn additional topics as

needed• ex: object-oriented programming is also well-supported in Perl

Perl's benefits– ease of programming, rapid prototyping, portability, healthy society

and abundant online resources

92

Perl

Sequences and strings Example 4-1. Putting DNA into the computer

#!/usr/bin/perl –w

# Storing DNA in a variable, and printing it out

use strict;

# First we store the DNA in a variable called $DNA

my $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';

# Next, we print the DNA onto the screen

print $DNA; Checkpoints

– # comments in Perl, like // in C

– #! magic line in Unix, ignored in Windows

– -w display warnings, a good habit

– use strict force programmers to declare variables first, a good habit

– my declare a variable

– $ scalar variable

– print convenient printf()

93

Example 4-2. Concatenating DNA#!/usr/bin/perl –w

# Concatenating DNA

# Store two DNA fragments into two variables

my $DNA1 = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';

my $DNA2 = 'ATAGTGCCGTGAGAGTGATGTAGTA';

# Using "string interpolation"

my $DNA3 = "$DNA1$DNA2";

print "$DNA3\n\n";

# An alternative way using the "dot operator":

$DNA3 = $DNA1 . $DNA2;

print "$DNA3\n\n";

# Print the same thing without using the variable $DNA3

print $DNA1, $DNA2, "\n\n";

print "$DNA1$DNA2\n\n"; Checkpoints

– "" string, notice that "$a\n" is different to '$a\n'

– . concatenation operator

– print allows multiple arguments

Maybe the Perl slogan should be, “There are more than two ways to do it.”

94

Example 4-3. Transcribing DNA into RNA#!/usr/bin/perl –w

# Transcribing DNA into RNA

use strict;

# The DNA


# Transcribe the DNA to RNA

# Substitute all T's with U's

my $RNA = $DNA;

$RNA =~ s/T/U/g;

print "$RNA\n"; Checkpoints

– =~ binding operator

– s/// substitute command, where g is modifier of s

95

96

Using the Perl documentation A Perl programmer’s most important resource is the Perl documentation

– it should be installed on your computer

– it may also be found on the Internet at the Perl site

– just Google ‘perl’

perldoc– $ perldoc -f printf

http://www.perl.com– Documentation link Perl’s Builtin Functions Alphabetical Listing of Perl’s

Functions

The Perl documentation– check out the examples they give is usually the quickest way

– it may answer some questions but raises others• E.g., the documentation of print starts out: “Prints a string or a comma-separated list of

strings.” But then comes a bunch of gibberish (or it is just the learning curve!) Filehandles? Output streams? List context?

– it also includes several tutorials

97

http://www.perl.com/

Example 4-4. Calculating the reverse complement of a strand of DNA #!/usr/bin/perl -w

# Calculating the reverse complement of a strand of DNA

use strict;

# The DNA


# Calculate the reverse complement

# Warning: this attempt will fail!

# First, copy the DNA into new variable $revcom (short for REVerse COMplement)

# Notice that variable names can use lowercase letters like "revcom“

# as well as uppercase like "DNA“. In fact, lowercase is more common.

my $revcom = reverse $DNA;

# Next substitute all bases by their complements,

# A->T, T->A, G->C, C->G

$revcom =~ s/A/T/g;

$revcom =~ s/T/A/g;

$revcom =~ s/G/C/g;

$revcom =~ s/C/G/g;

print "$revcom\n“; This examples has a logical bug Checkpoints

– reverse() reverse a string, the parentheses can be omitted

98

What is

99

The bug

Example 4-4. Calculating the reverse complement of a strand of DNA#!/usr/bin/perl -w

# Calculating the reverse complement of a strand of DNA

use strict;

# The DNA


# The problem is that the first two substitute commands above

# change all the A's to T's (so there are no A's) and then all the

# T's to A's (so all the original A's and T's are all now A's).

# Same thing happens to the G's and C's all turning into G's.

# Make a copy of the DNA

my $revcom = reverse $DNA;

# Next substitute all bases by their complements,

# A->T, T->A, G->C, C->G

$revcom =~ tr/ACGTacgt/TGCAtgca/;

print "$revcom\n"; Checkpoints

– tr transliteration operator

100

101

Example 4-5. Reading protein sequence data from a file#!/usr/bin/perl -w

# Reading protein sequence data from a file

use strict;

# The filename of the file containing the protein sequence data

$proteinfilename = 'NM_021964fragment.pep';

# First we have to "open" the file, and associate a "filehandle" with it.

open FH, $proteinfilename;

# Now we do the actual reading of the protein sequence data from the file, by using

the angle brackets < and > to get the input from the filehandle. We store the data

into our variable $protein.

$protein = <FH>;

# Now that we've got our data, we can close the file.

close FH;

print $protein; Checkpoints

– open() open a file, like fopen() in C

– filehandle it’s an interface between program and file system, very common concept

– <> it’s an operator in Perl, just memorize it

– close() close a file, like fclose() in C

102

Example 4-6. Reading protein sequence data from a file, take 2#!/usr/bin/perl -w

# Reading protein sequence data from a file, take 2

use strict;

# Open a file

open FH, 'NM_021964fragment.pep';

# Suppose that the file has three lines, and since the read only is returning

# one line, we'll read a line and print it, three times.

# First line

my $protein = <FH>;

print $protein;

# Second line

$protein = <FH>;

print $protein;

# Third line

$protein = <FH>;

print $protein;

close FH; Checkpoints

– how stupid it is

103



use strict;

# Open a file

open FH, 'NM_021964fragment.pep';

# Read the protein sequence data from the file, and

# store it into the array variable @protein

@protein = <FH>;

print @protein;

close PROTEINFILE; Checkpoints

– @ array variable• $protein[0] the first element of @protein

– print know how to print an array, no loop is required

104

Perl facilities for array– my @bases = ( 'A', 'C', 'G', 'T' );

print @bases; # A C G T

pop, push (for stack) shift, unshift (for queue)– my $base1 = pop @bases; my $base1 = shift @bases;

print $base1; # T print $base1; # A

print @bases; # A C G print @bases; # C G T

push @bases, $base1; unshift @bases, $base1;

print @bases; # A C G T print @bases; # A C G T

reverse– my @reverse = reverse @bases;

print @reverse; # T G C A

scalar (length of an array)– print scalar @bases; # 4

splice (insert/delete/replace elements at an arbitrary place)– extremely powerful!– splice @bases, 2, 0, 'X';

print @bases; # A C X G T

105

Any Questions?

106

About Perl facilities for array

How to

107

my @a = ( '1', '2', '3', '4' );

splice @a, 2, 1, 'a', 'b';

print @a; # 1 2 a b 4

answer

Example 4-8. Scalar context and list context#!/usr/bin/perl -w

# Demonstration of "scalar context" and "list context"

use strict;

my @bases = ( 'A', 'C', 'G', 'T' );

my $a = @bases;

print $a; # 4

($a) = @bases;

print $a; # A Checkpoints

– many Perl operations behave differently depending on the context in which they are used

– another example is the reverse() on a string array

108

Perl

Flow control Example 5-1. if-elsif-else

#!/usr/bin/perl -w

# if-elsif-else

use strict;

my $word = 'MNIDDKL';

# if-elsif-else conditionals

if ( 'QSTVSGE' eq $word ) {

print "QSTVSGE\n";

} elsif ( 'MNIDDKL' eq $word ) {

print "MNIDDKL--the magic word!\n";

} else {

print "Is \"$word\" a peptide?\n";

} Checkpoints

– elsif else if

– eq equal, check for equality between strings

– ne not equal, check for equality between strings

109



use strict;

# In case the open fails, print an error message and exit.

my $fn = 'NM_021964fragment.pep';

unless ( open FH, $fn ) {

print "Could not open file $fn!\n";

exit;

}

# Read and print line-by-line.

my $protein;

while ( $protein = <FH> ) { print $protein; }

close FH; Checkpoints

– unless the opposite of if, just to fit English more

– openreturn false when fails

– exit interrupt the program, like exit() in C

– In Perl, an assignment returns the value of the assignment. If there is another line to read in, the assignment occurs, the $protein is not null, and the conditional is true.

110

Perl

Code layout Format A Format B

– while ($a) { while ($a)

if ($b) { {

print "ok\n"; if ($b)

} {

} print "ok\n";

}

}

Format C Format D– while ($a) while($a){if($b){print "ok\n";}}

if ($b)

{

print "ok\n";

}

}

A and B are common ways to lay out code, A is more preferred in Perl DON’T use C or D, ever! Perl provides a guide for code style.

– $ perldoc perlstyle

– however, they are not rules, and you may use your own judgment

111

Perl

Subroutine Example 6-1. A subroutine to append ACGT to DNA

#!/usr/bin/perl -w

# A program with a subroutine to append ACGT to DNA

use strict;

my $dna = 'CGACGTCTTCTCAGGCGA';

# The argument is $dna; the result is $longer_dna

my $longer_dna = &addACGT($dna); print $longer_dna;

####################

# Subroutines for Example 6-1

sub addACGT {

my $dna = shift @_;

$dna .= 'ACGT';

return $dna;

} Checkpoints

– & call subroutine, can be omitted but helpful (e.g., vi) for highlighting

– sub subroutine definition

– my $dna scoping issue

– @_ argument array, a special variable in Perl

– return return value

112

Perl

Scoping Make the variables specific to the subroutine with my

– my is a keyword in Perl that limits variables to the block where they are used (here the block is the subroutine)

Making variables local to a restricted part of a program is called scoping. There are different models of scoping:– in Perl, using my variables is known as lexical scoping, also

known as static scoping

– dynamic scoping is hard to track, use them as less as possible

Manipulation directly on $_[0] will change the argument (call by reference)

113

Any Questions?

114

About Perl subroutine

How to

115

Return multiple values?return ( 'a', 'b' );

$_[0] = 'a'; $_[1] = 'b';

answer#1

answer#2

Documents

Bioinformatics Programming