Upload
pier
View
45
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang). Molecular biology. Nucleic acid DNA RNA Central dogma Transcription Translation Protein Amino acid Primary structure Secondary structure Tertiary structure. Nucleic acid. - PowerPoint PPT Presentation
Citation preview
Bioinformatics Programming
1
EE, NCKUTien-Hao Chang (Darby Chang)
Molecular biology Nucleic acid
– DNA
– RNA
Central dogma– Transcription
– Translation
Protein– Amino acid
– Primary structure
– Secondary structure
– Tertiary structure
2
Nucleic acid A nucleic acid is a macromolecule
composed of chains of monomeric nucleotide
In biochemistry these molecules carry genetic information or form structures within cells
The most common nucleic acids are deoxyribonucleic acid (DNA) and ribonucleic acid (RNA)
3
4
http://juang.bst.ntu.edu.tw/BC2008/images/NA%20Fig1.jpg
Nucleic acid components
Sugar
5
http://www.mun.ca/biology/scarr/Fg10_09b_revised.gif
Nucleic acid components
Base Purine
–Adenine (A) and guanine (G)
Pyrimidine–Thymine (T), cytosine (C)
–Uracil (U, only in RNA)
6
7
http://www.elmhurst.edu/~chm/vchembook/images/580bases.gif
8
http://fig.cox.miami.edu/~cmallery/150/chemistry/sf3x14a.jpg
DNA Chemically, DNA is a long polymer of simple
units called nucleotides, with a backbone made of sugars and phosphate groups joined by ester bonds
Attached to each sugar is oneof four types of moleculescalled bases
It is the sequence of these fourbases along the backbone thatencodes information
9
http://upload.wikimedia.org/wikipedia/commons/8/87/DNA_orbit_animated_small.gif
DNA
Base pairing Each type of base on one strand forms a bond with just one
type of base on the other strand Here, purines form hydrogen bonds to pyrimidines, with A
bonding only to T, and C bonding only to G The two types of base pairs form different numbers of
hydrogen bonds, AT forming two hydrogen bonds, and GC forming three hydrogen bonds
Chargaff rule– A=T and G=C
DNA sequence– 5’CpGpCpApApTpT
3’TpTpApApCpGpC
– CGCGAATT
10
11
http://openlearn.open.ac.uk/file.php/2645/S377_1_005i.jpg
12
Double helix
http://www.coe.drexel.edu/ret/personalsites/2005/dayal/curriculum1_files/image001.jpg
Hydrogen bond A hydrogen bond exists between an
electronegative atom and a hydrogen atom bonded to another electronegative atom
This type of force always involves a hydrogen atom and the energy of this attraction is close to that of weak covalent bonds (155 kJ/mol), thus the name – Hydrogen Bonding
Biological functions– DNA/RNA base paring– protein secondary/tertiary structure formation– some properties of water molecule– antibody-antigen (and other protein-protein) binding
13
14
Hydrogen bond is resulted from electronegativity
http://upload.wikimedia.org/wikipedia/commons/4/43/Liquid_water_hydrogen_bond.png
15
Major and minor grooves
http://courses.biology.utah.edu/horvath/biol.3525/1_DNA/Fig2/marty_1.jpg
DNA structure
16
http://www.youtube.com/watch?v=qy8dk5iS1f0&NR=1
Any Questions?
17
About DNA
Central dogma
18
19
http://fig.cox.miami.edu/~cmallery/255/255hist/mcb4.1.dogma.jpg
Central dogma The process by witch information is
extracted from the nucleotide sequence of a gene and then used to make a protein is essentially the same for all livingthings on Earth and is describedby the grandly named centraldogma of molecular biology
Information in cells passesfrom DNA to RNA to proteins
20
http://upload.wikimedia.org/wikipedia/commons/3/3a/Crick's_1958_central_dogma.svg
RNA Information stored from DNA is used to make a more
transient, single-stranded polynucleotide called RNA (Ribonucleic Acid)
RNA is very similar to DNA, but differs in a few important structural details– in the cell RNA is usually single stranded, while DNA is
usually double stranded
– RNA nucleotides contain ribose while DNA contains deoxyribose (a type of ribose that lacks one oxygen atom)
– in RNA the nucleotide uracil substitutes for thymine, which is present in DNA
21
22
http://www.dadamo.com/wiki/dna-rna.png
Central dogma
Transcription Transcription is the synthesis of RNA under the
direction of DNA Both nucleic acid sequences use the same
language, and the information is simply transcribed, or copied, from one molecule to the other
DNA sequence is enzymatically copied by RNA polymerase to produce a complementary nucleotide RNA strand, called messenger RNA (mRNA)
23
DNA transcription
24
http://www.youtube.com/watch?v=vJSmZ3DsntU
Transcription detail
25
http://www-class.unl.edu/biochem/gp2/m_biology/animation/m_animations/gene2.swf
RNA
Various types mRNA
– messenger RNA (mRNA) is the RNA that carries information from DNA to the ribosome
– the coding sequence of the mRNA determines the amino acid sequence in the protein that is produced
Non-coding RNA– many RNAs do not code for protein
– these non-coding RNA can be encoded by their own genes (RNA genes), but can also derive from mRNA introns
– the most prominent examples of non-coding RNAs are transfer RNA (tRNA) and ribosomal RNA (rRNA), both of which are involved in the process of translation
– there are also non-coding RNAs involved in gene regulation, RNA processing and other roles
26
Central dogma
Translation Translation is the second stage of protein
biosynthesis Translation occurs in the cytoplasm where the
ribosomes are located In translation, mRNA is decoded to produce a
specific polypeptide according to the rules specified by the genetic code
Many types of transcribed RNA, such as transfer RNA, ribosomal RNA, and small nuclear RNA are not necessarily translated into an amino acid sequence
27
From RNA to protein synthesis
28
http://www.youtube.com/watch?v=NJxobgkPEAo
Protein translation
29
http://www.youtube.com/watch?v=nl8pSlonmA0
30
Genetic code
http://biology.kenyon.edu/courses/biol114/Chap05/code.gif
Any Questions?
31
About central dogma
Protein
32
Protein Proteins are large organic compounds
made of amino acids arranged in a linear chain and joined together by peptide bonds between the carboxyl and amino groups of adjacent amino acid residues
Proteins can also work together to achieve a particular function, and they often associate to form stable complexes
33
Protein
Amino acid In chemistry, an amino acid is a molecule that
contains both amine and carboxyl functional groups
In biochemistry, this term refers to alpha-amino acids with the general formula H2NCHRCOOH, where R is an organic substituent
In the alpha amino acids, the amino and carboxylate groups are attached to the same carbon, which is called the α–carbon
34
35
http://upload.wikimedia.org/wikipedia/commons/thumb/c/ce/AminoAcidball.svg/702px-AminoAcidball.svg.png
Amino acid
Various side chains The various alpha amino acids differ
in which side chain (R group) is attached to their alpha carbon
They can vary in size from just a hydrogen atom in glycine through a methyl group in alanine to a large heterocyclic group in tryptophan
36
37
http://upload.wikimedia.org/wikipedia/commons/thumb/3/37/Aa.svg/2000px-Aa.svg.png
38
http://juang.bst.ntu.edu.tw/BC2008/images/Amino(1)%202007/A1-7.JPG
39
http://juang.bst.ntu.edu.tw/BC2008/images/Amino(1)%202007/A1-9.JPG
40
http://www.russell.embl-heidelberg.de/aas/other_images/lb3.gif
Amino acid
The building blocks of proteins Amino acids combine in a condensation reaction
that releases water and the new “amino acid residue” that is held together by a peptide bond
Proteins are defined by their unique sequence of amino acid residues; this sequence is the primary structure of the protein
Just as the letters of the alphabet can be combined to form an almost endless variety of words, amino acids can be linked in varying sequences to form a vast variety of proteins
41
42
Peptide bond
http://upload.wikimedia.org/wikipedia/commons/thumb/6/6d/Peptidformationball.svg/2000px-Peptidformationball.svg.png
43
http://juang.bst.ntu.edu.tw/BC2008/images/Amino(1)%202007/A1-11.JPG
44
http://juang.bst.ntu.edu.tw/BC2008/images/Amino(1)%202007/A1-13.JPG
Protein
After knowing amino acids Amino acids form short polymer chains
called peptides or longer chains called either polypeptides or proteins
The process of such formation from an mRNA template is known as translation, which is part of protein biosynthesis
Twenty amino acids are encoded by the standard genetic code and are called proteinogenic or standard amino acids
45
Protein structure hierarchy
46
47
http://cropandsoil.oregonstate.edu/classes/css430/lecture%209-07/figure-09-03.JPG
48
http://juang.bst.ntu.edu.tw/BC2008/images/Protein(1)%202007/P1-4.JPG
49
http://juang.bst.ntu.edu.tw/BC2008/images/Protein(1)%202007/P1-8.JPG
50
http://juang.bst.ntu.edu.tw/BC2008/images/Protein(1)%202007/P1-9.JPG
Protein structure hierarchy
Secondary structure In biochemistry and structural biology,
secondary structure is the general three-dimensional form of local segments of biopolymers such as proteins and nucleic acids (DNA/RNA)
It does not, however, describe specific atomic positions in three-dimensional space, which are considered to be tertiary structure
51
52
http://juang.bst.ntu.edu.tw/BC2008/images/Protein(2)%202007/P2-3.JPG
Protein structure hierarchy
Tertiary structure The three-dimensional structure of a protein or any other
macromolecule, as defined by the atomic coordinates Describe the spatial relations among it secondary structures Tertiary structure is considered to be largely determined by the
protein’s primary sequence, or the sequence of amino acids of which it is composed
The majority of protein structures known to date have been solved with the experimental technique of X-ray crystallography
A second common way of solving protein structures uses NMR (Nuclear Magnetic Resonance)– lower-resolution data and is limited to relatively small proteins
– can provide time-dependent information about the motion of a protein in solution
53
54
http://campusapps.fullerton.edu/news/arts/2003/photos/protein-art.jpg
Protein structure hierarchy
Quaternary structure Many proteins are actually
assemblies of more than onepolypeptide chain, which in thecontext of the larger assemblageare known as protein subunits
In addition to the tertiary structureof the subunits, multiple-subunitproteins possess a quaternarystructure, which is the arrangementinto which the subunits assemble
55
http://courses.cm.utexas.edu/jrobertus/ch339k/overheads-1/ch6_quat-struct1.jpg
Protein sub-structure
56
Protein sub-structure
Domain A protein domain is a part of protein
sequence and structure that can evolve,function, and exist independently of therest of the protein chain
Each domain forms a compact three-dimensional structure and often can beindependently stable and folded
Domains vary in length from betweenabout 25 amino acids up to 500amino acids in length
The shortest domains such as zincfingers are stabilized by metal ionsor disulfide bridges
Domains often form functionalunits
57
http://upload.wikimedia.org/wikipedia/commons/6/67/1pkn.png
Protein domain
Zinc finger
58
http://upload.wikimedia.org/wikipedia/commons/7/79/Zinc_finger_DNA_complex.png
Protein sub-structure
Motif Sequence motif
– a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance
– for proteins, a sequence motif is distinguished from a structural motif, a motif formed by the three dimensional arrangement of amino acids, which may not be adjacent
Structure motif– a three-dimensional structural element or fold within the chain,
which appears also in a variety of other molecules
– in the context of proteins, the term is sometimes used interchangeably with “structural domain,” although a domain need not be a motif nor, if it contains a motif, need not be made up of only one
59
60
61
http://www.biomedcentral.com/content/figures/1471-2164-8-60-8.jpg
62
http://juang.bst.ntu.edu.tw/BC2008/images/Protein(1)%202007/P1-3.JPG
Molecular biology
Reference 台大莊榮輝教授網站
– http://juang.bst.ntu.edu.tw/BC2008/index.htm
交大分子生物學網站– http://www.life.nctu.edu.tw/~mb/c40101
.htm
63
Any Questions?
64
About molecular biology
65
Sequence alignmentIn a FASTA fileOut pairwise sequence alignment
Requirement- output alignment score (identity)- complexity/teamwork report- using Perl would be the best
Bonus- alignment allowing mismatches- output alignment
66
Deadline2010/4/27 23:59
Zip your code, step-by-step README, complexity analyses and anything worthy extra credit. Email to [email protected].
Input– Download from UniProt
– UniProt is the universal protein resource, a central repository of protein data created by combining Swiss-Prot, TrEMBL and PIR. This makes it the world's most comprehensive resource on protein information.
– http://www.uniprot.org/uniprot/?query=Saccharomyces+cerevisiae+transcription+factor+AND+reviewed%3ayes&force=yes&format=fasta
– >sp|P32333|MOT1_YEAST TATA-binding protein-associated fac...
MTSRVSRLDRQVILIETGSTQVVRNMAADQMGDLAKQHPEDILSLLSRVYPFLLVKKWET
...
TFIKTLR
>sp|Q00947|STP1_YEAST Transcription factor STP1...
MPSTTLLFPQKHIRAIPGKIYAFFRELVSGVIISKPDLSHHYSCENATKEEGKDAADEEK
...
>sp|P38830|NDT80_YEAST Meiosis-specific transcription fac...
MNEMENTDPVLQDDLVSKYERELSTEQEEDTPVILTQLNEDGTTSNYFDKRKLKIAPRST
...
Output– >MOT1_YEAST
STP1_YEAST 90
NDT80_YEAST 80
>STP1_YEAST
MOT1_YEAST 90
NDT80_YEAST 70
>NDT80_YEAST
MOT1_YEAST 80
STP1_YEAST 70 67
Sequence similarity Sequence identity Sequence alignment
– dynamic programming
– backtracking
– substitution matrix
68
Sequence similarity
Identity Which sequence is more similar to DKELIR?
– EPELIR or DKGLIR
A trivial (but useful) concept, identity– DKELIR
EPELIR
**** identity: 4/6 = 66.7%
– DKELIR
DKGLIR
** *** identity: 5/6 = 83.3%
69
Sequence similarity
Alignment When two sequence have different
lengths– DKELIR
MERPEPELIR identity: 0%
Obviously, we need to shift the first sequence by 4 residues– DKELIR
MERPEPELIR
This movement is so-called alignment70
Sequence alignment A way of arranging the sequences of DNA, RNA or protein
to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences
More complex alignments may involve gaps (EDKELIR vs. MERPEPELIR)– E----DKELIR
MERPEP--ELIR
* **** identity: 5/11 = 45.5%
And substitution matrix– E--DKELIR
MERPEPELIR
* .**** identity: 6/9 = 66.7%
71
Sequence alignment
Dynamic programming G A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0G 0 1 1 1 1 1 1 1 1 1 1 1G 0 1 1 1 1 1 1 1 2 2 2 2A 0 1 2 2 2 2 2 2 2 2 2 3T 0 1 2 2 3 3 3 3 3 3 3 3C 0 1 2 2 3 3 4 4 4 4 4 4G 0 1 2 2 3 3 4 4 5 5 5 5A 0 1 2 3 3 3 4 5 5 5 5 6
72
A class of solution methods for solving sequential decision problems with a compositional cost structure
In this matrix, each element Si,j indicates
that the best alignment score between the two corresponding sub-sequences– the key is to find the
relationships between
the problem Si,j to its
sub-problems S α,β,
where α≦i and β≦j73
1
max
1,1
1,
,1
,
ji
ji
ji
ji
s
s
s
s
Dynamic programming
Insertion gapG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0G 0 1 1 1 1 1 1 1 1 1 1 1G 0 1 1 1 1 1 1 1 2 2 2 2A 0 1 2 2 2 2 2 2 2 2 2 3T 0 1 2 2 3 3 ?C 0G 0A 0
1
max
1,1
1,
,1
,
ji
ji
ji
ji
s
s
s
s
74
Dynamic programming
Deletion gapG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0G 0 1 1 1 1 1 1 1 1 1 1 1G 0 1 1 1 1 1 1 1 2 2 2 2A 0 1 2 2 2 2 2 2 2 2 2 3T 0 1 2 2 3 3 3 3 3 3 3 3C 0 1 2 2 ?G 0A 0
75
1
max
1,1
1,
,1
,
ji
ji
ji
ji
s
s
s
s
Dynamic programming
MatchG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0G 0 1 1 1 1 1 1 1 1 1 1 1G 0 1 1 1 1 1 1 1 2 2 2 2A 0 1 2 2 2 2 2 2 2 2 2 3T 0 1 2 2 ?C 0G 0A 0
76
1
max
1,1
1,
,1
,
ji
ji
ji
ji
s
s
s
s
Dynamic programming
Relationship Two key ingredients for an optimization problem to
be suitable for a dynamic-programming solution– each substructure is optimal
– sub-problems are dependent, otherwise, a divide-and-conquer approach is the choice
Since now we know the three relationships– insertion gap
– deletion gap
– match
We can easily construct an alignment based on this matrix with the so-called backtracking technique
77
Dynamic programming
BacktrackingG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0G 0 1 1 1 1 1 1 1 1 1 1 1G 0 1 1 1 1 1 1 1 2 2 2 2A 0 1 2 2 2 2 2 2 2 2 2 3T 0 1 2 2 3 3 3 3 3 3 3 3C 0 1 2 2 3 3 4 4 4 4 4 4G 0 1 2 2 3 3 4 4 5 5 5 5A 0 1 2 3 3 3 4 5 5 5 5 6
78
Backtracking
Alternative pathsG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0G 0 1 1 1 1 1 1 1 1 1 1 1G 0 1 1 1 1 1 1 1 2 2 2 2A 0 1 2 2 2 2 2 2 2 2 2 3T 0 1 2 2 3 3 3 3 3 3 3 3C 0 1 2 2 3 3 4 4 4 4 4 4G 0 1 2 2 3 3 4 4 5 5 5 5A 0 1 2 3 3 3 4 5 5 5 5 6
79
Backtracking The backtracking algorithm enumerates a set of partial
candidates that, in principle, could be completed in various ways to give all the possible solutions to the given problem
A dynamic programming matrix can produce all possible alignments of the best score from different backtracking paths
Alternative paths– G-AATTCAGTTA
GGA-T-C-G--A
* * * * * * identity: 6/12 = 50%
– G-AATTCAGTTA
GGA--TC-G--A
* * ** * * identity: 6/12 = 50%
80
Sequence alignment
Substitution matrix Some alignments may involve mismatch relationship
– E----DKELIR
MERPEP--ELIR
* **** identity: 5/11 = 45.5%
– E--DKELIR
MERPEPELIR
* .**** identity: 6/9 = 66.7%
81
),(
),(
),(
max
1,1
1,
,1
,
jiji
jji
iji
ji
bascore
bscore
ascore
s
s
s
s
3333
38555
35855
35585
35558
T
G
C
A
TGCA
Sequence alignment with substitution matrix
InitializeG A A T T C A G T T A
0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33
G -3
G -6
A -9
T -12
C -15
G -18
A -21
82
3333
38555
35855
35585
35558
T
G
C
A
TGCA
Sequence alignment with substitution matrix
MatchG A A T T C A G T T A
0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33
G -3 ?
G -6
A -9
T -12
C -15
G -18
A -21
83
3333
38555
35855
35585
35558
T
G
C
A
TGCA
Sequence alignment with substitution matrix
Deletion gapG A A T T C A G T T A
0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33
G -3 8 ?
G -6
A -9
T -12
C -15
G -18
A -21
84
3333
38555
35855
35585
35558
T
G
C
A
TGCA
Sequence alignment with substitution matrix
Insertion gapG A A T T C A G T T A
0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33
G -3 8 5 2 -1 -4 -7 -10 -13 -16 -19 -22
G -6 ?
A -9
T -12
C -15
G -18
A -21
85
3333
38555
35855
35585
35558
T
G
C
A
TGCA
Sequence alignment with substitution matrix
MismatchG A A T T C A G T T A
0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33
G -3 8 5 2 -1 -4 -7 -10 -13 -16 -19 -22
G -6 5 ?
A -9
T -12
C -15
G -18
A -21
86
3333
38555
35855
35585
35558
T
G
C
A
TGCA
Sequence alignment with substitution matrix
InitializeG A A T T C A G T T A
0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33
G -3 8 5 2 -1 -4 -7 -10 -13 -16 -19 -22
G -6 5 3 ?
A -9
T -12
C -15
G -18
A -21
3333
38555
35855
35585
35558
T
G
C
A
TGCA
87
Sequence alignment with substitution matrix
CompleteG A A T T C A G T T A
0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33
G -3 8 5 2 -1 -4 -7 -10 -13 -16 -19 -22
G -6 5 3 0 -3 -6 -9 -12 -2 -5 -8 -11
A -9 2 13 11 8 5 2 -1 -4 -7 -10 0
T -12 -1 10 8 19 16 13 10 7 4 1 -2
C -15 -4 7 5 16 14 24 21 18 15 12 9
G -18 -7 4 2 13 11 21 18 29 26 23 20
A -21 -10 1 12 10 8 18 29 26 24 21 31
88
Sequence alignment with substitution matrix
BacktrackingG A A T T C A G T T A
0 -3 -6 -9 -12 -15 -18 -21 -24 -27 -30 -33
G -3 8 5 2 -1 -4 -7 -10 -13 -16 -19 -22
G -6 5 3 0 -3 -6 -9 -12 -2 -5 -8 -11
A -9 2 13 11 8 5 2 -1 -4 -7 -10 0
T -12 -1 10 8 19 16 13 10 7 4 1 -2
C -15 -4 7 5 16 14 24 21 18 15 12 9
G -18 -7 4 2 13 11 21 18 29 26 23 20
A -21 -10 1 12 10 8 18 29 26 24 21 31
89
Beginning Perl for Bioinformatics
90
Biology and computer science Bioinformatics
– biological data is proliferating rapidly
– computer-based tools now play an increasingly critical role in the advancement of biological research
– Bioinformatics, a rapidly evolving discipline, is the application of computational tools and techniques to the management and analysis of biological data
Recently, the new term in silico has become a common reference to biological studies carried out in the computer– in vivo in life, that is, in a living organism
– in vitro in glass, that is, in the test tube
– in silico in algorithm; most computer chips are made primarily of silicon
91
Getting started with Perl Perl is a popular programming language that's extensively used
in areas such as bioinformatics and web programming The word Perl refers to
– the language in which you will write programs• Perl programs, Perl scripts, Perl code
– the application on your computer that runs those programs• Perl interpreter
A low and long learning curve– one can get started very quickly, and then learn additional topics as
needed• ex: object-oriented programming is also well-supported in Perl
Perl's benefits– ease of programming, rapid prototyping, portability, healthy society
and abundant online resources
92
Perl
Sequences and strings Example 4-1. Putting DNA into the computer
#!/usr/bin/perl –w
# Storing DNA in a variable, and printing it out
use strict;
# First we store the DNA in a variable called $DNA
my $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
# Next, we print the DNA onto the screen
print $DNA; Checkpoints
– # comments in Perl, like // in C
– #! magic line in Unix, ignored in Windows
– -w display warnings, a good habit
– use strict force programmers to declare variables first, a good habit
– my declare a variable
– $ scalar variable
– print convenient printf()
93
Example 4-2. Concatenating DNA#!/usr/bin/perl –w
# Concatenating DNA
# Store two DNA fragments into two variables
my $DNA1 = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
my $DNA2 = 'ATAGTGCCGTGAGAGTGATGTAGTA';
# Using "string interpolation"
my $DNA3 = "$DNA1$DNA2";
print "$DNA3\n\n";
# An alternative way using the "dot operator":
$DNA3 = $DNA1 . $DNA2;
print "$DNA3\n\n";
# Print the same thing without using the variable $DNA3
print $DNA1, $DNA2, "\n\n";
print "$DNA1$DNA2\n\n"; Checkpoints
– "" string, notice that "$a\n" is different to '$a\n'
– . concatenation operator
– print allows multiple arguments
Maybe the Perl slogan should be, “There are more than two ways to do it.”
94
Example 4-3. Transcribing DNA into RNA#!/usr/bin/perl –w
# Transcribing DNA into RNA
use strict;
# The DNA
my $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
# Transcribe the DNA to RNA
# Substitute all T's with U's
my $RNA = $DNA;
$RNA =~ s/T/U/g;
print "$RNA\n"; Checkpoints
– =~ binding operator
– s/// substitute command, where g is modifier of s
95
96
Using the Perl documentation A Perl programmer’s most important resource is the Perl documentation
– it should be installed on your computer
– it may also be found on the Internet at the Perl site
– just Google ‘perl’
perldoc– $ perldoc -f printf
http://www.perl.com– Documentation link Perl’s Builtin Functions Alphabetical Listing of Perl’s
Functions
The Perl documentation– check out the examples they give is usually the quickest way
– it may answer some questions but raises others• E.g., the documentation of print starts out: “Prints a string or a comma-separated list of
strings.” But then comes a bunch of gibberish (or it is just the learning curve!) Filehandles? Output streams? List context?
– it also includes several tutorials
97
Example 4-4. Calculating the reverse complement of a strand of DNA #!/usr/bin/perl -w
# Calculating the reverse complement of a strand of DNA
use strict;
# The DNA
my $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
# Calculate the reverse complement
# Warning: this attempt will fail!
# First, copy the DNA into new variable $revcom (short for REVerse COMplement)
# Notice that variable names can use lowercase letters like "revcom“
# as well as uppercase like "DNA“. In fact, lowercase is more common.
my $revcom = reverse $DNA;
# Next substitute all bases by their complements,
# A->T, T->A, G->C, C->G
$revcom =~ s/A/T/g;
$revcom =~ s/T/A/g;
$revcom =~ s/G/C/g;
$revcom =~ s/C/G/g;
print "$revcom\n“; This examples has a logical bug Checkpoints
– reverse() reverse a string, the parentheses can be omitted
98
What is
99
The bug
Example 4-4. Calculating the reverse complement of a strand of DNA#!/usr/bin/perl -w
# Calculating the reverse complement of a strand of DNA
use strict;
# The DNA
my $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
# The problem is that the first two substitute commands above
# change all the A's to T's (so there are no A's) and then all the
# T's to A's (so all the original A's and T's are all now A's).
# Same thing happens to the G's and C's all turning into G's.
# Make a copy of the DNA
my $revcom = reverse $DNA;
# Next substitute all bases by their complements,
# A->T, T->A, G->C, C->G
$revcom =~ tr/ACGTacgt/TGCAtgca/;
print "$revcom\n"; Checkpoints
– tr transliteration operator
100
101
Example 4-5. Reading protein sequence data from a file#!/usr/bin/perl -w
# Reading protein sequence data from a file
use strict;
# The filename of the file containing the protein sequence data
$proteinfilename = 'NM_021964fragment.pep';
# First we have to "open" the file, and associate a "filehandle" with it.
open FH, $proteinfilename;
# Now we do the actual reading of the protein sequence data from the file, by using
the angle brackets < and > to get the input from the filehandle. We store the data
into our variable $protein.
$protein = <FH>;
# Now that we've got our data, we can close the file.
close FH;
print $protein; Checkpoints
– open() open a file, like fopen() in C
– filehandle it’s an interface between program and file system, very common concept
– <> it’s an operator in Perl, just memorize it
– close() close a file, like fclose() in C
102
Example 4-6. Reading protein sequence data from a file, take 2#!/usr/bin/perl -w
# Reading protein sequence data from a file, take 2
use strict;
# Open a file
open FH, 'NM_021964fragment.pep';
# Suppose that the file has three lines, and since the read only is returning
# one line, we'll read a line and print it, three times.
# First line
my $protein = <FH>;
print $protein;
# Second line
$protein = <FH>;
print $protein;
# Third line
$protein = <FH>;
print $protein;
close FH; Checkpoints
– how stupid it is
103
Example 4-7. Reading protein sequence data from a file, take 3#!/usr/bin/perl -w
# Reading protein sequence data from a file, take 3
use strict;
# Open a file
open FH, 'NM_021964fragment.pep';
# Read the protein sequence data from the file, and
# store it into the array variable @protein
@protein = <FH>;
print @protein;
close PROTEINFILE; Checkpoints
– @ array variable• $protein[0] the first element of @protein
– print know how to print an array, no loop is required
104
Perl facilities for array– my @bases = ( 'A', 'C', 'G', 'T' );
print @bases; # A C G T
pop, push (for stack) shift, unshift (for queue)– my $base1 = pop @bases; my $base1 = shift @bases;
print $base1; # T print $base1; # A
print @bases; # A C G print @bases; # C G T
push @bases, $base1; unshift @bases, $base1;
print @bases; # A C G T print @bases; # A C G T
reverse– my @reverse = reverse @bases;
print @reverse; # T G C A
scalar (length of an array)– print scalar @bases; # 4
splice (insert/delete/replace elements at an arbitrary place)– extremely powerful!– splice @bases, 2, 0, 'X';
print @bases; # A C X G T
105
Any Questions?
106
About Perl facilities for array
How to
107
my @a = ( '1', '2', '3', '4' );
splice @a, 2, 1, 'a', 'b';
print @a; # 1 2 a b 4
answer
Example 4-8. Scalar context and list context#!/usr/bin/perl -w
# Demonstration of "scalar context" and "list context"
use strict;
my @bases = ( 'A', 'C', 'G', 'T' );
my $a = @bases;
print $a; # 4
($a) = @bases;
print $a; # A Checkpoints
– many Perl operations behave differently depending on the context in which they are used
– another example is the reverse() on a string array
108
Perl
Flow control Example 5-1. if-elsif-else
#!/usr/bin/perl -w
# if-elsif-else
use strict;
my $word = 'MNIDDKL';
# if-elsif-else conditionals
if ( 'QSTVSGE' eq $word ) {
print "QSTVSGE\n";
} elsif ( 'MNIDDKL' eq $word ) {
print "MNIDDKL--the magic word!\n";
} else {
print "Is \"$word\" a peptide?\n";
} Checkpoints
– elsif else if
– eq equal, check for equality between strings
– ne not equal, check for equality between strings
109
Example 5-2. Reading protein sequence data from a file, take 4#!/usr/bin/perl -w
# Reading protein sequence data from a file, take 4
use strict;
# In case the open fails, print an error message and exit.
my $fn = 'NM_021964fragment.pep';
unless ( open FH, $fn ) {
print "Could not open file $fn!\n";
exit;
}
# Read and print line-by-line.
my $protein;
while ( $protein = <FH> ) { print $protein; }
close FH; Checkpoints
– unless the opposite of if, just to fit English more
– openreturn false when fails
– exit interrupt the program, like exit() in C
– In Perl, an assignment returns the value of the assignment. If there is another line to read in, the assignment occurs, the $protein is not null, and the conditional is true.
110
Perl
Code layout Format A Format B
– while ($a) { while ($a)
if ($b) { {
print "ok\n"; if ($b)
} {
} print "ok\n";
}
}
Format C Format D– while ($a) while($a){if($b){print "ok\n";}}
if ($b)
{
print "ok\n";
}
}
A and B are common ways to lay out code, A is more preferred in Perl DON’T use C or D, ever! Perl provides a guide for code style.
– $ perldoc perlstyle
– however, they are not rules, and you may use your own judgment
111
Perl
Subroutine Example 6-1. A subroutine to append ACGT to DNA
#!/usr/bin/perl -w
# A program with a subroutine to append ACGT to DNA
use strict;
my $dna = 'CGACGTCTTCTCAGGCGA';
# The argument is $dna; the result is $longer_dna
my $longer_dna = &addACGT($dna); print $longer_dna;
####################
# Subroutines for Example 6-1
sub addACGT {
my $dna = shift @_;
$dna .= 'ACGT';
return $dna;
} Checkpoints
– & call subroutine, can be omitted but helpful (e.g., vi) for highlighting
– sub subroutine definition
– my $dna scoping issue
– @_ argument array, a special variable in Perl
– return return value
112
Perl
Scoping Make the variables specific to the subroutine with my
– my is a keyword in Perl that limits variables to the block where they are used (here the block is the subroutine)
Making variables local to a restricted part of a program is called scoping. There are different models of scoping:– in Perl, using my variables is known as lexical scoping, also
known as static scoping
– dynamic scoping is hard to track, use them as less as possible
Manipulation directly on $_[0] will change the argument (call by reference)
113
Any Questions?
114
About Perl subroutine
How to
115
Return multiple values?return ( 'a', 'b' );
$_[0] = 'a'; $_[1] = 'b';
answer#1
answer#2