BioSci D145 Lecture #3blumberg-serv.bio.uci.edu/biod145-w2020/biod145-lecture3... · 2020. 1. 22. · • Enumerate 2-3 specific aims in the form of questions that test your hypothesis

BioSci D145 lecture 1 page 1 ©copyright Bruce Blumberg 2020. All rights reserved

BioSci D145 Lecture #3

• Bruce Blumberg ([email protected]) – 4103 Nat Sci 2 - office hours Tu, Th 3:30-5:00 (or by appointment) – phone 824-8573

• TA – Angela Kuo ([email protected])

– 4311 Nat Sci 2– office hours W 10-12 – Phone 824-6873

• check e-mail regularly for announcements, etc..

• Lectures will be posted in advance (without answers)

• Updated lectures (with answers) will be posted after lecture

– http://blumberg-lab.bio.uci.edu/biod145-w20120

• Don’t forget to discuss term paper topics with me in office hours or by email • Last year’s midterm is posted

http://blumberg-lab.bio.uci.edu/biod145-w20120

Term paper specific aims

• Title of your proposal • A paragraph introducing your topic and explaining why it is important; i.e.,

what impact will the knowledge gained have. – Why should any funding agency give you money to pursue this research?

• NIH now requires a statement of human health relevance for all grant applications

• NSF wants to know what is the intellectual merit of your proposed research and what broader impacts of your proposed research

• Present your hypothesis – A supposition or conjecture put forth to account for known facts; esp. in

the sciences, a provisional supposition from which to draw conclusions that shall be in accordance with known facts, and which serves as a starting-point for further investigation by which it may be proved or disproved and the true theory arrived at.

• Enumerate 2-3 specific aims in the form of questions that test your hypothesis – At least one of these aims needs to have a strong “whole genome”

component – This is not a review article – propose something new.

BioSci D145 lecture 4 page 2 ©copyright Bruce Blumberg 2004-2016. All rights reserved


Isothermal amplification – the solution to template preparation

• How to make template preparation faster, easier and more reliable? – Eliminate automation requirement, amplify starting material in some

other way – Φ29 DNA polymerase (aka TempliPhi) – https://youtu.be/CaFq9cnfTZI – Enzyme has high processivity and strand displacement activity

• Isothermal reaction produces huge quantities of DNA from tiny amount of input

• More efficient than PCR (no temp change, no machine, no cleanup)

https://youtu.be/CaFq9cnfTZI


Modern DNA sequence analysis

• Cycle sequencing – Virtually all routine DNA sequencing today is done by cycle sequencing

with fluorescent ddNTPs • ABI Big Dye chemistry

– Template preparation still tedious for small scale • TempliPHi used in genome centers (no need for most automation)

– Capillary sequencers predominant for small scale sequencing • Retrogen and similar companies

• But, next generation sequencing has already rapidly displaced old

technology in genome centers. – 454 sequencing (Roche) – Solexa (Illumina) *dominant player at the moment* – SoLID (Applied Biosystems) (dead technology due to poor support)

• 3rd generation sequencing (individual DNA molecule) now available

– e.g., Pacific Biosciences (sequence reads of 1,000-10K bases) – Oxford Biosciences Nanopore (read length 5 kb—200 kb)


DNA sequence analysis

• Landmarks in DNA sequencing – Sanger, Nicklen and Coulson. Sequencing with chain terminating

inhibitors. Proc. Natl. Acad. Sci. 74, 5463-5467 (1977). – Sanger, F. et al. The nucleotide sequence of bacteriophage ΦX174. J Mol

Biol 125, 225-46. (1978). – Sutcliffe, J. G. Complete nucleotide sequence of the Escherichia coli

plasmid pBR322. Cold Spring Harb Symp Quant Biol 43, 77-90. (1979). – Sanger et al., Nucleotide sequence of bacteriophage lambda DNA. J Mol

Biol 162, 729-73. (1982). – Messing, J., Crea, R. & Seeburg, P. H. A system for shotgun DNA

sequencing. Nucl.Acids Res 9, 309-21 (1981). – Anderson, S. et al. Sequence and organization of the human

mitochondrial genome. Nature 290, 457-65 (1981). – Deininger, P. L. Random subcloning of sonicated DNA: application to

shotgun DNA sequence analysis. Anal Biochem 129, 216-23. (1983). – Baer et al. DNA sequence and expression of the B95-8 Epstein-Barr virus

genome. Nature 310, 207-11. (1984). (189 kb) – Innis et al. DNA sequencing with Taq DNA polymerase and direct

sequencing of PCR-amplified DNA Proc. Natl. Acad. Sci. 85, 9436-9440 (1988)


DNA sequence analysis (contd)

• Landmarks in DNA sequencing (contd). – 1995 - Haemophilus influenzae (1.83 Mb)

– 1995 - Mycoplasma genitalium (0.58 Mb)

– 1996 - Saccharomyces cerevisiae genome (13 Mb) – 1996 - Methanococcus jannaschii (1.66 Mb)

– 1997 - Escherichia coli (4.6 Mb) – 1997 - Bacillus subtilis (4.2 Mb) – 1997 - Borrelia burgdorferi (1.44 Mb)

– 1997 - Archaeoglobus fulgidus (2.18 Mb)

– 1997 - Helicobacter pylori (1.66 Mb)

• first bacterium sequenced, human pathogen

• smallest free living organism

• first Archaebacterium

• Lyme disease

• first sulfur metabolizing bacterium

• first bacterium proven to cause cancer


• Landmarks in DNA sequencing (contd) – 1998 - Treponema pallidum (1.14 Mb) – 1998 - Caenorhabditis elegans genome (97 Mb) – 1999 - Deinococcus radiodurans (3.28 Mb)

– 2000 - Drosophila melanogaster (120 Mb) – 2000 - Arabidopsis thaliana (115 Mb) – 2001 - Escherichia coli O157:H7 (4.1 Mb)

– 2001 – draft Human “genome” – 2002 – mouse genome – 2002 – Ciona intestinalis

– 2003 – “complete “human genome – 2004 – rat genome – 2006 – Human “genome” complete sequence of all chromosomes – 2010 – Neanderthal genome sequenced – 2012 – Denisovan genome sequenced

• resistant to radiation, starvation, ox stress

DNA sequence analysis (contd)

• Primitive chordate

• Pathogenic variant of E. coli


DNA Sequence analysis • Complete DNA sequence (all nts both strands, no gaps)

– complete sequence is desirable but takes time • how long depends on size and strategy employed

– which strategy to use depends on various factors • how large is the clone?

– cDNA – genomic

• How fast is sequence required?

• sequencing strategies – primer walking – cloning and sequencing of restriction fragments – progressive deletions

• Bidirectional, unidirectional – Shotgun sequencing

• whole genome • with mapping

– map first (C. elegans) – map as you go (many)


DNA Sequence analysis (contd)

• Primer walking - walk from the ends with oligonucleotides – sequence, back up ~50 nt from end, make a primer and continue

• Why back up? – Need to see overlap to

be sure about sequence you are reading



• Primer walking (contd) – advantages

• very simple • no possibility to lose bits of DNA

– restriction mapping – deletion methods

• no restriction map needed • best choice for short DNA

– disadvantages • slowest method

– about a week between sequencing runs • oligos are not free (and not reusable) • not feasible for large sequences

– applications • cDNA sequencing when time is not critical • targeted sequencing

– verification – closing gaps in sequences



• Cloning and sequencing of restriction fragments – once the most popular method

• make a restriction map, subclone fragments

• sequence – advantages

• straightforward • directed approach • can go quickly • cloned fragments often useful otherwise

– RNase protection, nuclease mapping, in situ hybridization – disadvantages

• possible to lose small fragments – must run high quality analytical gels

• depends on quality of restriction map – mistaken mapping -> wrong sequence

• restriction site availability – applications

• sequencing small cDNAs • isolating regions to close gaps


DNA Sequence analysis (contd) • nested deletion strategies - sequential deletions from one end of the clone

– cut, close and sequence • Approach

– make restriction map – use enzymes that cut in polylinker and insert – Religate, sequence from end with restriction site – repeat until finished, filling in gaps with oligos

• advantages – Fast, simple, efficient

• disadvantages – limited by restriction site availability in vector and insert – need to make a restriction map


• nested deletion strategies (contd) – Exonuclease III-mediated deletion

• cut with polylinker enzyme – protect ends -

» 3’ overhang » phosphorothioate

• cut with enzyme between first cut and the insert

– can’t leave 3’ overhang • timed digestions with Exonuclease III • stop reactions, blunt ends • ligate and size select recombinants • sequence • advantages

– unidirectional – processivity of enzyme

gives nested deletions




• Nested deletion strategies – Exonuclease III-mediated deletion (contd)

• disadvantages – need two unique restriction sites flanking insert on each side – best used successively to get > 10kb total deletions – may not get complete overlaps of sequences

» fill in with restriction fragments or oligos • applications

– method of choice for moderate size sequencing projects » cDNAs » genomic clones

– good for closing larger gaps

• Small-scale sequence analysis – how is it practiced today? – Primer walking – ExoIII-mediated deletion with primer walking


Genome sequencing

• The problem – Genome sizes for most eukaryotes are large (108-109 bp) – High quality sequences only about 600-800 bp per run (Sanger) – Nextgen sequences ~150 bp/read

• The solution

– Break genome into lots of bits and sequence them all – Reassemble with computer

• The benefit

– Rapid increase in information about genome size, gene comparisons, etc

• The cost – 3 x 109 bp(human haploid genome) ÷ 600 bp/reaction = 5 x 106 reactions

for 1x coverage! – Need both strands (x2), need overlaps and need to be sure of sequences – ~107-108 reactions/runs required for a human-sized genome – About $1-2 per reaction these days, ~$8 commercially.


Genome sequencing (contd) • Shotgun sequencing NOT invented by Craig Venter

– Messing 1981 first description of shotgun sequencing – Sanger lab developed current methods in 1983 – approach

• blast genome into small chunks • https://youtu.be/ihPEvtPuc30 • clone these chunks

– 3-5 kb, 8 kb plasmid – 40 kb fosmid jump

repetitive sequences • sequence + assemble by computer

– A priori difficulties • how to get nice uniform distribution • how to assemble fragments • what to do about repeats? • How to minimize sequence redundancy?

https://youtu.be/ihPEvtPuc30


Genome sequencing(contd)




Genome sequencing (contd)

• Shotgun sequencing (contd) – How to minimize sequence redundancy?

• Best way to minimize redundancy is map before you start – C. elegans was done this way - when the sequence was finished,

it was FINISHED » mapping took almost 10 years

– mapping much too tedious and nonprofitable for Celera » who cares about redundancy, let’s sequence and make $$ » There is scientific value to draft genomes, too.

• why does redundancy matter? – Finished sequence today costs

about $0.50/base

– Note that 10x, 99.995% coverage leaves at least 150 kb unsequenced


Other sequencing technologies

• Sequencing by hybridization – Construct a high-density

microchip with all possible combinations of a short oligonucleotide

• Up to 25-mers • By photolithography

– Synthesized on chip directly

– Label and hybridize fragment to be sequenced

– Wash stringently – Read fluorescent spots – Reconstruct sequence

by computer


Other sequencing technologies (contd) stoopped here)

• Sequencing by hybridization rarely used for de novo sequencing – Extremely fast and useful to sequence something you already know the

sequence of but want to identify mutation - resequencing – Disease causing changes

• e.g in mitochondrial DNA – SNP discovery – Works best for examining sequence of <10 kb


Other sequencing technologies (contd)

• https://www.thermofisher.com/us/en/home/life-science/microarray-analysis/affymetrix.html?navMode=35810&aId=productsNav • SNP discovery

– Photo shows mitochondrial chip

– Right panel shows pairs of normal (top) vs disease (bottom) (Leber’s Hereditary Optic Neuropathy)

• Top 3 disease mutations

• Bottom control with no change

https://www.thermofisher.com/us/en/home/life-science/microarray-analysis/affymetrix.html?navMode=35810&aId=productsNav

https://www.thermofisher.com/us/en/home/life-science/microarray-analysis/affymetrix.html?navMode=35810&aId=productsNav


Other sequencing technologies – Next Generation sequencing

• 2nd generation = high throughput, short sequences • 3rd generation = single molecule sequencing

• Small number of sequence templates (thousands) but very long reads (~105 bp)

• What is the immediate implication of this technology for genome assembly?

• See Metzger, M.L. (2010) Sequencing technologies — the next generation, Nature Reviews Genetics 11, 31-46.

We should now be able to completely sequence large insert clones directly and avoid fragmentation by repetitive elements!

3rd generation



• Illumina (Solexa) sequencing – https://www.illumina.com/content/dam/illumina-

marketing/documents/products/illumina_sequencing_introduction.pdf – Based on synthesis of complementary strand to a template (like Sanger)

• Detection of elongation with labeled terminators – Steps

• Library generation - fragment genome to appropriate size (depends on application) and add adapters to each end

• Cluster generation – capture fragments on lawn of oligos and amplify • Sequencing – reversible terminator • Data analysis –

– align reads to reference genome – Analysis of reads

https://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf

https://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf



• Illumina sequencing (contd) – Library preparation – fragment target and add adapters.

• Can multiplex to gain additional capacity • That is, Hiseq-X can generate

1.8 Tb sequence per run, but we don’t need this much for most applications so use different adapters and “bar-code” samples.

• This way, you can get many sequences from one run and then deconvolute them

• also has advantage of removing batch effects

– Can directly compare all sequences with each other because they come from same run of machine.





• Bar coding sequence analysis

Popular deep sequencing technologies - 2020


Popular deep sequencing technologies – PacBio SMRT sequencing






Popular deep sequencing technologies – Oxoford Nanopore


https://nanoporetech.com/applications/dna-nanopore-sequencing# https://youtu.be/CGWZvHIi3i0

https://nanoporetech.com/applications/dna-nanopore-sequencing





https://youtu.be/CGWZvHIi3i0



• Deep sequencing - what is the point? – Can generate huge number of reads in parallel

• iSeq100 – 1.2 Gb (4 million reads/run, 2 x 150 bp) • Miniseq – 7.5 Gb (25 million reads/run, 2 x 150 bp) • MiSeq – 15 gb (15 million reads/run, 2 x 300 bp) • NextSeq – 120 Gb (400 million reads/run, 2 x 150 bp) • HiSeq – 1.5 Tb (5 billion/run, 2 x 150 bp) • HiseqX – 1.8 Tb (6 billion/run, 2 x 150 bp) • Novaseq – 6.0 Tb (20 billion/run, 2 x 150 bp)

• What is massively parallel sequencing good for? – Rapid sequencing of genomes, or resequencing of known sequences – Ancient DNA (even dinosaurs?)

Probably not, 1.5 million years appears to be upper limit – ChIP-sequencing – Sequencing ESTs or other tags – Determining microbial diversity in field samples – Transcriptome sequencing – Single cell sequencing – Identification of infrequent variants in large populations (e.g., viruses)


Amplicon sequencing

• Idea is to sequence many copies of the same thing – Gene sequence – mRNA transcript


Amplicon sequencing (contd)

• What is amplicon sequencing good for?

– Discovery of rare somatic mutations in complex samples (e.g., cancerous tumors - mixed with germline DNA) based on ultra-deep sequencing of amplicons

– Sequencing collections of exons from populations of individuals to identify diversity

– Sequencing collections of human exons from populations of individuals for the identification of rare alleles associated with disease

– Analysis of viral quasi-species present within infected populations in the context of epidemiological studies (find virulent mutations in population)

– Evolutionary biology in populations

The human genome

• In Feb 12 2001, Celera and Human Genome project published “draft” human genome sequencs – Celera -> 39114 – Ensembl -> 29691 – Consensus from all sources ~30K

• Number of genes – C. elegans – 19,000 – Arabidopsis - 25,000

• Predictions had been from 50-140k human genes

– What’s up with that? – Are we only slightly more complicated than a weed? – How can we possibly get a human with less than 2x the number of genes

as C. elegans – Implications?

• UNRAVELING THE DNA MYTH: The spurious foundation of genetic engineering, Barry Commoner, Harpers Magazine Feb, 2002


The human genome

• The answer – Gene sets don’t overlap completely (duh) – Floor is 42K – 130029build #236 UniGene Clusters (from EST and mRNA sequencing) – http://www.ncbi.nlm.nih.gov/unigene – Up from 123,459 in 2013 (85,793, 105,680, 128,826, 123,891 previous

years) (“final” count

• Important questions to be answered about what constitutes a “gene”

= 42113

– Crick genes? DNA-RNA-protein

– How about RNAs? – miRNAs? – Antisense transcripts? – lncRNAs?



– Whole genome shotgun sequencing (Celera) • premise is that rapid generation of draft sequence is valuable • why bother trying to clone and sequence difficult regions?

– Basically just forget regions of repetitive DNA - not cost effective • using this approach, genomes rarely are completely finished

– rule of thumb is that it takes at least as long to finish the last 5% as it took to get the first 95%

• problems – sequence may never be complete as is C. elegans – much redundant sequence with many sparse regions and lots of

gaps. – Fragment assembly for regions of highly repetitive DNA is dubious

at best – “Finished” fly and human genomes lack more than a few already

characterized genes


Genome sequencing (contd)

• Knowing what we know now – how to approach a large new genome? – Xenopus tropicalis 1.7 Gb (about ½ human) – BAC end sequencing – Whole genome shotgun – HAPPY mapping and radiation hybrid mapping to order scaffolds – Gaps closed with BACs – 8.5 x coverage (but > 9000 scaffolds for 18 chromosomes)

• 2019 update – now version 10.0

– FINALLY integrated BAC end sequences and genetic map – 99.86% of genome mapped to chromosomes

• 167 scaffolds, ~150 Mbp, 10 chromosomes – ~45k protein coding genes

• Xenopus laevis – v9.2

• >90% of genome in chromosomal scaffolds • 2 “subgenomes” fully characterized.


Comparison of typical model organisms used in biomedical research


Evolutionary trees for model organisms


X human

Documents

BioSci D145 Lecture #3blumberg-serv.bio.uci.edu/biod145-w2020/biod145-lecture3... · 2020. 1. 22. · • Enumerate 2-3 specific aims in the form of questions that test your hypothesis