43
WHOLE GENOME ANALYSIS R.Priyanka M.Sc Biotechnology

Whole genome analysis

  • Upload
    priya63

  • View
    90

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Whole genome analysis

WHOLE GENOME ANALYSIS

R.PriyankaM.Sc Biotechnology

Page 2: Whole genome analysis

Genome is the entire complement of genetic material of an organism, virus or an organelle (or) Haploid set of chromosome in eukaryotic organism

Whole genome is the complete genome set of an organism.

Whole genome sequencing is a laboratory process where complete DNA sequence of organism’s genome at a single time.

Page 3: Whole genome analysis

Whole genome analysis= Whole genome sequencing +

Bioinformatics

The term "sequence analysis" implies subjecting a DNA or peptide sequence to sequence alignment, sequence databases, repeated sequence searches, or other bioinformatics methods on a computer.Sequence analysis in bioinformatics is an automated, computer-based examination of characteristic fragments, e.g. of a DNA strand.

Page 4: Whole genome analysis

WHOLE GENOME ANALYSIS

Genomic analysis is the identification, measurement or comparison of genomic features such as DNA sequence, structural variation, gene expression, or regulatory and functional element annotation at a genomic scale. Methods for genomic analysis typically require high-throughput sequencing or microarray hybridization and bioinformatics.

Page 5: Whole genome analysis

History of Sequencing

• Allan Maxam and Walter Gilbert developed chemical method of DNA sequencing in 1976-1977.

• Since this method was technically complex & use of extensive hazardous chemicals & fallen out of flavor.

• Sanger coulson developed the chain-termination method in 1977.

• Only be used for fairly short strands (100 to 1000 base pairs) and longer sequences must be subdivided into smaller fragments.

Page 6: Whole genome analysis

History of genome sequencing

• Bacteriophage fX174, was the first genome to be sequenced in 1935 with 11 genes & 5,368 base pairs (bp). This is a viral genome smaller than T Phages and are polyhedral. This was done by Norman & Baker by staining method with the use of Sanger method of shotgun sequencing.

• Haemophilus influenza was the first bacterial genome to be sequenced in 1995 by Craig. This is a gram negative bacteria with 1.8 million bp

• The first nearly complete human genomes sequenced were J. Craig Venter's, James Watson's, Yoruban and Seong-Jin Kim.

Page 7: Whole genome analysis

WHY WHOLE GENOME SEQUENCING?

• Information about coding and non coding part of an organism.• To find out important pathways in microbes.• For evolutionary study and species comparison.• For more effective personalized medicine (why a drug works for person X

and not for Y).• Identification of important secondary metabolite pathways (e.g. in plants).• Disease-susceptibility prediction based on gene sequence variation.

Page 8: Whole genome analysis

STEPS OF GENOME ANALYSIS

1) Genome sequence assembly2) Identify repetitive sequences – mask out3) Gene prediction – train a model for each genome4) Genome annotation- process of attaching biological information to

sequence5) Metabolic pathways and regulation- to find missing genes6) Protein 2D gel electrophoresis- to detect translational product7) Functional genomics8) Gene location/gene map9) Self-comparison of proteome10) Comparative genomics11) Identify clusters of functionally related genes12) Evolutionary modeling- to analyze chromosomal arrangement,

duplications, predictions can be made

Page 9: Whole genome analysis

Lecture 14

Timeline of Large-Scale Genomic Analysis

Page 10: Whole genome analysis

GENE PREDICTION & RECOGNITION

• Easy for prokaryotes (single cell) – one gene, one protein

• More difficult for eukaryotes (multicell) – one gene, many proteins

• Very difficult for Human – short exons separated by non-coding long introns

• Gene recognition is by sequence alignment

Page 11: Whole genome analysis

Human Genome data

For eg: 3.1 x 109 bp in human genome

Difficulties:• Small genes are hard to identify• Some genes are rarely expressed and do not

have normal codon usage patterns – thus hard to detect

Page 12: Whole genome analysis

Human Genome Data

© American Society for Investigative Pathology

Time Period Turn-around Time

Cost per genome

1990 – 2003 ~ 5 years ~ $3 billion

2003-2009 ~ 6 months $300,000

2010-2014 < 1 month $3,800/exome$20,000/WGS

2015 15 minutes $100

Page 13: Whole genome analysis

NEXT GENERATION SEQUENCING• Recently a number of faster and cheaper sequencing methods have been developed.

– In October 2006, the X Prize Foundation, working in collaboration with the J. Craig Venter Science Foundation, established the Archon X Prize for Genomics, intending to award US $10 million to "the first Team that can build a device and use it to sequence 100 human genomes within 10 days or less, with an accuracy of no more than one error in every 1,000,000 bases sequenced, with sequences accurately covering at least 98% of the genome, and at a recurring cost of no more than $1,000 per genome". An error rate of 1 in 1,000,000 bases, out of a total of approximately six billion bases in the human diploid genome, would mean about 6,000 errors per genome. In clinical use, such as predictive medicine currently 1,400 clinical single gene sequencing tests are over

In August 2013 this has been cancelled.– Currently there is a developing method that will sequence the entire human genome for

$1000, to allow personal genomics.– One of the most widely used new methods involve the pyrosequencing biochemical

reactions (invented by Nyren and Ronaghi in 1996), with the massively parallel microfluidics technology invented by the 454 Life Sciences company. This combined technology is called “454 sequencing”.

Page 14: Whole genome analysis

NEXT GENERATION SEQUENCING

Roche’s 454 FLX Ion torrentIllumina ABI’s Solid

• Array based sequencing• Sequence full genome of an organism in a few days at a very low cost.• Produce high throughput data in form of short reads.

Page 15: Whole genome analysis

Towards ‘Next Generation’ sequencing instruments

Capacity greater than one Gigabase per run

Drastic decrease in costs per genome

sequencing of whole bacterial genomes in a single run sequencing genomes of individuals metagenomics: sequencing DNA extracted from

environmental samples looking for rare variants in a single amplified region, in

tumors or viral infections transcriptome sequencing: total cellular mRNA converted

to cDNA.

Page 16: Whole genome analysis

PYROSEQUENCING•This technique used to sequence DNA by using chemiluminescent enzymatic reactions.Step 1: Preparation of single stranded DNA molecule by alkali denaturation and dNTP is attachedStep 2: In DNA synthesis, a dNTP is attached to the 3’end of the growing DNA strand. DNA Polymerase start elongating by using dNTPs. The two phosphates on the end are released as pyrophosphate (PPi).

Page 17: Whole genome analysis

Samples collected from library adapters added to both ends

Individual fragments are captured using adapters

Roche’s 454 FLX

Page 18: Whole genome analysis

Fragments are amplified by PCR picotiter plate is loaded(sample loading takes 8h)

Sequence is accomplished &data is analysed

Page 19: Whole genome analysis

Step 3: ATP sulfurylase is normally used in sulfur assimilation: it converts ATP and inorganic sulfate to adenosine 5’-phosphosulfate (APS) and PPi. Luciferase is the enzyme that causes fireflies to glow. It uses luciferin and ATP as substrates, converting luciferin to oxyluciferin and releasing visible light.

Page 20: Whole genome analysis

• The four dNTPs are added one at a time• The amount of light released is proportional to the number of nucleotides

added to the new DNA strand. Thus, if the sequence has 2 A’s in a row, both get added and twice as much light is released.

Step 4: After the reaction has completed, apyrase is added to destroy any leftover dNTPs.

• The pyrosequencing machine cycles between the 4 dNTPs many times, building up the complete sequence. About 300 bp of sequence is possible (as compared to 800-1000 bp with Sanger sequencing).

Step 5: The light is detected with a charge-coupled device (CCD) camera- Pyrosequencing method

Page 21: Whole genome analysis

• Sample preparation: Extracted and purified DNA.

• Tagmentation: Transposome is fragmented

• 2 different adapters are added on each end of the DNA, then bind it to a slide( flow cell) coated with the complementary sequences for each primer.

• This allows “bridge PCR”, producing a small spot of amplified DNA on the slide.

• The slide contains millions of individual DNA spots. The spots are visualized during the sequencing run, using the fluorescence of the nucleotide being added.

Illumina Sequencing

Page 22: Whole genome analysis

Illumina Sequencing Chemistry• Cluster generation: process where each

fragment is isothermally amplified.• Reverse strands are cleaved and washed

away while forward strands are present• This method uses the basic Sanger idea of

“sequencing by synthesis” of the second strand of a DNA molecule. Starting with a primer, new bases are added one at a time, with fluorescent tags used to determine which base was added.

• The fluorescent tags block the 3’-OH of the new nucleotide, and so the next base can only be added when the tag is removed.

• The cycle is repeated 50-100 times.

Page 23: Whole genome analysis

SOLiD Sequencing

1. Fragment library- 2 types of fragments(single & mate paired)

2. Ligation of adapters 3. Substrate preparation4. Hybridization – clonal amplification5. emulsion PCR 6. Di-base probes(fluorescently labelled) are added7. Fluorescence is measured

Page 24: Whole genome analysis

Emulsion preparation: Water + capture beads + enzyme + DNA fragments + synthetic oil is vigorously shaked. Thus water droplets are formed around beads i.e emulsion•Each plate has 1.6 million wells•This is designed in such a way that only one bead will fit in each well

Page 25: Whole genome analysis
Page 26: Whole genome analysis

STEPS

Page 27: Whole genome analysis

CMOS- complementary metal oxide semiconductor

Page 28: Whole genome analysis

•Semi conductor chip contains millions of wells covered by millions of pixels.•Chip captures chemical information from DNA sequence & translate light to digital information•DNA is cleaved into millions fragments and is attached to its own bead flooded with DNA nucleotides•For each bonding hydrogen ion is released eg: G C and change the pH solution of well•The chemical change is read on chip by using ion sensitive layer beneath well•If nucleotide is not complementary to specific strand no ion is released eg: G T•This process occurs simultaneously in million wells

Page 29: Whole genome analysis

CHIP MACHINE

Page 30: Whole genome analysis
Page 31: Whole genome analysis

PacBio

This is a Natural process

Page 32: Whole genome analysis

•Each SMRT(single molecule real time) has 10,000 zeromode waveguides•Step 1: DNA polymerase is immobilized at bottom•Step 2: phospholinked nucleotides are added•Step 3: each nucleotide is labelled with different coloured fluorophore•Step 4: base is detected such that light pulse is produced after incorporation up to 1000 fold.

Page 33: Whole genome analysis

Single molecule real time(SMRT) sequencing

Page 34: Whole genome analysis

STEPS

Page 35: Whole genome analysis
Page 36: Whole genome analysis

Instrument Pacbio Ion torrent 454 Illumina SOLiD

Method Single molecule in real time

Ion semiconductor

Pyrosequencing Synthesis ligation

Read length 3kb 200bp 700bp 50 to 250bp 50+35 OR 50+50bp

Error type indel Indel Indel Substitution A-T bias

Error % 13 ̃E1 ̃E0.1 ̃E0.1 ̃E0.1

Reads per run 35000-75000 Upto 4M 1M Upto 3.2G 1.2 to 1.4G

Time/run 30 min in 2 h 2h 24h 1-10 days 1 to 2 wks

Cost/million bases in $

2 1 10 0.05 to 0.15 0.13

Advantages Longest read length & fast

↓ expensive & fast

Long read size & fast

↑ sequence yield, cost,accuracy

↓ low cost per base

disadvantages Low yield at ↑ accuracy. expensive

Homopolymer errors

Runs are expensive.homopolymer errors

expensive Slower than other methods, read lengths,longevity of platform

Page 37: Whole genome analysis

equipment applications454 Whole genome sequencing, resequencing, ,Metagenomics

Ion torrent Small de novo genome sequencing Amplicon sequencing,Metagenomics Validation

Illumina Small de novo genome sequencing,cytogenetic analysis, Metagenomics Validation Transcriptome sequencing (RNA-Seq) Whole Exome Sequencing Whole Genome Sequencing

SOLiD Transcriptome sequencing (RNA-Seq) Whole Exome Sequencing Whole Genome Sequencing

Pacific Biosciences Small genomes, Epigenomics

Page 38: Whole genome analysis

Assembly Problems

there are random mutations (either naturally occurring cell-to-cell variation or generated by PCR or cloning),

sometimes the cloning vector itself gets sequenced most genomes contain multiple copies of many sequences

Getting rid of vector sequences is easy once the problem is recognized. Repeat sequence DNA is very common in eukaryotes. High quality sequencing is

helpful Sequencing errors, bad data, random mutations,misreadings Data produce in form of short reads Short reads produced have low quality bases and vector/adaptor contaminations. Several genome assemblers are available but we have to check the performance of

them to search for best one. Quality control Patent and licensing restrictions

Page 39: Whole genome analysis

Short Reads

454 FLX Solid

IlluminaIon torrent Low cost & Less time

Genomic Fragments

Page 40: Whole genome analysis
Page 41: Whole genome analysis

BENEFITS

Treatments based on genomics Improve outcomes Faster diagnosis More precise prognosis Effective therapy Reduce healthcare costs

Page 42: Whole genome analysis

Applications

Oncology Determine the preferred therapeutic agent for each tumor Ascertain which patients are most likely to benefit from a given

therapy Molecular Pathology Disease-specific tests

Cost of a single test: $100 - $5,000 Individual test validation and performance Medicine• human genomic sequence in public databases allows rapid

identification of disease genes by positional cloning• Inherited Diseases can be identified

Page 43: Whole genome analysis