Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Bioinforma)cs-‐in-‐a-‐Box 04/18/15
Vermont Gene)cs Network
Professional Development Event
Pomeroy Alumni Center, St. Michael’s College Colchester, VT
Faye D. Schilkey Na/onal Center for Genome Resources
NCGR: National Center for Genome Resources
n Not-for-profit research organization n Formed: 1994 in Santa Fe, NM
n Expertise: Bioinformatics (21 yrs) and Next Gen Sequencing (8 yrs)
n Applies bioinformatics, software engineering and next-generation sequencing to solve the -omic challenges of 21st century
n Collaborative research and services
Faye D. Schilkey, BS Computer Engineering
n First Career: Software engineering in automotive (robotics) and aerospace (guidance and autopilot systems).
n Second Career (Big Data): n IT/Software Engineering/Database Development in Genomics/
Bioinformatics (> 15 yrs) n Genome Sequencing Center Operations & Services (8 yrs) n Director, NM INBRE Bioinformatics Core (9 yrs) n Director, NM INBRE Sequencing & Bioinformatics Core (8 yrs) n Founding Steering Committee Member of Network of IDeA-
funded Core Laboratories NICL (6 yrs)
Agenda n NCGR
n NM-INBRE Sequencing & Bioinformatics Core (SBC) and IDeA research advancement
n Sequencing and bioinformatics technologies
n Bioinformatics-in-a-Box
n Collaboration/education avenues n Summer Bioinformatics Intensive Internship
n NM Bioinformatics, Science and Technology (NMBIST) conference
n Sequencing and bioinformatics project ideas
n Conclusion and discussion
Research at NCGR
u Focus u Human health and nutrition u Plant science u > 200 publications
AJ Brass Foundation
Human Health Research Publications at NCGR
Dengue virus infection (Virology 2015)
Vibrio cholerae (Genomics Discovery 2014)
Guinea Pig (Genome Announc 2013)
Eyeless Hedgehog (PLoS One 2012)
Carrier Screening (Beyond Batten - Sci Transl Med 2011 )
Multiple Sclerosis (Twins study – Nature April 29, 2010 cover)
Sepsis (J Clin Microbiol. 2010)
Korean Genome (Nature 2009)
Mesothelioma (Proc Natl Acad Sci 2008)
Schizophrenia (PLos One 2008)
• Medicago truncatula (Barrel clover) HapMap (500Mb) – Cornell, UVM, JCVI, NSF, UCSC, INRA-‐Montpellier, ENSAT-‐Toulouse, Boyce Thompson Inst. – Samuel Roberts Noble Foundation
• Medicago sativa (Alfalfa) Genome (860Mb) – Samuel Roberts Noble Foundation
• Theobroma Cacao (Chocolate) Genome (330Mb) – USDA-‐ARS & Mars, Inc., Washington State University, JGI, USDA-‐ARS, IBM, PIPRA, CUGI
• Glycine Max (Soybean) (1 Gbp) and Zea Mays (Maize) (2Gb) Genetic Diversity – Syngenta
• Sorghum Transcriptome – USDA-‐ARS
• Gossypium arboreum (Cotton) Genome (1.7 Gbps) – Texas Tech University & Bayer Crop Sciences
• Phytophthora capsici (100 Mbps) – Univ. of Tennessee, Ohio State Univ., USDA/NSF
• Legume Disease Resistance – Na/onal Science Founda/on, University of California – Davis
Plant/Animal/Fungus/Bacteria Science
• Chickpea & Pigeon Pea Diversity – CIMMYT -‐ Genera/on Challenge Program, ICRISAT
• Andean Birds (Hummingbird) Transcriptome (1 Gbp)
– UNM, NSF
• Green Microalga (85 Mbp) and Diatom strain RGd-1 (25 Mbp) Genomes – Center for Biofilm Engineering, Montana State University
• Staphylococcus aureus strains (3 Mbp) – NMSU, OSU, NIH, NM-‐INBRE
• Burkholderia glumae (rice blight) genome (7.3 Mbp) – Louisiana State University
• Bacteroides xylanisolvens strains (6 Mbp) – USDA-‐ARS, DARPA, Vital Probes
• Polaromonas sp . Strain CG9_12 (pollutant degradation) Genome (5 Mbp) – Center for Biofilm Engineering, Montana State University
• Kibdelosporangium sp. MJ126-NF4 (Actinobacteria having natural products: anti-bacteria/viral/cancer)
Genome (11 Mbps) – UNM
Plant/Animal/Fungus/Bacteria Science
NM-INBRE Sequencing and Bioinformatics Core (SBC) research advancement
0 2 4 6 8 10 12 14 16 18 20
2008
2009
2010
2011
2012
2013
2014
Number of projects, pubs, and grants
Year
Serving to date > 160 researchers/postdocs/students
Pubs in press (31)
Grants Awarded/Continued (30)
Projects (66)
NM INBRE SBC Collaborations > 65 projects
23
1
3
4
2 3
2
1
1
1
1
1
1
2
1 2
2
15
INBRE 40 9 HHMI-SEA Phage INBRE 17 2014 -2015: 2008-2013:
• Dr. Charles “Chad” Melancon - "De Novo Genome Sequencing of Novel Bacterial Isolates from Cave Environments." - UNM
• Dr. Douglas J. Perkins - “Discovery of Genetic Biomarkers for Severe Malaria” - UNM
• Dr. Rebecca A. Reiss - “Nanoinformatics: Characterizing Cell Proliferation on Nanostructured Titanium” - NM Institute of Mining & Tech
• Dr. Travis R. Robbins - “Comparing genomic variation caused by invasion of a novel threat versus geographic separation of populations” - NNMC
• Dr. Alvaro Romero - “Study of transcriptional changes upon dengue virus infection in the Asian tiger mosquito, Aedes albopictus” - NMSU
• Dr. Hitoshi Tsujimoto - “Study of transcriptional changes upon dengue virus infection in the Asian tiger mosquito, Aedes albopictus” - NMSU
• Drs. Ben Wheaton & Rob Miller - “The role of the immune system in spinal cord injury and recovery.” UNM
• Dr. Tim Wright - “Genomic Approaches to Detecting Evolutionary Responses in Biological Invaders " - NMSU
2014-2015 pilot awardees
• Dr. Colleen Fordyce - “Cellular pH during carcinogenesis and how pH can be exploited for therapeutic benefit” - UNM
• Dr. Michael Franklin - “Epigentics of Pseudomonas aeruginosa during biofilm growth” - Montana State
• Dr. Kathryn A. Hanley - “Quasispecies Dynamics of West Nile Virus in Avian Reservoir Hosts” -NMSU
• Dr. Zoe Harrold - “Fire and Ice: metagenomic investigations of a unique sub----‐glacial ice cave system” - Montana State
• Dr. Mario Izaquierre-Sierra - “Transposable Element Regulation in Land Plants: Arabidopsis coilin and Cajal bodies, a case study.” - NNMC
• Dr. Thomas L. Kieft - “Metagenomic Sequencing of U-Contaminated Soils and Sediments” - NMTech
• Dr. Samuel A. Lee - “Illumina RNA-seq expression analysis of cranberry-derived proanthocyanidins for the prevention of Candida albicans urinary biofilms” – UNM
• Dr. Nora Perrone-Bizzozero - “Identification of KSRP neuronal RNA targets by RIP-Seq” - UNM
• Dr. Giancarlo Lopez-Martinez - “The transcriptomics of low-oxygen hormesis and irradiation: What drives the strong organismal performance improvement?” - NMSU
2014-2015 pilot awardees (cont.)
Next Gen Sequencing Applications
Digital Transcript Expression Small RNA
Discovery & Expression
ChIP-SEQ
Genome Structural Variation
Mutation Frequencies
DNAse1 HS Sites
Genetic Association
De novo genome Sequencing
DNA Methylation
Metagenomics
Exome Sequencing
Splice Isoform Abundance
SBC technologies accelerate IDeA research: Sequencing
Illumina HiSeq2000: • RNA, DNA, microRNA, and ChIP seq • 1x and 2x 50/100bp read lengths, ~300Gb yield/10-day run
PacBio RS II: Single Molecule Real-Time observation of DNA synthesis • No PCR bias, faster and accurate P6 polymerase • ~8000bp average read lengths • > 40kb read lengths • > 500Mb per v3 SMRT Cell • 8-16Gb yield per 16 cell run in 48 hours • DNA, De novo assembly, Base modification detection • IsoSeq: Determine the transcript landscape of your organism by sequencing
full-length transcripts and gene isoforms. No assembly required!
Why Sequence mRNA? 1. Cost Effective: Transcriptome ≈ 2% Genome
2. Biologically relevant – active in affected cell or tissue
3. Enables genomic congruence analysis (gene expression, isoform usage and non-synonymous variant information
4. Identifies mutations that are not apparent by genome sequencing (epigenetic silencing, RNA editing, allele-specific expression)
Drew Sheneman, New Jersey -- The Newark Star Ledger
de novo Assembly
2) Custom bioinformatics for de novo/hybrid assemblies, ChIP, metagenomics, etc.
1) New simple bioinformatics tool for “biologists”
Focused on the most popular Next Gen Sequencing experiments:
• RNA-Seq (expression analysis) • DNA-Seq (mutation detection) • microRNA seq analysis
Bioinformatics
RNA-Seq Analysis What’s involved?
QUALITY CHECK TOOLS
n FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ n Evaluate data quality based on several benchmarks (seq
quality, GC content) n Easy to read report n Important to verify that the samples have consistent quality
n BLAST:
http://www.ncbi.nlm.nih.gov/books/NBK1762/ n Verify species
Bioinformatician obtains data and downloads, installs, updates and runs…
TOOLS TO ALIGN/MAP READS TO GENOME Popular alignment algorithm n Tophat 2.0 http://ccb.jhu.edu/software/tophat/index.shtml
n Tophat 1.2/1.3
But what genome (and version) are you mapping against?
• UCSC: ftp://hgdownload.cse.ucsc.edu/goldenPath/ • NCBI: ftp://ftp.ncbi.nih.gov/genomes/ • Ensembl: ftp://ftp.ensembl.org/pub/ • Custom
Bioinformatician downloads, installs, updates and runs…
READ QUANTIFICATION TOOLS
n HtSeq-Count: http://www-huber.embl.de/users/anders/HTSeq/doc/count.html n Raw hit count n Transcript or Gene-based results
n Cufflinks: http://cufflinks.cbcb.umd.edu/ n Normalizing, transcript-based quantification n FPKM/RPKM values n Gene-based values are aggregates
Bioinformatician downloads, installs, updates and runs…
EXPRESSION ANALYSIS TOOLS n DESeq:
http://bioconductor.org/packages/release/bioc/html/DESeq.html
n Requires up-to-date R installation; works with raw-hit-count values
n EdgeR: http://www.bioconductor.org/packages/release/bioc/html/edgeR.html
n Requires up-to-date R installation; works with raw-hit-count values
n Cuffdiff: http://cufflinks.cbcb.umd.edu/
n Part of cufflinks, new version also works with CSV files; works with FPKM values
Bioinformatician downloads, installs, updates and runs…
COLLECT AND INTEGRATE ANNOTATION
ENSEMBL: http://www.ensembl.org/info/docs/api/index.html
NCBI: http://www.ncbi.nlm.nih.gov/refseq/
GO Interactive: http://amigo.geneontology.org/amigo
KEGG Interactive: http://www.genome.jp/kegg/genes.html
PubMed: http://www.ncbi.nlm.nih.gov/pubmed
Bioinformatician downloads, installs, updates resources and writes scripts to….
Bioinformatician writes custom scripts in Perl, AWK and Python to
Find significant genes/ elements
Compare analysis results
Sequencing provider Sequence files
Experimental design
Quality Checks
Read Mapping to Genome Quan/fica/on of reads
Expression Analysis (e.g. DeSeq)
Bioinforma/cian downloads, installs/updates various tools and performs
Annota/on
Significant gene discovery
Result comparison
RNA-Seq experiment and analysis ….. and analysis ….
results
“What if?”
2 months later of hard work by Bioinformatician
Requires analysis to be repeated
Enter Bioinformatics-in-a-BoxTM
Web-based tool for organized data management and analysis of NGS data
Securely share: • results • analysis steps • work together!
365x24
results
Publish faster
Bioinforma/cs-‐in-‐a-‐Box!
Easily execute “what if” ques/ons
Support provid
ed every step o
f
the way to ensu
re success
Collaborate
A Bioinformatics tool for “Biologists” and Bioinformaticians with large workloads!
• Organized analysis ar/facts • Parameter tracking
Computa/on power/disk is in the cloud or on your hardware
Example Start with an RNA-Seq Data set
n Six Samples n 3 Normal Prostate and 3 Prostate Adenocarcinoma
Samples
n SRA Project n SRP003611
n Publication n Nacu, S., et al., Deep RNA sequencing analysis of
readthrough gene fusions in human prostate adenocarcinoma and reference samples. BMC Med Genomics, 2011. 4: p. 11.
Bioinformatics-in-a-BoxTM: Obtain the Data
n Load your own data or from SRA n Combine Technical Replicates
RNA Seq Experiment
Bioinformatics-in-a-Box: Quality Check n A click to run FastQC n A click to run BLAST & align to NCBI All Genomes Database (nr)
Bioinformatics-in-a-Box: Read Quantification
Integrated Tools: n Cufflinks: FPKM values n Ht-SeqCount: Hit Count values
Bioinformatics-in-a-Box: Expression Analysis
Cuffdiff: takes FPKM n Genes, isoforms, TSS
edgeR: takes Hit Count n Genes or isoforms
DESeq: takes Hit Count n Genes or isoforms
Bioinformatics-in-a-Box: Integrated Annotations
n ENSEMBL n NCBI n GO n KEGG n PubMed
n ENSEMBL n NCBI n GO n KEGG n PubMed
Bioinformatics-in-a-Box: Integrated Annotations
n ENSEMBL n NCBI n GO n KEGG n PubMed
Bioinformatics-in-a-Box: Integrated Annotations
n ENSEMBL n NCBI n GO n KEGG n PubMed
Bioinformatics-in-a-Box: Integrated Annotations
n PubMed
Bioinformatics-in-a-Box: Integrated Annotations
Bioinformatics-in-a-Box:
Set filter criteria n P-value n Adjusted p-value n Fold change n Absolute expression
Save your subset of genes
Find Significant Genes/Elements
Bioinformatics-in-a-Box: Compare Results
Conclusion: Biological replicates are preferable
Indicates too many false positives with single-sample comparisons
5369 1317
962
Single Sample (T1 vs. C1) vs. Biological Replicates (T1,2,3 vs. C1,2,3)
Bioinformatics-in-a-Box: Compare DE Results
Indicates many differences between algorithms!
Conclusion: It is advisable to use multiple algorithms
DESeq vs. edgeR vs. Cuffdiff
Bioinformatics-in-a-Box: Compare Results Limma vs. NGS Algorithms
Conclusion: Limma found genes undetected by NGS tools
Bioinformatics-in-a-Box: Compare Results
Limma detects differential genes missed by edgeR & DESeq
Limma vs. NGS Algorithms
Conclusion: Traditional algorithms can be useful for analyzing NGS data
DNA-Seq Mutation Analysis
DNA-Seq Mutation Analysis: Analysis steps
1. Obtain and load data 2. Quality check 3. Align to genome
n Bowtie, Bowtie2, BWA
4. Check actual coverage (optional) 5. Mutation detection
n GATK, samtools, pindel
6. Compare results
Start with Data set: Human Exome
n Enrichment: Agilent Sure Select v4 n Configuration: 2x100; Approximately
100 million reads n Theoretic average coverage: ~130x
n Note quality drop-off after base 60
Bioinformatics-in-a-Box: Quality Check
Bioinformatics-in-a-Box: Align to Genome
Set mapping parameters,
including trimming
Set pairing parameters
Bioinformatics-in-a-Box: Check actual coverage Lower than theoretical, as expected
Bioinformatics-in-a-Box Integrated tools for SNP Detection
n GATK: https://www.broadinstitute.org/gatk/ n Samtools: http://samtools.sourceforge.net/ n FreeBayes: https://github.com/ekg/freebayes
Longer INDELs (> ~10b) and other SV
• Pindel: http://gmt.genome.wustl.edu/pindel/current/
Bioinformatics-in-a-BoxTM: Mutation detection Select an algorithm of choice
Set pre-processing options
SNP & INDEL Detection by hand n Using scripts, Integrate Annotation
n dbSNP, 1000genomes: URL API is slow, recommend local database installation
n Classification snpEff: http://snpeff.sourceforge.net/ n Selection, result comparison
§ Algorithm-specific filtering § Perl, Python, etc.
• Using scripts, filter by location, coverage, quality, type of mutation, codon impact, protein impact, clinical impact
n Using scripts, compare results
Bioinformatics-in-a-BoxTM: Integrated Annotations
n Known SNP n Location, gene (if
appropriate) n Codon, amino-acid, protein
impact
• Up-stream/down-stream sequences, quality, coverage, allele frequency
Bioinformatics-in-a-BoxTM: SNP Details
Bioinformatics-in-a-BoxTM: SNP Quality
Mutation Viewer
Bioinformatics-in-a-BoxTM: Insertion
Mutation Viewer
Bioinformatics-in-a-BoxTM: Deletion
Mutation Viewer
Bioinformatics-in-a-BoxTM: Integrated Annotations • Known SNP • Location, gene (if appropriate) • Codon, amino-acid, protein impact
• Up-stream/down-stream sequences, quality, coverage, allele frequency
Filter by quality, location, impact, etc.
Save dataset
Bioinformatics-in-a-Box Selecting SNPs and INDELs
Bioinformatics-in-a-Box: Compare SNP results GATK versus Sam
Different algorithms generate different results
Bioinformatics-in-a-Box: Compare SNP results using Samtools & BWA versus Bowtie2
Different ALIGNMENT algorithms generate different results
Bioinformatics-in-a-BoxTM: Compare InDel results using Samtools & BWA vs. Bowtie2
Different ALIGNERS generate different results
End of DNA Mutation Detection
Bioinformatics-in-a-Box: 365x24 n Data analysis
n Peer reviewed algorithms n RNA-Seq, SNP Detection and Genotyping and miRNA-Seq n What if? Scenarios
n Data management n Linking all primary data, algorithms, genome references,
parameters with results n Breadcrumb trail of what has been done, with what
settings and versions (algorithms and references)
n Secure worldwide collaboration
n Hands-on support (and documentation if you must…)
NCGR/NM-INBRE Bioinformatics Internship The National Center for Genome Resources & NM-INBRE Present June 15, 2015 - July 31, 2015 (tentative dates) 7-Week Intensive Program June 15 – June 26: 2 weeks of instruction June 29 – July 31: 5 weeks of hands-on projects including a presentation of your work Deadline to apply: 11:59pm Thursday, April 30, 2015 SPACE IS LIMITED Targeted towards: Grads and undergrads PREREQUSITE: The program requires some knowledge of UNIX and includes prerequisite reading and understanding of chapters 4 and 5 of the following: http://my.safaribooksonline.com/book/bioinformatics/1565926641
Annual Educational Symposia
New Mexico BioInformatics, Science and Technology (NMBIST) Symposium
“Transcriptional Control”
March 26,27 2015 Drury Plaza Hotel, Santa Fe, NM
- Experts in the field - Student poster session - Student speaking slot competition - Highlights: Dr. Klemens Hartel
Sequencing and Bioinformatics project ideas 1) Small genome sequencing and de novo assembly Draft assemblies for genomes up to 100Mb in size.
• Pacbio only sequencing and assembly • Illumina only assembly • Pacbio/Illumina/454 hybrid assembly approaches
2) PacBio sequencing and analysis, projects include
• IsoSeq pilot • Base Modification Detection
3) Illumina genomic sequencing and mutation detection 4) Illumina RNA-seq or miRNA-seq and expression analysis 5) Bioinformatics only 6) Custom
Conclusion and discussion The NM-INBRE SBC has the resources and track record to advance your research:
• Sequencing: Illumina and PacBio technologies
• Bioinformatics: Standard pipelines and custom analysis
Work with VGN to impact science!
Please contact us to find out more at [email protected]!
Acknowledgments NCGR/NMINBRE Sequencing and Bioinformatics Core
Science/Bioinformatics Sequencing Lab *Anitha Sundararajan Peter Nagm *Johnny Sena Jennifer Jacobi Joann Mudge Pooja Umale Nico Devitt Thiru Ramaraj IT/Administration Stephanie Guida Forrest Black Connor Cameron Kathy Myers Andrew Farmer *Lisana Chavez Boris Umylny Callum Bell NIH NIGMS (5P20GM103451)
Thank you! Faye D. Schilkey
[email protected] of: 505-995-4449 cl: 505-660-4388