64
NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Embed Size (px)

Citation preview

Page 1: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

NGS data analyseswith BioUML

Fedor Kolpakov

Biosoft.Ru, Ltd.Institute of Systems Biology, Ltd.

Novosibirsk, Russia

Page 2: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Agenda• BioUML overview• NGS tools

– quality control

– alignment tools– annotation tools– workflows

• Genome browser

• Archakov’s genome• Ribosome profiling• Live demonstration

Page 3: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

BioUML overview

Page 4: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

BioUML platform• BioUML is an open source integrated platform for systems

biology that spans the comprehensive range of capabilities including access to databases with experimental data, tools for formalized description, visual modeling and analyses of complex biological systems.

• Due to scripts (R, JavaScript) and workflow support it provides powerful possibilities for analyses of high-throughput data.

• Plug-in based architecture (Eclipse run time from IBM is used) allows to add new functionality using plug-ins.

BioUML platform consists from 3 parts: • BioUML server – provides access to biological databases;• BioUML workbench – standalone application. • BioUML web edition – web interface based on AJAX technology;

Page 5: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Main platforms for bioinformaticsand BioUML

Tavernastandalone applicationpowerful workflows

Galaxy

workflows, web interface, collaborative research,

genome browser

scripts, statistics, plots

R/Bioconductor

BioUML platform

standalone applicationpowerful workflows

web interface,collaborative research

genome browser

scripts, statistics, plots

BioClipse

Eclipse plug-in based architecture,chemoinformatics

Eclipse plug-in based architecture,chemoinformatics

Page 6: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Main platforms for bioinformaticsand BioUML

Tavernastandalone applicationpowerful workflows

Galaxy

workflows, web interface, collaborative research,

genome browser

scripts, statistics, plots

R/Bioconductor

BioUML platform

standalone applicationpowerful workflows

web interface,collaborative research

genome browser

scripts, statistics, plots

+ systems biology• visual modelling• simulation• parameters fitting• …

+ chat for on-line consultations

BioClipse

Eclipse plug-in based architecture,chemoinformatics

Eclipse plug-in based architecture,chemoinformatics

Page 7: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Android market

Android

AppStore

MacOS,iPOD, iPhone

Market

Platform

Biostore

BioUML

Page 8: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Biostore

BioUML platform

Developers- plug-ins: methods, visualization, etc.- databases

Users- subscriptions- collaborative & reproducible research

Experts-services for data analysis- on-line consultations

BioUML ecosystem

provide toolsand databases

use provide services

Page 9: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

NGS- интегрированные в BioUML методы (Bowtie, MACS, ChIPHorde, ChIPMunk, …)- программы, интегрированные в Galaxy- пакеты R- аннотация найденных пиков (SNP, сайтов и т.п.)- визуализация- workflows

- ChIP-SEQ- RNA-SEQ- сборка и аннотация генома человека (в процессе)- поддержка распарелеливания внешних программ как часть workflow

- база данных GTRD (на основе данных ChIP-SEQ) - выделенные сервера

- Amazon EC2 – по запросу- Biodatomics – 64 ядра, 256 Гб памяти.

Page 10: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia
Page 11: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia
Page 12: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Galaxy – analyses methods

Page 13: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Galaxy - workflow

Page 14: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Raw data preprocessing

Track statisticsTrack statisticsGather various statistics about track or FASTQ file

Preprocess raw readsPreprocess raw reads Remove reads not satisfying simple quality tests, removes adapters, trims low quality bases from read ends

Page 15: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Bowtie- fast- no indels - used for chip-seq

Novoalign -single-end and paired-end- in nucleotide and color space- handle indels, - finds global optimum alignments using full Needleman-Wunsch algorithm

ввыравнивание коротких ридов:ыравнивание коротких ридов:

Page 16: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

RNA-seq with tophat and Cuff* tools

Page 17: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

ChIP-seq

BowtieBowtie for alignmentMACSMACS for peak callingChipMunkChipMunk, IPSIPS, MEMEMEME for motif discovery

Page 18: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Popular NGS toolboxes available: GATK, Picard, SAM tools

Page 19: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

An example: workflow for analyses of ChIP-Seq data

Page 20: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

example: RNA-seq workflow

Page 21: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

NGS data quality control

2 examples: rna-seq data (rat, IPS )genome data – Archakov’s genome

Page 22: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Track statistics (FastQC)• Estimate quality of RAW or aligned reads like in FastQC program

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/• All original FastQC processors are supported• Works faster than FastQC• Additional processor: Overrepresented prefixes• Overrepresented K-mers works more precise (do not skip 80% of

sequences)• Along with HTML report separate statistics tables are generated

and accessible for further analysis• Ability to merge several reports into composite report• As any BioUML analysis can become a part of workflow, scripts, etc.• Tested on Archakov AP3 (RAW reads: 5.9Gb csfasta+12.7Gb qual),

analysis time: 36 min (all processors)• Tested on Zakian db50 (RAW reads: 6.5Gb fastq),

analysis time: 7 min (all processors)

Page 23: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Track statistics launchInput data: BAM, FastQ and Solid

(colorspace) data supported

Whether reads should be aligned

by left or right side

Switch off individual

processors to save time.

Page 24: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Track statistics results (Archakov AP3): Quality per base

Page 25: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Track statistics results (Archakov AP3): Quality per sequence

Page 26: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Track statistics results (Archakov AP3): Nucleotide content per base

Page 27: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Track statistics results (Archakov AP3): GC content per base

Page 28: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Track statistics results (Archakov AP3): GC content per sequence

Page 29: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Track statistics results (Archakov AP3): N content per base

Page 30: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Track statistics results (Archakov AP3): Duplicate sequences

Page 31: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Track statistics results (Archakov AP3): Overrepresented sequences and 5-mers

Page 32: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Track statistics results (Archakov AP3): Overrepresented prefixes

Page 33: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Track statistics results (Zakian db50): Quality per base

Page 34: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Track statistics results (Zakian db50): Quality per sequence

Page 35: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Track statistics results (Zakian db50): Nucleotide content per base

Page 36: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Track statistics results (Zakian db50): GC content per base

Page 37: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Track statistics results (Zakian db50): GC content per sequence

Page 38: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Track statistics results (Zakian db50): N content per base

Page 39: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Track statistics results (Zakian db50): Duplicate sequences

Page 40: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Track statistics results (Zakian db50): Overrepresented sequences and 5-mers

Page 41: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Genome browser

Page 42: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

• uses AJAX and HTML5 <canvas> technologies• interactive - dragging, semantic zoom• tracks support

• Ensembl• DAS-servers• user-loaded BED/GFF/Wiggle files

Genome browser: main features

Page 43: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

DAS

The Distributed Annotation System (DAS) defines a communication protocol used to exchange annotations on genomic or protein sequences.

It is motivated by the idea that such annotations should not be provided by single centralized databases, but should instead be spread over multiple sites. Data distribution, performed by DAS servers, is separated from visualization, which is done by DAS clients.

DAS is a client-server system in which a single client integrates information from multiple servers. It allows a single machine to gather up sequence annotation information from multiple distant web sites, collate the information, and display it to the user in a single view.

DAS is heavily used in the genome bioinformatics community. Over the last years we have also seen growing acceptance in the protein sequence and structure communities.

Page 44: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia
Page 45: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia
Page 46: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Genome browser

Two BAM tracks are compared with each other (Example view on Human NCBI37 Chr.1)Profile is visible showing the coverage

Page 47: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Genome browser

Upon zooming individual reads become visible. All information associated with selected read is displayed in the Info box

Page 48: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Genome browser

In detailed scale phred qualities graph is displayed along with changed nucleotides between read and reference sequence

Page 49: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia
Page 50: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

NGS dataArchakov’s genome

Page 51: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Preprocessing1. Remove duplicates

• Purpose is to mitigate the effects of PCR amplification bias introduced during library construction. Two read pairs considered duplicate if they align to the same genomic position.

• >60% were removed as duplicates

• Alignments after this step: 213 531 460

Page 52: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Preprocessing2. Local realignmentRead mapping algorithms operate on each read independently,

locally realign reads such that the number of mismatching bases is minimized across all the reads.

Page 53: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Preprocessing3. Remove duplicates after realignment

• Realignment may change genomic positions of read pairs, after this step additional duplicates can be identified.

• 712 reads were removed (<0.00035%)

Page 54: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Preprocessing4. Recalibration of base quality valuesFor each base in each read calculates various covariates (such

as reported quality score, cycle, dinucleotide, GC-content). Using these values build the model that predicts sequencing errors. Then apply this model to calculate an empirical base quality score and overwrites the phred quality score currently in the read.

Page 55: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Genotyping1. Call SNV by GATK 'Unified Genotyper'2. Assign a well-calibrated probability to each variant call.• Estimate the probability that SNV is a true genetic variant versus a sequencing or data

processing artifact given SNP call annotations provided by 'Unuified Genotyper' (DepthOfCoverage, StrandBias, HaplotypeScore, ReadPosRankSumTest for example).o Variant Annotator - create the set of "true variants" from dbSNP, Hapmap and 1000

genomes databases.o Variant Recalibrator - create a Gaussian mixture model by looking at the annotations

values over a high quality subset of the input call set ("true variants").o Apply Variant Recalibration - apply the model parameters to each variant identified

by Unified Genotyper calculating log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.

Page 56: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Genotyping

3. Call indels by GATK 'Unified Genotyper'4. Assign a well-calibrated probability to each indel.

Similar to SNV calling but use only indels from 1000 Genomes as "true variants"

Page 57: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Genotyping5. Filter out low quality variant calls. 1 783 656 SNVs 17 110 Indels6. Annotate identified variants relative to genes.

Page 58: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

GenotypingAffected geneshttp://cloud-biotech.com/bioumlweb/ #de=data/Collaboration/Dr.Archakov/Data/alignment/Ap1.bam-CleanedAlignment/Genotyping2/tmp/Raw-affected-annotated

Page 59: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Genotyping: potential lose of function118 genes have mutations that potentially affect function

Mutation in the exon of MAP4K3

Page 60: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Gene ontology classification

Full table

Page 61: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Genome browserExample of deletion and insertion presentation in genome browser

Page 62: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Ribosome profiling

Page 63: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Ignolia T.N. et al., Cell, 2011 Ribosome Profiling of Mouse Embryonic Stem Cells Reveals the Complexity and Dynamics of Mammalian Proteomes В статье представлен результат:

- полногеномного профилирования местоположения рибосом (секвинирование защищенных рибосомами фрагментов мРНК);

- скорости элонгации трансляции (pulse-chase strategy). Анализ полученных данных выявил:

- тысячи сильных сайтов задержки трансляции (pause sites); - тысячи неаннотированных продуктов трансляции, которые включают:

- расширение и обрезание с N-конца - вышележащие рамки считывания, начинающиеся как с AUG и не-AUG

кодонов, причем их трансляция изменяется после дифференцировки; - highly translated short ORFs in the majority of annotated lincRNAs - sprcRNAs - ,

polycistronic ribosome-associated coding RNAs (sprcRNAs), которые кодируют малые белки.

Данные исследования показывают наличие еще одного уровня сложности в протеоме млекопитающих.

Page 64: NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Live demonstration