26

Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical
Page 2: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

Microsoft Code of Conduct

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. This

includes all Microsoft events and gatherings, including on digital platforms, where we seek to create a

respectful, friendly, fun and inclusive experience for all participants.

We expect all digital event participants to uphold the principles of this Code of Conduct, which covers the

main digital event and all related activities. We do not tolerate disruptive or disrespectful behavior, messages,

images, or interactions by any party participant, in any form, at any aspect of the program including business

and social activities, regardless of location.

Microsoft will not tolerate harassment or discrimination based on age, ancestry, color, gender identity or

expression, national origin, physical or mental disability, religion, sexual orientation, or any other

characteristic protected by applicable local laws, regulations, and ordinances.

We encourage everyone to assist in creating a welcoming and safe environment. Please report any concerns,

harassing behavior, suspicious, or disruptive activity to Business Conduct Hotline (1-877-320-MSFT or

[email protected]). Microsoft reserves the right to refuse admittance to or remove any person from

Microsoft Build at any time at its sole discretion.

Page 3: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

An introduction to genomics data analysis

on the Azure Cloud

Roberto Lleras

Senior Scientist

Microsoft Genomics

Page 4: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

Agenda How do we fully harness the power

of genomics in the 21st century?

Genomics in the cloud

Challenges in genomics and data science:

how do we put it all together?

Page 5: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

Structure of DNA—deoxyribonucleic acid

Double stranded, stable storage material

Each cell contains the entire genome

4 letters alphabet: A, C, G, T

Human genome (23 chromosomes)

is 3.2 billion bases long (and encoded 2x per cell)

Page 6: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

The future of healthcare (and many other industries)

is powered by genomics

Tailor drugs to patients for

more effective treatment

Design personalized cancer

treatments based on analysis

of tissue from the tumor

Use gene therapy to treat

or prevent disease

Rapidly identify infectious

agents in the environment

and report that information

Determine the cause of

developmental issues in

newborns sooner

Predict inherited disease

Page 7: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

A single Illumina NovaSeq 6000

can generate around 300

terabytes of data every year.1

Sequencing a whole human

genome generates around

100 gigabytes of data.

By 2025, it is estimated that as

much as 40 exabytes of storage

capacity will be needed for human

genomic data.2

1 Illumina2 Z. D. Stephens et al., PLOS Biology, 2015

Analysis of a whole human

genome requires hundreds of

core-hours of compute time.

It would require over 9 million

core-hours to analyze the

genomic data of everyone in

New York’s Madison Square Garden.

Integrating genomic data into

Data Lakes for AI + ML data

science requires specialized

knowledge and tooling.

Data

Compute

Genomics at scale: a storage and compute challenge

Page 8: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

Genomics (+AI and ML) in the cloud

Sequencing data

>CTACGGT

ACTTACGGACGCGAGAGCGGCATTTACCT

>CTACGGT

ACTTACGGACGCGAGAGCGGCATTTACCT

>CTACGGT

ACTTACGGACGCGAGAGCGGCATTTACCT

Bulk data transfer and storage

(100s of GB per sample)

>CTACGGT

ACTTACGGACGCGAGAGCGGCATTTACCT

>CTACGGT

ACTTACGGACGCGAGAGCGGCATTTACCT

Sequence alignment

(NP-hard problem per sample)

Sample metadata + more

Reference: ATTACGGATTACCATGGGCATTTASample: ATTACGGATTGCCATGGGCATTTASample: ATTACGGATTGCCATGGGCATTTASample: ATTACGGATTGCCATGGTCATTTA

Variant calling and filtration

(Train ML models to more accurately spot mutations)

Healthy Individuals

Affected Individuals

Variant interpretation

(Integrate metadata + variants to assess impact

of mutations)

Clinical data

Socioeconomic data

Image data

Known variants

Patient histories

Disease research

Reference databasesStructure and store

other datatypes for easy access

Human interface and hypothesis testing

(Assess model performance and determine experimental results)

Human interpretation

Page 9: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

Genomics (+AI and ML) in the cloud

Sequencing data Sample metadata + more

Clinical data

Socioeconomic data

Image data

Reference databases

Human interpretationHuman interface research

Data security and governance

Machine learning

Model generation

Statistical analyses

Data science research

Data structures

Machine vision

Natural language

processing

Algorithms development

Artificial intelligence

Hardware acceleration

Algorithms development

Hardware acceleration

Bioinformatics

Data transfer

Compression

Archiving

Data sharing

Page 10: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

Genomics (+AI and ML) in the cloud

Sequencing data Sample metadata + more

Clinical data

Socioeconomic data

Image data

Reference databases

Human interpretation

Specialized hardware (FPGAs, GPUs, TPUs)

Page 11: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

Genomics/Omics

(structured)

Medical Imaging

(unstructured)

EMR

(unstructured)

Business apps

(structured)

Notes

(unstructured)

HD

Insights

Power BI

Machine

Learning

Data Lake

Analytics

Cognitive

Services SharePoint

Data

FactoryData

Store

Stream

AnalyticsAzure

Databricks

SQL Data

Warehouse

Genomics/Omics

(unstructured) Teams

INGEST PREP MODEL & TRAIN VISUALIZE & SHARE

Insights through multimodal data analytics pipelines

Page 12: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

The newspaper problem

Example from www.bioinformaricsalgorithms.com

Page 13: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

The newspaper problem as an overlapping puzzle

Example from www.bioinformaricsalgorithms.com

Page 14: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

Multiple unsequenced copies

Randomly fragment the genome

Resulting overlapping reads;

The higher the coverage the better the quality

Apply bioinformatics tools

to reassemble the reads

Unordered sequenced segments (reads),

2–3 billion reads

Requires a large amount of computing power and storage capacity

From experimental to computational challenges

Page 15: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

Secondary analysis pipeline

Best practices pipeline recommended by the Broad Institute of MIT

Microsoft Genomics—optimized service on Azure

Input size Compute time

Average 43 GB 5 h

Largest 398 GB 52 h

Average compute time 30 h

Accelerate precision medicine with Microsoft Genomics. Microsoft, 2018

Page 16: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

The world of genomics is rapidly evolving

An explosion of

new technologies…

Long read sequencing

Single cell sequencing

Spatial sequencing

Optical mapping,

chromosome capture

Increased accessibility and accelerating innovation in a connected world demand a rapidly scalable platform elastic to conducting research and deployment at scale

Page 17: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

Cromwell on Azure

Azure implementation of the Broad Institute’s Cromwell workflow engine using the GA4GH Task

Execution Service (TES) backend

Free OSS solution on GitHub made available under the MIT license

Easy to install, configure, and use

Leverages Azure Batch compute and Blob storage for near-infinite scalability

Support for authenticated access in a workgroup setting

Best Practice Pipelines supported on Cromwell—BWA/GATK, MuTect2, RNA-Seq, ATAC-Seq, etc.

Page 18: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical
Page 19: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

Cromwell on Azure (Genomics Workflow Orchestration)

Azure Services (Storage, Compute, DB, ML, PBI)

Secondary Analysis Tertiary Analysis Presentation

Automated genomics + data science and ML pipeline

Page 20: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

Visualize scientific discovery in real time

Page 21: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

Microsoft Confidential 21

Page 22: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

Q & A

Page 23: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

What next?

Microsoft student resources can be found at the GitHub repository for

further learning opportunities.

aka.ms/StudentsAtBuild

Microsoft Learn for Students is the place to develop practical skills through

fun, interactive modules and paths. Plus, educators can get access to Microsoft

classroom materials and curriculum. Find it all at: aka.ms/learnforstudents

Azure for Students gives you $100 in credit on the Azure Cloud. Build your

skills in trending tech including data science, artificial intelligence (AI),

machine learning, and other areas with access to professional developer tools.

Start here: aka.ms/azureforstudents

Imagine Cup is more than just a competition—you can work with friends (and

make new ones), network with professionals, gain new skills, make a difference

in the world around you, and get the chance to win cash and cloud credits.

To find out more: Higher education students: imaginecup.com/. Educators of

students ages 13–18 start with Imagine Cup Jr.

Microsoft Student Learn Ambassadors are a global group of campus leaders who

are eager to help fellow students, lead in their local tech community, and develop

technical and career skills for the future.

Learn more at: studentambassadors.microsoft.com/

Page 24: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

Interested in genomics?

Bioinformatics algorithms

bioinformaticsalgorithms.com

Tools

Cromwell on Azure: https://github.com/microsoft/CromwellOnAzure

MS Genomics https://www.microsoft.com/en-us/genomics

Page 25: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

Microsoft genomics service architecture

“msgen” CLI

Customer

Blob Storage

Azure Portal

API Management

Primary DB

App Insights

Monitoring

Azure AD

Genomics Admin

Data Warehouse

Microsoft Genomics Region Boundary

REST API

Web App

Resource Provider

Admin Portal

Data Factory

Internal Reference

Data

Logs

Azure Batch VM Pools

Page 26: Microsoft Code of Conduct€¦ · The world of genomics is rapidly evolving An explosion of new technologies… Long read sequencing Single cell sequencing Spatial sequencing Optical

© Copyright Microsoft Corporation. All rights reserved.