(SEC313) Security & Compliance at the Petabyte Scale

Preview:

Citation preview

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Igor Bogicevic, CTO

Security and Compliance

at the Petabyte ScaleLessons from the National Cancer Institute’s

Cancer Genomics Cloud PilotAngel Pizarro, AWS Scientific Computing

October 2015

What to expect from this session

• Background: Unique challenges for securing genomics

information

• Case study: Democratizing access to The Cancer

Genome Atlas (TCGA) through the Seven Bridges

Cancer Genomics Cloud

• Deep dives: How we’ve leveraged AWS to support

secure and compliant genomics research

Why is securing genomics

information hard?

i) Genomics data is big…and getting bigger

NGS: Next Generation Sequencing

NGS sequencers include machines from Illumina, Life Technologies, and Pacific Biosciences. Human genome data based on estimates of whole human genomes sequenced

Sources: Financial reports of Illumina, Life Technologies, Pacific Biosciences; revenue guidances; JP Morgan; The Economist; Seven Bridges Analysis.

Between 2014–2018, production of new NGS data to exceed 2 exabytes

# s

equencers

Genom

ic d

ata

Tb

ii) Genomes are inherently sensitive

Very personal (including your relatives…)

Can’t fully anonymize information

Can’t take it back once it’s out there

iii) Research is highly collaborative and

diverse

It occurs in large teams... ...with numerous analytical tools

The Challenge

Enable thousands of researchers

using hundreds of (custom) tools

to analyze petabytes of highly sensitive data

in a secure and compliant environment

Case study:

Bringing the Cancer Genome

Atlas (TCGA) to the Cloud

This project has been funded in whole or in part with Federal funds from the

National Cancer Institute, National Institutes of Health, Department of Health

and Human Services, under Contract No. HHSN261201400008C.

TCGA is one of the richest and most complete

genomics data sets in the world

34 tumor types

from thousands

of patients…

…analyzed across

multiple

dimensions…

…by researchers

across the US…

…at a cost of

$375 million.

1.5+ petabytes, growing to 3.5 petabytes in the next year

But learning from this data is challenging

The Cancer Genomics Cloud Pilots seek to

directly address these difficulties

• Initiated by Dr. Harold Varmus in 2013

• BAA issued in January 2014

• 3 pilots awarded September 2014o Broad Institute

o Institute for Systems Biology

o Seven Bridges Genomics

Early access: November 2015

Open release: January 2016

www.CancerGenomicsCloud.org

Our approach to democratizing

access to TCGA data

The components of democratized access –

Data

● Immediately and securely access

petabytes of open-access and

controlled-access cancer genomics

data.

● Analyze data from your private

cohorts alongside public data.

● Data access governed by the NIH

Genomic Data Sharing Policy.

● As an NIH trusted partner, Seven

Bridges is able to authorize approved

researchers.

● First controlled access genomic

dataset on AWS.

● Coming soon:

http://aws.amazon.com/public-data-

sets/tcga/.

The components of democratized access –

Reproducibility

1.1.2 2.0a 2.3Lite

● Execute workflows from primary

analysis through visualization.

● Each result is always associated with

a complete snapshot of the tool

versions, parameters, and input files.

The components of democratized access –

Open standards

● Native execution of Docker-based Common

Workflow Language (CWL) pipelines allows

portability and sharing of custom tools.

● APIs support workflow automation and

enhance interoperability.

...implemented through our genomics platform

How we’ve leveraged AWS to

support secure and compliant

genomics research

Security and compliance―connected, but separate.

Security

• Network and data security overview

• Parallel file access at scale

• Enabling secure computation using researcher-

contributed tools

• Enabling secure user access and collaboration

Simplified system architecture

Encrypted Amazon S3 buckets

Virtual private cloud

(Development environment)

Virtual private cloud

(Production environment)

Dynamic worker

instancesInfrastructure

server

Seven Bridges

website

Dynamic worker

instances

Infrastructure

server

IPSEC VPN

Seven Bridges

offices

Open VPN

Gateway

Remote

workforce

AWS

IPSEC

AWS

IPSEC

UserAccess platform

download data

Data flow

Secure access point

AWS

Securing the network

• Extensive use of virtual private clouds (VPCs)

• Separate dev and production environments

DevProduction

● Built-in IPSEC allows easy

network integration

• Open VPN to secure remote

user access

● Each instance and VPC is

individually firewalled

Securing data

• At-rest encryption

• Amazon S3 SSE, SSE-KMS

• Amazon EBS encryption

• Ephemeral storage

DevProduction

• In transit

• Data in-transit-fortifying - TLS

exclusively on S3

● From other users

• AWS IAM to access other users’ buckets

Controls to support secure data

• Atomic data access

• Data locality

• Dedicated tenancy on

computation instances

• Using only encrypted storage

• Strict data purging

Amazon S3 Amazon EBS Amazon EC2

{

"Version":"2012-10-17",

"Statement":[

{

"Sid":"112",

"Effect":"Deny",

"Principal": "*",

"Action":"s3:PutObject",

"Resource":"arn:aws:s3:::examplebucket/*",

"Condition": {

"StringNotEquals": {"s3:x-amz-server-side-encryption": "AES256"}

}

}

]

}

dm-crypt

Security

• Network and data security overview

• Parallel file access at scale

• Enabling secure computation using researcher-

contributed tools

• Enabling secure user access and collaboration

Parallel file access at scale

The Challenge:

Many bioinformatics tasks require sharing of

intermediary results between multiple instances.

Parallel file access at scale – NFS

Observed network

saturation at ~8 NFS clients.

Hypothesis

• Amazon S3 would remove single NFS server bandwidth

bottleneck.

• Presenting user’s S3 objects as a local filesystem could provide

an elegant abstraction that any application could use.

• Cumulative S3 read/write speed should scale mostly linearly

with number of workers.

• Total read/write speed on shared S3 objects should significantly

exceed NFS server solution speed on >10 workers.

Parallel access at scale – SBG-FS/Amazon S3

Amazon S3

SBG-FS single worker performance

Compute Instances

300200100

90

215

894

Thro

ughput M

B/s

400

600

50 250150

1st read (SBG-FS Prefetch)

Write (SBG-FS Upload)

2nd read (SBG- FS Cache)

SBG-FS cumulative worker performance

Compute Instances

300200100

50

250

Thro

ughput G

B/s

150

200

50 250150

1st read (SBG-FS Prefetch)

Write (SBG-FS Upload)

2nd read (SBG- FS Cache)100

SBG-FS auditing capabilities

Amazon S3

Security

• Network and data security overview

• Parallel file access at scale

• Enabling secure computation using researcher-

contributed tools

• Enabling secure user access and collaboration

Enabling secure computation using

researcher-contributed tools

The Challenge:

bioinformatics tools

10,000+

50+tools used in single

TCGA marker paper

Our Approach:

Common Workflow Language (CWL) wrapper

Seven Bridges Platform

Benefits of using Docker to deploy user-

contributed tools

• Enables solid resource

isolation at the container

level

• Simplifies deploying and

managing tools at scale

DevProduction

Security risks posed by use of Docker

• Docker daemon runs under

root privileges

• User can intentionally or

unintentionally add malicious

apps

• If resources management not

set properly, apps could do

damage outside its container

DevProduction

Enabling secure use of Docker containers

● Know your private vs. public

resources

● Isolate network resources for

each container (firewalling)

• Be careful with linking

containers

• Aggregate logs (forensics)

DevProduction

Security

• Network and data security overview

• Parallel file access at scale

• Enabling secure computation using researcher-

contributed tools

• Enabling secure user access and collaboration

Enabling secure access

DevProduction

● Organizations have diverse

models of internal structure

and responsibilities

• Roles and authentication

models are very diverse

• Federated authentication

and SSO

Supporting federated login for controlled data

access

Error Message

Approved Researchers

cron x 24hr

Metadata service

ELK stackVerify

SAML

Enabling collaboration

• SBG Platform provides isolation

of resources at project level

• Users can share projects and

control access through roles

• Basic role provides just a read

access, write/copy privileges

separate from execution

One Billing Group

per project

$

Multiple users and

roles per project

Users participate in projects

and can provide funding

. .

(-

$ $$

$

Project-specific user roles

Multiple users per project

Clear funding/payment

responsibility

Overall system security is enabled by

monitoring and testing

• Penetration testing

• Patch management

• Software and infrastructure vulnerability assessments

• Monitoring of platform performance and availability

• Pandora FMS/OSSEC/Sysdig

• Auditing and logs at a project and platform level

• Logs aggregated and available for inspection with ELK

stack

Putting it all together 1. User logs on to the platform

2. Platform creates a unique signed URL

for the user

3. Using signed URL, data is uploaded to

an encrypted Amazon S3 bucket

4. After the user starts a computation, the

Seven Bridges Platform calculates the

optimal execution plan and starts

dedicated task worker instances

5. Worker instances securely pull data

from Amazon S3

6. Worker instances are able to securely

share intermediate data

7. Final results are uploaded to

Amazon S3

Encrypted

S3 bucket

User

EC2

instancesData sharing

between instances

6

SevenBridges

Computation environment

Seven Bridges Platform

4

1,2

3

5,7 Encrypted

Amazon S3

Amazon EC2

Instances

Lessons learned from petabyte-scale security

• Isolate resources as much as possible

• Encrypt everything―it will make your life easier

• Understand the scale of the data

• Measure everything

• Leverage the infrastructure

Compliance

When we talk about compliance, we talk about

Building trust Shared language

dbGaPProtect against risk associated with release of genomes of

individuals consenting to participate in research studies.

HIPAAProtect against risk associated with release of Personal Health

Information (PHI).

ISO 27001 Provides framework for general security management of assets

across the organization and is a general specification for

information security management system (ISMS).

Compliance frameworks

Shared responsibility == compliance coordination

Sta

cked R

esponsib

ility

Facilities

Infrastructure

Virtualization

API and Service Endpoints

AWS

Data Security

Data Provenance

Application Monitoring

OS, Network, etc.

Seven Bridges

Genomics

Users | Groups | Projects | Applications Researcher

Auditor

Shared responsibility across frameworks

dbGaP

HIPAA

ISO 27001

ResearcherAWS Seven Bridges

Shared responsibility across frameworks

dbGaP

HIPAA

ISO 27001

ResearcherAWS Seven Bridges

Shared responsibility across frameworks

dbGaP

HIPAA

ISO 27001

ResearcherAWS Seven Bridges

Securely integrating with platforms

Security and compliance in practiceS

tacked R

esponsib

ility Data Security

Data Provenance

Application Monitoring

OS, Network, etc.

Users | Groups | Projects | Applications

Facilities

Infrastructure

Virtualization

API and Service Endpoints

Horizontal

Responsibility

Seven Bridges GenomicsResearcher Amazon Web Services

Use case: Analyze Personal Genome Project data

http://personalgenomes.org

VPC subnet

Dedicated instance

1000 Genomes

Strategies to follow

• Rely on the platform as much as possible

• Follow security best practices outlined in the AWS

documentation

• Have a checklist!

Compliance checklist

AWS security

VPC, security groups, encrypted storage

Protect AWS credentials

Protect platform credentials

SOPs for OS and application updates

Audit and logging of the activities outside of platform

Data provenance and lifecycle

AWS architecture

IAM instance role

VPC subnet

Security

group

Virtual private cloud

• Access platforms via

Internet or VPC peering

• DevOps for instance and

application management

• Protect credentials with

AWS IAM and AWS KMS

Secure bootstrapping with instance UserData

AWS Command Line Interface

Secure and format local storage

Compliance checklist

AWS security

VPC, security groups, encrypted storage

Protect AWS credentials

Protect platform credentials

SOPs for OS and application updates

❑ Audit and logging of the activities outside of platform

❑ Data provenance and lifecycle

Thank you!

Remember to complete

your evaluations!

Recommended