18
An Open-Source Format for Personal Genome Representation Enabling Fast Queries and Analyses of Human Genomes Compact Genome Format Sally Guthrie, Research Scientist, Curoverse ([email protected])

Compact Genome Format

  • Upload
    arvados

  • View
    138

  • Download
    4

Embed Size (px)

Citation preview

An Open-Source Format for Personal Genome Representation Enabling Fast Queries and Analyses of Human Genomes

Compact Genome Format

Sally Guthrie, Research Scientist, Curoverse ([email protected])

ACKNOWLEDGEMENTS

Alexander Wait Zaranek, Chief Scientist, Curoverse Abram Connelly, Research Scientist, Curoverse

CURRENT USES OF GENOMIC DATA

Patient Care • Analyze one genome for rare and pathogenic variants Population Analysis • Examine a population for rare variants • Separate a population into subgroups • Case/Control Studies and GWA Studies

• Can require merging multiple data sets • Can require using supervised and unsupervised

machine learning

VARIANT CALL FORMAT (VCF) SNAPSHOT

Advantages • Very flexible • Easily annotated with canonical or in-house annotation

pipelines • Can be small (with compression) Disadvantages • Difficult to merge VCFs between studies • Can be slow to query and run machine learning

algorithms on (requires pre-processing)

WHAT IS COMPACT GENOME FORMAT (CGF)?

Compact Genome Format is a compressed genomic sequence

Allows analysis to be run on the compressed data

Represents a sequence using a series of vectors • Each position in the vector is termed a “tile” • The value of the vector points to a sequence in a “Tile

Library,” a pan-genome

GENERATING THE REFERENCE TILE LIBRARY

Human Reference Genome(with tag sets highlighted)

Tag Set: …

1. Choose a tag set of unique 24-base long sequences 2. Map tag set to a reference genome

GENERATING THE REFERENCE TILE LIBRARY

1. Save sequence between each tag pair to the tile library 2. Give these sequences a value (0)

Tile Position Id

00.0000

00.0001

00.0002

… …

EXTENDING THE TILE LIBRARY

…010020……011031…

Tile LibraryTile Position Id

00.002b

00.002c

00.002d

00.002e

00.002f

00.0030

……

EXTENDING THE TILE LIBRARY

…00201*……1*11**…

Tile LibraryTile Position Id

00.002b

00.002c

00.002d

00.002e

00.002f

00.0030

……

RATE OF GROWTH OF THE TILE LIBRARY

CGF AND TILE LIBRARY FACILITATE

Requires: beginning locus and end locus Returns: the sequences between the two loci for all people in the population

Queries on Sequences

TIME USED FOR QUERIES ON SEQUENCES

TIME PER BASE FOR QUERIES ON SEQUENCES

CGF FACILITATES SEVERAL IMPORTANT ANALYSIS TYPES

Unsupervised Machine Learning

Supervised Machine Learning (Case/Control) GWAS

Encompass all variation, not just SNP variation

COMPACT GENOME FORMAT FINAL THOUGHTS

• Allows annotations Tile Library can be annotated by canonical and in-house annotation pipelines, thus automatically applying annotations to all CGF files

• Small • Standardized • Fast to query • Designed for machine learning

Thank you!

Any Questions?

Preliminary implementation: lightning-dev3.curoverse.com/brca Source code: https://github.com/curoverse/lightning Software license: GNU AGPLv3

GENERATING THE REFERENCE TILE LIBRARY WITH MULTIPLE TAG SETS

Tag Set … … Tag Set

RATE OF GROWTH OF THE TILE LIBRARY (NO CALLS CREATE VARIANTS)