An Open-Source Format for Personal Genome Representation Enabling Fast Queries and Analyses of Human Genomes
Compact Genome Format
Sally Guthrie, Research Scientist, Curoverse ([email protected])
ACKNOWLEDGEMENTS
Alexander Wait Zaranek, Chief Scientist, Curoverse Abram Connelly, Research Scientist, Curoverse
CURRENT USES OF GENOMIC DATA
Patient Care • Analyze one genome for rare and pathogenic variants Population Analysis • Examine a population for rare variants • Separate a population into subgroups • Case/Control Studies and GWA Studies
• Can require merging multiple data sets • Can require using supervised and unsupervised
machine learning
VARIANT CALL FORMAT (VCF) SNAPSHOT
Advantages • Very flexible • Easily annotated with canonical or in-house annotation
pipelines • Can be small (with compression) Disadvantages • Difficult to merge VCFs between studies • Can be slow to query and run machine learning
algorithms on (requires pre-processing)
WHAT IS COMPACT GENOME FORMAT (CGF)?
Compact Genome Format is a compressed genomic sequence
Allows analysis to be run on the compressed data
Represents a sequence using a series of vectors • Each position in the vector is termed a “tile” • The value of the vector points to a sequence in a “Tile
Library,” a pan-genome
GENERATING THE REFERENCE TILE LIBRARY
Human Reference Genome(with tag sets highlighted)
Tag Set: …
1. Choose a tag set of unique 24-base long sequences 2. Map tag set to a reference genome
GENERATING THE REFERENCE TILE LIBRARY
1. Save sequence between each tag pair to the tile library 2. Give these sequences a value (0)
Tile Position Id
00.0000
00.0001
00.0002
… …
EXTENDING THE TILE LIBRARY
…010020……011031…
Tile LibraryTile Position Id
00.002b
00.002c
00.002d
00.002e
00.002f
00.0030
……
EXTENDING THE TILE LIBRARY
…00201*……1*11**…
Tile LibraryTile Position Id
00.002b
00.002c
00.002d
00.002e
00.002f
00.0030
……
CGF AND TILE LIBRARY FACILITATE
Requires: beginning locus and end locus Returns: the sequences between the two loci for all people in the population
Queries on Sequences
CGF FACILITATES SEVERAL IMPORTANT ANALYSIS TYPES
Unsupervised Machine Learning
Supervised Machine Learning (Case/Control) GWAS
Encompass all variation, not just SNP variation
COMPACT GENOME FORMAT FINAL THOUGHTS
• Allows annotations Tile Library can be annotated by canonical and in-house annotation pipelines, thus automatically applying annotations to all CGF files
• Small • Standardized • Fast to query • Designed for machine learning
Thank you!
Any Questions?
Preliminary implementation: lightning-dev3.curoverse.com/brca Source code: https://github.com/curoverse/lightning Software license: GNU AGPLv3