89
The Future of DNA Sequencing Technology Graham Taylor Melbourne University, Human Variome Project (Australia), Victorian Clinical GeneFcs Laboratories

Graham Taylor - The future of DNA sequencing technology

Embed Size (px)

Citation preview

Page 1: Graham Taylor - The future of DNA sequencing technology

The  Future  of  DNA  Sequencing  Technology  Graham  Taylor  

Melbourne  University,  Human  Variome  Project  (Australia),  Victorian  Clinical  

GeneFcs  Laboratories  

Page 2: Graham Taylor - The future of DNA sequencing technology

Context  and  Topics  1.  Technology  review  &  IdiosyncraFc  selecFon  

of  noteworthy  developments  and  trends  in  NGS  hardware  and  soPware  from  the  perspecFve  of  Genomic  Medicine  (so  mostly  human  genome)  in  the  context  of  meeFng  clinical  needs  Sadly,  not  covering  Transcriptomics,  ChIP-­‐seq,  non-­‐human  genomes  

2.  ApplicaFons  and  implicaFons  for  diagnosFcs  

Page 3: Graham Taylor - The future of DNA sequencing technology

An  UlFmate  Goal  for  Sequence  analysis?  

For  sequencing  –  Chromosome-­‐length  reads  –  Perfect  base  calling  accuracy  –  Each  molecule  is  read  – Highly  parallel  

For  analysis    – De  novo  assembly  – Well  curated  reference  resources  – Data  integrated  with  other  biological  and  medical  resources  

Page 4: Graham Taylor - The future of DNA sequencing technology

Research,  translaFon  and  service  •  Original  •  Surprising  •  >80%  accurate  •  Numerator-­‐driven:  get  

publicaFons  •  Bespoke  

•  Proven  •  Predictable  •  >99.99%  accurate  •  Denominator-­‐driven  (cost  

sensiFve)  •   Standardised  

Page 5: Graham Taylor - The future of DNA sequencing technology

Cost  and  performance  cost  per  base   Illumina  share  price  

Now  is  the  winter  of  our  discount  tests  (unless  you  are  Illumina)  

Page 6: Graham Taylor - The future of DNA sequencing technology

The  case  for  disease-­‐centric  analysis  

•  $1,000  dollar  genomes  or  1,000  x  $1  interesFng  regions?  •  How  to  validate  3.5x  109  tests  •  Sequencing  costs  are  not  limiFng  

•  Quality  and  accuracy  are  incomplete  •  Perform  tests  for  a  (clinical)  reason    

Page 7: Graham Taylor - The future of DNA sequencing technology

Sequence  performance  and  clinical  needs  

number'of'readslength'of'reads

Genetics Tumor-Analysis MicrobiologySample/library,preparation 3 4 4Base,calling,accuracy 5 5 3De,novo,assembly 3 5 4Detect,Rare,Events 3 5 5Portability 2 3 4

Page 8: Graham Taylor - The future of DNA sequencing technology

How  many  variants  per  exome?  SNP  count   Study  

20,000   Choi  et  al.  PNAS  2009  

142,000   Mullikin  NIH,  unpublished  2010  

50,000   Clark  et  al.  Nature  biotechnology  2011  

125,000   Smith  et  al.  Genome  Biology  2011  

100,000     Johnston  &  Biesecker  Human  Molecular  GeneFcs  2013  

200,000  to  400,000   Yang  et  al.N  Engl  J  Med  2013  

•  20-­‐fold  range  •  Exome  designs  vary  •  Likely  to  be  higher  variant  count  in  African  populaFons  as  the  

reference  sequence  is  non-­‐African  

Page 9: Graham Taylor - The future of DNA sequencing technology

Low  concordance  of  mulFple  variant-­‐calling  pipelines    O’Rawe  et  al.  Genome  Medicine  2013,  5:28    

SNV  concordance:  57.4%   Indel  concordance  26.8%  

Page 10: Graham Taylor - The future of DNA sequencing technology

Venn  diagrams  of  selected  CNV  detecFon  methods  in  real  data  processing  

Duan  J,  Zhang  J-­‐G,  Deng  H-­‐W,  Wang  Y-­‐P  (2013)  ComparaFve  Studies  of  Copy  Number  VariaFon  DetecFon  Methods  for  Next-­‐GeneraFon  Sequencing  Technologies.  PLoS  ONE  8(3):  e59128.  doi:10.1371/journal.pone.0059128  hlp://www.plosone.org/arFcle/info:doi/10.1371/journal.pone.0059128  

Page 11: Graham Taylor - The future of DNA sequencing technology

De  novo  Assembly  (the  unfinished  genome)  

•  Genome  Res.  2014.  24:  688-­‐696  2014  Huddleston  et  al.    –  Within  the  human  genome,  there  are  >900  annotated  genes  mapping  to  large  segmental  duplicaFons.  Such  genes  are  typically  missing  or  misassembled  in  working  draP  assemblies  of  genomes  

–  The  widespread  adopFon  of  next-­‐generaFon  sequencing  methods  for  de  novo  genome  assemblies  has  complicated  the  assembly  of  repeFFve  sequences  and  their  organizaFon  

–  resolved  regions  that  are  complex  in  a  genome-­‐wide  context  but  simple  in  isolaFon  for  a  fracFon  of  the  Fme  and  cost  of  tradiFonal  methods  using  long-­‐read  single  molecule,  real-­‐Fme  (SMRT)  sequencing  and  assembly  technology  

–  SMRT  sequencing  of  large-­‐insert  clones  can  significantly  improve  sequence  assembly  within  complex  repeFFve  regions  of  genomes  

Page 12: Graham Taylor - The future of DNA sequencing technology

Recent  past  and  future  RIP   Coming  soon?  

Page 13: Graham Taylor - The future of DNA sequencing technology

SBS  •  GnuBIO/BioRad  :emulsion  microfluidics  for  targeted  

sequencing  and  hotspot  analysis  of  rare  variants  •  LaserGen:  Lightning  Terminators™;  increased  accuracy,  

longer  reads  and  faster  cycle-­‐Fmes  Nucleic  Acids  Res.  Oct  2007;  35(19):  6339–6349.TerminaEon  of  DNA  synthesis  by  N6-­‐alkylated,  not  3ʹ′-­‐O-­‐alkylated,  photocleavable  2ʹ′-­‐deoxyadenosine  triphosphate  Weidong  Wu  et  al.  

•  Qiagen/Intelligent  Biosystems  •  QuantuMDx:  Nat  Biotechnol.  2005  Oct;23(10):1294-­‐301.  

MulFplexed  electrical  detecFon  of  cancer  markers  with  nanowire  sensor  arrays  Zheng  G,  Patolsky  F,  Cui  Y,  Wang  WU,  Lieber  CM  

Page 14: Graham Taylor - The future of DNA sequencing technology

Currently  SBS  are  Market  Leaders  

•  Illumina  •  Proton  Torrent  •  PacBio  

Page 15: Graham Taylor - The future of DNA sequencing technology

PacBio  

•  English  et  al.  (2012)  Mind  the  Gap:  Upgrading  Genomes  with  Pacific  Biosciences  RS  Long-­‐Read  Sequencing  Technology.  PLoS  ONE  7(11):  e47768  

•  Loomis  et  al  (Sequencing  the  unsequenceable:  Expanded  CGG-­‐repeat  alleles  of  the  fragile  X  gene  Genome  Research  (2012)  

Page 16: Graham Taylor - The future of DNA sequencing technology

Nanopores  •  Electronic  BioSciences:  developing  a  system  with  a  single/few  pores  

with  a  very  fast  rate  of  sequencing  of  ~50kb/second  •  Genia:  DNA  polymerase  to  incorporate  nucleoFdes  with  PEG-­‐based  

NanoTags.  As  the  bases  are  incorporated  the  NanoTags  are  cleaved,  allowing  them  to  travel  through  the  pore  where  they  can  be  measured,  generaFng  sequence-­‐specific  informaFon  

•  IBM:  a  solid  state  nanopore  using  alternaFng  layers  of  metal  and  dielectric  material  to  control  the  rate  of  passage  through  the  nanopore  

•   NABsys:  Modified  DNA  (e.g.  by  SBH)  read  via  Nanopore.    Not  yet  sequencing,  but  very  long  reads  

•  NobleGen:  combinaFon  of  opFcal  detecFon  on  nanopores  •  Oxford  Nanopore:  exonuclease  and  strand-­‐based  nanopore  

methods  

Page 17: Graham Taylor - The future of DNA sequencing technology

Real  long  reads  Nanopore  sequencing  

8,476  base  single  read  

Page 18: Graham Taylor - The future of DNA sequencing technology

Not  producFon  ready  3040506070

3040506070

3040506070

3040506070

3040506070

3040506070

3040506070

3040506070

3040506070

3040506070

3040506070

total time 273 seconds

mea

n sig

nal (

picoa

mps

)

Wiggle  plot  

Viterbi  algorithm  for  all  trinucloFdes    

Page 19: Graham Taylor - The future of DNA sequencing technology

Electron  Microscopy  •  ZS  GeneFcs:  directly  visualizes  the  sequence  of  DNA  molecules  

using  electron  microscopy.  Proof  of  principle  by  the  use  of  a  dUTP  nucleoFdewith  a  single  mercury  atom  alached  to  the  nitrogenous  base.  This  modificaFon  is  small  enough  to  allow  very  long  molecules  with  labels  at  each  A-­‐U  to  be  seen  using  annular  dark-­‐field  scanning  transmission  electron  microscopy  (ADF-­‐STEM)  Microsc  Microanal.  2012  Oct;18(5):1049-­‐53  DNA  base  idenFficaFon  by  electron  microscopy  Bell  DC,  Thomas  WK,  Murtagh  KM,  Dionne  CA,  Graham  AC,  Anderson  JE,  Glover  WR.  

•  Reveo:  atomic  force  microscopy  called  the  Omni  Molecular  Recognizer  ApplicaFon  (OmniMoRA),  will  use  arrays  of  nano-­‐knife  edge  probes  to  measure  the  vibraFonal  characterisFcs  of  individual  bases  on  DNA  molecules  that  have  been  stretched  and  immobilized  on  a  surface  

Page 20: Graham Taylor - The future of DNA sequencing technology

Electron  Microscopy  Progress  toward  an  aberraFon-­‐corrected  low  energy  electron  microscope  for  DNA  sequencing  and  surface  analysis.  Mankos  M,  Shadman  K,  N'diaye  AT,Schmid  AK,  Persson  HH,  Davis  RW.  Vac  Sci  Technol  B  Nanotechnol  Microelectron.  2012  Nov;30(6):6F402  

Imaging  of  reduced  5ʹ′-­‐/5DTPA/C-­‐20mer  on  Au  substrate:  (a)  (b)  AFM  images  at  two  magnificaFons,  (c)  height  profile  along  line  shown  in  (a),  (d)  height  profile  along  line  shown  in  (b),  and  (e)  LEEM  images  at  three  different  landing  energies.  

Aiming  for  50  megabase  reads  with  phred  60  

Page 21: Graham Taylor - The future of DNA sequencing technology

Hardware  Trends  

•  Clonal  sequencing  –  Increasing  accuracy  –  Increasing  read  lengths  –  Increasing  read  counts  

•  Single  molecule  sequencing  – PacBio  – Oxford  Nanopore  

Page 22: Graham Taylor - The future of DNA sequencing technology

Increasing  read  counts  via  palerned  flow  cells  

•  Palerned  flow-­‐cells  useful  for  nucleic  acid  analysis  US  20120316086  A1  

•  KineFc  exclusion  amplificaFon  of  nucleic  acid  libraries  WO  2013188582  A1  –  (i)  capturing  the  different  target  nucleic  acids  at  the  amplificaFon  sites  at  an  average  capture  rate,  and    

–  (ii)  amplifying  the  target  nucleic  acids  captured  at  the  amplificaFon  sites  at  an  average  amplificaFon  rate,  wherein  the  average  amplificaFon  rate  exceeds  the  average  capture  rate.  

Page 23: Graham Taylor - The future of DNA sequencing technology

Palerned  flow  cells,  super  Poisson  kineFcs  

Page 24: Graham Taylor - The future of DNA sequencing technology

Pseudo-­‐long  reads  via  “molceculo”  

Page 25: Graham Taylor - The future of DNA sequencing technology

Genome  informaFcs  example..  •  Does  Moleculo’s  technology  have  both  a  wet  lab  and  a  

bioinformaFcs  aspect?    

•  Yes,  it’s  about  50:50.  One  doesn’t  make  sense  without  the  other.  There  are  two  components:  first,  there  is  a  molecular  biology  kit  and  protocol  that  takes  in  genomic  DNA  and  turns  it  into  a  sequencer-­‐compaFble  library.  APer  modifying  and  tagging  the  DNA,  this  allows  the  second  component,  the  algorithmic  part,  to  take  the  short  reads  and  reconstructs  long  reads  using  those  tags.  Those  are  two  separate  parts.  We  developed  both  on  campus,  and  improved  upon  them  aPer  we  started  the  company  last  year.    

Page 26: Graham Taylor - The future of DNA sequencing technology

Reducing  assembly  complexity  of  microbial  genomes  with  single-­‐molecule  sequencing  

identifying DNA modification, such as methylation pat-terns, directly from the single-molecule sequencing data[15]. While adoption of this technology was initially slowedby the low accuracy of the single-pass sequences, recentadvancements have demonstrated that this drawback canbe algorithmically managed to produce assemblies of un-matched continuity [7,8,16]. Steady improvements to thePacBio technology continue to increase read lengths andyield [17], while future technologies promise to combineaccuracy with length using either nanopores [11] or ad-vanced sample preparation [18]. Improved microbial gen-ome assembly is an obvious application of these recentdevelopments in long-read sequencing.Genome assembly is the process of reconstructing a

genome from many shorter sequencing reads [19-21]. Itis typically formulated as finding a traversal of a properlydefined graph of reads, with the ultimate goal ofreconstructing the original genome as faithfully as pos-sible. Repeated sequence in the genome induces com-plexity in the graph and poses the greatest challenge toall assembly algorithms [22]. In addition, repeats areoften the focus of analysis [23-25], making their correctassembly critical for subsequent studies. However, re-peats can only be resolved by a spanning read or readpair that is uniquely anchored on both sides. Read pairsare typically used due to their length potential (tens ofkilobase pairs), but introduce additional complexitybecause they cannot be precisely sized. Alternatively,long-read sequencing promises to more accurately re-solve repeats and directly assemble genomes into theirconstituent replicons. Figure 1 shows the benefit of in-creasing read length when assembling Escherichia coliK12 MG1655. This genome can only be assembled intoa single contig when the read length exceeds the size ofthe longest repeat in the genome, a multi-copy rDNAoperon. The rDNA operon, sized around 5 to 7 kbp, isthe largest repeat class in most bacteria and archaea[26]. Therefore, sequencing reads longer than the rDNAoperon, such as those produced by single-moleculesequencing, can automatically close most microbialgenomes.ALLPATHS-LG was the first assembler shown to pro-

duce complete microbial genomes using single-moleculesequences [7]. Utilizing a combination of PacBio RSsingle-molecule reads (2 to 3 kbp), short-range Illuminaread pairs (<300 bp insert), and long-range Illumina readpairs (3 to 10 kbp insert), ALLPATHS-LG assembles theIllumina reads first using a de Bruijn graph and incorpo-rates PacBio reads afterwards to patch coverage gapsand resolve repeats. Riberio et al. [7] tested this methodon 16 genomes and consensus accuracy was measuredat 99.9999% on 3 genomes with an available reference.Four of the sixteen genomes were successfully assembledinto a complete genome - the remaining genomes were

all highly continuous but left unresolved due to large-scale repeats. These results are promising, especially interms of consensus accuracy; however, the methodrequires two different sequencing platforms and threelibrary preparations, which limits its efficiency. Inaddition, the jumping libraries were observed to be in-consistent at spanning large repeats due to biases in thelibrary construction process.Ideally, complete genomes could be reconstructed

from a single fragment library, minimizing costs. Previ-ously, pair libraries were the only sequencing methodcapable of spanning large repeats, such as the rDNA op-eron, but the PacBio RS is now capable of producingsingle-molecule reads of the same length. Leveragingthis recent development, we present an approach for mi-crobial genome closure that relies on overlapping andassembling single-molecule reads de novo rather than

A

B C

Figure 1 Genome assembly graph complexity is reduced assequence length increases. Three de Bruijn graphs for E. coli K12are shown for k of 50, 1,000, and 5,000. The graphs are constructedfrom the reference and are error-free following the methodology ofKingsford et al. [27]. Non-branching paths have been collapsed, soeach node can be thought of as a contig with edges indicatingadjacency relationships that cannot be resolved, leaving a repeat-induced gap in the assembly. (A) At k = 50, the graph is tangledwith hundreds of contigs. (B) Increasing the k-mer size to k = 1,000significantly simplifies the graph, but unresolved repeats remain.(C) At k = 5,000, the graph is fully resolved into a single contig. Thesingle contig is self-adjacent, reflecting the circular chromosome ofthe bacterium.

Koren et al. Genome Biology 2013, 14:R101 Page 2 of 16http://genomebiology.com/2013/14/9/R101

Long,  single-­‐molecule  reads  are  sufficient  for  the  complete  assembly  of  most  known  microbial  genomes.  The  assemblies  presented  here  have  good  likelihood  and  finished-­‐grade  consensus  accuracy  exceeding  99.9999%.  

Koren  et  al.  Genome  Biology  2013,  14:R101  

Page 27: Graham Taylor - The future of DNA sequencing technology

Clinical  Drivers  

•  Manageable  workflow  •  Cost  efficiency  •  SensiFvity  and  specificity  •  Referring  to  the  clinical  quesFon  •  Depth  vs.  breadth  of  coverage  

Page 28: Graham Taylor - The future of DNA sequencing technology

AdapFng  NGS  to  purpose  

control'

expansion'0'

200'

400'

600'

800'

1000'

1200'

(GGCCCC)4'(GGCCCC)5'

(GGCCCC)6'

Merging'forward'and'reverse'reads'

Use  a  HiFi  polymerase  

0"

200"

400"

600"

800"

1000"

1200"

1400"

1600"

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTTTATGTGATCAAGAAATCGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGTATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGTA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATTAAGAAATCGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA

TAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATCTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA

CAGAAAAAGTAGGAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA

CAGAAAGAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA

Discarding  rare  (wrong)  reads  

DetecFng  allele  expansion  

0"

1000"

2000"

3000"

4000"

5000"

6000"

7000"

8000"

110"111"112"113"114"115"116"117"118"119"120"121"122"123"124"125"126"127"128"

Count&

Nucleo+de&base&pair&length&

"Control""

CG11"(MI095)"

Sample CASE Gene Genomic/Coordinates RefSeq VARIANT/c Variant/pAllele/%/

(Amplivar)

Allele/%/(MiSeq/

Reporter)1 22029 BRCA2 chr13_329116280G>T NM_000059.3 c.3136G>T p.Glu1046Ter 26.03% 22.37%2 22814 BRCA1 chr17_412464430insGA NM_007300.3 c.1105_1106insTC p.Asp369ValfsTer6 76.27% 75.65%3 23074 BRCA2 chr13_329144380delT NM_000059.3 c.5946delT p.Ser1982ArgfsTer22 74.36% 74.27%4 23162 BRCA1 chr17_412760430delCT NM_007300.3 c.68_69delAG p.Glu23ValfsTer17 76.04% 78.69%5 23165 BRCA2 chr13_329140660delAATT NM_000059.3 c.5574_5577delAATT p.Ile1859LysfsTer3 100.00% 95.53%5 23165 BRCA1 chr17_412444380del75 NM_007300.3 c.3005_3079del75 p.Asn1002_Ile1027del 48.71% Not0found6 23179 BRCA1 chr17_41215948_G>A NM_007300.3 c.5095C>T p.Arg1699Trp 85.24% 87.14%7 23210 BRCA2 chr13_329688360insT NM_000059.3 c.9266_9267insT p.Val3091ArgfsTer20 60.06% 59.64%8 23815 BRCA1 chr17_412562060insA NM_007300.3 c.374dupT p.Gln126ProfsTer16 100.00% 94.23%9 23824 BRCA1 chr17_4125691550delA NM_007300.3 c.271delT p.Cys91ValfsTer28 81.62% 83.32%10 23828 BRCA2 chr13_329128870delATTAC NM_000059.3 c.4395_4399delATTAC p.Leu1466PhefsTer2 93.20% 91.76%

MSI  by  NGS  

Genotyping  

Page 29: Graham Taylor - The future of DNA sequencing technology

Library  ConstrucFon  

Covaris  optimisinglane SampleM E-­‐gel  Quant  ladder1 gDNA2 A3 B4 C5 50bp  ladder

Peak  at  200bp50  bp ladder,  350bp  bright  band

B:  213.4  bp B:  208.8  bp C  :  201.4  bp

Page 30: Graham Taylor - The future of DNA sequencing technology

NextEra  method  of  Library  ConstrucFon  

Page 31: Graham Taylor - The future of DNA sequencing technology

Higher  coverage  greater  reproducibility  

Coverage   Coefficient  of  variaFon  

Page 32: Graham Taylor - The future of DNA sequencing technology

Can    we  capture  coverage  report  dosage  to  diagnosFc  standards?  samples  

targets  

samples  

autosomal  ta

rgets  

chrX  ta

rgets  

Inter-­‐sample  variaFon  is  low,  But  low  coverage  prevents  dosage  esFmaFon  

Chr  X  is  a  good  first  pass  test  for  dosage  

Page 33: Graham Taylor - The future of DNA sequencing technology

XX  vs.  XY  

8  Female  cases  and  16  Male  cases  showing  reproducibility  of  coverage  of  X  loci  within  each  group.    Loci  with  higher  SDs  were  associated  with  reduced  coverage.  

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

1.4"

1.6"

1.8"

2"

0" 10" 20" 30" 40" 50" 60" 70" 80"

Average"XX"

Average"XY"

!0.5%

0%

0.5%

1%

1.5%

2%

2.5%

3%

0% 10% 20% 30% 40% 50% 60% 70% 80%

AVGE%XX%

AVGE%XY%

870  

160  

Page 34: Graham Taylor - The future of DNA sequencing technology

EPCAM  exon  9  to  MSH2  exon  8  and  EPCAM  exon  9  to  MSH2  exon  1  deleFons  

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

1.4"

1.6"

chr2:"47596595047596770"(EPCAM

)"

chr2:"47600552047600759"(EPCAM

)"

chr2:"47600897047601237"(EPCAM

)"

chr2:"47602323047602488"(EPCAM

)"

chr2:"47604103047604266"(EPCAM

)"

chr2:"47606042047606243"(EPCAM

)"

chr2:"47606858047607158"(EPCAM

)"

chr2:"47612255047612399"(EPCAM

)"

chr2:"47613661047613802"(EPCAM

)"

chr2:"47630281047630591"(MSH2)"

chr2:"47635490047635744"(MSH2)"

chr2:"47637183047637561"(MSH2)"

chr2:"47639503047639749"(MSH2)"

chr2:"47641358047641607"(MSH2)"

chr2:"47643385047643618"(MSH2)"

chr2:"47656831047657130"(MSH2)"

chr2:"47672637047672846"(MSH2)"

chr2:"47690120047690343"(MSH2)"

chr2:"47693747047693997"(MSH2)"

chr2:"47698054047698251"(MSH2)"

chr2:"47702114047702459"(MSH2)"

chr2:"47703456047703760"(MSH2)"

chr2:"47705361047705708"(MSH2)"

chr2:"47707785047708060"(MSH2)"

chr2:"47709868047710138"(MSH2)"

chr2:"48010323048010682"(MSH6)"

chr2:"48018016048018312"(MSH6)"

chr2:"48022983048023252"(MSH6)"

chr2:"48025700048028344"(MSH6)"

chr2:"48030509048030874"(MSH6)"

chr2:"48031999048032216"(MSH6)"

chr2:"48032707048032896"(MSH6)"

chr2:"48033293048033547"(MSH6)"

chr2:"48033541048033840"(MSH6)"

chr2:"48033868048034049"(MSH6)"

Page 35: Graham Taylor - The future of DNA sequencing technology

IntegraFng  data  handling  with  sequencing  operaFon  

Partly  auto-­‐fills  metadata  file  (based  on  samplesheet  metadata  file  name)  

Checks  if  the  run  finishes  with  FASTQs  generated  on  board  

Checks  if  the  analysis  is  complete,  then  if  metadata  contains  sufficient  informaFon  for  FASTQ  renaming.  Also  checks    and  starts  analysis  workflow  in  samplesheet    

Renames  FASTQ  files  

Monitor  data  monitor_data.sh  

Monitor  run  status  monitor_run_status_miseq.sh  

Monitor  metadata  monitor_metadata.sh  

Rename  FASTQs  rename_files.py  

Syncing  MiSeq  

miseq_rsync.sh  

Copies  miseq  run  directory  to  server    (every  hour)  

Workflow  /storage/local/sw/system_automagic/

workflows  

Depending  on  the  workflow  defined  in  samplesheet  (  Project)  

Page 36: Graham Taylor - The future of DNA sequencing technology

AutomaFc  data  populaFon  for  sample  sheets  

Page 37: Graham Taylor - The future of DNA sequencing technology

HiSeq   MiSeq  

ProducFon:  Taffy  Centos    11TB  

Development:  S’Box  Centos  

1  TB  Systems  3TB  Sandbox  

ProducFon:  MiSeq  Reporter  

PC  Windows  

Backup:  Biocube  Centos  

RAID6  20TB  

Image  systems  at  install  and  aPer  update  

Snapshots:  14  12-­‐hourly  4  weekly  12  monthly  3  yearly  

Snapshots  (System,  Scripts  +  SoPware  only)  

Transfer  Data  as  needed  

Transfer  new  runs  and  backup  analysis  

Backup:  ITS  Tape  (2  Copies)  

Most  recent  Snapshot  every  6  month  for  3  years  rolling  

Required  addi2onal  Resources:  1.  MiSeq  Reporter  Desktop  PC  2.  New  Desktop  PC  for  Seb  3.  ITS  Account  4.  (NAS  to  expand  Biocube)    

Costs  Es2mate:  1.  2x  Desktop  PCs:  3  –  4k  2.  ITS  tapes:  max  11k/3years  for  20TB  3.  (NAS:  approx  10k  for  28TB)  

Run  Raw  Data  to  keep:  1.  MiSeq:  Complete  Run  Folder  2.  HiSeq:  Run  Folder  excl.  Bcl,  image  files  etc.  (fastq  =  raw  data)  3.  Compress  data  aPer  1  month  (excl.  fastq)  4.  Discard  external  data  aPer  2  month  

Page 38: Graham Taylor - The future of DNA sequencing technology

NGS-­‐DB  

Projects  Project_ID  Project_DescripFon  User  

Experiments  Exp_ID  Project_ID  Exp_DescripFon  Prep_Method  Kit_ID  (if  applicable)  Pipeline(s)  LibOperator  

Samples  [LIMS?]  Sample_ID  PaEent_ID  Family_ID  Sample_Type  Sample_Source  Sample_Conc  QC_Score  

Libraries  Library_ID  Sample_ID  Exp_ID  Barcode_1  Barcode_2  LibConc  QC_P/F  Pool_IDs  MiSeqRun_IDs  HiSeqRun_Ids  (FC+Lane)  

MiSeqRuns  MiSeqRun_ID  MiCartridge_ID  MiSeq_LoadConc  SeqOperator  MiFC_ID  MiSeq_RunDate  MiSeq_ClustDens  MiSeq_PFClustDens  MiSeq_Reads  MiSeq_PFReads  ...  

HiSeqRuns  HiSeqRun_ID  HiFC_ID  SeqOperator  HiSeq_LoadConc  SBS_RGT  Cluster_RGT  cBot_RGT  HiSeq_RunDate  HiSeq_ClustDens  HiSeq_PFClustDens  HiSeq_Reads  HiSeq_PFReads  ...  

HiSeq_Flowcells  HiFC_ID  HiFC_CatNo  HiFC_LotNo  HiFC_Type  HiFC_expiry  HiSeqRun_ID  

MiSeq_Flowcells  MiFC_ID  MiFC_CatNo  MiFC_LotNo  MiFC_Type  MiFC_expiry  MiSeqRun_ID  

HiSeq_SBS  (clust,cBot)  SBS_RGT  SBS_expiry  SBS_CatNo  SBS_LotNo  SBS_Type  HiSeqRun_ID  

MiSeq_Cartridges  Cart_ID  Cart_CatNo  Cart_LotNo  Cart_TypeNo  Cart_expiry  MiSeqRun_ID  

LibraryPools  Pool_ID  Library_IDs  Exp_IDs  Barcodes_1  Barcodes_2  PoolConc  QC_P/F  MiSeqRun_ID(s)  HiSeqRun_ID(s)  HiSeqRun_Ids  (FC+Lane)  

User  Name  InsFtuFon  Contact  Info  Billing  Info  ...  

Others  Pipelines  LibOperators  SeqOperators  Kits  PrepMethods  

Page 39: Graham Taylor - The future of DNA sequencing technology

AlternaFves  to  read  mapping  and  alignment  

•  Grouped  read  tesFng  – Amplivar,  AmbiVert  

•  Tiled  matching  – MIST  

•  kmer  subtracFon  – Diamund  

•  DetecFon  of  allele  expansions  

Page 40: Graham Taylor - The future of DNA sequencing technology

coveragestatistics

each amplicon read sorted by primers

grouped amplicon variants

grab amplicons

sort by locus

group amplicons

Edit Disitance

Read Counts, Read Distribution and Analysis

Using the amplimer sequence to grab each amplicon is an alternative to querying the entire sequence output with the advantage that each set of reads should be more homogenous and amenable to grouping. Groups above a certain abundance (corresponding to the detection limit selected) and then be compared in detail with the canonical sequence using string comparison tools such as the Levenshtein (edit) distance, or by Smith-Waterman alignment. Using this approach we have confirmed that variants can be identified de novo, but with more interference from sequence errors than by grouped read typing. We have also shown that the current TruSeq Cancer Panel kit co-amplifies a region of chromosome 22 containing a perfect match to the pathogenic KIT Exon 11 c.1669T>A mutation. Artifactual data from the duplicated region risks the reporting of specious variants as false positive results.0"

1000"

2000"

3000"

4000"

5000"

6000"

NRAS1_7_2"chr1"1152565283115256531"

NRAS8_13_3"chr1"1152587303115258748"

PIK3CA1_20"chr3"1789168763178916876"

PIK3CA2_21"chr3"1789215533178921553"

PIK3CA3_22"chr3"1789279803178927980"

PIK3CA4_11_23"chr3"1789360743178936095"

PIK3CA12_24"chr3"1789388603178938860"

PIK3CA13_20_25"chr3"1789520073178952150"

PIK3CA13_20_26"chr3"1789520073178952150"

KIT1_36"chr4"55561764355561764"

KIT2_37"chr4"55592185355592186"

KIT3_19_38"chr4"55593464355593689"

KIT3_19_39"chr4"55593464355593689"

KIT3_19_40"chr4"55593464355593689"

KIT20_21_41"chr4"55594221355594258"

KIT22_42"chr4"55595519355595519"

KIT23_43"chr4"55597495355597497"

KIT24_28_44"chr4"55599320355599348"

KIT29_45"chr4"55602694355602694"

EGFR1_74"chr7"55211080355211080"

EGFR2_75"chr7"55221822355221822"

EGFR3_76"chr7"55233043355233043"

EGFR4_77"chr7"55241677355241708"

EGFR9_78"chr7"55242418355242511"

EGFR44_79"chr7"55249005355249131"

EGFR44_80"chr7"55249005355249131"

EGFR54_81"chr7"55259514355259524"

BRAF1_92"chr7"1404531213140453193"

BRAF28_93"chr7"1404813973140481478"

PTEN1_110"chr10"89624242389624244"

PTEN3_111"chr10"89685307389685307"

PTEN4_112"chr10"89711893389711900"

PTEN7_113"chr10"89717615389717772"

PTEN7_114"chr10"89717615389717772"

PTEN13_115"chr10"89720716389720852"

PTEN13_116"chr10"89720716389720852"

KRAS1_140"chr12"25378562325378562"

KRAS2_141"chr12"25380275325380283"

KRAS7_142"chr12"25398255325398285"

Average'Read'Count'per'Amplicon'+/6'SEM'

Smith-Waterman

BLAST

363#reads;#common#mutation:#p.G12A;#chr12:25,398,290C>G###c.35G>CTGTATCGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA220#reads;#wildtypeTGTATCGTCAAGGCACTCTTGCCTACGCCACCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA64#reads;#nonEcoding#chr12:295,398,329C>TTGTATCGTCAAGGCACTCTTGCCTACGCCACCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCTTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA39#reads;#two#errors,#non#adjacent,#one#corresponding#to#c.35G>CTGTATTGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA27#reads;#two#errors,#non#adjacent,#one#corresponding#to#c.35G>CTGTATCGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGATTATATTAGAACATGTCACACATAAGGTTA26#reads#two#errors,#non#adjacent,#one#corresponding#to#c.35G>CTGTATCGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGTAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA17#reads;##two#nonEcoding#errors,#non#adjacentTGTATCGTCAAGGCACTCTTGCCTACGCCACCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCCGCAGGCTTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA17#reads;#corresponding#to#c.35G>ATGTATCGTCAAGGTACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA

>1141_>EGFR9_78 chr7 55242418-55242511 1141GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT>646_>EGFR9_78 chr7 55242418-55242511 646GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT>60_>EGFR9_78 chr7 55242418-55242511 60GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTTCATGGCT>57_>EGFR9_78 chr7 55242418-55242511 57GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGTTTTGCTGTGTGGGGGTCCATGGCT>54_>EGFR9_78 chr7 55242418-55242511 54GACTTTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT>51_>EGFR9_78 chr7 55242418-55242511 51GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCTTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT

GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCTGACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAG---------------ACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT

KRAS point mutation

EGFR 15 base deletion

EGFR 15 base deletion with Smith-Waterman alignment

Amplicon  Sequencing:    treaFng  reads  as  groups  

Page 41: Graham Taylor - The future of DNA sequencing technology

Amplivar  vs.  Alignment  Amplivar   Alignment  (e.g.  BWA)  

Groups  reads   Uses  individual  reads  

Designed  for  amplicons   Designed  for  randomly  sheared  fragments  

Works  with  FASTA  aPer  filtering  

Works  with  FASTQ  

Matches  against  target  list   Aligns  against  whole  genome  

Alignment  is  an  opFonal  late  stage  

Alignment  is  a  required  early  stage  

Page 42: Graham Taylor - The future of DNA sequencing technology

quality_filter  Hard  coded  quality  filters   Output  FASTA  and  qcore  files  

FASTA  files  (.fna)  and  quality  files  (.csv)  wrilen  to  merged  folder  

Page 43: Graham Taylor - The future of DNA sequencing technology

fastq  and  fasta  

>MISEQ-2:20:000000000-A61NM:1:1101:12299:1738 1:N:0:some_name!TGCGTCATCATCTTTGTCATCGTGTACTACGCCCTGATGGCTGGTGTGGTTTGGTTTGTGGTC!

@MISEQ-2:20:000000000-A61NM:1:1101:12299:1738 1:N:0:some_name!TGCGTCATCATCTTTGTCATCGTGTACTACGCCCTGATGGCTGGTGTGGTTTGGTTTGTGGTC!+!AAAAADAFFFFFGGGFGGFGGFHFGFHHFGAEGIIIIIIIIIIIIIIIIIIIIIIIIIIIIII!  

FASTQ  

FASTA  

Page 44: Graham Taylor - The future of DNA sequencing technology

group_fasta_reads  

•  @file_list  =  glob  "$merged_dir/*fna"  ;  

Page 45: Graham Taylor - The future of DNA sequencing technology

Grouped  reads  

3 !AGACAACTGTTCAAACTGATGGGACCCACTCCATCGAGATTTCACTGTAGCTAGACCAAAATAG!!1 !ACCACTTTTGGAGGGAGATTTCGCTCCTGAAGAAAATTCGACAGCTTTGTGCCTGGCTAATTCT!!527!AGTGTATCCATTTTCTTCTCTCTGACCTTTGGCCCCCTACATCGACCATTCTGCAAGGTTAACA!!1 !CTCACCCCCAGACTGGGTTTTTAGGTCTCGGTTTACAAGTTTCTTATGCTGATGCTGAAAAAAA!

Page 46: Graham Taylor - The future of DNA sequencing technology

Usual  suspects  file  

4  column  tab  separated  text  file  with  Unix  line  endings  •  Column  1:  RefSeq  idenFfier    •  Column  2:  cDNA  HGVS  nomenclature  •  Column  3:  codon  change  HGVS  nomenclature  •  Sequence  to  match  •  Usual  suspects  files  available  for  TruSeq  cancer  panel  and  for  

PCRbrary  

RefSeq cDNA*description codon*change sequenceBRAF_NM_004333.4 c.1798G V600 CTCCATCGAGATTTCACTGTAGCTAGACCAAABRAF_NM_004333.4 c.1798_1799delinsAA V600K CTCCATCGAGATTTCTTTGTAGCTAGACCAAABRAF_NM_004333.4 c.1798_1799delinsAG V600R CTCCATCGAGATTTCCTTGTAGCTAGACCAAABRAF_NM_004333.4 c.1798G>A V600K CTCCATCGAGATTTCATTGTAGCTAGACCAAABRAF_NM_004333.4 c.1799T V600 CTCCATCGAGATTTCACTGTAGCTAGACCAAABRAF_NM_004333.4 c.1799_1800delinsAA V600E CTCCATCGAGATTTTTCTGTAGCTAGACCAAABRAF_NM_004333.4 c.1799_1800delinsAT V600D CTCCATCGAGATTTTACTGTAGCTAGACCAAABRAF_NM_004333.4 c.1799T>A V600E CTCCATCGAGATTTCTCTGTAGCTAGACCAAA

Page 47: Graham Taylor - The future of DNA sequencing technology

Genotype  table  

libraryBRAF+600+wt+V600

BRAF+600+c.1799T>A+V600E

KRAS+12+&+13+wt+G12/G13

KRAS+12+c.34G>A+G12S

KRAS+12+c.34G>C+G12R

KRAS+12+c.34G>T+G12C

KRAS+12+c.35G>A+G12D

KRAS+12+c.35G>C+G12A

KRAS+12+c.35G>T+G12V

KRAS+13+c.38G>A+G13D

DL130016FTGx120036 100.0 0.0 100.0 0.0 0.0 0.0 0.0 74.2 0.0 0.0DL130028FTGx120036 100.0 0.0 100.0 0.0 0.0 0.0 0.0 30.6 0.0 0.0DL130040FTGx120036 100.0 0.0 100.0 0.0 0.0 0.0 0.0 50.0 0.0 0.0DL130052FTGx120036 100.0 0.0 100.0 0.0 0.0 0.0 0.0 130.8 0.0 0.0DL130064FTGx120036 100.0 0.4 100.0 0.0 0.0 0.0 0.0 300.0 0.0 0.0DL130076FTGx120036 100.0 0.0 100.0 0.0 0.0 0.0 0.0 126.5 0.0 0.0DL130018FTGx120041 100.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130030FTGx120041 100.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130042FTGx120041 100.0 0.0 100.0 1.5 0.0 0.0 0.0 0.0 0.0 0.0DL130054FTGx120041 100.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130066FTGx120041 100.0 0.2 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130078FTGx120041 100.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130015FTGx120044 100.0 49.9 100.0 0.0 0.0 0.0 0.0 0.0 0.0 3.4DL130027FTGx120044 100.0 45.9 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130039FTGx120044 100.0 54.9 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130051FTGx120044 100.0 32.5 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130063FTGx120044 100.0 45.3 100.0 0.0 0.0 0.0 0.7 0.0 0.0 0.0DL130075FTGx120044 100.0 43.2 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Page 48: Graham Taylor - The future of DNA sequencing technology

Amplicon  flank  file  

4  column  tab  separated  text  file  with  Unix  line  endings  •  Column  1:  amplicon  idenFfier  with  unique  number    •  Column  2:  size  (not  currently  calculated)  •  Column  3:  co-­‐ords  •  Flanks  with  (.*)  to  capture  the  sequence  between  amplimers  •  Flank  files  available  for  PCRbrary,  TruSeq  Cancer  Panel  and  

Olga’s  TruSeq  panel  

ID Size Co)ords Sequence1_MPL1_2 175 chr1:43815006-43815137 GCCGTAGGTGCGCACG(.*)TCAGCAGCAGCAGG2_NRAS1_7 175 chr1:115256526-115256653 GCATTCCCTGTGGTTTT(.*)AGAGTACAGTGCCATG

Page 49: Graham Taylor - The future of DNA sequencing technology

Grouping  Reads  amplivar –i * –j * –k *!!-i /storage/local/sandbox/working_directory!-j usual_suspects.txt !-k flanking_primers!

Merges  read  pairs,  quality  filters,  converts  to  fasta,  groups  by  sequence  Counts  reads  corresponding  to  each  amplicon  Genotypes  according  to  the  usual  suspects  table  with  read  counts  Groups  reads  by  amplicon  for  mutaFon  scanning    

Page 50: Graham Taylor - The future of DNA sequencing technology

AMPLIVAR  SeqPrep:    

Remove  adapters  &  Merge  reads  

Filter  reads  by  quality  

Convert  fastq2fasta  

Group  fasta  reads  

Genotype  grouped  reads  

Grab  reads  by  flanks  

Sort  reads  by  locus  

Page 51: Graham Taylor - The future of DNA sequencing technology

AMPLIVAR  WRAPPER  

AMPLIVAR  SeqPrep:    

Remove  adapters  &  Merge  reads  

Filter  reads  by  quality  

Convert  fastq2fasta  

Group  fasta  reads  

Genotype  grouped  reads  

Grab  reads  by  flanks  

Sort  reads  by  locus  

Create  symbolic  links  for  fastqs  

Create  subdirectories  for  each  fastq  file  pair  

(R1  &R2)  

Run  amplivar  

Run  sort  amplicons  

Run  blat  on  grouped,  sorted  reads  

Convert  blat  psl2sam,  sam2bam  

Run  bamleYalign  

Inflate  bam  

Run  VarScan  

Run  VEP  

Page 52: Graham Taylor - The future of DNA sequencing technology

AmpliVar  Required  tools  

•  SeqPrep  (C)  •  Blat  (C)  •  Samtools  (C)  •  BamlePalign  (C++)  •  VarScan  (java)  •  Bash  •  Perl  •  Python  

Page 53: Graham Taylor - The future of DNA sequencing technology
Page 54: Graham Taylor - The future of DNA sequencing technology

>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC734>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACAAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC4>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGCTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC4>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTACCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC4>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCAGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC4>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGAGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC4>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAGCCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC4>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGGTCTGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC4>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCAGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC3>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCAAAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC3>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCAAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC3>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGTCTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC3

Sorted,  locus  (amplicon)-­‐based  files  

Page 55: Graham Taylor - The future of DNA sequencing technology

Sample Amplicon Read %/of/major/alleleMS0318_1164/Blood 1030_BRCA1_exon14/175/chr17:41226334C41226449 AAGAGGAGCTCATTAAGGTTGTTGATGTGGAGGAGCAACAGCTGGAAGAGTCTGGGCCACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATCTAGGTAATATTTCATCT 100%MS0323_1164/Frozen 1030_BRCA1_exon14/175/chr17:41226334C41226449 AAGAGGAGCTCATTAAGGTTGTTGATGTGGAGGAGCAACAGCTGGAAGAGTCTGGGCCACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATCTAGGTAATATTTCATCT 100%MS0313_1164/FFPE 1030_BRCA1_exon14/175/chr17:41226334C41226449 AAGAGGAGCTCATTAAGGTTGTTGATGTGGAGGAGCAACAGCTGGAAGAGTCTGGGCCACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATCTAGGTAATATTTCATCT 100%MS0313_1164/FFPE 1030_BRCA1_exon14/175/chr17:41226334C41226449 AAGAGGAGCTTATTAAGGTTGTTGATGTGGAGGAGCAACAGCTGGAAGAGTCTGGGCCACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATCTAGGTAATATTTCATCT 28.40%MS0313_1164/FFPE 1030_BRCA1_exon14/175/chr17:41226334C41226449 AAGAGGAGCTCATTAAGGTTGTTGATGTGGAGGAGCAACAGCTGGAAGAGTCTGGGCCACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATTTAGGTAATATTTCATCT 26.20%

Most  reads  are  error  free  FFPE  contaminates  the  evidence  

Page 56: Graham Taylor - The future of DNA sequencing technology

Clustal  alignment  &  phylogeny  of  errors  

Page 57: Graham Taylor - The future of DNA sequencing technology

Merged  forward  and  reverse  reads  PandaSeq  and  SeqPrep  can  merge  overlapping  read  pairs  to  make  them  even  longer  and  more  accurate.  In  an  unselected  100  base  read  pair  run  enriched  for  hereditary  cancer  genes  over  20%  of  the  reads  could  be  merged.  With  longer  reads  and  suitable  experimental  design  this  fracFon  could  be  increased.  

Pairs&Processed: 18,456,760Pairs&Merged: 4,029,383Pairs&With&Adapters: 32,899Pairs&Discarded: 646percent&merged 21.83

locus 1'total 2'totalchr17'41223060'41223115 2631 2777chr17'41223076'41223130 2452 2674chr17'41223101'41223146 2223 2501

0'

500'

1000'

1500'

2000'

2500'

3000'

chr17'41223060'41223115' chr17'41223076'41223130' chr17'41223101'41223146'

1'total'

2'total'

Probability  of    a  given  length  read  as  a  subset  of  a  longer  read    in  a  normal  distribuFon  of  longer  reads:  the  “minimum  substring  problem”  

This  approach  might  make  sense  with    longer  reads  

Average  read  length  from  101  to  150  bases  

Page 58: Graham Taylor - The future of DNA sequencing technology

Orthogonal  validaFon  without  Sanger  Scale:chr17--->

RefSeq Genes

RepeatMasker

10 baseshg1941,223,07041,223,07541,223,08041,223,08541,223,09041,223,09541,223,10041,223,105

AGTCATCATACTCGTCGTCGACCTGAGACCCGTCTAAGAflanks

Your Sequence from Blat SearchRefSeq Genes

Simple Nucleotide Polymorphisms (dbSNP 137) Found in >= 1% of SamplesDuplications of >1000 Bases of Non-RepeatMasked Sequence

Repeating Elements by RepeatMasker

CCAGCAGTATCAGTA(.*)AGATTCTGCAACTTTTATGAGCAGCAGCTG(.*)CAATTGGGGAACTTT

AGATTCTGCAACTTT(.*)CAATGCAGAGGTTGAG YourSeq

rs1799966

0"

50"

100"

150"

200"

250"

300"

350"

400"

450"

500"

"""GCCCAGAGTCCAGCTGCTGCTCATAC"""GCCCAG*G*GTCCAGCTGCTGCTCATAC"

Forward  reads  inc  rs1799966  

Page 59: Graham Taylor - The future of DNA sequencing technology

Downstream  processing  of  FASTA  files  BWA,  BLAT,  annotaFon  

VEP  

Page 60: Graham Taylor - The future of DNA sequencing technology

MIST  

Page 61: Graham Taylor - The future of DNA sequencing technology

A schematic of the workflow used by MiST.

Subramanian S et al. Nucl. Acids Res. 2013;41:e154

Page 62: Graham Taylor - The future of DNA sequencing technology

Identification of potentially paralogous read pairs.

Subramanian S et al. Nucl. Acids Res. 2013;41:e154

Page 63: Graham Taylor - The future of DNA sequencing technology

Motivation for the use of Geoseq in variant calling.

Subramanian S et al. Nucl. Acids Res. 2013;41:e154

Page 64: Graham Taylor - The future of DNA sequencing technology

Comparison of MiST and GATK. Each box has three sets of numbers, from left to right they are variant calls, (i) unique to MiST, (ii) common to both platforms and (iii) unique to GATK.

Filters are applied to remove calls occurring in public databases like dbSNP (17), 1000 Genomes (18) and a collection of already known private variants.

Subramanian S et al. Nucl. Acids Res. 2013;41:e154

Page 65: Graham Taylor - The future of DNA sequencing technology

DIAMUND:  Direct  Comparison  of  Genomes  to  Detect  Muta2ons    

Figure 1. Outline of initial steps in the Diamund algorithm, which identifies all k-mers unique to an affected proband and missing from bothunaffected parents. The first step identifies k-mers, after which the proband data are filtered to remove k-mers resulting from sequencing errors.Intersecting all three sets identifies k-mers that are unique to the proband.

sequencing, where the number of true but clinically irrelevant vari-ants will be 50 times greater.

Here, we introduce a new method, DIAMUND (direct alignmentfor mutation discovery), which takes a different approach to exomeand whole-genome analysis, and as a result produces dramaticallysmaller sets of candidate mutations. Rather than aligning all samplesto the reference genome, we align the sequences directly to oneanother. This method is designed primarily for two types of analyses:(1) self-comparisons, where diseased tissue is compared with normaltissue from the same individual, and (2) family studies, where thedifferences among the DNA sequences from the subjects are farfewer than the differences between any subject and the referencegenome.

Our method does not require that the raw sequencing reads, usu-ally numbering 100 million or more for a whole exome, be aligned tothe GRC37 reference genome, nor does it require a complex genomeassembly or an all-versus-all alignment of these large data sets. Aswe explain in detail below, we use a more efficient algorithm thatallows us to quickly find sequences that are unique to any sample.

We have implemented and tested DIAMUND on exomes repre-senting two types of analysis problem. First, we considered self-comparisons, in which DNA from primary cultured fibroblasts de-rived from diseased tissue in an affected individual was comparedwith DNA from nondiseased primary cultured fibroblasts from thesame individual. For the analysis of tumor cells or other somaticmosaic genetic abnormalities, this direct comparison should yielda smaller set of variants than an analysis that first compares all se-quences to the reference genome. Second, we looked at three parent–child trios in which a de novo mutation in the child was suspectedto be causing disease. The standard algorithm would compare allthree individuals to the reference genome, generating very large listsof variants, many of which are shared by the child and a parent. Bycomparing the child’s DNA directly to both parents, we can quicklyidentify all de novo mutations, without losing sensitivity and with-out detecting family-specific variants that add noise to the process.For each of these problems, the number of true de novo mutationsis very small, obviating the need for the aggressive filters that exomeand whole-genome pipelines use, which might eliminate the truevariant of interest.

De novo mutations may account for a high proportion ofMendelian disorders. Yang et al. recently reported [Yang et al., 2013]on exome sequencing of 250 probands and their families, amongwhich they identified 33 patients with autosomal dominant and nine

with X-linked diseases. Of these, 83% of the autosomal dominantand 40% of the X-linked mutations occurred de novo.

In addition to generating fewer false positives, direct comparisonbetween samples within a family, or between affected and unaf-fected tissue, allows for detection of mutations in regions that areentirely missing from the reference genome. It has already beenshown that some human populations have large shared genomicregions, often spanning many megabases [Li et al., 2010], which aremissing entirely from the human reference genome. These includenovel segmental duplications [Schuster et al., 2010] as well as en-tirely novel sequences. If a mutation of interest happens to fall inone of these regions, then conventional methods will be guaran-teed to miss it. Our direct comparison algorithm, in contrast, in-cludes these regions and is quite capable of finding mutations withinthem.

An important caveat is that DIAMUND is not intended to solvethe more general problem of variant detection in any sample. It isdesigned to take advantage of very closely related samples wheredirect between-sample comparisons can more effectively identifymutations present in just one or a subset of the samples.

MethodsDIAMUND begins with two or more sets of DNA sequences, or

“reads,” generated by a sequencing instrument. Here, we describethe algorithm as applied to three trios consisting of an affected in-dividual (or proband) and two unaffected parents. Specializing thealgorithm to two samples, where one is normal and the other is dis-eased (e.g., cancerous) tissue from the same individual, is straight-forward.

One way of directly comparing two or more genomes is to assem-ble each data set de novo, using any of several next-generation se-quence assemblers [Schatz et al., 2010], and then compare the assem-blies using a whole-genome alignment algorithm such as MUMmer[Delcher et al., 1999; Kurtz et al., 2004]. However, whole-genomeassembly is computationally costly and can produce erroneous as-semblies, which in turn might create even larger problems thanaligning all reads to the reference genome. Instead, DIAMUND uses adirect approach in which we count all sequences of length k in all thereads, for some fixed value of k, and then compare these k-mers toone another. Here, we outline the 10 major steps of the algorithm;the initial steps are illustrated in Figure 1.

284 HUMAN MUTATION, Vol. 35, No. 3, 283–288, 2014

Page 66: Graham Taylor - The future of DNA sequencing technology

Filtering  staFsFcs  Table 1. Illustration of the Data Reduction at Each Step from Raw Reads to a Final Set of Mutated Loci

Data remaining at the end of step

Filtering step Disease/normal pair Family trio BH1019 Family trio BH2041 Family trio BH2688

Number of reads from proband/diseased tissue 118,414,556 84,201,820 75,877,750 103,527,644Number of 27-mers in proband/diseased tissue 911,738,627 795,477,167 517,272,851 1,088,610,020Number of k-mers with count >10 77,903,885 61,805,320 64,719,150 113,066,951Remove vector sequence 77,898,848 61,800,798 64,713,995 113,062,417Eliminate k-mers found in reference GRC37 exome 17,821,359 9,385,347 10,730,208 50,535,681Eliminate k-mers found in parent exomes/normal tissue 10,568 65,352 20,130 2,006Identify reads containing k-mers 32,829 reads 148,496 46,454 4,404Remove reads containing vector 15,260 125,648 38,799 2,760Number of contigs after assembly 2,147 13,189 3,755 359Number of contigs with >3 reads after merging contigs 279 contigs 1,437 701 71Identify variants covered by reads from normal tissue 55 contigs 5 6 2Keep variants with >5% coverage 42 variants 5 6 2Find variants in coding regions 14 variants 3 3 1Remove synonymous SNPs 10 variants 2 3 1

Step 1: We utilize an efficient parallel algorithm, Jellyfish [Marcaisand Kingsford, 2011], for the k-mer counting step. This firststep converts the reads for each exome (or genome) to a set ofk-mers, which should in theory be a much smaller data set: thenumber of k-mers in an exome is equivalent to the length of theexome, 50–60 Mbp using current exome capture kits. However,the initial set is dramatically larger, due primarily to sequencingerrors, which we address below. We sort each set of k-mers to allowfor efficient intersection operations in subsequent steps. SortingN k-mers requires O(N log N) time, after which computing theintersection with another set of k-mers requires only O(N) time.

Step 2: The second step in the DIAMUND algorithm removes allk-mers from the proband (but not from the unaffected samples)that are likely to represent sequencing errors. Note that everysequencing error introduces k new k-mers. If k is sufficiently large,then virtually all of these k-mers will be unique, i.e., they will notoccur in the genome or elsewhere in the reads. Combined with thefact that exome coverage is usually very deep, we can safely assumethat any k-mer that occurs just once represents an error.

After empirical observations of multiple exomes, we observedthat even k-mers occurring more than once are usually errors. Dueto biases in sequencing technology, exome data sets may containerroneous k-mers that occur 10 or more times, particularly forregions that contain very deep coverage (which can exceed 1000-fold for some exonic targets). For the exomes we have analyzed,average coverage is approximately 80–100!, which means that anovel, heterozygous mutation should have 40–50! coverage. Evenin regions with lower coverage, novel mutations should have 20 ormore reads (and k-mers) covering them. Note that in the case ofmosaicism, a much lower proportion than 50% of the reads mightcontain the mutation; the software can be adjusted to report suchcases.

Given these observations, at this stage, we discard all k-mers thatoccur fewer than 10 times. We tested different values before choosing10 as the default value, and this can easily be adjusted for data setswith lower or higher coverage. In our tests, a minimum value of 10excluded an extremely small number of true k-mers.

Step 3: After removing likely sequencing errors, somek-mers may remain due to vector contamination. We pre-compute all k-mers in known vectors, taken from the UniVecdatabase (www.ncbi.nlm.nih.gov/tools/vecscreen/univec), and re-move these from the exome representing the proband (or

the diseased tissue, in the case of normal vs. diseased tissuecomparisons).

We also observe that any k-mer that occurs in the referencegenome is probably not the cause of disease. We precompute allk-mers from the targeted regions of the GRC37 genome, and re-move these “normal” k-mers from the proband’s data. Note thatthis set can easily be expanded to include a larger set of variantsknown to be harmless.

Step 4: After computing all k-mers in the reads from the probandand both parents, the third step computes the intersection betweenproband and mother, and separately between proband and father(Fig. 1). We collect all k-mers unique to the proband but missingfrom the mother, and repeat this step for the father. We thenintersect the two resulting files to give us a single file that containsall k-mers found in the proband but missing from both unaffectedparents. These form our initial set that should contain any de novomutations in the affected individual.

Step 5: At this point, DIAMUND usually has reduced the initial setof k-mers over 10,000-fold, leaving between 2,000 and 65,000k-mers (Table 1). For the fifth step, we collect the reads containingthese k-mers. This requires us to align the k-mers back to theoriginal reads, because the Jellyfish k-mer counter does not keeptrack of the source of each k-mer. DIAMUND can use either of twoefficient alignment systems for this step: MUMmer [Delcher et al.,1999; Kurtz et al., 2004], a suffix tree-based algorithm that rapidlyfinds exact matches; or Kraken [Wood and Salzberg, 2013], a fastsequence classifier that we modified to provide the output neededby our system. Kraken is the default choice because it is significantlyfaster. In our experiments, the number of reads identified in thisstep ranged from 4,400 to 148,000 (Table 1).

Step 6: Despite every effort to screen reads for contamina-tion, some small fragments of vector sequences often still re-main in the reads. If these vectors happen to contaminateonly the proband (or affected) data set, they will appear tobe novel mutations. We eliminate these by comparing thereads identified in the previous step to the UniVec database(www.ncbi.nlm.nih.gov/tools/vecscreen/univec) using the vec-screen program, and removing any reads with vector sequence.Note that running vecscreen on the original data would be ex-tremely demanding computationally, but because the number ofreads at this step has been reduced approximately 1,000-fold, it isrelatively fast.

HUMAN MUTATION, Vol. 35, No. 3, 283–288, 2014 285

Page 67: Graham Taylor - The future of DNA sequencing technology

Panagopoulos  et  al.  Plosone  2014    Volume  9  (6)  e99439  

The  ‘‘Grep’’  Command  But  Not  FusionMap,  FusionFinder  or  ChimeraScan  Captures  the  CIC-­‐DUX4  Fusion  Gene  from  Whole  Transcriptome  Sequencing  Data  on  a  Small  Round  Cell  Tumor  with  t(4;19)(q35;q13)  

Three  fusion-­‐finder  programs  FusionMap,  Fusion  Finder,  and  ChimeraScan  generated  a  plethora  of  fusion  transcripts  but  not  the  biologically  important  and  cancer-­‐specific  fusion  gene,  the  CIC-­‐  DUX4  chimeric  transcript.  It  was  necessary  to  use  the  ‘‘grep’’  command-­‐line  uFlity  to  siP  out  the  laler  from  the  many  data  produced  by  the  automated  algorithms.  CytogeneFc,  FISH,  and  clinico-­‐pathologic  tumor  features  hinted  at  the  presence  of  the  said  fusion,  but  it  was  eventually  found  only  aPer  the  manual  ‘‘grep’’-­‐  funcFon  had  been  used.  

Simple  is  good  

Page 68: Graham Taylor - The future of DNA sequencing technology

3

2. For each maximal repetition Y, identify the minimum unit U

such that U is not a repetition and Y is a concatenation of

multiple occurrences of U and a prefix of U. For example,

when Y = (CAG) 6CA, U = CAG.

3. An approximate repetition is a substring such that its

alignment with repetition (U)m is decomposed into series of

exact matches of length |U| or more, and neighboring series

must have only one mismatch, one insertion, or one deletion

between them in the alignment, where |U| indicates the length

of U. We calculate an approximate repetition by extending a

maximal (exact) repetition in both directions in a greedy

manner. For example, given

CGCCCGCAGCGCAT(CAG)6CATCAGGGA,

we can extend repetition (CAG)6CA to the underlined

substring,

CGCCCGCAGC-GCAT(CAG)6CATCAGGGA,

where bold letters represent mismatches and “-” indicates a

deletion. In this way, we retrieve an approximate STR that is

not necessarily an exact repeat of the minimum unit U, but

may contain mismatches and indels.

4. A read may contain multiple overlapping STRs with the

same unit. If two overlap, eliminate the shorter one. If both

are of the same length, select one arbitrarily.

The algorithm is able to process ten million reads of length 100

bases in ~1700 s on a Xeon X5690 with a clock rate of 3.47-GHz

(Supplementary Fig. S1). As the computational time is

proportional to the number of reads, ~47 hours is required to

process 1 billion 100-bp reads, confirming the practicality of the

method for processing real human resequencing data.

Fig. 1. Sensing and locating short tandem repeats (STRs) in short reads. (A) An original short read. (B) An approximate STR (AGAGGC)n (n=6) in the

short read. The central four copies of AGAGGC are an exact STR with no mutations, while the flanking copies contain the mutations shown in bold letters.

If one of the regions (black) surrounding the STR aligns in a unique position, the STR can be located in the genome. (C) A read occupied by an approximate

STR. (D) Sensing STRs from frequency distributions of (AGAGCC)n in NA12877 (father of the HapMap CEU trio), NA12878 (mother), and NA18507 (an

African male). The x-axis is the lengths of STR occurrences detected in a read, and the y-axis is the frequency of reads containing STR occurrences of the

length indicated on the x-axis. Note that 100-bp-long STR occurrences are frequent in NA12877, while no STR occurrences of length >70 bp are observed

in samples NA12878 and NA18507. (E) When a read is filled with an STR (red), we attempt to anchor the other end read (blue) to a unique position

unambiguously. (F, G) An STR is located easily if its location can be sandwiched using information on paired-end reads. The length of an STR of length <

100 bp is easily estimated (F), while determining the length of a much longer STR is nontrivial (G). We need to use third-generation sequencers, such as

PacBio RS, with the capability of reading DNA fragments having a length of thousands of bases.

by guest on June 7, 2014http://bioinform

atics.oxfordjournals.org/D

ownloaded from

Rapid  detecFon  of  expanded  short  tandem  repeats    in  personal    genomics  using  hybrid  sequencing    

Koichiro  Doi,  Taku  Monjo,  Pham  H.  Hoang,  Jun  Yoshimura,  Hideaki  Yurino,  Jun  Mitsui,  Hiroyuki  Ishiura,  Yuji  takahashi,  Yaeko  Ichikawa,  Jun  Goto,  Shoji  Tsuji  and  Shinichi  Morishita    

University  of  Tokyo  

Page 69: Graham Taylor - The future of DNA sequencing technology

Standards  

Page 70: Graham Taylor - The future of DNA sequencing technology

Human  Variome  Project  

Page 71: Graham Taylor - The future of DNA sequencing technology

Prototype  NGS  database  

Page 72: Graham Taylor - The future of DNA sequencing technology

Report  

Page 73: Graham Taylor - The future of DNA sequencing technology

Sharing  Experience  with  TruSight  One  

•  In  partnership  with  Illumina,  RCPA  and  the  HGSA  Kim  Flintoff  (Wellington  Regional  GeneFcs  Laboratory)  is  leading  an  evaluaFon  of  exon  sequencing  using  Illumina’s  True  Sight  One  panel.    Two  Coriell    family  trios  will  be  sequenced  by  New  Zealand  Genomics  Limited  and  the  data  will  be  shared  on  a  HVPA  database  

•  The  VCF  file  will  be  available  on  the  HVPA  LOVD  database  and  performance  stats  will  also  be  made  available.  

Page 74: Graham Taylor - The future of DNA sequencing technology

Next  Steps  

•  Robust  standards  for  genomic  medicine  •  Databases  and  data  content  – Access  to  idenFfied  and  de-­‐idenFfied  data  (consent  and  confidenFality)  

– Database  accreditaFon  process  in  prep  with  RCPA  – Defining  the  performance  of  various  aligners,  variant  callers  and  annotaFon  programs  

–  Clinical  grade  Variant  Call  Format  (VCF)  – Metafile  covering  data  trail:  what  was  tested,  what  was  not  tested  

 

Page 75: Graham Taylor - The future of DNA sequencing technology

Data  quality  classes  DifferenFate  between  three  classes  of  data:    The  Clinically  Reported  data  label  would  denote  the  class  of  data  that  the  HVP  Australian  Node  was  originally  designed  to  collect  and  share:  data  that  has  been  generated  in  a  NATA  accredited  Australian  diagnosFc  laboratory  and  is  able  to  be  included  in  a  clinical  report.    Unreported  Clinical  quality  data  would  denote  data  that  has  been  generated  in  a  NATA  accredited  diagnosFc  laboratory,  but  is  not  capable  of  being  included  in  a  clinical  report.  This  class  would  comprise,  primarily,  of  next-­‐generaFon  sequencing  (NGS)  type  data.    Unaccredited  data  would  be  used  to  denote  data  that  has  been  generated  by  an  Australian  laboratory  that  has  not  been  NATA  accredited    A  new  filtering  opFon  would  be  made  available  to  allow  users  to  view  only  data  of  a  certain  class  

Page 76: Graham Taylor - The future of DNA sequencing technology

Standards  for  AccreditaFon  of  DNA  Sequence  VariaFon  Databases  

Quality  Use  of  Pathology  Program  (QUPP),  a  naFonal  project  for  the  Development  of  Standards  for  AccreditaFon  of  DNA  Sequence  VariaFon  Data  Bases  has  been  jointly  iniFated  by  the  Royal  College  of  Pathologists  of  Australasia  (RCPA),  and  the  Human  Variome  Project  (HVP).  Background  •  There  is  a  rapidly  increasing  volume,  spectrum,  and  complexity  of  geneFc  tests  emerging  within  

diagnosFc  pathology  laboratories.  In  parFcular,  high  throughput  sequencing  methods  such  as  targeted  panel,  exome  (WES),  and  whole  genome  sequencing  (WGS),  are  producing  an  increasing  quanFty  of  geneFc  data  requiring  analysis  and  interpretaFon,  forming  a  substanFal  proporFon  of  the  workload.  

•   Currently,  there  is  a  plethora  of  online  mutaFon  databases  to  refer  to,  however  there  is  a  disFnct  lack  of  such  databases  that  meet  the  stringent  accuracy  and  reproducibility  that  the  clinical  diagnosFc  environment  demands.  AddiFonally,  The  current  databases  are  “Fractured”,  with  varied  access  and  sharing  of  the  data  within;  and  variable  quality  due  to  errors  /  inaccurate  data  posFng,  all  of  which  is  a  clear  risk  to  the  quality  of  paFent  care.  With  more  widespread,  secure  sharing  of  variants  and  associated  phenotypes,  the  value  of  cumulaFve  variant  informaFon  will  accelerate  the  delivery  of  accurate,  acFonable,  and  efficient  clinical  reports.  

•  There  are  currently  no  standards  or  equivalent  mechanisms  for  accreditaFon  of  databases  to  ensure  the  accuracy  and  quality  of  uploaded  data  into  any  central  repository  to  meet  the  needs  of  the  clinical  diagnosFcs  environment.  

Page 77: Graham Taylor - The future of DNA sequencing technology

Pathogenicity  1.  “Deleterious-­‐  and  Disease-­‐Allele  Prevalence  in  Healthy  Individuals:  

Insights  from  Current  PredicFons,  MutaFon  Databases,  and  PopulaFon-­‐Scale  Resequencing”  Yali  Xue,  Yuan  Chen,  Qasim  Ayub,  Ni  Huang,  Edward  V.  Ball,  Malhew  Mort,  Andrew  D.  Phillips,  Katy  Shaw,  Peter  D.  Stenson,  David  N.  Cooper,  Chris  Tyler-­‐Smith,  and  the  1000  Genomes  Project  ConsorFum  Am  J  Hum  Genet  91,  1022–1032  2012  

2.  “Amino  Acid  Changes  in  Disease-­‐Associated  Variants  Differ  Radically  from  Variants  Observed  in  the  1000  Genomes  Project  Dataset”  Tjaart  A.  P.  de  Beer*,  Roman  A.  Laskowski,  Sarah  L.  Parks,  Botond  Sipos,  Nick  Goldman,  Janet  M.  Thornton  PLOS  Comp  Biol,  9  1-­‐15  2013  

3.  “Large  Numbers  of  GeneFc  Variants  Considered  to  be  Pathogenic  are  Common  in  AsymptomaFc  Individuals”  Christopher  A.  Cassa,  Mark  Y.  Tong,  and  Daniel  M.  Jordan  HuMu  34.  9  1216–1220,  2013  

4.  “Integrated  sequence  analysis  pipeline  provides  one-­‐stop  soluFon  for  idenFfying  disease-­‐causing  mutaFons”  Hao  Hu  ,  Thomas  F  Wienker,  Luciana  Musante,  Vera  MM  Kalscheuer,  Peter  N  Robinson,  H  Hilger  Ropers  HuMu  under  review  

 

Page 78: Graham Taylor - The future of DNA sequencing technology

Table  1.  List  of  selected  CNV  detec2on  methods.  

Duan  J,  Zhang  J-­‐G,  Deng  H-­‐W,  Wang  Y-­‐P  (2013)  ComparaFve  Studies  of  Copy  Number  VariaFon  DetecFon  Methods  for  Next-­‐GeneraFon  Sequencing  Technologies.  PLoS  ONE  8(3):  e59128.  doi:10.1371/journal.pone.0059128  hlp://www.plosone.org/arFcle/info:doi/10.1371/journal.pone.0059128  

Page 79: Graham Taylor - The future of DNA sequencing technology

Summary  

•  Current  sequencing  technology  has  plenty  of  room  for  improvement  w.r.t.  read  length  and  accuracy  

•  Many  informaFcs  challenges  relate  to  managing  poor  quality  data  or  technological  limitaFons  and  will  go  away  with  longer,  more  accurate  reads  

•  AnnotaFon,  data  sharing  and  integraFng  variant  data  with  clinical  and  phenotypic  data  are  the  high  value  healthcare  deliverables  

Page 80: Graham Taylor - The future of DNA sequencing technology

Acknowledgments  

•  Genomic  Medicine  &  Centre  for  TranslaFonal  Pathology,  University  of  Melbourne:  Arthur  Hsu,  Olga  Kondrashova,  SebasFan  Lunke,  Clare  Love,  Renate  Marquis-­‐Nicholson,  Kym  Pham,  Paul  Waring  

•  Human  Variome  Project:  Tim  Smith,  Alan  Lo,  Dick  Colon  

•  Melbourne  Genomics  Health  Alliance:  Clara  Gaff,  Kathryn  North,  Doug  Hilton,  Stephen  Smith  

Page 81: Graham Taylor - The future of DNA sequencing technology

Targeted  Tumour  Sequencing:    

© 2014 Illumina, Inc. All rights reserved.

BWA Enrichment, Version 1.0.0.1

Enrichment Sequencing Report

Page 2

Sample Information

Sample ID: TL140380

Sample Name: TL140380

Total PF Reads: 77,538,750

Percent Q30: 78.6%

Adapters Trimmed: Yes

Median Read Length: 151 bp

Enrichment Summary

Target Manifest Total Length of Targeted Reference Padding Size

TruSight One v1.0 11,946,514 bp 150 bp

Note: All enrichment values are calculated without padding (sequence immediately upstream anddownstream) unless otherwise stated.

Read Level Enrichment

Total Aligned Reads

Percent Aligned Reads

Targeted Aligned Reads

Read Enrichment

Padded Target Aligned Reads

Padded Read Enrichment

75,690,682 97.6% 49,355,753 65.2% 57,751,970 76.3%

Base Level Enrichment

Total Aligned Bases

Targeted Aligned Bases

Base Enrichment

Padded Target Aligned Bases

Padded Base Enrichment

10,449,595,970 4,976,953,404 47.6% 7,636,013,223 73.1%

© 2014 Illumina, Inc. All rights reserved.

BWA Enrichment, Version 1.0.0.1

Enrichment Sequencing Report

Page 4

Coverage Summary

Mean Region Coverage Depth

Uniformity of Coverage (Pct > 0.2*mean)

Target Coverage at 1X

Target Coverage at 10X

Target Coverage at 20X

Target Coverage at 50X

416.6X 96.1% 99.4% 99.1% 98.9% 97.9%

© 2014 Illumina, Inc. All rights reserved.

BWA Enrichment, Version 1.0.0.1

Enrichment Sequencing Report

Page 4

Coverage Summary

Mean Region Coverage Depth

Uniformity of Coverage (Pct > 0.2*mean)

Target Coverage at 1X

Target Coverage at 10X

Target Coverage at 20X

Target Coverage at 50X

416.6X 96.1% 99.4% 99.1% 98.9% 97.9%

ConsFtuFonal   Frozen  

© 2014 Illumina, Inc. All rights reserved.

BWA Enrichment, Version 1.0.0.1

Enrichment Sequencing Report

Page 3

Small Variants Summary

SNVs Insertions Deletions

Total Passing 8,113 192 230

Percent Found in dbSNP 98.8% 87.5% 74.3%

Het/Hom Ratio 1.7 1.8 2.5

Ts/Tv Ratio 3.1 - -

Variants by Sequence Context

SNVs Insertions Deletions

Number in Genes 8,206 187 225

Number in Exons 6,927 80 107

Number in Coding Regions 6,587 50 64

Number in UTR Regions 340 30 43

Number in Splice Site Regions 742 54 69

Genes include exons, introns and UTR regions. Exons include coding and UTR regions. UTR regions include 5'and 3' UTR regions. Splice site regions include regions annotated as splice acceptor, splice donor, splice site orsplice region.

Variants by Consequence

SNVs Insertions Deletions

Frameshift - 20 23

Non-synonymous 2,886 30 40

Synonymous 3,676 - -

Stop Gained 19 0 0

Stop Lost 6 0 0

Variation consequences are calculated following the guidelines athttp://uswest.ensembl.org/info/genome/variation/predicted_data.html#consequences

© 2014 Illumina, Inc. All rights reserved.

BWA Enrichment, Version 1.0.0.1

Enrichment Sequencing Report

Page 4

Coverage Summary

Mean Region Coverage Depth

Uniformity of Coverage (Pct > 0.2*mean)

Target Coverage at 1X

Target Coverage at 10X

Target Coverage at 20X

Target Coverage at 50X

2555.9X 94.6% 99.5% 99.3% 99.3% 99.1%

© 2014 Illumina, Inc. All rights reserved.

BWA Enrichment, Version 1.0.0.1

Enrichment Sequencing Report

Page 4

Coverage Summary

Mean Region Coverage Depth

Uniformity of Coverage (Pct > 0.2*mean)

Target Coverage at 1X

Target Coverage at 10X

Target Coverage at 20X

Target Coverage at 50X

2555.9X 94.6% 99.5% 99.3% 99.3% 99.1%

© 2014 Illumina, Inc. All rights reserved.

BWA Enrichment, Version 1.0.0.1

Enrichment Sequencing Report

Page 3

Small Variants Summary

SNVs Insertions Deletions

Total Passing 8,244 184 255

Percent Found in dbSNP 98.8% 88.6% 70.6%

Het/Hom Ratio 1.7 1.5 2.9

Ts/Tv Ratio 3.0 - -

Variants by Sequence Context

SNVs Insertions Deletions

Number in Genes 8,336 182 250

Number in Exons 7,033 79 101

Number in Coding Regions 6,685 49 59

Number in UTR Regions 348 30 42

Number in Splice Site Regions 762 51 89

Genes include exons, introns and UTR regions. Exons include coding and UTR regions. UTR regions include 5'and 3' UTR regions. Splice site regions include regions annotated as splice acceptor, splice donor, splice site orsplice region.

Variants by Consequence

SNVs Insertions Deletions

Frameshift - 22 24

Non-synonymous 2,952 27 34

Synonymous 3,705 - -

Stop Gained 22 0 0

Stop Lost 6 0 0

Variation consequences are calculated following the guidelines athttp://uswest.ensembl.org/info/genome/variation/predicted_data.html#consequences

© 2014 Illumina, Inc. All rights reserved.

BWA Enrichment, Version 1.0.0.1

Enrichment Sequencing Report

Page 2

Sample Information

Sample ID: WES001FR1

Sample Name: WES001FR1

Total PF Reads: 454,879,338

Percent Q30: 81.9%

Adapters Trimmed: Yes

Median Read Length: 151 bp

Enrichment Summary

Target Manifest Total Length of Targeted Reference Padding Size

TruSight One v1.0 11,946,514 bp 150 bp

Note: All enrichment values are calculated without padding (sequence immediately upstream anddownstream) unless otherwise stated.

Read Level Enrichment

Total Aligned Reads

Percent Aligned Reads

Targeted Aligned Reads

Read Enrichment

Padded Target Aligned Reads

Padded Read Enrichment

445,404,466 97.9% 305,571,144 68.6% 341,517,122 76.7%

Base Level Enrichment

Total Aligned Bases

Targeted Aligned Bases

Base Enrichment

Padded Target Aligned Bases

Padded Base Enrichment

60,040,931,299 30,533,612,135 50.9% 44,910,193,400 74.8%

Page 82: Graham Taylor - The future of DNA sequencing technology
Page 83: Graham Taylor - The future of DNA sequencing technology

Called  Variants  

FFPE  Frozen  

Blood  

16,711  

2,095   154  

979  

3,641  9,368   2,455  

Page 84: Graham Taylor - The future of DNA sequencing technology

Gene  List  

techniques allow for the rapid detection of EGFRmutations with high sensitivity and specificity.However, confirmation of mutations via directsequencing is still necessary.27,76,77 Though not ofany current clinical use, an assay that provides arapid assessment of EGFR mutation status in as littleas 30 min using a ‘smart amplification process’ hasbeen described. These may one day provide greatlyimproved turnaround times for this analysis.78

Formalin-fixed and paraffin-embedded tissue isperfectly suitable for fluorescence in situ hybridiza-tion (FISH) and DNA-based tests, but tissue pre-servation is critical for a successful test. Decalcifiedand ethanol-fixed tissue, as well as tissues contain-ing abundant necrosis, should be avoided.

The ability to detect multiple driver mutations inlung adenocarcinoma has revolutionized the medi-cal management of this disease and multiplexedtesting for all common driver mutations will providephysicians with a more precise guide for therapy.9

Recently, Kris et al79 identified 10 driver mutationsin tumor samples from 1000 lung adenocarcinomapatients enrolled in the National Cancer InstituteLung Cancer Mutation Consortium. The mutations,involving KRAS, EGFR, ERBB2 (HER2), BRAF,PIK3CA, AKT1, MAP2K1, and NRAS, were screenedusing standard multiplexed assays and FISH. Drivermutations were detected in 60% of tumors. Theincidences of mutations were as follows: KRAS25%, EGFR 23%, ALK rearrangements 6%, BRAF3%, PIK3CA 3%, MET amplifications 2%, ERBB21%, MAP2K1 0.4%, NRAS 0.2%, and AKT1 0%(Figure 3).12,67–71 It is noteworthy that 95% ofmolecular lesions were mutually exclusive.79

EGFR mutations are responsible for the constitu-tive activation of the tyrosine kinase receptor. Thesemutations are also most frequently associatedwith either sensitivity or resistance to EGFR TKIs(Figure 2).6,80–84 The response-associated mutationsare linked with response rates of 470% in patientstreated with either erlotinib or gefitinib.85,86 How-ever, upto 25% of patients with TKI resistance-associated mutations will also respond to thetherapy.67 Pao et al7 analyzed EGFR mutation ofexons 18–24 in tumors from 10 gefitinib-responsiveand from 7 erlotinib-responsive patients. The resultsdemonstrated that EGFR mutations were present in7 of 10 (70%) gefitinib-responsive and in 5 of 7(71%) erlotinib-responsive tumors.

EGFR genotype was more useful than clinicalcharacteristics for selection of appropriate patientsfor consideration of first-line therapy with an EGFRTKI.85 EGFR mutations are generally associated withsensitivity to TKI therapy.71,87 Both retrospectiveand prospective studies have demonstrated thatlung adenocarcinoma patients carrying such anEGFR mutation and who were treated with TKIshad significantly higher response rates and longerprogression-free survival than patients without anEGFR mutation,5–7,25,29,71,83,85,87,88 with some patientsexperiencing rapid, complete, or partial responses

that were persistant.55 Jackman et al85 studied 223chemotherapy-naı̈ve patients with advanced lungcancer of non-small cell type, among which 86%were adenocarcinomas. Sensitizing EGFR mutationswere found in 84 carcinomas, 89% of which wereadenocarcinomas. The mutations were associatedwith a 67% response rate, with a time to progressionof 11.8 months, and overall survival of 23.9months.85 Exon 19 deletions were associated witha relatively longer median time to progression andoverall survival compared with L858R (exon 21)mutations. Wild-type EGFR was found in 139patients (62%), and this finding was associatedwith poor outcomes (response rate, 3%; time to pro-gression, 3.2 months), irrespective of KRAS status.

EGFRvIII Mutation

EGFR variant III (EGFRvIII), a mutation resultingfrom an in-frame deletion of exons 2–7 of the codingsequence (amino acids 6–273), has been associatedwith a subset of squamous cell lung cancers.89–91 Anumber of functional differences between EGFRvIIIand EGFR have been characterized.90,91 EGFRvIII hasbeen identified in an array of human solid tumors,including glioblastoma, breast cancer, ovarian can-cer, prostate cancer, and lung caner. AlthoughEGFRvIII fails to bind EGF, its intracellular tyrosine

Figure 3 Frequency of major driver mutations in signalingmolecules in lung adenocarcinomas. About 64% of all adenocar-cinoma cases harbor somatic driver mutations. According to theNational Cancer Institute Lung Cancer Mutation Consortiumdata,79 B23% of lung adenocarcinomas harbor EGFR mutations.The EGFR mutation status of the cancer is associated with itsresponsiveness or resistance to EGFR TKI therapy. KRAS muta-tions are more frequently found in adenocarcinomas (25%),which are mutually exclusive with EGFR mutations. Mutationsin KRAS have been proposed as one of the mechanisms ofprimary resistance to gefitinib and erlotinib therapy. A subsetof adenocarcinoma cases harbors a transforming fusion gene,EML4–ALK (6%), which mainly involves adenocarcinoma fromnon-smokers with wild-type EGFR and KRAS mutations. Themutation frequency of BRAF is 3%, PIK3CA 3%, MET amplifica-tions 2%, ERBB2(Her2/neu) 1%, MAP2K1 0.4%, and NRAS 0.2%.Each of the molecular alterations has a role in the signalpathways, activating important cell functions, including cellproliferation and survival. Approximately 36.4% of lung adeno-carcinomas do not harbor currently detectable mutations.

Molecular pathology of lung cancer

350 L Cheng et al

Modern Pathology (2012) 25, 347–369

Page 85: Graham Taylor - The future of DNA sequencing technology

Filtering  Variants  All  variants   None   Qual   Not  in  Blood  

Blood   9828   8551   NA  

Frozen   9920   8736   126  

FFPE   9709   8163   199  

Variants  in  Gene  List   None   Qual   Not  in  Blood  

Blood   27   18   NA  

Frozen   27   23   2  (EGFR)  

FFPE   25   19   3  (EGFR,  ROS)  

Page 86: Graham Taylor - The future of DNA sequencing technology

EGFR  p.L858R  

Page 87: Graham Taylor - The future of DNA sequencing technology

EGFR  p.T790M  

Page 88: Graham Taylor - The future of DNA sequencing technology

ConfirmaFon  by  PCR  

0.0#

50.0#

100.0#

150.0#

200.0#

250.0#

EGFR_NM_005228.3#T790#T790#WT#

EGFR_NM_005228.3#784#"c.2350T>C,#p.S784P"#

EGFR_NM_005228.3#784#"c.2351C>T,#p.S784F"#

EGFR_NM_005228.3#785#"c.2354C>T,#p.T785I"#

EGFR_NM_005228.3#786#"c.2356G>A,#p.V786M"#

EGFR_NM_005228.3#790#"c.2368A>G,#p.T790A"#

EGFR_NM_005228.3#790#"c.2369C>T,#p.T790M"#

EGFR_NM_005228.3#828#&#861#"828#&#861,#wt"#

EGFR_NM_005228.3#858#"c.2572C>A,#p.L858M"#

EGFR_NM_005228.3#858#"c.2573_2574delinsGT,#

EGFR_NM_005228.3#858#"c.2573T>A,#p.L858Q"#

EGFR_NM_005228.3#858#"c.2573T>G,#p.L858R"#

EGFR_NM_005228.3#860#"c.2579A>T,#p.K860I"#

EGFR_NM_005228.3#861#"c.2582T>A,#p.L861Q"#

EGFR_NM_005228.3#861#"c.2582T>G,#p.L861R"#

EGFR%normalised%%

0.0#

0.2#

0.4#

0.6#

0.8#

1.0#

1.2#

1.4#

1.6#

1.8#

2.0#

KRAS_NM_033360.2#12#"c.34G>A,#p.G12S"#

KRAS_NM_033360.2#12#"c.34G>C,#p.G12R"#

KRAS_NM_033360.2#12#"c.34G>T,#p.G12C"#

KRAS_NM_033360.2#12#"c.35G>A,#p.G12D"#

KRAS_NM_033360.2#12#"c.35G>C,#p.G12A"#

KRAS_NM_033360.2#12#"c.35G>T,#p.G12V"#

KRAS_NM_033360.2#13#"c.37G>A,#p.G13S"#

KRAS_NM_033360.2#13#"c.37G>C,#p.G13R"#

KRAS_NM_033360.2#13#"c.37G>T,#p.G13C"#

KRAS_NM_033360.2#13#"c.38G>A,#p.G13D"#

KRAS_NM_033360.2#13#"c.38G>C,#p.G13A"#

KRAS_NM_033360.2#13#"c.38G>T,#p.G13V"#

KRAS%normalised%%

Page 89: Graham Taylor - The future of DNA sequencing technology

Gene  list  summary  

gene locus

covered,by,capture,panel? Observations Notes

AKT1 chr14:105,235,7611105,262,116 y wt activating:pointALK chr2:29,415,410130,146,821 y wt rearrangements:and:secondary:resistance:mutationsBRAF chr7:140,433,8131140,624,564 y wt activating:pointEGFR chr7:55,248,979155,259,567 Y L858R;:T790M activating:points,:indels,:ressitance:pointHER2 chr17:37,844,393137,884,915 Y wt ampliication,:activating:pointKRAS chr12:25,386,768125,403,863 Y ? activating:point

MAP2K1 chr15:66,679,211166,783,882 Ychanged:allele:ratio

MET chr7:116,312,4591116,409,963 Ychanged:allele:ratio mutation:and:amplification

NRAS chr1:115,247,0851115,259,515 Y wt activating:pointPI3KCA chr3:178,866,3111178,952,497 Y wt activating:pointROS1 chr6:117,609,5301117,747,018 Y wt rearrangements