Bioc strucvariant seattle_11_09

1. Using R and BioConductor To Find Structural Variants In Short Read Sequencing Data Sean Davis, MD, PhD National Cancer Institute National Institutes of Health Bethesda, MD

2. Why use R and BioConductor? 3. phenotype Gene Copy Number Sequence Variation Chromatin Structure and Function Gene Expression Transcriptional Regulation DNA Methylation Patient and Population Characteristics 4. Why structural variation? 5. Overview

What is structural variation and why might it be important in biology?

6. What is paired-end sequencing? 7. How can paired-end sequencing be used for finding structural variants? 8. How can the existing tools within R and BioConductor be leveraged to find structural variants in the genome? 9. Overview

10. What is paired-end sequencing? 11. How can paired-end sequencing be used for finding structural variants? 12. How can the existing tools within R and BioConductor be leveraged to find structural variants in the genome? 13. What is a structural variation?

Insertions

14. Deletions 15. Translocations

Intrachromosomal

16. Interchromosomal Inversions 17. [copy number variation] 18. Importance of Structural Variation

Can alter gene expression both directly and indirectly

Deletion, insertion, or translocation that disrupts or removes transcript(s)

19. Translocation that alters regulatory environment 20. Can place two distant functional elements in proximity to each other (gene fusion events are an example) Possibly change chromatin structure 21. Normal Karyotype Tumor Karyotype 22. Redon et al., Nature 2006 23. 24. A Genome View of Copy Number 25. Overview

26. What is paired-end sequencing? 27. How can paired-end sequencing be used for finding structural variants? 28. How can the existing tools within R and BioConductor be leveraged to find structural variants in the genome? 29. 30. 31. Insert Read Read 32. Overview

33. What is paired-end sequencing? 34. How can paired-end sequencing be used for finding structural variants? 35. How can the existing tools within R and BioConductor be leveraged to find structural variants in the genome? 36. Medvedev et al., Nature 2009 37. 38. Paired-end reads and SV

For the paired-end data, determine the distribution of insert sizes

39. Find reads that show an unusually high or low insert size ( mean +/- 3sd, for example) 40. Cluster these abnormal related pairs 41. Where there is significant clustering, there may be evidence for a structural variant 42. The type of the structural variant can be determined using the relationships between clusters 43. Overview

44. What is paired-end sequencing? 45. How can paired-end sequencing be used for finding structural variants? 46. How can the existing tools within R and BioConductor be leveraged to find structural variants in the genome? 47. Experimental Setup

Choose 1Mb region on chromosome 17 (from 40Mb to 41Mb) as reference sequence

48. Make a new sequence that has several structural variants in it and use that as the basis for our sequencing 49. Simulate 100k paired end reads using MAQ simulate

Simulated with mean insert size 200, sd 20

50. 35 bp reads 51. Allow errors according to error model from real data 52. Experimental Setup, continued

Align paired-end data to the human reference genome using BWA

53. Convert output of BWA to sorted and indexed BAM 54. Use R and Bioconductor tools to try torediscoverthe structural variants in the simulation 55. The Sample Sequence 56. The Sample Sequence, continued

Segment between 40.4 and 40.5Mb is translocated to sit between 40.04Mb and 40.05MB.

57. Segment between 40.1 and 40.11 tandemly duplicated five times (a copy number variation) 58. Bioconductor and R Tools Used

Rsamtools

59. IRanges

RangedData object is used to store paired-end reads and subset abnormal mapped pairs

60. Calculate coverage on abnormal mapped pairs R graphics for making plots 61. Get data into R

Rsamtools -> RangedData

62. 63. Using the SAM Flag Field 41 = 1 0 1 0 1 8 = 0 0 1 0 0 & 8 = 0 0 1 0 0 64. Insert Size Distribution 65. 66. 67. 68. 69. Future Work

Build infrastructure for dealing more easily with paired-end reads

70. Refine workflow for finding and clustering abnormal related pairs 71. Define or implement algorithms for taking raw clustering results and converting that to biologically meaningful descriptions of the structural variants 72. Lots, lots, lots more 73. 74. A couple of final thoughts

Public data

1000 genomes data

75. SRA (NCBI short read sequencing archive) 76. NCBI GEOnot just for microarrays Interactive visualization

UCSC genome browser (rtracklayer)

77. Integrated Genome Browser (IGB, available from Affymetrix and at Sourceforge) 78. Integrated Genomic Viewer (IGV, available from the Broad Institute)

Health & Medicine

Bioc strucvariant seattle_11_09