26
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach CB Hong , KJ Kim 4-5 February 2015 Contents 1 TCGA Benchmark 4 Data Set 3 1.1 GenomeTorrent| t©\ TCGA pt0 ¥\............................. 3 1.2 Sample Data Set DX0 .......................................... 3 1.3 îúÌ Ù Ux ............................................ 4 1.4 ‰µ` pt0 Ux ............................................. 4 1.5 ¨X0 ................................................... 5 2 Somatic Mutation Prediction 6 2.1 SomaticSniper â ¨D0 ©X0 (164) ............................ 6 2.2 VarScan2 â ¨D0 ©X0 (10Ñ) ............................... 7 2.3 MuTect â ¨D0 ©X0 (18Ñ) ................................ 9 2.4 ¨X0 ................................................... 10 3 Full Consensus / Partial Consensus sSNV lX0 11 3.1 Bi-allelic SNPà îúX0 .......................................... 11 3.2 Full Consensus / Partial Consensus lX0 ................................. 11 3.3 Full Consensus / Partial Consensus /lX0 .............................. 12 3.4 ¨X0 ................................................... 12 4 î D0 ©X0 13 4.1 Unifed Genotyper| t©\ normal, tumor variants call (8Ñ) ....................... 13 4.2 Filtering SNVs - full consensus (µ ) ................................. 13 4.3 Filtering SNVs - partial consensus (SomaticSniper/MuTect) ........................ 13 4.4 GATK D0| © ƒ Full Consensus / Partial Consensus /lX0 .................. 14 4.5 ¨X0 ................................................... 14 5 Validation 15 5.1 COSMIC, CCLE pt0 DX0 ...................................... 15 5.2 Validation â - consensus / parital consensus ............................... 15 5.3 ¨X0 ................................................... 16 6 0¿ Somatic Mutation Callers - Strelka, Virmid 17 6.1 Strelka (1Ñ38) ............................................... 17 6.2 Virmid (33Ñ) ................................................. 18 KT GenomeCloud [email protected] 1

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach

Embed Size (px)

Citation preview

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble

Approach

CB Hong ⇤, KJ Kim

4-5 February 2015

Contents

1 TCGA Benchmark 4 Data Set 31.1 GenomeTorrent| t©\ TCGA pt0 ‰¥\‹ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Sample Data Set �DX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 îú⌧ �Ì �Ù Ux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 ‰µ` pt0 Ux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 �¨X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Somatic Mutation Prediction 62.1 SomaticSniper ‰â ✏ ¨⌅ D0 �©X0 (164�) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 VarScan2 ‰â ✏ ¨⌅ D0 �©X0 (10Ñ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 MuTect ‰â ✏ ¨⌅ D0 �©X0 (18Ñ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 �¨X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Full Consensus / Partial Consensus sSNV lX0 113.1 Bi-allelic SNPà îúX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Full Consensus / Partial Consensus lX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Full Consensus / Partial Consensus /⇠ lX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4 �¨X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 î� D0 �©X0 134.1 Unifed Genotyper| t©\ normal, tumor variants call (8Ñ) . . . . . . . . . . . . . . . . . . . . . . . 134.2 Filtering SNVs - full consensus (›µ �•) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.3 Filtering SNVs - partial consensus (SomaticSniper/MuTect) . . . . . . . . . . . . . . . . . . . . . . . . 134.4 GATK D0| �© ƒ Full Consensus / Partial Consensus /⇠ lX0 . . . . . . . . . . . . . . . . . . 144.5 �¨X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5 Validation 155.1 COSMIC, CCLE pt0 �DX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.2 Validation ⇠â - consensus / parital consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.3 �¨X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

6 0¿ Somatic Mutation Callers - Strelka, Virmid 176.1 Strelka (1Ñ38�) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176.2 Virmid (33Ñ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

⇤KT GenomeCloud [email protected]

1

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 2

7 ⌅¥ l| ⌅\ ¨⇧§ 197.1 ‰µ© ¨⇧§ ⌧Ñ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197.2 ‰µ© ¨⇧§ ⌧Ñ ⌘çX0 - �ƒ∞਩ê . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197.3 ‰µ© ¨⇧§ ⌧Ñ ⌘çX0 -  ⇣î ¨⇧§ ¨©ê . . . . . . . . . . . . . . . . . . . . . . . . . . 197.4 ¨⇧§ ‹§\ �Ù LD¥0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197.5 ¨⇧§ �| ‹§\ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207.6 ¨⇧§ X‹§l î�X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227.7 �| �( Ö9¥ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227.8 ¨⇧§ $∏Ãl �Ù . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237.9 ¨⇧§ Uï ttX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.10 ¨⇧§ å⌅∏Ë¥ $XX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

7.10.1 APT| t©\ å⌅∏Ë¥ $X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.10.2 å§ T‹ Ù�|D µ\ å⌅∏Ë¥ $X . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 3

1 TCGA Benchmark 4 Data Set

¯‰µ–⌧î TCGA mutation calling benchmark4 datasetsDt©XÏ¥ªå somatic mutationD>D¿–�t⌧LD¸ÉÖ»‰. Genome sequencing benchmakr dataset@x⌅�<\ tumorÿ�–|�D((5%-95%)X Normalÿ�D <iXÏ ›1\ pt0Ö»‰. t ⌘–⌧ ∞¨î n40t60 (mixed with 60% of the tumor and 40% of thenormal)¸ t– �QXî normal sampleD ¨©` ÉÖ»‰. t˘ pt0î BAM Ϙ<\ TCGA Benchmark Hòt¿–⌧ ‰¥\‹� �•i»‰.

1.1 GenomeTorrent| t©\ TCGA pt0 ‰¥\‹

• ‰¥\‹ S/W $X - Key/UUID �| ‰¥\‹ - ÿ� ‰¥\‹• �‹)TCGA Benchmark Data SetD ⌅\ Public Key ‰¥\‹

• https://cghub.ucsc.edu/datasets/benchmark download.html

$ cd$ wget https:// cghub.ucsc.edu/software/downloads/cghub_public.key

• π��|X ‰¥\‹ �Ù| ÏhXî UUID(universally unique identifier, ›ƒê) �|

• TCGA Benchmark cell line: HCC1143 tumor 50x

$ curl https:// cghub.ucsc.edu/cghub/metadata/analysisAttributes? \analysis_id=ad3d4757 -f358 -40a3 -9d92 -742463 a95e88 \-o uuid.txt

$ more uuid.txt<?xml version="1.0" encoding="utf -8" standalone="yes"?><center_name >UCSC </ center_name ><study >TCGA_MUT_BENCHMARK_4 </study ><files ><file ><filename >G15511.HCC1143 .1.bam </filename ><filesize >255795959440 </ filesize >

</file >

• gtdownload| t©\ pt0 ‰¥\‹

$ cd$ gtdownload -c cghub_public.key -vv -d uuid.txt

1.2 Sample Data Set �DX0

• BAMX |Ä �Ì îú - �,(sort) - xqÒ (index)¸…¥ Ë⌅\ îú � (-b: bam Ϙ<\ ú%)

$ cd$ samtools view -b in.bam 1 > chr1.bam$ samtools sort chr1.bam chr1_sorted$ samtools index chr1_sorted.bam

• π� �ÌX îú � (BED �| t©)

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 4

$ cd$ cat chr17.bed17:5967 -620717:11197 -1138917:11806 -1201817:13897 -1401717:22307 -2242717:30843 -3096317:31151 -3127917:63618 -6373817:65398 -6563817:69410 -6953017:96838 -9710817:131511 -13166117:169155 -16939517:170984 -17125417:177205 -17735517:260100 -26030817:262897 -26325717:263317 -263947$ cat chr17.bed |xargs samtools view -b in.bam \

> exome.bam$ samtools sort exome.bam exome_sorted$ samtools index exome_sorted.bam

1.3 îú⌧ �Ì �Ù Ux

• � readƒ⌅X�Ù| bedϘ<\ú%\‰.⌅Ëà ucsc genome browserX custom track<\î�XÏ align⌧ read �Ù| Ux` ⇠ à‰.

$ cd$ bamToBed -i exome_sorted.bam > cov_1.bed

• BAM �|X ‰Ñ¨¿| BED �|\ ú%Xp, read depth �Ù| ৆¯®<\ ¯¨0 ⌅\ �Ù\ \© ⇠ à‰.

$ cd$ samtools view -b exome_sorted.bam | \

genomeCoverageBed -ibam stdin > cov_2.bed

1.4 ‰µ` pt0 Ux

• ÿ�, ⌅\¯®, �|§ pt0 ©]

$ cd /somatic_bench$ pwd/somatic_bench$ ls -altotal 176drwxr -xr-x 7 root root 4096 Jan 21 15:25 .drwxr -xr-x 25 root root 4096 Jan 20 08:53 ..

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 5

drwxr -xr-x 9 root root 4096 Jan 21 08:15 appdrwxr -xr-x 2 root root 4096 Jan 21 14:38 bamdrwxr -xr-x 2 root root 4096 Jan 19 11:43 referencedrwxr -xr-x 2 root root 4096 Jan 21 15:24 scriptdrwxr -xr-x 2 root root 151552 Jan 21 12:59 tmp$ more /somatic_bench/script/somatic_call_bench.shinput_bam1="/somatic_bench/bam/hcc1143.ccle.n40t60.sorted.bam"input_bam2="/somatic_bench/bam/hcc1143.ccle.b.sorted.bam"gatk_b37="/somatic_bench/reference/human_g1k_v37_decoy.fasta"temp_dir="/somatic_bench/tmp/"$ cd$ ln -s /somatic_bench/bam/hcc1143.ccle.n40t60.sorted.bam tumor.bam$ ln -s /somatic_bench/bam/hcc1143.ccle.b.sorted.bam normal.bam

1.5 �¨X0

• ⌅\¯® ©]: wget, curl, gtdownload, samtools, bedtools(bamToBed, genomeCoverageBed)

• ∞¸<: –Xî �ÌÃt t¨Xî .bam, t˘ .bamX coverage| Ùϸî .bed

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 6

2 Somatic Mutation Prediction

SomaticSniper, VarScan2, MuTectD t©XÏ ÿ� pt0K<\ Ä0 (tumor@ matched normal bam) somatic mu-tationD >D≈»‰.

• Ñ� Ö9: https://gist.github.com/hongiiv/06611f189f4c8158edb0• SAMtools: v0.1.19• GATK: v2.8.1• MuTect: v1.1.4• SomaticSniper: v1.0.4• Strelka: v1.0.14• Virmid: v1.1.1

2.1 SomaticSniper ‰â ✏ ¨⌅ D0 �©X0 (164�)

SomaticSniperî Varscan2| Ç ÃÒ4 �YX Li Ding– Xt 2011D ⌧⌧⇠»<p, Bayesian probability@ poste-rior filteringD t©\‰. ¸î π’<\î High computational e�ciency| Ùx‰.

• -J: joint genotyping mode with default prior probability of a somatic mutation (0.01)

• -n, -t: normal/tumor sample id (for VCF header)

• -F: output Ϙ (classic, vcf, bed)

• -f: ref.fasta �|X Ω\

$ cd$ bam -somaticsniper \

-J \-F vcf \-n HCC1143_Normal \-t HCC1143_Tumor \-f /somatic_bench/reference/human_g1k_v37_decoy.fasta \tumor.bam normal.bam \HCC1143_somaticsniper.vcf

• (D05X) Reads with a mapping quality of 0 were filtered prior to somatic mutation identification. Predictionswith ’somatic score’ of 40 or greater were considered for subsequent downstaream validation and analysis step.

• GATKXSelectVariants| t©XÏ –Xî variantsÃD îú` ⇠ à‰.

• VCF �|X FORMAT D‹X SSC (somatic score), MQ (mapping quality) �Ù| t©

$ cd$ ln -s /somatic_bench/app/GenomeAnalysisTK -2.8 -1/ GenomeAnalysisTK.jar ./$ update -alternatives --config javaThere are 2 choices for the alternative java (providing /usr/bin/java).

Selection Path Priority------------------------------------------------------------

0 /usr/lib/jvm/java -7-oracle/jre/bin/java 21 /usr/lib/jvm/java -6-oracle/jre/bin/java 1

* 2 /usr/lib/jvm/java -7-oracle/jre/bin/java 2

Press enter to keep the current choice [*], or type selection number: 2update -alternatives: using /usr/lib/jvm/java -6-oracle/jre/bin/java

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 7

$ java -versionjava version "1.7.0 _72"Java(TM) SE Runtime Environment (build 1.7.0_72 -b14)Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04 , mixed mode)$ java -jar GenomeAnalysisTK.jar \

-T SelectVariants \-R /somatic_bench/reference/human_g1k_v37_decoy.fasta \--variant HCC1143_somaticsniper.vcf \-o HCC1143_somaticsniper_filter.vcf \-sn HCC1143_Tumor -sn HCC1143_Normal \-select 'vc.getGenotype("HCC1143_Tumor"). getExtendedAttribute("SSC") >= 40 \&& (vc.getGenotype("HCC1143_Tumor"). getExtendedAttribute("MQ") > 0 || \vc.getGenotype("HCC1143_Normal"). getExtendedAttribute("MQ") > 0)'

• D0 ⌅/ƒX mutation /⇠ DPX0

$ cd$ grep -v "#" HCC1143_somaticsniper.vcf |wc -l583$ grep -v "#" HCC1143_somaticsniper_filter.vcf |wc -l161

2.2 VarScan2 ‰â ✏ ¨⌅ D0 �©X0 (10Ñ)

VarScan2î ÃÒ4 �YX Li Ding– Xt SomaticSniperÙ‰ 1D ¶@ 2012D ⌧⌧⇠»‰. ‰x 4‰¸î ϨFisher exact test@ filtering and FDR correctionD ¨©\‰. ¸î π’<\ high-quality sSNVs– �t⌧ sensitivedetectionD ⇠â\‰. ‰x 4‰¸ Ϩ Ö% �|D .bam �|t Dà pileup ⇣î mpileup �|D Ö% �î‰.

• samtoolsX mpileupD t©XÏ normal, tumor– �t⌧ pileup/mpileup ϘD ›1\‰.

• mpileup˃–⌧ -q 1 (skip alignments with mapQ smaller than INT), -B (disable BAQ computation)5XDµtfilter| ⇠â\‰.

• VarScan–⌧ mpileup1 ϘD Ö%<\ ¨©Xî Ω∞ ’–mpileup 1’ 5XD �‰.

$ cd$ samtools mpileup \

-f /somatic_bench/reference/human_g1k_v37_decoy.fasta \-q 1 -B normal.bam > HCC1143_n.pileup

$ samtools mpileup \-f /somatic_bench/reference/human_g1k_v37_decoy.fasta \-q 1 -B tumor.bam > HCC1143_t.pileup

$ ln -s /somatic_bench/app/VarScan/VarScan.v2 .3.3. jar ./$ java -jar VarScan.v2 .3.7. jar \

somatic HCC1143_n.pileup HCC1143_t.pileup \HCC1143_varscan \--output -vcf 1

14617150 positions in tumor14616970 positions shared in normal13721478 had sufficient coverage for comparison

10tX 8⌧‰@ samtoolsX pileupD ¨©Xî ÉD 0�<\ $Ö⇠¥ à¿Ã, samtools� ≈pt∏ ⇠t⌧ pileup@ ¨|¿‡ mpileup

<\ �¥ ⇠»‰. X¿Ã mpileup<\ƒ XòX ÿ�à pileupt �•X‰. <` varscan–⌧î N/T� ®P Ïh⌧ mpileup �|D ¿–\‰.

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 8

13700958 were called Reference0 were mixed SNP -indel calls and filtered18427 were called Germline1562 were called LOH450 were called Somatic81 were called Unknown0 were called Variant

• VarScan2X⇠â∞¸Dò@⇡t INDEL¸ SNP�Ïh⌧∞¸|�� VCF�‹\›1⌧‰ (HCC1143 varscan.indel.vcf,HCC1143 varscan.snp.vcf).

drwxr -xr-x 2 root root 4096 Jan 30 09:52 ./drwxr -xr-x 5 root root 8192 Jan 30 09:35 ../-rw-r--r-- 1 root root 402354 Jan 30 09:47 HCC1143_varscan.indel.vcf-rw-r--r-- 1 root root 2691462 Jan 30 09:47 HCC1143_varscan.snp.vcf

• VarScan2X ∞¸ ⌘, HCC1143varscan.snp.vcf�XprocessSomaticısomaticF ilter|tXD0|¸\.

• processSomatic: high-confidence2/low-confidence Somatic mutationsD Ѩt�‰.

• somaticFilter: ê‡t –Xî D0 5X –min-coverage, –p-value, –indel-file Ò �© �•X‰.

$ cd$ java -jar VarScan.v2 .3.3. jar processSomatic -helpUSAGE: java -jar VarScan.jar process [status -file] OPTIONS

status -file - The VarScan output file for SNPs or IndelsOPTIONS--min -tumor -freq - Minimum variant allele frequency in tumor [0.10]--max -normal -freq - Maximum variant allele frequency in normal [0.05]--p-value - P-value for high -confidence calling [0.07]

$ java -jar VarScan.v2 .3.3. jar processSomatic HCC1143_varscan.snp.vcfReading input from HCC1143_varscan.snp.vcfOpening output files:17914 VarScan calls processed382 were Somatic (102 high confidence)16048 were Germline (15431 high confidence)1451 were LOH (1447 high confidence)

• processSomaticX ∞¸\ Germline, LOH, Somatic– �t⌧ �� high confidence, low confidenceX ©]t Ïh⌧ ∞¸| �� ›1\‰.

$ ls-rw-r--r-- 1 2413169 Jan 30 09:52 HCC1143_varscan.snp.vcf.Germline-rw-r--r-- 1 2320566 Jan 30 09:52 HCC1143_varscan.snp.vcf.Germline.hc-rw-r--r-- 1 216574 Jan 30 09:52 HCC1143_varscan.snp.vcf.LOH-rw-r--r-- 1 215997 Jan 30 09:52 HCC1143_varscan.snp.vcf.LOH.hc-rw-r--r-- 1 59990 Jan 30 09:52 HCC1143_varscan.snp.vcf.Somatic-rw-r--r-- 1 17055 Jan 30 09:52 HCC1143_varscan.snp.vcf.Somatic.hc

• VarScan2X ∞¸ VCFX Ω∞ ALT allele– ’G/T’ Ò<\ \0Xîp tî îƒ Ñ�– –Ï| ⌧›\‰. 0|⌧ ’G,T’X \⌅ )›<\ ¿Ω\‰.

2

tumor–⌧ minimum variant allele frequency� 0.1, normal–⌧ maximum variant allele frequency� 0.05

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 9

$ cd$ perl -pe 's/\tA\//\tA ,/' HCC1143_varscan.snp.vcf.Somatic.hc | \

perl -pe 's/\tT\//\tT ,/'| \perl -pe 's/\tG\//\tG ,/'| \perl -pe 's/\tC\//\tC ,/' > HCC1143_varscan_filter.vcf

• D0 ƒX mutation /⇠

$ cd$ grep -v "#" HCC1143_varscan_filter.vcf |wc -l102

2.3 MuTect ‰â ✏ ¨⌅ D0 �©X0 (18Ñ)

MuTect@ Broad–⌧⌧⌧⌧4\ Bayesian probability with pre- and post- filteringD⇠âXp,πà low allelic-fraction–⌧ sSNVs– �t⌧ sensitive detectionD ⇠â\‰.

• MuTectî ê 1.6 Ñ⌅–⌧Ã ŸëX0 L8– ⌅¨ Java Ñ⌅D Ux ƒ– Dî‹ update-alternatives| t©XÏ Ñ⌅D ¿Ω\‰.

$ cd$ ln -s /somatic_bench/app/mutect/muTect -1.1.4. jar ./$ samtools index normal.bam$ samtools index tumor.bam$ cp /somatic_bench/reference/ccle.gatk.bed ./$ update -alternatives --config javaThere are 2 choices for the alternative java (providing /usr/bin/java).

Selection Path Priority------------------------------------------------------------

0 /usr/lib/jvm/java -7-oracle/jre/bin/java 21 /usr/lib/jvm/java -6-oracle/jre/bin/java 1

* 2 /usr/lib/jvm/java -7-oracle/jre/bin/java 2

Press enter to keep the current choice [*], or type selection number: 1update -alternatives: using /usr/lib/jvm/java -6-oracle/jre/bin/java$ java -versionjava version "1.6.0 _45"Java(TM) SE Runtime Environment (build 1.6.0_45 -b06)Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01 , mixed mode)$ java -jar muTect -1.1.4. jar --analysis_type MuTect \

--reference_sequence /somatic_bench/reference/human_g1k_v37_decoy.fasta \--cosmic /somatic_bench/reference/b37_cosmic_v54_120711.vcf \--dbsnp /somatic_bench/reference/dbsnp_132_b37.leftAligned.vcf \--input_file:normal normal.bam \--input_file:tumor tumor.bam \--out HCC1143_mutect.out \--vcf HCC1143_mutect.vcf \--coverage_file HCC1143.mutect.cov.wig.txt \--normal_sample_name HCC1143_Normal \--tumor_sample_name HCC1143_Tumor \-L ccle.gatk.bed

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 10

• (D05X) Predictions not labeled as ’REJECT’ were accepted as confident somatic mutation predictions, andsubsequent downstream validation and analysis steps.

• D0– ¨©` GATKî ê 1.7 Ñ⌅D Dî\ X¿\ update-alternatives| t©XÏ ê Ñ⌅D ¿Ω\‰.

• GATKX SelectVariants| t©XÏ VCFX D0 (FILTER) D‹ÄÑt PASS⌧ (REJECT| ⌧x\) variantsÃ>D∏‰.

$ cd$ update -alternatives --config javaThere are 2 choices for the alternative java (providing /usr/bin/java).

Selection Path Priority------------------------------------------------------------

0 /usr/lib/jvm/java -7-oracle/jre/bin/java 21 /usr/lib/jvm/java -6-oracle/jre/bin/java 1

* 2 /usr/lib/jvm/java -7-oracle/jre/bin/java 2

Press enter to keep the current choice [*], or type selection number: 2update -alternatives: using /usr/lib/jvm/java -6-oracle/jre/bin/java$ java -versionjava version "1.7.0 _72"Java(TM) SE Runtime Environment (build 1.7.0_72 -b14)Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04 , mixed mode)$ java -jar GenomeAnalysisTK.jar -T SelectVariants \

-R /somatic_bench/reference/human_g1k_v37_decoy.fasta \--variant HCC1143_mutect.vcf \-o HCC1143_mutect_filter.vcf \-sn HCC1143_Tumor -sn HCC1143_Normal \-select 'vc.isNotFiltered ()'

• GATKX SelectVariants| t©XÏ VCFX D0 (FILTER) D‹ ÄÑt PASS⌧ (REJECT| ⌧x\) variantsÃ>D∏‰.

$ cd$ java -jar GenomeAnalysisTK.jar -T SelectVariants \

-R /somatic_bench/reference/human_g1k_v37_decoy.fasta \--variant HCC1143_mutect.vcf \-o HCC1143_mutect_filter.vcf \-sn HCC1143_Tumor -sn HCC1143_Normal \--excludeFiltered

• D0 ƒX mutation /⇠

$ cd$ grep -v "#" HCC1143_mutect_filter.vcf |wc -l109

2.4 �¨X0

• ⌅\¯® ©]: VarScan2, SomaticSniper, MuTect, GATK

• ∞¸<: � 4ƒ D0� DÃ⌧ somatic mutation (161, 102, 112)

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 11

3 Full Consensus / Partial Consensus sSNV lX0

SomaticSniper, VarScan2, MuTect 3ÖX SNV detecting toolsX full consensus callD >î‰. ∞ multi-allelic¸ indel@ ⌧p\‰.

3.1 Bi-allelic SNPÃ îúX0

• ¨⌅ D0 ∞¸– �t⌧ multi-allelicD ⌧pX‡ SNPà îú\‰.

• GATKX SelectVariants|t©XÏ -selectTypeD SNP (INDEL, SNP, MIXED, MNP, SYMBOLIC, NO VARIATION),-restrictAllelesTo| BIALLELIC (MULTIALLELIC or BIALLELIC)<\ \�\‰.

$ cd$ java -jar GenomeAnalysisTK.jar \

-T SelectVariants \-R /somatic_bench/reference/human_g1k_v37_decoy.fasta \--variant HCC1143_mutect_filter.vcf \-o HCC1143_mutect_1.vcf \-selectType SNP \-restrictAllelesTo BIALLELIC

$ java -jar GenomeAnalysisTK.jar \-T SelectVariants \-R /somatic_bench/reference/human_g1k_v37_decoy.fasta \--variant HCC1143_somaticsniper_filter.vcf \-o HCC1143_somaticsniper_1.vcf \-selectType SNP \-restrictAllelesTo BIALLELIC

$ java -jar GenomeAnalysisTK.jar \-T SelectVariants \-R /somatic_bench/reference/human_g1k_v37_decoy.fasta \--variant HCC1143_varscan_filter.vcf \-o HCC1143_varscan_1.vcf \-selectType SNP \-restrictAllelesTo BIALLELIC

3.2 Full Consensus / Partial Consensus lX0• Partial Consensus (SomaticSniper/MuTect, MuTect/VarScan2, VarScan2/SomaticSniper)@ somatic caller 3Ö–�\ ⌅¥ consensus| �� l\‰.

$ cd$ java -jar GenomeAnalysisTK.jar \

-T SelectVariants \-R /somatic_bench/reference/human_g1k_v37_decoy.fasta \--variant HCC1143_somaticsniper_1.vcf \--concordance HCC1143_mutect_1.vcf \-o HCC1143_SM.vcf

$ java -jar GenomeAnalysisTK.jar \-T SelectVariants \-R /somatic_bench/reference/human_g1k_v37_decoy.fasta \--variant HCC1143_mutect_1.vcf \--concordance HCC1143_varscan_1.vcf \-o HCC1143_MV.vcf

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 12

$ java -jar GenomeAnalysisTK.jar \-T SelectVariants \-R /somatic_bench/reference/human_g1k_v37_decoy.fasta \--variant HCC1143_varscan_1.vcf \--concordance HCC1143_somaticsniper_1.vcf \-o HCC1143_VS.vcf

$ java -jar GenomeAnalysisTK.jar \-T SelectVariants \-R /somatic_bench/reference/human_g1k_v37_decoy.fasta \--variant HCC1143_SM.vcf \--concordance HCC1143_varscan_1.vcf \-o HCC1143_SMV.vcf

3.3 Full Consensus / Partial Consensus /⇠ lX0• full consensus ✏ parital consensus /⇠| l\‰.

$ cd$ grep -v "#" HCC1143_SM.vcf |wc -l45$ grep -v "#" HCC1143_MV.vcf |wc -l38$ grep -v "#" HCC1143_VS.vcf |wc -l42$ grep -v "#" HCC1143_SMV.vcf |wc -l32

3.4 �¨X0

• ⌅\¯® ©]: GATK

• ∞¸<: consensus / parital consensus pt0

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 13

4 î� D0 �©X0

GATK Unified Genotyper| t©XÏ specificity| ù�‹¨ ⇠ à‰.

4.1 Unifed Genotyper| t©\ normal, tumor variants call (8Ñ)

• GATK UnifiedGenotyper| t©XÏ Normal/Tumor ÿ�– �t �� SNP| calling\‰.

$ cd$ java -jar GenomeAnalysisTK.jar \

-T UnifiedGenotyper \-o HCC1143_gatk.tumor.vcf \-I tumor.bam \--genotype_likelihoods_model SNP \-R /somatic_bench/reference/human_g1k_v37_decoy.fasta-L ccle.gatk.bed

$ java -jar GenomeAnalysisTK.jar \-T UnifiedGenotyper \-o HCC1143_gatk.normal.vcf \-I normal.bam \--genotype_likelihoods_model SNP \-R /somatic_bench/reference/human_g1k_v37_decoy.fasta-L ccle.gatk.bed

4.2 Filtering SNVs - full consensus (›µ �•)

• GATK UnifiedGenotyper|t©XÏ›1⌧ Normal/Tumor��X variants|t©XÏ SNVs predicted in tumorbut not the germlines D0| ⇠â\‰.

$ cd$ java -jar GenomeAnalysisTK.jar \

-T SelectVariants \-R /somatic_bench/reference/human_g1k_v37_decoy.fasta \--variant HCC1143_SMV.vcf \--discordance HCC1143_gatk.normal.vcf \-o HCC1143_SMV_discordance_normal.vcf

$ java -jar GenomeAnalysisTK.jar \-T SelectVariants \-R /somatic_bench/reference/human_g1k_v37_decoy.fasta \--variant HCC1143_SMV_discordance_normal.vcf \--concordance HCC1143_gatk.tumor.vcf \-o HCC1143_final_filter_concordance.vcf

4.3 Filtering SNVs - partial consensus (SomaticSniper/MuTect)

$ cd$ java -jar GenomeAnalysisTK.jar \

-T SelectVariants \-R /somatic_bench/reference/human_g1k_v37_decoy.fasta \--variant HCC1143_SM.vcf \--discordance HCC1143_gatk.normal.vcf \

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 14

-o HCC1143_SM_discordance_normal.vcf$ java -jar GenomeAnalysisTK.jar \

-T SelectVariants \-R /somatic_bench/reference/human_g1k_v37_decoy.fasta \--variant HCC1143_SM_discordance_normal.vcf \--concordance HCC1143_gatk.tumor.vcf \-o HCC1143_SM_final_filter_concordance.vcf

4.4 GATK D0| �© ƒ Full Consensus / Partial Consensus /⇠ lX0• GATK D0| »\ consensus ✏ parital consensus /⇠| l\‰.

$ cd$ grep -v "#" HCC1143_final_filter_concordance.vcf |wc -l32$ grep -v "#" HCC1143_SM_final_filter_concordance.vcf |wc -l45

4.5 �¨X0

• ⌅\¯® ©]: GATK

• ∞¸<: GATK D0| �©\ consensus / parital consensus pt0

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 15

5 Validation

COSMIC¸CCLEX HCC1143 ÿ�– �\ ¿t ¨§∏| �¿‡ º»ò |XXî¿| LD¯‰. validation.list �|@ ⌧Ñ– �•⌧ �| ⇣î ‰¥\‹ (https://gist.github.com/hongiiv/42194181ce6402d8b629)XÏ ¨©i»‰.

5.1 COSMIC, CCLE pt0 �DX0

• COSMIC¸ CCLEX HCC1143 ÿ�– �\ ¿t ©] (� 103⌧)D ı¨\‰.

$ cd$ cp /somatic_bench/reference/validation.list ./$ cat validation.list | wc -l103

5.2 Validation ⇠â - consensus / parital consensus

• \Ö filter⌧ consensus/partial consensus (SomaticSniper/MuTect)– �t⌧ á⌧� |XXî¿| Ux\‰.

$ cd$ java -jar GenomeAnalysisTK.jar \

-T SelectVariants \-R /somatic_bench/reference/human_g1k_v37_decoy.fasta \--variant HCC1143_final_filter_concordance.vcf \-o all.val.filter.vcf \-L validation.list

$ java -jar GenomeAnalysisTK.jar \-T SelectVariants \-R /somatic_bench/reference/human_g1k_v37_decoy.fasta \--variant HCC1143_SM_final_filter_concordance.vcf \-o sm.val.filter.vcf \-L validation.list

$ grep -v "#" all.val.filter.vcf | wc -l6$ grep -v "#" sm.val.filter.vcf | wc -l9

• î�\ GATK D0⌅X consensus ¿t– �t⌧ á⌧� |XXî¿| Ux\‰.

$ cd$ java -jar GenomeAnalysisTK.jar \

-T SelectVariants \-R /somatic_bench/reference/human_g1k_v37_decoy.fasta \--variant HCC1143_SMV.vcf \-o all.val.vcf \-L validation.list

$ java -jar GenomeAnalysisTK.jar \-T SelectVariants \-R /somatic_bench/reference/human_g1k_v37_decoy.fasta \--variant HCC1143_SM.vcf \-o sm.val.vcf \-L validation.list

$ grep -v "#" all.val.vcf |wc -l6

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 16

$ grep -v "#" sm.val.vcf |wc -l9

• consensus: before GATK filter (32/6) - after GATK filter (32/6)

• partial consensus-SM: before GATK filter (45/9) - after GATK filter (45/9)

5.3 �¨X0

• ⌅\¯® ©]: GATK

• ∞¸<: \Ö consensus / partial consensus@ COSMIC, CCLE@ |XXî /⇠

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 17

6 0¿ Somatic Mutation Callers - Strelka, Virmid

6.1 Strelka (1Ñ38�)

Bayesian probability with posterior filtering| t©\ somatic mutation caller\ 2012D |˯ò� Ç ⌅\¯®t‰. |˯òX alignerx issactò eland –à D»| bwaƒ ¿–\‰.‰â)ït |⇠ ⌅\¯®‰¸î }⌅ ‰x)›Dt©Xîptî|˯ò�\¸⌧\\ issac⇣\D∑\‰â)ïD¨©Xp,tîXòX⌅\�∏|®(�<\ �¨X‡ |�1àå �¨X0 ⌅XÏ Makefile t|î �›D ¨©Xî make |î ¯¨| t©X0 L8t‰.

• Strelka| ¨©X0 ⌅t⌧î StrelkaX 5Xt �•⌧ �|t DîXp, 0¯�<\ bwa, eland, isaac 3⌧Xaligner| ⌅\ 0¯ 5XD �� ⌧ı\‰.

• 0¯ 5X–⌧ exometò target sequencingX Ω∞ isSkipDepthFilters = 1 \ ¿�\‰.

$ ll /somatic_bench/app/strelka -1.0.14/ etc/total 20drwxrwxr -x 2 viz viz 4096 Jul 10 2014 ./drwxr -xr-x 7 root root 4096 Jan 30 11:06 ../-rw-rw-r-- 1 viz viz 3658 Jul 10 2014 strelka_config_bwa_default.ini-rw-rw-r-- 1 viz viz 3683 Jul 10 2014 strelka_config_eland_default.ini-rw-rw-r-- 1 viz viz 3821 Jul 10 2014 strelka_config_isaac_default.ini

• Strelka� $X⌧ †¨@ Ñ�∞¸� �• †¨– �t⌧ �� ¿⇠ $�D \‰.• 0¯ 5X �|D ı¨X‡ configureStrelkaWorkflow.pl Ö9<\ Ñ� Ö9¥| ›1\‰.• É¥ƒÑ�Ö9D make|µt‰âXptL -j5XDµtÑ�–¨©` thread (cpu)/⇠|¿�\‰.• INDEL¸ SNP� �� ƒƒX VCF Ϙ<\ ›1⇠p, pass ⌧ ɸ raw somatic �� � 4⌧X ∞¸ �|t›1⌧‰.

$ STRELKA_INSTALL_DIR =/ somatic_bench/app/strelka -1.0.14/echo $STRELKA_INSTALL_DIR/somatic_bench/app/strelka -1.0.14/$ WORK_DIR =/root/myWork$ cp $STRELKA_INSTALL_DIR/etc/strelka_config_isaac_default.ini config.ini$ STRELKA_INSTALL_DIR/bin/configureStrelkaWorkflow.pl \--normal =/root/normal.bam \--tumor=/root/tumor.bam \--ref=/ somatic_bench/reference/human_g1k_v37_decoy.fasta \--config=config.ini --output -dir =./ myAnalysis$ cd ./ myAnalysis$ make -j 8$ ll myAnalysis/results/total 88drwxr -xr-x 2 root root 4096 Jan 30 11:39 ./drwxr -xr-x 5 root root 4096 Jan 30 11:37 ../-rw-r--r-- 1 root root 13452 Jan 30 11:37 all.somatic.indels.vcf-rw-r--r-- 1 root root 36736 Jan 30 11:37 all.somatic.snvs.vcf-rw-r--r-- 1 root root 7098 Jan 30 11:37 passed.somatic.indels.vcf-rw-r--r-- 1 root root 16070 Jan 30 11:37 passed.somatic.snvs.vcf

• \Ö pass⌧ somatic SNPX /⇠| Ux\‰.

$ cd myAnalysis/results/$ grep -v "#" passed.somatic.snvs.vcf|wc -l62

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 18

6.2 Virmid (33Ñ)

Virmidî 2013D 8�YP @¡∞ P⇠� Ç å⌅∏Ë¥Ö»‰. ÿ�¡D µt tumor–⌧ normal ÿ�X pro-portionD \©\‰ (↵).

• \Ö pass⌧ somatic SNPX /⇠| Ux\‰.

$ java -jar /somatic_bench/app/Virmid -1.1.1/ Virmid.jar \-R /somatic_bench/reference/human_g1k_v37_decoy.fasta \-D /root/tumor.bam \-N /root/normal.bam \-t 8 \-w /root/virmid

$ cd /root/virmid$ ls -la$ ls -altotal 98024drwxr -xr-x 2 root 4096 Jan 30 16:00 ./drwxr -xr-x 8 root 8192 Jan 30 15:32 ../-rw-r--r-- 1 root 1252161 Jan 30 16:03 tumor.bam.virmid.germ.all.vcf-rw-r--r-- 1 root 955213 Jan 30 16:03 tumor.bam.virmid.germ.passed.vcf-rw-r--r-- 1 root 262 Jan 30 16:00 tumor.bam.virmid.gm-rw-r--r-- 1 root 36564 Jan 30 16:03 tumor.bam.virmid.loh.all.vcf-rw-r--r-- 1 root 2233 Jan 30 16:01 tumor.bam.virmid.loh.passed.vcf-rw-r--r-- 1 root 992 Jan 30 16:03 tumor.bam.virmid.report-rw-r--r-- 1 root 1364144 Jan 30 15:29 tumor.bam.virmid.sample.control.bai-rw-r--r-- 1 root 53107377 Jan 30 15:29 tumor.bam.virmid.sample.control.bam-rw-r--r-- 1 root 1364104 Jan 30 15:29 tumor.bam.virmid.sample.disease.bai-rw-r--r-- 1 root 41746178 Jan 30 15:29 tumor.bam.virmid.sample.disease.bam-rw-r--r-- 1 root 84053 Jan 30 16:03 tumor.bam.virmid.som.all.vcf-rw-r--r-- 1 root 6883 Jan 30 16:03 tumor.bam.virmid.som.passed.vcf$ grep -v "#" tumor.bam.virmid.som.passed.vcf|wc -l78

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 19

7 ⌅¥ l| ⌅\ ¨⇧§

7.1 ‰µ© ¨⇧§ ⌧Ñ

• ⌧Ñ ¸å: xxx.xxx.xxx.xxx• Dt: edu01, edu02• T8: kogo2015• ˘⌘ç: http://xxx.xxx.xxx.xxx:8787

7.2 ‰µ© ¨⇧§ ⌧Ñ ⌘çX0 - �ƒ∞਩ê

• http://www.chiark.greenend.org.uk/˜sgtatham/putty/download.html ⌘ç

• Intel x86© putty.exe| ‰¥\‹ i»‰.

• Host Name: xxx.xxx.xxx.xxx / Port: xx

• Security Alert =t (t ’�(Y)’| ›i»‰.

• \¯x Dt: `˘ �@ Dt@ T8| ¨©i»‰.

7.3 ‰µ© ¨⇧§ ⌧Ñ ⌘çX0 -  ⇣î ¨⇧§ ¨©ê

• Â(OSX)X Ω∞ ’Q©⌅\¯®, ¯¨, 0¯⇣ app’D ‰âi»‰. ¨⇧§X Ω∞ ’Tt ⇣î �ê ¨⇧§X ⌅\¯® Tt–⌧ 0¯⇣D ‰â i»‰.

$ ssh user_id@host_name$ ssh root@127 .0.0.1

• ssh Ö9D t©XÏ ‰µ© ¨⇧§ ⌧Ñ– ⌘çi»‰. ´à¯ ⌘ç‹ yes| ›Xt T8| ;î Tttò$å ⇠p tL ÄÏ�@ T8| Ö%XÏ ⌘çi»‰.

7.4 ¨⇧§ ‹§\ �Ù LD¥0

¯ 8⌧î ¨⇧§ 0Ï⇣3

X Xòx ’Ubuntu (∞Ñ,)’| 0⇠<\ $Öi»‰. ƒƒX \‹� ∆î Ω∞ ®‡ ÖXX ¨⇧§– ¨©t �•i»‰. ¨⇧§î ‰ë\ 0Ï⇣¸ X‹Ë¥¡–⌧ ŸëXî ¥�¥⌧Ö»‰. ê‡X¨⇧§� ¥†\ XΩ–⌧ ŸëXî¿| LDP¥| å⌅∏Ë¥ $X‹ ê‡X ¨⇧§– �i\ å⌅∏Ë¥X$X� �•i»‰.

• ⌅¨ ê‡t ¨©Xî ¨⇧§ 0Ï⇣X ÖX ›ƒXî )ïÖ»‰. UbuntuX Ω∞ 4Ã\ 0Ï⇠î ¨⇧§¥�¥⌧\ ⌅¨ \‡Ñ⌅@ 14.04 LTS (Long Term Support)4 Ñ⌅Ö»‰.

$ cat /etc/issue.netUbuntu 12.04.1 LTS

• ¨⇧§î ‰ë\ X‹Ë¥ XΩ–⌧ ¥�⇠p ¨⇧§| ¿–Xî å⌅∏Ë¥‰@ tÏ\ X‹Ë¥– 0|‰â �|D 0\ ⌧ıi»‰. 0|⌧ ⌅¨ ê‡t ¨©Xî X‹Ë¥ �Ù| Lt ꇖå fiî å⌅∏Ë¥|‰¥\‹XϨ©`⇠ൻ‰.¨⇧§⌧Ñ•DX‹Ë¥¨ë›ƒ@ ’-m’â, machine5XDµtL ⇠ ൻ‰. ’x86’@ Intel 0⇠X CPU| X¯Xp, ’64’î 64D∏ X‹Ë¥| X¯5i»‰.

$ uname -mx86_64

3

¨⇧§î lå �‹á ƒÙ¸ pDH ƒÙ\ Ѩ⇠p � ƒÙƒ\ ‰ë\ 0Ï⇣t t¨\‰.

4T‹Ö@ Trusty TahrÖ»‰.

5Tà ⌅Ï⌧ x64|‡ \⌅i»‰.

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 20

• ‰⇣@¨⇧§¥�¥⌧XuÏ<\¨©êXÖ9D‰⌧X‹Ë¥|µt‰âXƒ]i»‰.¨⇧§‰⇣@¨©Xî0Ï⇣–0|⌧\‰xÑ⌅D¨©i»‰.⌅¨�•\‡X¨⇧§‰⇣@ 3.14.3dmfh 2014D5‘6| ⌧\⌧ Ñ⌅Ö»‰. � ¨⇧§ 0Ï⇣@ t⌥å ⌧\⌧ ‰⇣D 0⇠<\ ⌧ë)»‰. ¨⇧§X ‰⇣�Ù ›ƒ tÙƒ] X†µ»‰.

$ uname -r3.2.0-32 - virtual

• X@ ¨⇧§ Ö9¥| Ö%�D t| ‰âXî XΩ<\ ’PATH’î ⌅\8§� ŸëXî )ï– �•D |Xî ✓x XΩ ¿⇠ ⌘X XòÖ»‰. exportî tÏ\ XΩ¿⇠X ✓D $�Xî Ö9¥ Ö»‰. ¨⇧§–Ö9D Ö%Xt PATH– $�⌧ †¨| ∞ Ä…XÏ t˘ Ö9¥� àî¿| UxX‡ t| ‰âi»‰. 0|⌧ ê‡X ¡⌘ å⌅∏Ë¥| $XX‡ ¨⇧§ ¡–⌧ ‰âXî Ω∞ ⇠‹‹ PATH| ¿�t| ¥–⌧‡¿ ‰ât �•Xp ¯⌥¿ J@ Ω∞ å⌅∏Ë¥� $X⌧ †¨ ¥–⌧à ‰ât �•i»‰.X XΩ ¿⇠ Ux@ ’env’ Ö9<\ LD º ⇠ à<p, PATHî ’export’| µt $�i»‰.

$ env | grep PATHMANPATH =/usr/local/texlive /2013/ texmf/doc/man:PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/binINFOPATH =/usr/local/texlive /2013/ texmf/doc/info:$ export PATH=/BIO/app/bwa -0.7.5a/:$PATH$ env | grep PATH

7.5 ¨⇧§ �| ‹§\

¨⇧§X �X@ XòX <¨� §l| |¨�<\ ÏÏ �Ì<\ lÑXÏ �¨Xp � �X@ �| ‹§\D ›1XÏ �| ✏ †¨| �¨` ⇠ ൻ‰.

• ¨⇧§ ‹§\@ ÏÏ ¨©ê� ¨©Xî ‹§\<\ �ê ê‡X ‡ �Ìx H †¨| �¿‡ ൻ‰. H †¨¥–⌧î ê‡t �|D ›1, ≠⌧� �•i»‰. H †¨\ tŸXî Ö9@ ’cd’ Ö9tp, ⌅¨ †¨ Ω\î ’pwd’ Ö9<\ Ux` ⇠ ൻ‰.

$ cd$ pwd/home/hongiiv

• †¨ ɇ t˘ †¨\ tŸX0

$ cd$ mkdir sample_data$ ls -latotal 2203488drwxr -xr-x 16 hongiiv hongiiv 4096 May 29 10:34 .drwxr -xr-x 3 root root 4096 May 7 13:14 ..-rw------- 1 hongiiv hongiiv 1908 May 10 11:59 .bash_history-rw-r--r-- 1 hongiiv hongiiv 220 May 7 13:14 .bash_logout-rw-r--r-- 1 hongiiv hongiiv 3763 May 10 17:06 .bashrcdrwxr -xr-x 2 root root 4096 May 29 10:34 sample_data$ cd sample_data$ pwd/home/hongiiv/sample_data

• †¨ ✏ �| ≠⌧X0

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 21

$ cd$ rm -rf sample_data$ ls -latotal 2203488drwxr -xr-x 16 hongiiv hongiiv 4096 May 29 10:34 .drwxr -xr-x 3 root root 4096 May 7 13:14 ..-rw------- 1 hongiiv hongiiv 1908 May 10 11:59 .bash_history-rw-r--r-- 1 hongiiv hongiiv 220 May 7 13:14 .bash_logout-rw-r--r-- 1 hongiiv hongiiv 3763 May 10 17:06 .bashrc$

• ¨⇧§ �| ‹§\ Ù0

$ df -hFilesystem Size Used Avail Use% Mounted on/dev/xvda1 19G 14G 4.8G 74% /udev 3.9G 4.0K 3.9G 1% /devtmpfs 1.6G 188K 1.6G 1% /runnone 5.0M 0 5.0M 0% /run/locknone 3.9G 0 3.9G 0% /run/shm/dev/xvdb1 79G 38G 38G 50% /home/hongiiv/test

• <¨� X‹§l �X �Ù Ù0 - 21.5 GBX <¨�x /dev/xvda X‹§lî vxda1, xvda2 2⌧X �X<\ l1⇠¥ à<p �� Linux, Linux swapX �|‹§\ÑD Ux` ⇠ ൻ‰.

$ fdisk -lDisk /dev/xvda: 21.5 GB , 21474836480 bytes255 heads , 63 sectors/track , 2610 cylinders , total 41943040 sectorsUnits = sectors of 1 * 512 = 512 bytesSector size (logical/physical): 512 bytes / 512 bytesI/O size (minimum/optimal): 512 bytes / 512 bytesDisk identifier: 0x00034212

Device Boot Start End Blocks Id System/dev/xvda1 2048 40038399 20018176 83 Linux/dev/xvda2 40038400 41940991 951296 82 Linux swap / Solaris

Disk /dev/xvdb: 300.6 GB , 300647710720 bytes171 heads , 35 sectors/track , 98112 cylinders , total 587202560 sectorsUnits = sectors of 1 * 512 = 512 bytesSector size (logical/physical): 512 bytes / 512 bytesI/O size (minimum/optimal): 512 bytes / 512 bytesDisk identifier: 0x3459a991

Device Boot Start End Blocks Id System/dev/xvdb1 2048 587202559 293600256 8e Linux LVM

• �| ‹§\ »¥∏ �Ù Ux

$ cat /etc/fstabproc /proc proc nodev ,noexec ,nosuid 0 0/dev/xvda1 / ext3 errors=remount -ro 0 1/dev/xvda2 none swap sw 0 0

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 22

7.6 ¨⇧§ X‹§l î�X0

• fdisk| µt î�⌧ X‹§l| Ux\ ƒ �T›, �|‹§\ ›1, »¥∏X 3˃| p– X‹§l| ¨©i»‰. USB •X| ¨⇧§– x›X0 ⌅t⌧î mount ¸�ÃD pXt )»‰.

$ fdisk /dev/xvdb$ mkfs.ext3 /dev/xvdb1$ mkdir /new_hdd$ mount /dev/xvdb1 /new_hdd$ cd /new_hdd$ df -h

7.7 �| �( Ö9¥• touch -�|l0� 0x»\¥�|›1Xpò�|t›1⌧‹⌅D¿Ω`⇠ൻ‰.⌅9 ⌅¥�(å⌅∏Ë¥ $Xò P!‹ ¨©Xî Ö9¥\ ⇡¿X‹0 绉.

$ touch a$ ls -al-rw-r--r-- 1 root root 0 Jun 18 10:04 a$ dateWed Jun 18 10:05:10 KST 2014$ touch -c a$ ls -al-rw-r--r-- 1 root root 0 Jun 18 10:05 a

• cat - �|X ¥©D UxXpò ⌅Ë\ §lΩ∏ ë1‹ ¨©i»‰. ’cat ¿ test’ Ö9<\ test|î �|D›1Xt⌧ �| ¥©D ë1i»‰. ë1t DÃ⌧ ƒ–î ’ctrl+D’ ѺD �Ï `8ò, ⇠ ൻ‰.

$ cat > testhi theremy name is hong$ cat testhi theremy name is hong$ ls -al-rw-r--r-- 1 root root 25 Jun 18 10:09 test

• π� †¨X �|X /⇠ 80

$ ls -l . | grep ^- | wc -l50

• �|X π� 8êÙ\ ‹ëXî ÄÑD ⌧x\ ÄÑ ú%X0Ö»‰. VCF �|¸ ⇡t ’’\ ‹ëXî ÄÑ@¸�x Ω∞ ¸�ÃD ⌧x\ ‰⌧ ⌅¿tX ¨§∏| ú%i»‰. ⇣î ¯ ⇠�\ ¸� ÄÑÃD ú%i»‰.

$ cd /BIO/data/gatk$ grep -v "#" dbsnp_138.hg19.vcf| wc -l8087914$ grep -F "#" dbsnp_138.hg19.vcf |wc -l165

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 23

• π� ¸…¥Ã ú%i»‰. t˘ ¸…¥X L�≥⌧ ’-d’, +ê⌧’-c’<\ �,t �•i»‰.

$ grep -v "#" dbsnp_138.hg19.vcf |awk '{print $1}'|morechrMchrMchrMchrMchrMchrMchrMchrMchrM$ grep -v "#" dbsnp_138.hg19.vcf |awk '{print $1}'| sort -dchr1chr2$ grep -v "#" dbsnp_138.hg19.vcf |awk '{print $1}'| uniq -c

475 chrM4723878 chr13363561 chr2$ grep -v "#" dbsnp_138.hg19.vcf | \awk '{if ($1 == "chrM") printf "chrM is: %s\n", $2}'chrM is: 16390chrM is: 16391chrM is: 16429chrM is: 16445chrM is: 16499

• \�ú%<\ ú%⇠î ¥©D �|\ �•X0

$ grep -v "#" dbsnp_138.hg19.vcf | \awk '{if ($1 == "chrM") printf "chrM is: %s\n", $2}' > ~/chr_pos.txt$ grep -v "#" dbsnp_138.hg19.vcf | \awk '{if ($1 == "chr1") printf "chrM is: %s\n", $2}' >> ~/chr_pos.txt

7.8 ¨⇧§ $∏Ãl �Ù

• $∏Ãl x0òt§– �\ �Ù\ eth0X inet addrt xÄ–⌧ ⌅¨ ¨⇧§\ ⌘ç �•\ ¸å6

Ö»‰.

$ ifconfigeth0 Link encap:Ethernet HWaddr 02:00:5b:73:00:33

inet addr:172.27.252.234 Bcast:172.27.255.255inet6 addr: fe80::5bff:fe73:33/64 Scope:LinkUP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1RX packets:501386 errors:0 dropped:0 overruns:0 frame:0TX packets:346879 errors:0 dropped:0 overruns:0 carrier:0collisions:0 txqueuelen:1000RX bytes:19357734604 (1 GB) TX bytes:2720265191 (2 GB)Interrupt:68

lo Link encap:Local Loopbackinet addr:127.0.0.1 Mask:255.0.0.0inet6 addr: ::1/128 Scope:Host

6

¨⇧§ ⌧ÑX ¸åî 172.27.252.234\ �êX ‰µ XΩ– 0| ‰tå \‹⌧‰.

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 24

UP LOOPBACK RUNNING MTU:16436 Metric:1RX packets:4337 errors:0 dropped:0 overruns:0 frame:0TX packets:4337 errors:0 dropped:0 overruns:0 carrier:0collisions:0 txqueuelen:0RX bytes:2203478 (2.2 MB) TX bytes:2203478 (2.2 MB)

7.9 ¨⇧§ Uï ttX0

¨⇧§î ‰ë\ UïD ¿–Xp, å⌅∏Ë¥ò pt0| 0ÏXî Ω∞ Uï⌧ �|D t©XÏ 0Ïi»‰.

• ¨⇧§–⌧¨©Xî‰ë\Uït⌧)ïÖ»‰. UïDt⌧\�|H–î8⌧�‰¥àµ»‰. 8⌧|⌧| <� x‹î Ñ–åî ¡àt ¸¥—»‰.

$ cd$ cp -R /BIO/data/compress ./ compress$ cd compress$ gzip -d compress01.gz$ tar xvfz compress02.tar.gz$ unzip compress03.zip$ bzip2 -d comress04.bz2$ tar xvfz compress05.tar.gz$ tar xvf compress06.tar.bz2

• gzip: Recommended for fast network connections• bzip2: Recommended for slower network connections (smaller size but takes longer to compress)• zip: Not recommended but is provided as an option for those who cannot open the above formats• �©…XUï⌧ ⌅¥pt0–�tUïDt⌧X¿J‡¯¨�|X¥©UxXî)ïÖ»‰. FASTQ�|ÒD UxXîp ©i»‰.

$ gzip -dc CEUTrio.HiSeq.WGS.b37.bestPractices.hg19.vcf.gz | more$ gzip -dc CEUTrio.HiSeq.WGS.b37.bestPractices.hg19.tar.gz | tar -tvf -

7.10 ¨⇧§ å⌅∏Ë¥ $XX0

|⇠�<\ ¨⇧§– å⌅∏Ë¥| $XXî )ï@ ‰LX 3�¿ )ït ൻ‰. ´à¯î t�¨ (‰â)�|D Uï�‹\ ⌧ıXî )ï<\ ⌅Ëà UïD t⌧XÏ \ ¨©t �•X‰. Pà¯î ¨⇧§–⌧ ⌧ıXî (§¿| t©Xî )ï<\ ∞Ñ,X Ω∞ APT|î (§¿ �¨ ⌅\¯®D t©\‰. 8à¯\î å§�|D t©XÏ $XXî )ït‰.

7.10.1 APT| t©\ å⌅∏Ë¥ $X

• APT| t©\ (§¿ ≈pt∏

$ apt -get update$ apt -get install bwaReading package lists ... DoneBuilding dependency treeReading state information ... DoneUse 'apt -get autoremove ' to remove them.Suggested packages:

samtools

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 25

The following NEW packages will be installed:bwa

0 upgraded , 1 newly installed , 0 to remove and 153 not upgraded.Need to get 135 kB of archives.After this operation , 286 kB of additional disk space will be used.Fetched 135 kB in 3s (40.1 kB/s)Selecting previously unselected package bwa.(Reading database ...17 files and directories currently installed .)Unpacking bwa (from .../ archives/bwa_0 .6.1-1 _amd64.deb) ...Processing triggers for man -db ...Setting up bwa (0.6.1 -1) ...$ bwa

Program: bwa (alignment via Burrows -Wheeler transformation)Version: 0.6.1- r104Contact: Heng Li <[email protected] >

Usage: bwa <command > [options]

Command: index index sequences in the FASTA formataln gapped/ungapped alignmentsamse generate alignment (single ended)sampe generate alignment (paired ended)bwasw BWA -SW for long queriesfastmap identify super -maximal exact matches

fa2pac convert FASTA to PAC formatpac2bwt generate BWT from PACpac2bwtgen alternative algorithm for generating BWTbwtupdate update .bwt to the new formatbwt2sa generate SA from BWT and Occpac2cspac convert PAC to color -space PACstdsw standard SW/NW alignment

• NGS �( å⌅∏Ë¥ $X| ⌅t ¯¨ 0¯ $X⇠¥| Xî (§¿ ©]Ö»‰.

$ apt -get update -y$ apt -get install gcc -y$ apt -get install make -y$ apt -get install zlib1g -dev -y$ apt -get install libncurses5 -dev -y$ apt -get install g++ -y$ apt -get install tcl tk -y$ apt -get install tcl -dev -y$ apt -get install unzip -y$ apt -get install curl -y$ apt -get install screen -y$ apt -get install python -dev -y$ apt -get install python -software -properties -y$ add -apt -repository ppa:webupd8team/java$ apt -get update -y$ apt -get install oracle -java7 -installer -y

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 26

7.10.2 å§ T‹ Ù�|D µ\ å⌅∏Ë¥ $X

• å§\ $XX0

$ cd$ cp /BIO/app/bwa -0.7.4. tar.bz2 ./$ tar xvf bwa -0.7.4. tar.bz2$ cd bwa -0.7.4$ make$ ./bwa

Program: bwa (alignment via Burrows -Wheeler transformation)Version: 0.7.4- r385Contact: Heng Li <[email protected] >

Usage: bwa <command > [options]

Command: index index sequences in the FASTA formatmem BWA -MEM algorithmfastmap identify super -maximal exact matchespemerge merge overlapping paired ends (EXPERIMENTAL)aln gapped/ungapped alignmentsamse generate alignment (single ended)sampe generate alignment (paired ended)bwasw BWA -SW for long queries

fa2pac convert FASTA to PAC formatpac2bwt generate BWT from PACpac2bwtgen alternative algorithm for generating BWTbwtupdate update .bwt to the new formatbwt2sa generate SA from BWT and Occ

$ bwa

Program: bwa (alignment via Burrows -Wheeler transformation)Version: 0.6.2- r126Contact: Heng Li <[email protected] >

Usage: bwa <command > [options]