Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
1
GBS Bioinformatics Pipeline ...or, “Where Your Data Go After Sequencing”
James Harriman Ed Buckler Jeff Glaubitz
Qseq
QseqToTagCount
TagCounts per lane
TagCounts for species
(Master Tags)
Fastq
SAM alignment
TagsOnPhysical Map
Key files
TagsByTaxa files (1 per lane)
TagsByTaxa for species
HapMap
Merge TagsByTaxa
SAM convertor
BWA (Burrows-Wheeler Aligner)
TagCountsTo FASTQ
Merge TagsCounts
TagsToSNP ByAlignment Process
QseqToTBT
File (data structure)
Reference Genome Pipeline
2
Qseq
QseqToTagCount
TagCounts per lane
TagCounts for species
(Master Tags)
Key files
TagsByTaxa files (1 per lane)
TagsByTaxa for species
HapMap
Merge TagsByTaxa
Merge TagsCounts
TagHomology PhaseNoAnchor
Process
QseqToTBT
File (data structure)
Non-Reference Genome Pipeline
HWI-ST397 0 3 68 15896 200039 0 1 GTCGATTCTGCTGACTTCATGGCTTCTGTTGACGACGATGTGGAACGAGCTGTTGTTGAAACTGATGAGGTTGCTGAGATCGGAAGAGCGGTTCAGCAGG ggggggggggggdeggggggggggg^dccebabcbde_Tb[\`T_baa``ddcdd^c^^b_ab[K]]SO_\_\T\VVVW\`dc\``b`b`]U]UZWXW[T 1 HWI-ST397 0 3 68 15960 200043 0 1 GAGAATCAGCTTTTCCAACACCTTGAGTTTGAGTATGCGATGACAGTTACTCTTACTGTCCATTGTCAGCATTGCCAGAGCTTGACCAGCTGAGATCGGA fefefffefe_ddd]feeeceefffeeafbd`d`dfefedccffeeefccedfea^deddcZfae[[ba`bccb_da`^bQccc`c\`c`daT`W]T_[] 1 HWI-ST397 0 3 68 15831 200053 0 1 ATGTACTGCACCGTTGCAAGCGAGCACCACCAAGCGGCGGTATGCACTTTGCAATATGTAGCTAGAATAGGATTTTCAGGTGATTAGGAGCGTAAAAAAG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15867 200049 0 1 CCAGCTCAGCCTGCATTCTTTCAAAAACTTCCAATGCCTCTCTTGGCCTAGCATTTTGGGCATACCCTGTGACCATTGCTGTCCATGCCACCATATCCTT ggcggggfgggggggggffgggggdeeggdffgfggggggggcgggggggggfgggceggeggcbgggggedebeefccfgcggegeddeegdfde`fef 1 HWI-ST397 0 3 68 15943 200048 0 1 GATTTTACTGCACATCGGTCTTGTCACACCAGCTATACCTGTAGAGTTGCCTTCCACAGTTGTAGAGATCGGAAGAGCGGTTCAGCGGGACTGCCGAGAA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15812 200062 0 1 TCACCCAGCATCACGCCCCTTCACATCCAGTAAAACCCCTGAATGATGTGCTGTCACTGTTTGATATACAGTTGTTAACGTGAGGACGGGCTTTGAAGGA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15888 200067 0 1 CTTGACTGCCACCATGAATATGTGTTCCAAGTGCCACAAGGACTTGGCCCTGAAGCAAGAACAAGCCAAACTTGCAGAGATCGGAAGAGCGGTTCAGCAG fffffffdefegfdffbcfeffeffffedegdggdgefbfe^cfcf\ded\]_]_Ya`]KW`cc`cYdd`Q[XYXabWaLa]_aZ]Y]TWddd^Y```Tc 1 HWI-ST397 0 3 68 15969 200067 0 1 CCACAACTGCTCCATCTTTTCCATGAGACATTGCTCCCGCCATTGCACCCTTGGCATCAGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG VTMKOJJMUGVZN[V_`_`YWZYPWILQSNYbb\Y_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15786 200078 0 1 GTATTCTGCACACGAATCAGCTGAGACACCAATTGGGCATGAATCAAATGGCGCCATTGCCGGGGATCGAACCCCGAATCAAATGGTGCCATTGCCACTG fceffcddbeabbdcaebeccaedZc`bc`^``X`\cbbc^d`b[c\[dcdbcd^`^]`bZdabZaVa\dZZ[SQccS``c^_^c^bTa^\b\b_BBBBB 1 HWI-ST397 0 3 68 15830 200072 0 1 AATATGCCAGCAGTTAAGAGAGTTCAAGATCCAGGGCTCATATTCAGTCACCTATATCAATTTCGAAATGGATTTCCAGGGTTTTAAGAGCCTAACAAAG effffefffffYfeedbdddad\ddbTb^^abMb`aedeedaedacdea`cWdaadcYdXdabb`df]efd^db`dddddfefff``Z`cc^ac^`BBBB 1 HWI-ST397 0 3 68 15863 200073 0 1 CTCCCTGCGGGTGCGCGCGACCCATCTTCAGTTGGAGCGTCTATCGGCGTTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTA ggggggggggf`ggfggggdggggf^gfgdfffffggdgeggeggggcgbaedaebdd`debccc^aVccbabK_`_b_`d_beeTMMSM``Z]^X``_B 1 HWI-ST397 0 3 68 15762 200088 0 1 TGGTACGTCTGCGGAATGGCGTTTTTTATGCCTTAGTGGTTCGCAGAGCATTTGGCAGCTGAGATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGAT ^YTLX]X]ZX]\^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15903 200085 0 1 GGACCTACTGCCCAAGAACGGCTCACCCATCATCCGCTTTCTTCACCTTCCGTCTTCTTTGGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGAC gggggggfgggggggggggggggggegggg`gggggegggggggdeggggggeggggegegggeead^dddaa]bXYZ^`bdbbb_\Yd_`cM``bb^aW 1 HWI-ST397 0 3 68 15921 200082 0 1 GAGAATCAGCGTGTACGGGGCACGGGGTGACTGCTGTTGCGTGCGAGGGCTGAGATCGGAAGAGCGGTTCAGCAGGAGTGCCGAGACCGATCTCGTATGC fdedffdc^ddbebececed[acde_^OTN__`a_ed^^dadaabK][^^X]NYMcc^[WOUSSE]`S[[U`ZU`^_I`a`Z_b`Q[]_]]]V[R^^`[` 1 HWI-ST397 0 3 68 15984 200085 0 1 TTCTCCAGCCGCATGGGCCGGAGACCAGAGAGGCCTCCCCAGGATTTGCACGATAGACCACGACTTATGGACGATTGGGAAGCCCTTGTTGGAAGGAAAT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15788 200096 0 1 GCGTCAGCAAATGCCCCAACAGCCAAGTCAGCAATTGCCTCAGCAACTTGGGCCACAAACACCACAGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCC eeeb`ebbeeb^]^bebeeb[`X^`^dXcabZb]bb`bbb_dd]]_]_bbd\_\_cc[___a__Vca_Sc`b]Y\c__]b`V[\^[_PSbBBBBBBBBBB 1 HWI-ST397 0 3 68 15842 200099 0 1 TAGGCCATCAGCTGACTTCCCGGGTGTGGAGAAAAGAGGGCCCCTCACTTCTCTCAAGTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCGGAGACCG dddddbffffffdfaadaeeffff[M[QUJda`[VRIQR\cX[cddc[df^cX^ISS^QZ`Y^TYEPTZS\`T\`MSTTSNVYQQ]]]T`\`QY_BBBBB 1 HWI-ST397 0 3 68 15876 200105 0 1 GGACCTACTGCCGGCGGGACGAAAGCGGTTGTTGAATGATGGGGGTCACTAGGCCTTCCAGGGCCTTTAAGCGCGCGCTGAGATCGGAAGAGGGGTTCAG XVKYYIUWWM\O__BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15937 200097 0 1 CTCCCTGTTGAAGCATGTGCAAAAGAGCTTGTTCTCGGCCTTCTTCAAGCCATTCTCTTGGCAGACGGCTTTGCCTAGAAGTTTCGCCCCATCACCCTTG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15958 200102 0 1 CGCCTTATCTGCCCTCGCCGGTCATGGGGAGTGGTGCCCCTACCTCGGACAAGACAGATGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG \c\ededdceeeVeedffddaceedeeeWS[O``\cccccdadZd\daddeaacfefeee\^d_d_ccbUQVWHUIKPFGUY\UY[T[[XT`YT\]_b^] 1 HWI-ST397 0 3 68 15765 200113 0 1 CCAGCTCAGCATGGATCTCTCCTTGATGGACTGAAAGCGCGTGTGCTCCCCTGTGTGATGGAAAGTGGCAGTGATCGGAAGTTCGGTACGGAAGGAGTGC NQHLGNUSTUIQUTX_O_X][_YX[K\\U[O^KJ\[SYW]^[[^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15912 200114 0 1 CCAGCTCAGCTCAAGCATTGGCTTCCGCTTTGGCATCCTGGAGGGTAAGCTTCTGCTCTTCTCACTAGAGGAGGATCCTGGCTATCGAAACCTGTTGTAT a\YaY^^[^R^a^`[c[c\`VS^^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 1 HWI-ST397 0 3 68 15791 200127 0 1 ACAAACAGCAGAGGTCGCATTGTAGTTAGTCCGGGACTTGCCCAGTTCATTGCTGAGATCGGAAGAGCGGTTCAGCAGGCATGCCGAGACCGATCTCGTA QY^\^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15831 200117 0 1 GCTCTACAGCTTCTGGCCAGAATGCTTTTGGCACTTGTTTGTCACAAAGCATGCACTGAACCATATTCATGATAGTTCGATTTTTGCGTTCAGCTACCCC b_`ba\X\ZZ__X^XUabbaKKK\V\_\XbUHMXURRNURTT`PVQQTSXSV]NIVXVSHWTTSI[VYW^Wa`_`\^BBBBBBBBBBBBBBBBBBBBBBB 1 HWI-ST397 0 3 68 15848 200124 0 1 TTCTCCAGCTGCTACATGCACCGTGGGAAGAAGGTCTGCCCCACATACCCACCAGCCATCGCCCTTCTCACATTCGATTCAAACATCTTTGGGTTATCCC a__^a^d_dda_edeeec]ed]c^da^`^c`ca```cac]aa`Y_acc_`c[V^bc^``b`cG\`_]^VYUabX^UIXXX[OPVUW_]``HU[X]aS`bZ 1 HWI-ST397 0 3 68 15891 200120 0 1 GAGATACAGCTGCGAATTGGGGGTTCCTGTGTTGCGAAGTGGCACTCGTGTGCCAAACTTGGCTACGCAGAGATCGGTAGAGCGGTTAAGCAGGAATGCC eadcdeddffffefbbffdffefdfcedcdfffff]abcbdYddebde_dadXc`cMccbc]^`BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 1 HWI-ST397 0 3 68 15931 200128 0 1 AAAAGTTCAGCAATACCTGTTGAAGCCAAGCCCTTGTGGTGATTGCCTCGTTCATTGCTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG ddddNcb\_bdddddeaeceab`_]VX]_\___cbZT[ZQ[[[Y]`^cZ]bcbYbcYcU^K]\S[^`Q]]VVPIUIUQQRMQ[NWbcb\L__LZ`BBBBB 1 HWI-ST397 0 3 68 15991 200121 0 1 GAATCTGCTACTAGTGAGCCTTTGTATGGGGACCGAGTTCAGAAGCTCTAACCCTCGTTTTCCCATCTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATG [\W\YN]XTX\\URV]]T[[O^_[NPLWWT]XXRRYYTXR]T]XX^a`^Y\WJX]Y[GTTUULRKIMPEMT[O[Zaa^Ra`BBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15765 200133 0 1 TAGCATGCCTGCTGCAGGAGTTGGTGCCCAGCATTCTCAGGTGTAGTCCAAATTCTGTCTGATACTTATTGTTTATGCGATTTTGCCATCACATATGGAG d]dcd]ddeeeeceZ`ddaUY]^aX_^[[^acRda`^c``c_eebbbda]X`c^`ce``debc[[_Q_][`cd_Zee`a[eeeda_^_ZX]^[^Kb`^X^ 1 HWI-ST397 0 3 68 15810 200133 0 1 TTCAGACAGATGATGCTTGTCAAGGGTCACCATCTTGCATTGCGCTGCGTCACATCCTTAGTGGGAATAGGGGATCAGCCTCGCTTTTTGAAGCTGAGTT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15871 200135 0 1 CTTGCTTCAGCCATGTAGAGTGGTGTTGCTCCTTTACTACCACGAATCATTGGTAACTCCCTGTTCTTATTCACCATTACCTCAGCAATCCTTGTGATTC gggggggggggggggggfggdff^ebbcccgeggggegggggggdggeggefgd^bbeeeggeegegfgfeaee`ccbccUedeedfdfe`caccT[baa 1 HWI-ST397 0 3 68 15974 200136 0 1 TTCAGACAGCCAAACGACGTCTTAGTGGAGAAAATACCTGAGAAAAGTCAAGAAACCAAAACACTAAAAAATGACCAAGAAATAGAGCAGAGATCGGAAG ggggggggggggggggggggeggggeggggegggdagedegggggdgbeddgeedgegdedggdedece^^baXc`f`dde[c[cbcIceddRda\_c\` 1 HWI-ST397 0 3 68 15909 200147 0 1 AGCCTCAGCTTGGTTGCTTGTGGTTGGGGGTGAGGGGGCGGGCGGGAACTTATGTTTGCGCCCCGAGGCGGAGCCGCTCTAGGCAGAGATCGGAAGAAGG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15946 200152 0 1 CTTGACTGGGCGTGGTGCTGAGGCTACTGCGGAATTGAGGTGTTGTCATCCACCGGATTGGGTCGTAGGGCGTGGCTTTGAGCGGAGATCGGATGAGCGG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15774 200153 0 1 TTCAGACAGCCAACTGAGATGACTCTCATTCTTGGTAGGAACCAATTTCTGAGAGCTTCGTAATGACATCAACTACAGCTGTGATCGGAAGAGCGGTTCA U[P[VZ\NWR]M]T\[J``[_[I^RY[Q\\^Z]^KSSUMGOZIOI]WPTT_UNVVTWQMW[^O^\^]]aR][aBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15814 200155 0 1 GAGATACAGCAACAAATGATGTCATTCCTTGCAAAAGCTGTACAAAGCCCTGGTTTCTTAGCTCAGCTGGTACAGCAGAGATCGGACGAGCGGTTCGCCA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15850 200154 0 1 GTGTTTGGTCGTGAAAGTGGACCTCTTTCAGGTGCAGGTGCGAGTAGAAGGAGGTCCCAGAGACGTGCGGCTGGAGATCGGAAGGGTGCTGAGGCGGGAA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15870 200157 0 1 GAGAAACCGCAGAATGATAGCAAAAAGCGCGTTACAGGAGATATTAAGAAAAGGAGACTTGCAATGCAGGAGTAAGAGATCCATTCTGCTATGCATGCTC BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15984 200158 0 1 CGTCAACTGCATGAAGGAGGTTGTCTGGCCGTTGGAGGAGTGATTTTGGAAGGCTGAGATCGGAAGAAAGGTTCAGCAGGAATGCAGAGACCGATCTCGT N[a`cY^Y\X\MXZX\WRPTc`cab[[T_[\]MZ\GPPWQ\_N]_[X\[]TcNcaa\URaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 1 HW
Raw Sequence (Qseq)
3
Assignment to Samples
Barcode sequences from the plate map are compared to barcode sequences in the reads, in order to associate reads with the samples from which they originate. Parameters:
Users supply a plate map and staff members supply DNA barcodes. These are combined into a table of barcodes by sample.
Project Details Sample Details Organism Details Origin Details Project Name Source Lab Plate Name Well Sample Name Pedigree Population Stock Number Sample DNA Concentration Sample Volume Sample DNA mass Preparer Kingdom Phylum Class Order Family Genus Species Subspecies Variety Location Name Country BREAD Buckler BREAD-Maize-A A01 PI597982 inbred 04A0160A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A B01 blank 0 0 0 plantae zea mays BREAD Buckler BREAD-Maize-A C01 PI576130 inbred 04A0191B 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A D01 PI655991 inbred 04A0165A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A E01 PI656059 inbred 04A0193B 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A F01 CML91 inbred 04A0005BA 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A G01 CML311 inbred 04A0301A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A H01 CML311 inbred 04A0200A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A A02 MR_0011.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0281A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A B02 MR_0013.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0279B 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A C02 MR_0014.3 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0164B 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A D02 MR_0015.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0163A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A E02 MR_0016.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0315B 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A F02 MR_0018.2 (PI655994 x PI655998)S4 PI655994 x PI655998 02F146114A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A G02 MR_0020.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0289B 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A H02 MR_0022.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0171A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A A03 MR_0025.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0170B 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A B03 MR_0027.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0381B 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A C03 MR_0028.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0258A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0304B 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada Cacao Buckler BREAD-Maize-A E03 Tc1536 Catie F1 04A0216A 10 100 1000 Jemmy Takrama plantae theobroma cacao gha Tafo Cacao Buckler BREAD-Maize-A F03 Tc7959 Brazil F2 04A0255A 10 100 1000 Jemmy Takrama plantae theobroma cacao gha Tafo BREAD Buckler BREAD-Maize-A G03 PI542406 inbred 04A0217A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A H03 PI655981 inbred 04A0167A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A A04 PI656007 inbred 04P160451A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A B04 PI17548 inbred 04A0258B 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A C04 PI564163 inbred 04A0244A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A D04 PI651492 inbred 04A0298A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A E04 PI656008 inbred 04A0293B 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A F04 PI655985 inbred 04A0296A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada
Plate Map
4
Flowcell Lane barcode sample Plate# Row Column PlateName 434GFAAXX 2 CTCC M0001 1 A 1 IBM1 1A01 434GFAAXX 2 TGCA M0012 1 A 2 IBM1 1A02 434GFAAXX 2 ACTA M0021 1 A 3 IBM1 1A03 434GFAAXX 2 GTCT M0029 1 A 4 IBM1 1A04 434GFAAXX 2 GAAT M0038 1 A 5 IBM1 1A05 434GFAAXX 2 GCGT M0046 1 A 6 IBM1 1A06 434GFAAXX 2 TGGC M0057 1 A 7 IBM1 1A07 434GFAAXX 2 CGAT M0067 1 A 8 IBM1 1A08 434GFAAXX 2 CTTGA M0080 1 A 9 IBM1 1A09 434GFAAXX 2 TCACC M0090 1 A 10 IBM1 1A10 434GFAAXX 2 CTAGC M0099 1 A 11 IBM1 1A11 434GFAAXX 2 ACAAA M0113 1 A 12 IBM1 1A12 434GFAAXX 2 TTCTC M0003 1 B 1 IBM1 1B01 434GFAAXX 2 AGCCC M0013 1 B 2 IBM1 1B02 434GFAAXX 2 GTATT M0022 1 B 3 IBM1 1B03 434GFAAXX 2 CTGTA M0030 1 B 4 IBM1 1B04 434GFAAXX 2 AGCAT M0039 1 B 5 IBM1 1B05 434GFAAXX 2 ACTAT M0047 1 B 6 IBM1 1B06 434GFAAXX 2 GAGAAT M0058 1 B 7 IBM1 1B07 434GFAAXX 2 CCAGCT M0068 1 B 8 IBM1 1B08 434GFAAXX 2 TTCAGA M0081 1 B 9 IBM1 1B09 434GFAAXX 2 TAGGAA unknown 1 B 10 IBM1 1B10
Example DNA Barcode Key
Notes on Names & Chromosomes
• Chromosomes (or contigs MUST be integers)
• Samples Names some Advice: • NO spaces • NO “:” • Try to avoid weird characters.
5
Qseq
QseqToTagCount
TagCounts per lane
TagCounts for species
(Master Tags)
Fastq
SAM alignment
TagsOnPhysical Map
Key files
TagsByTaxa files (1 per lane)
TagsByTaxa for species
HapMap
Merge TagsByTaxa
SAM convertor
BWA (Burrows-Wheeler Aligner)
TagCountsTo FASTQ
Merge TagsCounts
TagsToSNP ByAlignment Process
QseqToTBT
File (data structure)
Reference Genome Pipeline
QSeqToTagCounts
Processes a Qseq file so we know what alleles (tags) are present in the the sample • Handles sequence quality issue • Identifies the barcodes • Removes problem tags • Counts tags
6
Cut site Read Barcode adapter Cut site
Cut site Read Cut site Sequence Barcode adapter
Cut site Read Cut site
GBS Restriction Fragment Structure
Potential chimeric sequence Rejected or Trimmed reads
Common adapter
Accepted read Read Barcode adapter Cut site
Short sequence
Common adapter
Adapter dimer
Barcode adapter Common adapter Cut site
Sequence Processing Raw sequence data is processed into unique 64-bp sequences. For example: CTCCCAGCCCTCGGCGGTCAAACCACCCGGTCATCCATGCACCAAGGCCTGCGTGCGGGCTTGGTGTCATCGTACGC GTTGAACAGCCCTCGGCGGTCAAACCACCCGGTCATCCATGCACCAAGGCCTGCGTGCGGGCTTGGTGTCATCGTACGC
Becomes: CAGCCCTCGGCGGTCAAACCACCCGGTCATCCATGCACCAAGGCCTGCGTGCGGGCTTGGTGTCATCGTACGC 64 2
Parameters: Restriction enzyme Different enzymes will create different sequence motifs, such as overlapping cut sites, palindromes or wobble bases.
Barcode Barcode sequences must be provided to identify acceptable reads.
Number of identical sequences accepted This gives investigators the option to ignore repetitive sequences or singleton reads.
7
26442466 2 CAGCAAAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGAC 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGAATTTTATGTTTCCTACCTCCAACCCCAGGACTTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGTAATTTGATGTCCTATACCTCATCCCACAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGTAATTTTATTTCTCATACCTCATACCACAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCCAAGAAATTTGATGTCTCAAACCCCAACACACAGGCTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCCAAGAAATTTTTTGTCTCAAACCCCAACCCCCAGGCCT 64 1 CAGCAAAAAAAAAAAAAAAAAAAAGGGGTTTTGAATAAAAAAAACTGAAGGATCTTAAATCTAC 64 1 CAGCAAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTTTCATACCTCATACCACAGGACT 64 1 CAGCAAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACT 64 2 CAGCAAAAAAAAAAAAAAAAAAACCAAAAAATTTTATGTCTCAAACCCCAAACCCCAGGGCTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCAAATAATTTGATGTCTCATACCTCATACCACAGGGCTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCAAGAAATTTTGGCACTCAAGCCCAAAACCACAGATCTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCAAGTAATTTGTTGTCTCATACCTCATACCACAGAACTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCCAAAAAATTTTTTTTTCCAACCCCAAAACCCAAGGCTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCCAAGAAATTTTTTTTCCCAAACCCCAAACCCCAGGCTTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAGGGATAGGGAAGATGGGGGAGAGTGGCGGCCACGCATGGAA 64 1 CAGCAAAAAAAAAAAAAAAAAACAACAAGGAATTTGGGTATTCATTCCCCATACCCCAGGATTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACAAAAAAATTTGTTTTCTCAACCCCAAAACCAAAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCAAAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTT 64 2 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCCCAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGGAATTGAATCTCTCACACCTTAAAACACCGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATTTCTCATACCTCATACCAAAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAATTATTTGAAAGATCATTACCCTATACCACGGGGTTC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAAAAATTTGATGTCTCATACCCCATACCACAGGACTCCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAAAAATTTTATTTCTCATACCCCAAACCCCAGGACTTCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAAGAATTTTATGTCTCATACCTCAAACCAAAGGACTTCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAATAAATTTGTTGCTCATACCCCAAACCACAGGGCTTTC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAGCAATTTGATTCCACTTAATCTATCCCACAGAACTTCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTTCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCCAAAAAATTTTTTGTTTCCCTAACCCCAAAACCACGGACT 64 1 CAGCAAAAAAAAAAAAAAAAAACCCAATGAATTTGTAGTGCCAAACCCCAAACCAACGGACTTT 64 1 CAGCAAAAAAAAAAAAAAAAAACCCCAAGAAATTTGATGTCTCATACCCCAAACCCCAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAAGACCAGGTAATTATTGCTCACATACATCAAACTCCAATTGCC 64 1 CAGCAAAAAAAAAAAAAAAAAAGCGCCTAACGTTTCAAAATGAATGAGTTGCCAACCAAGGACT 64 1 CAGCAAAAAAAAAAAAAAAAAAGGGTTAGGAAAGATGGGTGGGAGGGGCGGGCCTGCTTGAAAT 64 1
TagCounts File Number of Tags
Max Size of Tag x 32bp
Tag Sequence Length (bp) Count
Qseq
QseqToTagCount
TagCounts per lane
TagCounts for species
(Master Tags)
Fastq
SAM alignment
TagsOnPhysical Map
Key files
TagsByTaxa files (1 per lane)
TagsByTaxa for species
HapMap
Merge TagsByTaxa
SAM convertor
BWA (Burrows-Wheeler Aligner)
TagCountsTo FASTQ
Merge TagsCounts
TagsToSNP ByAlignment Process
QseqToTBT
File (data structure)
Reference Genome Pipeline
8
Unique Reads (FASTQ) @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGAC + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGAATTTTATGTTTCCTACCTCCAACCCCAGGACTTT + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGTAATTTGATGTCCTATACCTCATCCCACAGGACTT + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGTAATTTTATTTCTCATACCTCATACCACAGGACTT + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAAACCCAAGAAATTTGATGTCTCAAACCCCAACACACAGGCTT + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAAACCCAAGAAATTTTTTGTCTCAAACCCCAACCCCCAGGCCT + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAAAGGGGTTTTGAATAAAAAAAACTGAAGGATCTTAAATCTAC + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTTTCATACCTCATACCACAGGACT + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=2 CAGCAAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACT + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAACCAAAAAATTTTATGTCTCAAACCCCAAACCCCAGGGCTTC + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAACCAAATAATTTGATGTCTCATACCTCATACCACAGGGCTTC + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTTC + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
BWA (Burrows-Wheeler Aligner) • Aligns the tags in FASTA format to the
reference genome • Parameters:
• Similarity of read sequence and genome sequence. This controls the tradeoff between number of SNPs and confidence in the alignment. Default is 4 edits per sequence.
• Gap penalty. This controls sensitivity to indels. Default is no indels within 5bp of the read ends.
• Outputs a SAM Alignment • There are many other aligners. BWA is fast
and memory efficient, but may not be appropriate for your species
9
Generic Alignment (SAM)
length=64count=1 0 7 6994125 37 55M2I7M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCAAAGGACTT ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:1 XO:i:1 XG:i:2 MD:Z:29T32 length=64count=2 0 7 6994125 37 54M2I8M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTT ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:1 XO:i:1 XG:i:2 MD:Z:29T32 length=64count=1 0 7 6994125 37 53M2I9M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCCCAGGACTT ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:1 XO:i:1 XG:i:2 MD:Z:29T32 length=64count=1 0 7 6994125 37 54M2I8M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTT ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:2 X0:i:1 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:62 length=64count=1 0 7 6994125 37 55M2I7M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATTTCTCATACCTCATACCAAAGGACTT ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:1 XO:i:1 XG:i:2 MD:Z:38G23 length=64count=4 0 7 6994125 37 4M3D47M2I11M * 0 0 CAGCAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTTCCC ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:5 X0:i:1 X1:i:0 XM:i:2 XO:i:1 XG:i:2 MD:Z:4^A length=64count=1 16 17 14761759 25 64M * 0 0 CCTTTCTTGGCCTGGTTCTCACTCATCTGGGCTTGGAATTGAGAACCGTTTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:8T1 length=64count=7 16 18 1517944 25 64M * 0 0 GCCCGTCTACACGCTTGTGTCCCATGCCCGCAAGCCGCCCCATCCCTCTTTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:14G12A12T0G length=64count=1 16 18 1517944 25 64M * 0 0 GCCCGTCTACACGTTTGTGTCCCATGCACGCAAGCCGCCCCATCCCTCTTTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:13C0G25T0G2 length=64count=1 16 18 1517944 25 64M * 0 0 GCCCGTCTACAGGCTTGTGTCCCATGCACGCAAGCCGCCCCATCCCTCTTTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:11C2G25T0G2 length=64count=4 16 18 1517944 25 64M * 0 0 GCCCGTCTACCCGCTTGTGTCCCATGCACGCAAGCCGCCCCATCCCTCTTTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:10A3G25T0G2 length=64count=2 16 18 1517944 25 64M * 0 0 GCCCGTCTCCACGCTTGTGTCCCATGCACGCAAGCCGCCCCATCCCTCTTTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:8A5G25T0G22 length=64count=53 16 18 1517944 37 64M * 0 0 GCCCGTCTACACGCTTGTGTCCCATGCACGCAAGCCGCCCCATCCCTCTTTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:3 XO:i:0 XG:i:0 MD:Z:14G25T0G22 length=64count=1 16 18 1517944 25 64M * 0 0 CCCCGTCTACACGCTTGTGTCCCATGCACGCAAGCCGCCCCATCCCTCTTTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:0G13G25T0G2 length=64count=1 16 18 1517944 25 64M * 0 0 GCCCGTCTACACCCTTGTGTCCCATGCACGCAAGCCGCCCCATCCCTCTTTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:12G1G25T0G2 length=64count=1 0 10 10388735 37 58M1I5M * 0 0 CAGCAAAAAAAAAAAATAGAACTTAGAAACTTATACCGTGGGACACGTCAAGTGACTGCTGATG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:2 XO:i:1 XG:i:1 MD:Z:59A length=64count=1 0 2 714861 37 64M * 0 0 CAGCAAAAAAAAAAACCAAAGATCGACTTGCAACATCTGGATGGAAACAACAAACAAACAAAGA ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:2A61 length=64count=11 16 19 13463035 37 49M1I14M * 0 0 TGCCCGTCTACACGCTTGTGTCCCATGCACGCAAGCCGCCCCATCCCTCTTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:3 XO:i:1 XG:i:1 length=58count=1 0 2 14032437 37 4M1I59M * 0 0 CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGAAAAAA ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:3 XO:i:1 XG:i:1 MD:Z:57C length=64count=1 0 2 14032437 37 4M1I59M * 0 0 CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGGAGGGA ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:3 XO:i:1 XG:i:1 MD:Z:57C length=64count=1 16 19 13463036 37 48M2I14M * 0 0 GCCCGTCTACACGCTTGTGTCCCATGCACGCAAGCCGCCCCATCCCTCCTTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:2 XO:i:1 XG:i:2 length=64count=1 0 6 20542400 37 64M * 0 0 CAGCAAAAAAAAAAATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATTT ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:13C length=64count=1 16 5 15019027 37 49M1I14M * 0 0 CCCATTGTTGTATCTTGATTGCAGACTCCCCCTCATCACTCTTTCTGCATTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:2 XO:i:1 XG:i:1 length=64count=3 16 5 15019027 37 49M1I14M * 0 0 ACCATTGTTGTATCTTGATTGCAGACTCACCCTCATCACTCTTTCTGCATTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:0 XO:i:1 XG:i:1 length=64count=1 16 5 15019027 37 49M1I14M * 0 0 CCCATTGTTGTATCTTGATTGCAGACTCACCCCCATCACTCTTTCTGCATTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:2 XO:i:1 XG:i:1 length=64count=1 16 5 15019027 37 49M1I14M * 0 0 ACCATTGTTGTATCTTGATTGCAGACTCTGCCTCATCACTCGTTCTGCATTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:3 XO:i:1 XG:i:1 length=64count=1 0 6 20542400 37 4M1I59M * 0 0 CAGCAAAAAAAAAACATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATT ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:0 XO:i:1 XG:i:1 MD:Z:63 length=64count=1 0 8 18851188 37 64M * 0 0 CAGCAAAAAAAAAAGAGAGGCCTAAAAAGGGTAATGAAGGCAAAAGTGCCCTTCTTAGCTGTAG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:3 XO:i:0 XG:i:0 MD:Z:8G3 length=64count=5 16 19 13463034 23 64M * 0 0 CTGCCCGTCTACACGCTTGTGTCCCATGCACGCAAGCCGCCCCATCCCTCTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:3 X0:i:1 X1:i:1 XM:i:3 XO:i:0 XG:i:0 MD:Z:1C1 length=64count=1 0 5 6176480 37 64M * 0 0 CAGCAAAAAAAAAAGCCCAATCTAGACCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:26G37 length=64count=7 0 5 6176480 37 64M * 0 0 CAGCAAAAAAAAAAGCCCAATCTAGAGCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:64 length=57count=31 0 2 14032437 25 64M * 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGAAAAAAA ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:57C length=64count=4 0 2 14032437 25 64M * 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGGAGAGAT ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:57C length=64count=1 0 2 14032437 25 64M * 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGTTGCAGAGAA ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:54C length=64count=1 16 5 15019027 37 16M1I47M * 0 0 TCCATTGTTGTATCTTCGATTGCAGACTCACCCTCATCACTCTTTCCGCCTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:3 XO:i:1 XG:i:1
SAMConverter & TagsOnPhysicalMap (TOPM)
• TOPM is the key file to interpret tags present in a species. Contains:
• Tag Sequence • Position • Divergence from reference • Polymorphisms • Genetic mapping support
10
6040401 2 4 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCAAAGGACTT 64 0 7 1 6994125 6994189 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTT 64 0 7 1 6994125 6994189 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCCCAGGACTT 64 0 7 1 6994125 6994189 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTT 64 0 7 1 6994125 6994189 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATTTCTCATACCTCATACCAAAGGACTT 64 0 7 1 6994125 6994189 0 CAGCAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTTCCC 64 0 7 1 6994125 6994189 0 CAGCAAAAAAAAAAAACGGTTCTCAATTCCAAGCCCAGATGAGTGAGAACCAGGCCAAGAAAGG 64 0 17 0 14761759 1476182 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGGGCATGGGACACAAGCGTGTAGACGGGC 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAACGTGTAGACGGGC 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCCTGTAGACGGGC 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGGGTAGACGGGC 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGGAGACGGGC 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGG 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGGGTGTAGACGGGC 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAATAGAACTTAGAAACTTATACCGTGGGACACGTCAAGTGACTGCTGATG 64 0 10 1 10388735 1038879 CAGCAAAAAAAAAAACCAAAGATCGACTTGCAACATCTGGATGGAAACAACAAACAAACAAAGA 64 0 2 1 714861 714925 0 CAGCAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGCA 64 0 19 0 13463035 1346309 CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGAAAAAA 64 0 2 1 14032437 1403250 CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGGAGGGA 64 0 2 1 14032437 1403250 CAGCAAAAAAAAAAAGGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 64 0 19 0 13463036 1346310 CAGCAAAAAAAAAAATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATTT 64 0 6 1 20542400 2054246 CAGCAAAAAAAAAAATGCAGAAAGAGTGATGAGGGGGAGTCTGCAATCAAGATACAACAATGGG 64 0 5 0 15019027 1501909 CAGCAAAAAAAAAAATGCAGAAAGAGTGATGAGGGTGAGTCTGCAATCAAGATACAACAATGGT 64 0 5 0 15019027 1501909 CAGCAAAAAAAAAAATGCAGAAAGAGTGATGGGGGTGAGTCTGCAATCAAGATACAACAATGGG 64 0 5 0 15019027 1501909 CAGCAAAAAAAAAAATGCAGAACGAGTGATGAGGCAGAGTCTGCAATCAAGATACAACAATGGT 64 0 5 0 15019027 1501909 CAGCAAAAAAAAAACATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATT 64 0 6 1 20542400 2054246 CAGCAAAAAAAAAAGAGAGGCCTAAAAAGGGTAATGAAGGCAAAAGTGCCCTTCTTAGCTGTAG 64 0 8 1 18851188 1885125 CAGCAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGCAG 64 0 19 0 13463034 1346309 CAGCAAAAAAAAAAGCCCAATCTAGACCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC 64 0 5 1 6176480 6176544 0 CAGCAAAAAAAAAAGCCCAATCTAGAGCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC 64 0 5 1 6176480 6176544 0
TagsOnPhysicalMap File
BWA sensitivity is pretty poor
Alignment Class BWA Bowtie2 Single Best
Mapping 57% 69%
Multiple Mapping 17% 17% Unmapped 26% 14%
BLAST about the same as Bowtie2. Code needs to be updated to parse Bowtie2. Many of the multiple mapping do NOT map with 100% identity, which suggests they can be genetically mapped.
11
Qseq
QseqToTagCount
TagCounts per lane
TagCounts for species
(Master Tags)
Fastq
SAM alignment
TagsOnPhysical Map
Key files
TagsByTaxa files (1 per lane)
TagsByTaxa for species
HapMap
Merge TagsByTaxa
SAM convertor
BWA (Burrows-Wheeler Aligner)
TagCountsTo FASTQ
Merge TagsCounts
TagsToSNP ByAlignment Process
QseqToTBT
File (data structure)
Reference Genome Pipeline
6040401 2 88 08.0731-5 chardonnay 08.0731-19 08.0731-29 08.0731-6 08.0731-24 08.0731-37 08.0731-15 08.0731-38 08.0731-44 08.0731-23 08.0731-46 08.0731-31 08.0731-10 08.0731-49 08.0731-47 08.0731-7 08.0731-48 08.0731-21 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCAAAGGACTT 64 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTT 64 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCCCAGGACTT 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTT 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATTTCTCATACCTCATACCAAAGGACTT 64 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTTCCC 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 CAGCAAAAAAAAAAAACGGTTCTCAATTCCAAGCCCAGATGAGTGAGAACCAGGCCAAGAAAGG 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGGGCATGGGACACAAGCGTGTAGACGGGC 64 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAACGTGTAGACGGGC 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCCTGTAGACGGGC 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGGGTAGACGGGC 64 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGGAGACGGGC 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 64 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGG 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGGGTGTAGACGGGC 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 CAGCAAAAAAAAAAAATAGAACTTAGAAACTTATACCGTGGGACACGTCAAGTGACTGCTGATG 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAACCAAAGATCGACTTGCAACATCTGGATGGAAACAACAAACAAACAAAGA 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGCA 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGAAAAAA 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGGAGGGA 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAGGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATTT 64 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAATGCAGAAAGAGTGATGAGGGGGAGTCTGCAATCAAGATACAACAATGGG 64 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAATGCAGAAAGAGTGATGAGGGTGAGTCTGCAATCAAGATACAACAATGGT 64 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAATGCAGAAAGAGTGATGGGGGTGAGTCTGCAATCAAGATACAACAATGGG 64 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAATGCAGAACGAGTGATGAGGCAGAGTCTGCAATCAAGATACAACAATGGT 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAACATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATT 64 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAGAGAGGCCTAAAAAGGGTAATGAAGGCAAAAGTGCCCTTCTTAGCTGTAG 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGCAG 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 CAGCAAAAAAAAAAGCCCAATCTAGACCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC 64 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAGCCCAATCTAGAGCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC 64 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGAAAAAAA 64 1 1 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGGAGAGAT 64 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGTTGCAGAGAA 64 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Tags by Taxa
12
Qseq
QseqToTagCount
TagCounts per lane
TagCounts for species
(Master Tags)
Fastq
SAM alignment
TagsOnPhysical Map
Key files
TagsByTaxa files (1 per lane)
TagsByTaxa for species
HapMap
Merge TagsByTaxa
SAM convertor
BWA (Burrows-Wheeler Aligner)
TagCountsTo FASTQ
Merge TagsCounts
TagsToSNP ByAlignment Process
QseqToTBT
File (data structure)
Reference Genome Pipeline
Tags that align to the same region are aligned against one another and SNPs and small indels are identified. Based on the alignments SNPs are propagated to specific lines having that tag into a HapMap file. Parameters: • chromosomes to search for SNPs • bi or tri-allelic SNPs • Indels • Genetic mapping support • Max markers on a chromosome
TagsToSNPByAlignment
13
rs# alleles chrom pos strand SgSBRIL067:633Y5AAXX:2:C9 SgSBRIL019:633Y5AAXX:2:C3 SgSBRIL061:633Y5AAXX:2:E8 SgSBRIL013:633Y5AAXX:2:E2 SgSBRIL062:633Y5AAXX: S1_2100 A/G 1 2100 + N N N N N N N R N A N A N N N N N N N S1_2163 T/C 1 2163 + N N N N N N T C T T N N C N T N T N C S1_13837 T/G 1 13837 + N N N N N N N G N N T N N N N N N N S1_14606 C/T 1 14606 + N N C N N N T T T T C N C N N C N N S1_20601 T/A 1 20601 + T N N N N N N A N N N N N N N N N N S1_68332 C/T 1 68332 + N N N N N N N N N N N N N N N N N T S1_68596 A/T 1 68596 + A N N N N N N N N A N N A N N N N N S1_69309 G/A 1 69309 + N G N N N N N A N N N N A N N N N N S1_79955 T/G 1 79955 + N T G T T N T T N N N T T T N N T T S1_79961 T/G 1 79961 + N T T T T N T T N N N T T T N N T T S1_80584 G 1 80584 + N N N N N N N N N N G N G N A G N N S1_80647 C/T 1 80647 + N N N N N N N C N N C T N N N C T N S1_81274 T/G 1 81274 + N N N N N N T G N N N N N N N N N N S1_108834 G/A 1 108834 + N N N N N N N N N N N N N N N N N N S1_112345 T/G 1 112345 + N N N N N N K T N N N N N K G G G N S1_115359 C/T 1 115359 + N N N N N N T C N T N N N N N N N N S1_115362 T/C 1 115362 + N N N N N N N C N N N N T N N N C N S1_115405 G/A 1 115405 + G G A N N G G G G N R N G N G N N N S1_115516 T/G 1 115516 + N N T N N N T T N N T N T T T T N N S1_116694 A/G 1 116694 + N A G N N N G A N N N N A N G A G N S1_119016 C/T 1 119016 + N N N N C N N C N N N N N N N N N N S1_155366 T/C 1 155366 + N T N N N N N Y N N N N C N N T N N
HapMap Format
Why another pipeline?
• The last maize build (21000 taxa) with the discovery pipeline took over 2 weeks.
• Most common alleles have been idenBfied aDer the first few discovery builds
• Use the informaBon from the discovery pipeline to call SNPs in new runs quickly.
• Improve efficiency and automate.
14
GBS bioinformaBcs pipeline Discovery
Tag Counts
SNP Caller
Genotypes
Tags by Taxa
Fastq
TOPM
GBS bioinformaBcs pipeline Discovery
Tag Counts
SNP Caller
Tags by Taxa
Fastq
TOPM
Genotypes
Filtered Genotypes
15
GBS bioinformaBcs pipeline Discovery
Tag Counts
SNP Caller
Genotypes
Tags by Taxa
Fastq
ProducCon
TOPM
Fastq
Discovery
Tag Counts
SNP Caller
Genotypes
Tags by Taxa
Fastq
ProducCon
TOPM
Fastq
TagsOnPhysicalMap (TOPM)
16
GBS bioinformaBcs pipeline Discovery
Tag Counts
SNP Caller
Genotypes
Tags by Taxa
Fastq
ProducCon
Filtered Genotypes
TOPM
Fastq
GBS bioinformaBcs pipeline Discovery
Tag Counts
SNP Caller
Genotypes
Tags by Taxa
Fastq
ProducCon
Fastq
Filtered Genotypes
TOPM TOPM
17
GBS bioinformaBcs pipeline Discovery
Tag Counts
SNP Caller
Genotypes
Tags by Taxa
Fastq
ProducCon
Fastq
Filtered Genotypes
TOPM TOPM
GBS bioinformaBcs pipeline Discovery
Tag Counts
SNP Caller
Genotypes
Tags by Taxa
Fastq
ProducCon
Fastq
Filtered Genotypes
TOPM TOPM
Genotypes
18
Running the ProducBon Pipeline
• Required Files: – Sequence file (fastq or qseq) – Key file – ProducBon TOPM
• TASSEL 3 Standalone & RawReadsToHapMapPlugin
• Running the Pipeline: – One lane processed at a Bme – HapMap files by chromosome
• ~7 minutes
TesBng ProducBon Pipeline
• Compared HapMap files produced by Discovery Pipeline and ProducBon Pipeline
• Site Comparison: – Discovery 48,139 – ProducBon 47,676 – Difference due to maximum 8 alleles
• 99.98% correlaBon of geneBc distance matrices
19
Shifting to HDF5 • Hierarchical Data Format – supports very
large data sets and complex data structures.
• Widely used in climate and astromonomy communities
• TBT – files can approach 2 Tb in size • Compressed HDF5 can be 40 times smaller • Access times looks very good • Working to fuse TOPM, TBT, and Keyfile
into one HDF5 repository
Edward Buckler USDA-ARS
Cornell University http://www.maizegenetics.net
Why can GBS be complicated? Tools
for filtering, error correction and
imputation.
20
Maize has more molecular diversity than humans and apes combined
Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001)
1.34% 0.09%
1.42%
Only 50% of the maize genome is shared between two varieties
Fu & Dooner 2002, Morgante et al. 2005, Brunner et al 2005 Numerous PAVs and CNVs - Springer, Lai, Schnable in 2010
50%
Plant 1
Plant 2 Plant 3
99%
Person 1
Person 2 Person 3
Maize Humans
21
Maize genetic variation has been evolving for 5 million years
Modern Variation Begins Evolving
Sister Genus Diverges
Zea species begin diverging
Maize domesticated
5mya
4mya
3mya
2mya
1mya
War
m
Plio
cene
C
old
Plei
stoc
ene
Divergence from Chimps
Ardipithecus
Homo erectus
Modern Humans Modern Variation Begins
Australopithecus
What are our expectations with GBS?
22
High Diversity Ensures High Return on Sequencing
• Proportion of informative markers – Highly repetitive – 15% not easily
informative – Half the genome is not shared between two
maize line • Potentially all of these are informative with a
large enough database – Low copy shared proportion (1% diversity)
• Bi-parental information = (1-0.01)^64bp = 48% informative
• Association information = (1-0.05)^64bp= 97% informative
Expectation of marker distribution
Biallelic, 17%
Too Repetitive, 15%
Non-polymor
phic; 18%
Presense/
Absense, 50%
Multiallelic, 34%
Too Repetitive, 15%
Non-polymorphic; 1%
Presense/
Absense, 50%
Biparental population Across the species
23
Sequencing Error
Illumina Basic Error Rate is ~1%
• Error rates are associated with distance from start of sequence – Bad – GBS puts these all at the same
position – Good – Reverse reads can correct – Good – Error are consistent and modelable
24
Reads with errors
• Perfect sequences: 0.9964=52.5% of the 64bp sequences are
perfect 47.5 are NOT perfect The errors are autocorrelated so the proportion of perfect sequence is a little higher, and those with 2 or more is also higher.
Do we see these errors? • Assume 10,000 lines genotyped at
0.5X coverage
Base Type Read # (no SNP)
Read # (w/ SNP)
A Major 4950 4900
C Minor 17 67 (50 real)
G Error 17 17
T Error 17 17
25
Do Errors Matter? • Yes –Imputation, Haplotype
reconstruction • Maybe – GWAS for low frequency
SNPs • No – GS, genetic distance, mapping
on biparental populations
Expectations of Real SNPs
• Vast majority are biallelic • Homozygosity is predicted by
inbreeding coefficient • Allele frequency is constrained in
structured populations • In linkage disequilibrium with
neighboring SNPs
26
HapMap
Process
File (data structure)
Clean Up and Imputation
HapMap
GBSHapMapFiltersPlugin Site Coverage, Taxa Coverage, Inbreeding
Coefficient, LD
Imputation Imputation &
Phasing
HETEROZYGOUS NOT SOLVED YET
INBREDS PARTIALLY SOLVED
Kinship Distance
Phylogeny LD GS
GWAS
MergeDuplicateSNPsPlugin Merge reads from opposite sides
BiParentalErrorCorrectionPlugin Error rate estimation, LD filters
MergeIdenticalTaxaPlugin Error rate estimation, LD filters
Filters in TagsToSNPByAlignmentMTPlugin • Only calls bi-allelic (hard coded now)
– Two most common alleles used • Inbreeding coefficient (-mnF)
– If have inbred samples definitely use, very powerful for errors and paralogues
• Minimum minor allele frequency (-mnMAF) – Very important if do not have other tools for
filtering (bi-parental populations or LD) – Set for >=1% if no other filter method present
27
MergeDuplicateSNPsPlugin
• When restriction sites are less than 128bp apart, we may read SNP from both directions (strands)
• ~13% of all sites • Fusing increases coverage • Fixes errors • -misMat = set maximum mismatch rate • -callHets = mismatch set to hets or not
GBSHapMapFiltersPlugin
• Basic filters for coverage of sites, taxa inbreeding coefficient, and LD
• -mnTCov = minimum taxa coverage (e.g.0.05)
• -mnSCov = minimum site coverage, proportion of taxa with call (e.g. 0.10)
• -mnMAF = minimum minor allele frequency (e.g. 0.01)
28
GBSHapMapFiltersPlugin
• -mnF = minimum inbreeding coefficient (e.g. 0.9) – Don’t use with outcrossers
• -hLD = require that sites are in high local LD, currently parameters are hard coded, so difficult to tune without using the code. – Tests a sliding window of 100 surrounding
sites, and looks for a Bonferonni corrected P<0.01
– Useful but can be slow option. – More work needed here.
Biparental populations Limited range of alleles,
expected allele frequencies, high LD
29
Maize RIL population expectations
• Allele frequency 0% or 50% • Nearby sites should be in very high
LD (r2>50%) • Most sites can be tested if multiple
populations are available
Bi-parental populations allow identification of error, and non-Mendelian segregation
Error
Non-segregating
Segregating
30
Bi-parental populations allow identification of error, and non-Mendelian segregation
Error
Median error rate is 0.004, but there is a long tail of some high error sites
Median
31
BiParentalErrorCorrectionPlugin
• -popM = REGEX population identification(e.g. “Z[0-9]{3}”)
• -popF = population File (not implemented) instead of popM option
• -mxE = maximum error rate (e.g. 0.01); calculated from non-segregating populations
BiParentalErrorCorrectionPlugin • -mnD = distortion from expectation (e.g.
2.0); the test uses both the binomial distribution and this distortion to classify segregation.
• -mnPLD = minimum linkage disequilibrum r2= 0.5; this is calculated within each population, and then the median across segregating populations is used
32
MergeIdenticalTaxaPlugin
• Fuse taxa with the same name. Useful for checks and duplicated runs. Also useful in determining error rates
• -xHets = exclude heterozygotes calls (e.g. true)
• -hetFreq= frequency between hets and homozygous calls (e.g. 0.76)
Product of Filtering
• After filters, in maize we find 0.0018 error rate – AA<>aa = < 0.0018 – AA<>Aa = 0.8 at low coverage
• SNPs in wrong location <~1%. Lower in other species.
33
HapMap
Process
File (data structure)
Clean Up and Imputation
HapMap
GBSHapMapFiltersPlugin Site Coverage, Taxa Coverage, Inbreeding
Coefficient, LD
Imputation Imputation &
Phasing
HETEROZYGOUS Partially SOLVED
INBREDS PARTIALLY SOLVED
Kinship Distance
Phylogeny LD GS
GWAS
MergeDuplicateSNPsPlugin Merge reads from opposite sides
BiParentalErrorCorrectionPlugin Error rate estimation, LD filters
MergeIdenticalTaxaPlugin Error rate estimation, LD filters
Missing Data Two major sources: • Sampling
• Low coverage often used in big genomes with inbred lines
• Differential coverage caused by fragment size biases
• Biological • Region on genome not shared between lines • Cut site polymorphisms
We want to impute the missing sampling but not the biological
34
Standard Imputation
Lots of algorithms: FastPhase, NPUTE, BEAGLE, etc.
These are appropriate for high coverage loci, inbreds, and regions where biological missing is a rare condition
Some can be slow for sample sizes that we have.
FastImputationBitFixedWindow
• Imputation approach focused on speed and large sets of taxa with some closely related individuals.
• Nearest neighbor approach, fixed window sizes
• Strengths: Very accurate <1% error, much faster than other algorithms 100X
• Weakness: Not good a recombination junctions, heterozgyosity
• Code in TASSEL – not plugin, but available
35
Hidden Markov Model TASSEL GBS Imputation
• Developed by Peter Bradbury • Aimed a GBS and biparental populations • Hidden Markov Model • Very accurate at determining boundaries • Works well on Maize NAM inbred lines, and
probably others. • AA <> BB error rate– 0.00005 • AB > AA – 0.0278
• Most problem appears in faulty populations • Available as TASSEL 4.0 plugin
Only 50% of the maize genome is shared between two varieties
Fu & Dooner 2002, Morgante et al. 2005, Brunner et al 2005 Numerous PAVs and CNVs - Springer, Lai, Schnable in 2010
50%
Plant 1
Plant 2 Plant 3
99%
Person 1
Person 2 Person 3
Maize Humans
36
Mapping all the alleles (TagCallerAgainstAnchor) • Most maize alleles have no position on
the reference map • Map allele presence (TagsByTaxa)
versus a anchor SNP map (HapMap) • 8.7M alleles were mapped in <24 hours
using 100 CPU cluster
Physical and genetic mapping of 8.7 million GBS alleles
Gene$c&and&Physical&Agree&
Gene$c&and&Physical&Disagree&
Not&in&Physical,&Gene$cally&mapped&
Complex&mapping&or&modest&power¤tly&
Consistent&Error&or&Evenly&repe$$ve&
Reads&with&strong&gene/c&and/or&BLAST&posi/on&
Reads&with&weaker&posi/on&hypothesis&
Reads&with&no&hypothesis&(Error&or&even&repe//ve)&
• Only 29% of alleles are simple - physical and genetic agree
• 55% of alleles are easily genetically mappable
• Many complex alleles are rarer, so 71% of alleles are genetic and/or physically interpretable.
• With more samples and better error models perhaps 90% will be useable
Alle
les
Rea
ds
37
Using the Presence/Absence Variants
• In species like maize, this is the majority of the data
• Less subject to sequencing error • Need imputation methods to
differentiate between missing from sampling and biologically missing
Future • Need better integration of Whole Genome
Sequence data with pipeline – Add information on premature cut sites or
mutated cut sites • Use paired-end read information • Full incorporation of presence/absence
variants • Increase range of imputation tools and
phasing for structure populations • Quantitative genotype tools for polyploids/
GS