27
Sequence File Formats

Sequence File Formats

  • Upload
    dima

  • View
    65

  • Download
    0

Embed Size (px)

DESCRIPTION

Sequence File Formats. Sequencing – the old way. G. A. T. C. di - deoxy chain terminators (G, A, T, C) 4 different reactions 35 S dCTP Electrophoresis through an acrylamide gel Transfer gel to blotting paper Expose to X-ray film Develop film and read sequence. - PowerPoint PPT Presentation

Citation preview

Page 1: Sequence  File Formats

Sequence File Formats

Page 2: Sequence  File Formats

Sequencing – the old way

• di-deoxy chain terminators(G, A, T, C)

• 4 different reactions• 35S dCTP• Electrophoresis through an

acrylamide gel• Transfer gel to blotting paper• Expose to X-ray film• Develop film and read sequence

G A T C

Page 3: Sequence  File Formats

Chromatograms – Sanger sequencing

• Fluorescent di-deoxy chain terminators– Four nucleotides, four “colors”

• Electrophoresis through a polymer• Read the colors as they pass through a laser/detector

Page 4: Sequence  File Formats

Flowgrams, Ionograms

• Flow nucleotides through a reaction cell – one at a time

• Detect byproducts of incorporation– 454 sequencing, pyrophosphate (light)– Ion Torrent, hydrogen ions (pH)

Page 5: Sequence  File Formats

Colorspace – SOLiD sequencing

• Sequence by ligation (detects 2 bases/cycle)– Flow 4 pools of 4 oligonucleotides over the reaction wells

(each pool is labeled with a different fluorescent dye)– Detect dye, cleave off oligo-dye adaptor and repeat

Page 6: Sequence  File Formats

Process is repeated using nested primers

Page 7: Sequence  File Formats

SOLiD color codes

AT

Page 8: Sequence  File Formats

Colorspace csfasta

1st b

ase

2nd base

Page 9: Sequence  File Formats

fasta, multifasta

• .mfa, .mpfa

• Fasta (.fasta, .fa, .fas, .fsa, .fna)>Sequence1CAATCATAGAGACAGCTGTTGTATCGTTACGTCATTCATGCAAGACCGCATTTAACGGCCAAGGCATTTCGCTACCTTAG

• Multifasta (.mfa, .mpfa)>Sequence1CAATCATAGAGACAGCTGTTGTATCGTTACGTCATTCATGCAAGACCGCATTTAACGGCCAAGGCATTTCGCTACCTTAG>Sequence2ACCAGGAAGGTGGCCGACGCCAGCCGCTGATGCCACTCCACCCGCCGCGCACCGAGTCCAGGAGCGCGGACAAGGGGATT

Page 10: Sequence  File Formats

Colorspace fasta

• .csfasta>Sequence1T0123020301120301012020212330213230>Sequence2T2130322221303120001320310030123123

0

1

2

3

1

0

3

2

2

3

0

1

3

2

1

0

A

C

G

T

T

GCA

1st b

ase

2nd base

Page 11: Sequence  File Formats

Sequence Quality

• Some sequence calls have better quality than others

Page 12: Sequence  File Formats

Quality values

• Q = -10 log10P

Q = quality valueP = probability of error

Phred quality score Probability of incorrect call Base call accuracy

10 1 in 10 90%

20 1 in 100 99%

30 1 in 1000 99.9%

40 1 in 10000 99.99%

50 1 in 100000 99.999%

Page 13: Sequence  File Formats

.sff files

.sff^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^Vp<D8>^B0^@^D^B^H^ATACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCAG^@^@^@^@^@^@ ^@^K^@^@^@4^@^E^@%^@^@^@^@PXDEO:13:42^@^@^@^@^@^@4^@^A^@q^@^@^@^A^@f^@^H^@<E0>^@^A^@p^@ESC^@^@^@j^@j^@^@^@k^@<C3>^@^Z^@K^@^@^@l^@^A^@c^@x^@^@^@^@^@^A^@^@^@g^@^A^@i^@^@^@^@^@j^@^A^@<E2>^@^@^@^@^@u^@^@^@i^@^@^@^@^@|^@b^@r^@^M^@T^@^@^@O^@^@^@^@^@B^@^@^@3^@^@^@^@^@^@^@^B^@^B^AS^@<F1>^@<B1>^@L^@^B^@b^@3^@^@^@^B^@3^@3^@^@^@^@^@^B^@^C^@^B^@w^A<E3>^@^@^@^H^@`^@^A^@^@^@a^@^@^@^B^A^E^@^@^@^@^@^@^@^B^@^@^@^G^@^B^@^@^@^B^@^B^@^B^@^B^@^A^@^A^@^A^@^A^@^A^@^A^@^A^@^A^@^A^@^A^@^A^@^A^@^A^@^A^@^A^@^A^@^A^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^A^B^C^B^@^B^C^A^B^A^@^B^B^B^A^E^B^C^B^@^C^B^C^A^A^B^B^C^B^F^@^@^A^@^A^@^A^B^A^C^A^F^A^@^@^@^@^C^C^C^@^@TCAGGCAGATTGTGACGAGGCTGAGACTGCCCAAGGCACACAGGGGGTAGGG^M^L^Q^Z^S^YESC^]^]^_^Y^T^Y^W^\^]^_ESCESC^VESC^Z^Z^U^X^T^W^O^L^L^L^F ^G ^P^L ^L ^L^P^P^P^R^F^R^L^L^L^F^@^@^@^@^@ ^@^K^@^@^@'^@^E^@^X^@^@^@^@

Page 14: Sequence  File Formats

Roche 454 .sff Files – common header

Magic Number: 0x2E736666Version: 0001Index Offset: 110544Index Length: 3173# of Reads: 35Header Length: 840Key Length: 4# of Flows: 800Flowgram Code: 1Flow Chars: TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGKey Sequence: TCAG

Page 15: Sequence  File Formats

.sff files - sequence specific information

• >F7K88GK01BMPI0Run Prefix: R_2009_12_18_15_27_42_Region #: 1XY Location: 0551_2346• Run Name: R_2009_12_18_15_27_42_FLX########_Administrator_yourrunnameAnalysis Name:

D_2009_12_19_01_11_43_XX_fullProcessingFull Path: /data/R_2009_12_18_15_27_42_FLX########_Administrator_yourrunname/D_2009_12_19_01_11_43_XX_fullProcessing/

• Read Header Len: 32Name Length: 14# of Bases: 500Clip Qual Left: 15Clip Qual Right: 490Clip Adap Left: 0Clip Adap Right: 0

• Flowgram: 1.03 0.00 1.01 0.02 0.00 0.96 0.00 1.00 0.00 1.04 0.00 0.00 0.97 0.00 0.96 0.02 0.00 1.04 0.01 1.04 0.00 0.97 0.96 0.02 0.00 1.00 0.95 1.04 0.00 0.00 2.04 0.02 0.03 1.05 0.99 0.01 2.84 0.03 0.05 0.97 0.12 0.00 1.01 0.05 0.97 0.01 2.89 0.04 0.09 1.05 0.15 0.00 2.84 0.06 1.00 0.01 0.13 1.01 0.09 0.98 0.01 0.05 1.01 0.06 0.00 1.04 3.72 0.03 0.00 0.96 1.97 0.04 0.01 1.97 0.12 0.98 0.02 0.08 0.95 0.12 ...Flow Indexes: 1 3 6 8 10 13 15 18 20 22 23 26 27 28 31 31 34 35 37 37 37 40 43 45 47 47 47 50 53 53 53 55 58 60 63 66 67 67 67 67 70 71 71 74 74 76 79 82 83 86 86 88 88 91 93 96 97 99 102 105 ...Bases: tcagatcagacacgCCACTTTGCTCCCATTTCAGCACCCCACCAAGCACAAGGCTGTCATCCCAATTGGACGGACAGATATGAGGTTAGCATTGGAAACCAATTCAGTCCCTAATTATTCACGACTGAACCCAGCGACAATTGGACATGGATTCATTTTTCAACTTGATTTGTTGTTGTAAAAGCACTGAAGAAGATGCCGCAACAAGAGCTTCCAAAGTTTCCCACCGGATCGACGGTACCCTTTCCCTATGAATCTCCTTATCCTCAGCAGACAGCTTTGATGGACACGCTGCTCGAGTGTTTGCAGCAAAAGGATCACGATGATTCAACATGGCGCCAAACCAATGACAGCCATAGCAAGAACAAGAAGAAACCCCGTGCGGCCGTGATGATGTTGGAGTCTCCTACCGGCACTGGCAAGTCTCTATCTTTGGCGTGTAGTGCCATGGCGTGGCTCAAGTACTGCGAACAACGAGATTTGACTGCAGaagaagaatcQuality Scores: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 38 38 38 40 40 40 39 39 39 40 34 34 34 40 40 40 40 39 26 26 26 26 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 ...

Page 16: Sequence  File Formats

Quality Files

.phdBEGIN_SEQUENCE CLS_AGTC_1a73_1_x_C10_FLCN12R_x_A08

BEGIN_COMMENT

CHROMAT_FILE: CLS_AGTC_1a73_1_x_C10_FLCN12R_x_A08.ab1BASECALLER_VERSION: KB 1.2TRACE_PROCESSOR_VERSION: KB 1.2QUALITY_LEVELS: 99TIME: Wed Dec 07 19:41:26 2011TRACE_ARRAY_MIN_INDEX: 0TRACE_ARRAY_MAX_INDEX: 16022TRIM: -1 -1 -1.000000e+000TRACE_PEAK_AREA_RATIO: -1.000000e+000CHEM: termDYE: big

END_COMMENT

BEGIN_DNA

T 3 7G 3 28G 4 44A 6 57A 5 70G 3 81C 4 101

.qual>contig00016 length=237 numreads=920 3 10 64 14 64 9 64 4 19 64 64 4 64 64 64 64 21 64 37 64 64 64 64 64 64 41 12 64 64 64 64 64 32 64 64 64 64 64 64 64 13 37 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 45 64 64 64 64 64 64 29 64 64 64 64 64 20 64 64 64 64 64 39 64 64 64 64 64 64 64 64 64 64 64 20 64 64 64 64 64 64 64 64 64 64 64 40 64 64 64 64 64 20 64 64 64 64 64 20 64 64 64 64 64 23 64 64 64 64 64 64 64 64 64 64 64 39 64 64 64 64 64 64 16 64 64 64 64 64 64 5 64 64 64 64 64 64 64 64 64 64 64 33 64 64 64 64 64 64 4 64 64 64 64 64 64 64 64 64 64 64 64 29 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 39 64 64 64 64 64 64 48 64 64 64 64 64 64 64 48 64 64 64 50 64 64 64 64 64 64 64 32 64 64 34 64 64 33 64 64 64 48 64 64 64 64>contig00017 length=161 numreads=964 64 64 18 17 64 64 43 20 64 24 41 64 64 30 2 53 64 64 64 64 35 64 64 64 64 64 12 64 64 64 64 31 64 64 28 64 16 64 64 64 64 64 41 64 64 17 64 64 34 25 64 30 64 64 64 64 64 64 64 64 20 64 64 64 64 47 64 64 64 40 64 61 64 64 34 64 64 64 64 64 22 64 64 64 64 64 64 64 64 64 64 64 64 64 58 64 64 64 64 64 64 64 64 64 64 64 58 64 64 64 64 64 64 64 64 64 64 37 43 64 64 52 64 64 64 64 64 64 64 64 64 64 64 60 64 64 49 64 64 64 64 64 64 64 64 20 29 64 64 64 64 64 17 64 21 3 64 21 21 3

Page 17: Sequence  File Formats

.fastq, .fsq, .fq

• Incorporates sequence calls and quality values into a single file:

@PXDEO:18:45ATATATATAAAATATAAAAAGGGTTTTTTTTAAAAAAAATTAATCCAGCAATAATTCCAAATTATTTTGAGGCCGAATCGGATGGGTTATTTTTTTTTTTATAAAAAATTATTTGCAACGAGCCATTATATAACAAA+9=?>??AAAB@-:0+,000&0:.:;===;=(<<<<677(5151552766>;:>9@8=7:>2=7===>.>=?7=6:<7::<4:<99'0(0*---------%*-*4566)60133,366035665)+0/488443+...

Page 18: Sequence  File Formats

Quality scores in ASCII format

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ...................... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.................................................... !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126

S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) (Note: See discussion above). L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)

Page 19: Sequence  File Formats

Why does the encoding start at 33?

Page 20: Sequence  File Formats

File converters

• .sff fasta, qual– sffinfo (Newbler tool, www.Roche.com)– sff_extract (bioinf.comav.upv.es/sff_extract/)

• .sff .fastq– SFF workbench (www.dnabaser.com/download/)– sff_extract (bioinf.comav.upv.es/sff_extract/)– Sff2fastq (github.com/indraniel/sff2fastq)

• .fastq .fasta (&.qual, if desired)– Prinseq-lite.pl

or

cat file_in.fastq | perl -e '$i=0;while(<>){if(/^\@/&&$i==0){s/^\@/\

>/;print;}elsif($i==1){print;$i=-3}$i++;}' > file_out.fasta

Page 21: Sequence  File Formats

Quality Metrics

• Was the sequencing run successful?– Number of phred20(30) bases– Average read length (@Q20)

• How much useable data?– Genome assembly

• Total high quality bases

– RNAseq/CHIPseq• Total number of map-able reads

Page 22: Sequence  File Formats

Sequence Trimming/Masking/Filtering

• Trimming– Barcode & adapter sequences

– Poor quality sequence at the starts/ends of reads

• Masking– Poor quality sequence in the middles of reads

• Filtering– Sequence reads that are shorter than a pre-defined threshold

Page 23: Sequence  File Formats

Quality Trimming/Masking

>Sequence116 16 21 9 10 13 14 12 8 8 9 16 24 21 19 19 19 25 25 33 35 35 34 34 34 34 34 34 34 40 45 45 56 56 56 51 51 40 45 37 37 37 40 40 40 40 40 40 39 39 39 40 40 40 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 51 45 39 39 39 39 39 39 39 39 39 39 39 39 39 39 40 51 51 51 51 51 56 56 56 56 56 56 56 40 35 35 35 35 35 39 40 40 45 45 45 45 45 51 56 40 39 39 39 39 39 45 45 45 45 40 40 40 39 35 35 40 40 40 40 40 40 38 38 38 39 25 23 18 10 8 9 23 31 51 51 45 43 43 43 43 43 43 43 43 43 43 43 43 56 56 56 56 56 56 56 56 43 43 43 43 43 43 45 45 45 51 51 56 51 51 51 51 51 51 56 51 45 43 43 56 56 56 56 56 56 51 51 51 45 45 45 40 40 40 44 42 38 38 40 40 40 51 56 56 56 56 56 46 46 51 51 51 51 56 56 56 56 51 40 45 45 40 40 40 42 42 42 45 45 42 42 42 42 56 42 42 40 40 34 37 33 40 40 40 44 48 48 48 29 29 29 26 32 29 32 32 32 32 33 44 48 56 40 40 40 40 40 40 40 40 40 37 34 34 37 40 40 40 40 37 34 34 48 40 32 28 25 25 25 34 48 48 48 40 40 32 29 24 25 29 40 40 40 40 40 40 33 33 37 40 40 40 43 43 42 42 42 44 44 56 56 56 56 56 40 35 34 33 33 40 40 40 40 40 40 29 29 34 29 29 29 29 40 29 34 25 27 23 23 21 23 18 20 25 25 25 32 32 32 32 29 18 20 14 16 16 17 17 16 22 20 18 25 19 14 16 15 26 27

Page 24: Sequence  File Formats

Masked sequence

>Sequence1

CCAGAAACTACGCGGTGGCGGCCGCTCTAGAACTAGTGGATCCCCCGGGCTGCAGA

TCGTCCGCCAGACTAAAGAAGTCCAAGAGTTGGCTCGCCAAAACGCGCTAAAAACG

CAAAAAGCGGCGACCAGTAGANNNNAGGCGAGGCAGGAAGAACAAGCCAACTTTTG

GGGTTAACGACTATGTTTTCGTCAAGAAAAAAGGGTTTCCGACGACCGCACCGACG

ACCAGATTGGATTCACAGTGGACCGGACCATGGCAGATTCTAGAAGAACGAGGATA

TAGCTATGTTTTGGACGTACCTGAATCGTTTAAAGGAAAAAATTTGTTCCACGCAG

ACCGCCTCCGCAAAGCCGCAATGGACCCATTACCACAACAGAAAAGAGAGCCGCCT

CCGCCAGAAGAGATCAACGCCAGAGTTTGTGGTCGATAAAGTTTTAGCGTCCCGAT

TATTTGGCCGGAGTAAGATATTGCAATACCAGGTCGCATGGCAAGGATGTGATCCA

GACGACACGTGGTACCCGGCTGAAAACTTCAAGAATTCAGCGACAGCCCTTGACGA

CTTCCACAAGAAGTAC

Page 25: Sequence  File Formats

Sequence header information (Illumina)

@M01478:6:000000000-A40C5:1:1101:16859:1439 1:N:0:1ATCGTTTCGGAGCAAGGCAACTGTNTCAGGCACCATGAAGTTGAGCTATTCTACTGCGCCAACCTTTGCGAGATAAATCGTCNTGCCNTNNTTATCANCGTCAATTGGAANTCAGATGTGCCACCNNAAN+ABBBAABFBBBBGEGGFGGGGGHF#AAFF2AGFGHGHHHHHHFHFFDGFGHHHHGHEGGGGCGGGHFABEEGFFHHEGHEGE#BBFG#?##???FFH#??FEFGHHEHHG#??FFEDGGGFFHFH##??#• Machine name• Run number• Flowcell ID• Flowcell lane• Tile in flowcell• X-coordinate in tile• Y-coordinate in tile• Member of Pair (1/2)• Read filtered? (Y/N)• Control bits on (0 or even number)• Index sequence used

Page 26: Sequence  File Formats

Today’s Exercises

• Convert different file formats• Evaluate sequence data quality using FastQC

• Trim sequence reads to improve data quality• Re-test trimmed data using FastQC

Page 27: Sequence  File Formats

Tips For a Productive Time

• Practice using tab-completion

• Make sure you execute all of the steps preceded by check boxes

• Tick off/fill-in the check boxes after you have (successfully) completed each command

• Do not skip over the text between the check boxes– It provides information designed to aid your understanding

of what you are doing

• ASK QUESTIONS