Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Sequence data formats A short guide on sequencing data formats
Data formats
Sequence and Quality
• Base calls
• Quality of base calls
A T G T A G C A C G
29 28 33 18 26 31 18 34 32 39
•A quality value Q is an integer mapping of p
(i.e., the probability that the corresponding base
call is incorrect).
Q10 on in every 10 bases
Q20 probability of error = 1%
Q30 probability of error = 0.1%
Plain old FASTA
Fasta sequence
>identifier description
atcgtaggctttcggctata
gctaatgtagctatattgtc
Fasta qual
>identifier description
21 23 25 27 28 29 28 28 33 31 31 34 45 43 41 42 41 39 38 40
29 28 28 33 31 31 34 41 39 45 43 41 42 38 40 21 23 25 27 28
A few notes in advance
Numbers can be represented by letters through ASCII codes
http://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters
Number code
33 !
34 "
35 #
36 $
37 %
... ...
64 @
65 A
66 B
... ...
121 Y
122 Z
FastQ
• One name, multiple formats
• Stores sequence and quality per base in the same file
• Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
• Line 2 is the raw sequence letters.
• Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
• Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@SEQ_ID
GATTTGGGGTTCAAAGCAGTAT
CGATCAAATAGTAAATCCATTT
GTTCAACTCACAGTTT
+SEQ_ID
!''*((((***+))%%%++)(%
%%).1***-+*''))**55CCF>
>>>>>CCCCCCC65
@SEQ_ID
GATTTGGGGTTCAAAGCAGTAT
CGATCAAATAGTAAATCCATTT
GTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%
%%).1***-+*''))**55CCF>
>>>>>CCCCCCC65
Illumina output formats
.seq.txt
.prb.txt
Illumina FASTQ (ASCII – 64 is Illumina score)
Qseq (ASCII – 64 is Phred score)
Illumina single line format
SCARF
FastQ Quality
• A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect).
• Two different equations have been in use. The first is the standard Sanger variant to assess reliability of a base call, otherwise known as Phred Quality Score
• Old days Solexa, now Illumina used a different mapping, encoding the odds p/(1-p) instead of the probability p:
• They differ at low quality values
• Q20 no differences
Different mappings
platform Phred score Ascii codes
Sanger 0-93 33-126
Solexa/Illumina 1.0 -5 to 62 59 to 126
Illumina 1.3 – 1.7 0 – 62
Illumina 1.5 – 1.7
0,1 no longer used
2 marks end of HQ read (but may occur in the middle of a read as well)
Illumina 1.8 (sanger
encoding)
0-93 33-126
PacBio 0-93 33-126
Ion Torrent 0-93 33-126
There is no standard file extension for a FASTQ file, but .fq and .fastq, are commonly used.
Standard flowgram format (SFF)
454 equivalent to the ABI chromatogram files.
• the flowgram,
• the called sequence,
• the qualities,
• recommended quality and adaptor clipping.
• SFF files are binary. There are several tools to extract the sequences
• fasta + fasta.qual or fastq
• Sanger quality encoding
SAM (BAM) format
• text format for storing sequence & quality data in a series of tab delimited ASCII columns
• Stores alignment information against a given reference
• SAM human readable version of BAM(compressed & indexed for fast parsing)
• Can be converted into each other with SAMtools
• Can be converted to FastQ or even Fasta
• Common output format of workflows
Information on SAM/BAM
http://samtools.github.io/
https://github.com/samtools/hts-specs
http://genome.sph.umich.edu/wiki/SAM
PacBio - SMRT Cell
• A PacBio SMRT-Cell run is packaged as .tgz file (gzipped tar format)
• Big (4-14 GB / SMRT Cell)
• Check the required folder structure, otherwise the file cannot be loaded in the SMRT-Portal database
• Contains several .h5 (HDF5 format) files
• Contains meta data in xml format
• Should be read as one package by SMRT-Portal
• SMRT Portal can convert the files to standard FASTQ
documentation:
https://github.com/PacificBiosciences/SMRT-Analysis/wiki
format https://s3.amazonaws.com/files.pacb.com/software/instrument/2.0.0/bas.h5%20Reference%20Guide.pdf
Tools for converting formats
• FASTX-Toolkit
http://hannonlab.cshl.edu/fastx_toolkit/
• BioPython
http://www.biopytjon.org
• SFF extract or now seq_crumbs
http://bioinf.comav.upv.es/seq_crumbs/
• Other: PrinSeq, ea-utils, bedtools, sambamba
MD5 Checkums
• Large files can be damaged during file transfers
• MD5 checksums may help to detect corrupt files
• If possible ask for MD5 checksums
Example (on linux):
Windows tool: WinMD5 (http://winmd5.com/)
md5sum -b largefile.fastq.gz > md5sum -b largefile.fastq.gz.md5
a3672a3d4185acc49c7fa4460f1167ab *largefile.fastq.gz md5sum -c largefile.fastq.gz.md5
largefile.fastq.gz: OK
Pfff Alternative fingerprinting
http://biit.cs.ut.ee/pfff/
Compression formats
Compression Extension Commandline to unzip
GZIP .gz gunzip <file> gzip –d <file>
BZIP2 .bz2 bzip2 –d <file> bunzip <file>
zip .zip unzip <file>
rar .rar unrar x <file>
7z .7z 7za e <file>
File come in various compression formats. All of them can be read under Linux,
Windows might need extra software to extract the files.
Most common fileformats are .gz and .bz2
Conclusions
• Be aware of the different file formats for sequence and quality data
• Data integrity ask for MD5 checksums from your sequence provider
• If possible convert older files to standard Sanger encoded quality values