Sequence data formats - KNAW

Sequence data formats A short guide on sequencing data formats

Data formats

Sequence and Quality

• Base calls

• Quality of base calls

A T G T A G C A C G

29 28 33 18 26 31 18 34 32 39

•A quality value Q is an integer mapping of p

(i.e., the probability that the corresponding base

call is incorrect).

Q10 on in every 10 bases

Q20 probability of error = 1%

Q30 probability of error = 0.1%

Plain old FASTA

Fasta sequence

>identifier description

atcgtaggctttcggctata

gctaatgtagctatattgtc

Fasta qual

>identifier description

21 23 25 27 28 29 28 28 33 31 31 34 45 43 41 42 41 39 38 40

29 28 28 33 31 31 34 41 39 45 43 41 42 38 40 21 23 25 27 28

A few notes in advance

Numbers can be represented by letters through ASCII codes

http://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters

Number code

33 !

34 "

35 #

36 $

37 %

... ...

64 @

65 A

66 B

... ...

121 Y

122 Z

http://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters

FastQ

• One name, multiple formats

• Stores sequence and quality per base in the same file

• Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).

• Line 2 is the raw sequence letters.

• Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.

• Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

@SEQ_ID

GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

+

!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

@SEQ_ID

GATTTGGGGTTCAAAGCAGTAT

CGATCAAATAGTAAATCCATTT

GTTCAACTCACAGTTT

+SEQ_ID

!''*((((***+))%%%++)(%

%%).1***-+*''))**55CCF>

>>>>>CCCCCCC65

@SEQ_ID

GATTTGGGGTTCAAAGCAGTAT

CGATCAAATAGTAAATCCATTT

GTTCAACTCACAGTTT

+

!''*((((***+))%%%++)(%

%%).1***-+*''))**55CCF>

>>>>>CCCCCCC65

http://en.wikipedia.org/wiki/FASTA_format

Illumina output formats

.seq.txt

.prb.txt

Illumina FASTQ (ASCII – 64 is Illumina score)

Qseq (ASCII – 64 is Phred score)

Illumina single line format

SCARF

FastQ Quality

• A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect).

• Two different equations have been in use. The first is the standard Sanger variant to assess reliability of a base call, otherwise known as Phred Quality Score

• Old days Solexa, now Illumina used a different mapping, encoding the odds p/(1-p) instead of the probability p:

• They differ at low quality values

• Q20 no differences

Different mappings

platform Phred score Ascii codes

Sanger 0-93 33-126

Solexa/Illumina 1.0 -5 to 62 59 to 126

Illumina 1.3 – 1.7 0 – 62

Illumina 1.5 – 1.7

0,1 no longer used

2 marks end of HQ read (but may occur in the middle of a read as well)

Illumina 1.8 (sanger

encoding)

0-93 33-126

PacBio 0-93 33-126

Ion Torrent 0-93 33-126

There is no standard file extension for a FASTQ file, but .fq and .fastq, are commonly used.

Standard flowgram format (SFF)

454 equivalent to the ABI chromatogram files.

• the flowgram,

• the called sequence,

• the qualities,

• recommended quality and adaptor clipping.

• SFF files are binary. There are several tools to extract the sequences

• fasta + fasta.qual or fastq

• Sanger quality encoding

SAM (BAM) format

• text format for storing sequence & quality data in a series of tab delimited ASCII columns

• Stores alignment information against a given reference

• SAM human readable version of BAM(compressed & indexed for fast parsing)

• Can be converted into each other with SAMtools

• Can be converted to FastQ or even Fasta

• Common output format of workflows

Information on SAM/BAM

http://samtools.github.io/

https://github.com/samtools/hts-specs

http://genome.sph.umich.edu/wiki/SAM

PacBio - SMRT Cell

• A PacBio SMRT-Cell run is packaged as .tgz file (gzipped tar format)

• Big (4-14 GB / SMRT Cell)

• Check the required folder structure, otherwise the file cannot be loaded in the SMRT-Portal database

• Contains several .h5 (HDF5 format) files

• Contains meta data in xml format

• Should be read as one package by SMRT-Portal

• SMRT Portal can convert the files to standard FASTQ

documentation:

https://github.com/PacificBiosciences/SMRT-Analysis/wiki

format https://s3.amazonaws.com/files.pacb.com/software/instrument/2.0.0/bas.h5%20Reference%20Guide.pdf




https://s3.amazonaws.com/files.pacb.com/software/instrument/2.0.0/bas.h5 Reference Guide.pdf

Tools for converting formats

• FASTX-Toolkit

http://hannonlab.cshl.edu/fastx_toolkit/

• BioPython

http://www.biopytjon.org

• SFF extract or now seq_crumbs

http://bioinf.comav.upv.es/seq_crumbs/

• Other: PrinSeq, ea-utils, bedtools, sambamba

MD5 Checkums

• Large files can be damaged during file transfers

• MD5 checksums may help to detect corrupt files

• If possible ask for MD5 checksums

Example (on linux):

Windows tool: WinMD5 (http://winmd5.com/)

md5sum -b largefile.fastq.gz > md5sum -b largefile.fastq.gz.md5

a3672a3d4185acc49c7fa4460f1167ab *largefile.fastq.gz md5sum -c largefile.fastq.gz.md5

largefile.fastq.gz: OK

Pfff Alternative fingerprinting

http://biit.cs.ut.ee/pfff/

Compression formats

Compression Extension Commandline to unzip

GZIP .gz gunzip <file> gzip –d <file>

BZIP2 .bz2 bzip2 –d <file> bunzip <file>

zip .zip unzip <file>

rar .rar unrar x <file>

7z .7z 7za e <file>

File come in various compression formats. All of them can be read under Linux,

Windows might need extra software to extract the files.

Most common fileformats are .gz and .bz2

Conclusions

• Be aware of the different file formats for sequence and quality data

• Data integrity ask for MD5 checksums from your sequence provider

• If possible convert older files to standard Sanger encoded quality values

Documents

Sequence data formats - KNAW