Assembly group Presentation II

Assembly group Presentation II

Robert ArthurKevin Lee

Xing LiuPushkar Pande

Gena TangRacchit Thapliyal

Tianjun Ye

Two problematic libraries

M19107 & M21639 were resequenced due to low coverage

M19107 still had low coverage after re-seq M21639 coverage increased to ~80X

De novo assembly – exponential dec.

0 10 20 30 40 50 60 70 80 900

200

400

600

800

1000

1200

1400

Effect of coverage depth on contig number

De novoReference2 sample de novo

Coverage

Tota

l C

onti

g C

ount

M21639 still bad

0 10 20 30 40 50 60 70 80 900

200

400

600

800

1000

1200

1400



Coverage

Tota

l C

onti

g C

ount

Mixed culture?

Is this a mixed culture? What does a mixed culture assembly look

like? newProject, addRun, addRun, runProject … M19501 & M21127 M21127 & M21621

Mixed culture

0 10 20 30 40 50 60 70 80 900

200

400

600

800

1000

1200

1400



Coverage

Tota

l C

onti

g C

ount

Possible explanations for poor assembly

About 20% larger genome◦ Recent plasmid or other large genome insertion

Also lost hemolytic ability and H2S production

De novo assembly – 36 is the best

0 10 20 30 40 50 60 70 80 900

200

400

600

800

1000

1200

1400



Coverage

Tota

l C

onti

g C

ount

De novo assembly

Limited by our data, 36 contigs is lower limit◦ Number of repeat elements◦ rRNA

20X coverage is sufficient

General assembly stats

GenomeNewbler de novo

Newbler reference

Mira3 de novo

AMOScmp reference

minimus2 - newbler/mira

minimus2 - newbler/AMOScmp

M19107.sff 217 1260 208 471 123 109

M19501.sff 75 988 181 521 22 27

M21127.sff 59 1013 89 538 38 28

M21621.sff 50 986 67 515 28 23

M21639.sff 175 1272 175 573 54 69

M21709.sff 52 313 83 187 37 32

M19107_1.sff 1336 1361

M19107_2.sff 450 1006

M21639_1.sff 266 1165

M21639_2.sff 147 1282

Assembler Evaluation Strategy

Single assembler evaluation Minimus2 assembler evaluation Feedback by gene prediction group

Single Assembler Evaluation (“Hard Measurement”)

Single assembler evaluation

“Soft” measurement: satisfy gene prediction group's requirement.

RNA prediction group requires a file which can trace back the depth of the reads. For this, we use the .tsv file in the Newbler output.

• 454AlignmentInfo.tsv (-infoall/-info/-noinfo) base consensus, quality, depth and flow-signal, at each position in each contig. A very useful file.

• eg:

Position Consensus Quality Unique Align Signal Signal Score Depth Depth StdDev (incl. duplicates)

>contig000081 G 64 26 32 0.98 0.052 A 64 27 33 0.94 0.133 T 64 27 33 1.97 0.144 T 64 27 33 1.97 0.145 G 64 27 33 0.97 0.06...etc...

Newbler Output

Single Assembler Evaluation

For “hard” measurement: we focused mainly on “Total Contigs” “N50 Contigs Bases”, “Total Big Contigs”, “Big Contigs Percent Bases”, “Big Contig Reads”, “Singleton Reads”.

For “soft” measurement: we focused on the trace back of depth for each base pair.

Final Rank of Single Assembler

Combine the “hard” and “soft” measurement manually. We get as a result:

1. Newbler De Novo 2. Mira3 3. Amos Eliminated : Newbler reference & velvet

Minimus2 Evaluation

Since our top choice is Newbler, we want to include Newbler’s results in the merged contigs. Thus, we analysed the statistics of:

1. Newbler merged with Amos 2. Newbler merged with Mira Visulization tool: hawkeye

Minimus2 Evaluation

Minimus2 Evaluation

Gene prediction results feedback

Single assembler evaluation (Sample Data: contigs)

Single assembler evaluation (Sample Data: Big contigs)

Final Rankings of Single Assembler

Combined the “hard” and “soft” measurement manually and we got:

1. Newbler De Novo 2. Mira3 3. Amos Eliminated : Newbler reference & velvet

Minimus2 Evaluation (sample data: contigs)

Minimus2 Evaluation (sample data: reads)

Minimus2 Evaluation

Using the same strategy as above, our rankings are:

1. Newbler merged with Mira; 2. Newbler merged with Amos;

Final Recommendation

1. Merged Contigs: (1) Newbler merged with Mira (2) Newbler merged with Amos 2. Single Assembler: (1) Newbler (2) Mira (3) Amos

Feedback Strategy

Gene Prediction Group may use predicted genes and RNAs to evaluate our assembly results.

Minimus2 Overview

Minimus2 is a modified minimus pipeline

It is designed to merge one or two sequence sets hereafter referred to as S1 or S2

Uses Nucmer based Overlap Detector instead of the Smith-Waterman hash overlap Minimus1 uses (much faster)

Minimus2 Usage

minimus2 prefix \ -D REFCOUNT=n \ # Number of sequences is the first

set -D OVERLAP=n \ # Minimum overlap (Default 40bp) -D CONSERR=f \ # Maximum consensus error (0..1)

(Def 0.06) -D MINID=n \ # Minimum overlap %id for align. (Def 94) -D MAXTRIM=n # Maximum sequence trimming length

(Def 20bp)

◦ Prefix refers to an .AFG filename

Minimus2 Usage, Cont’d.

REFCOUNT should be set to number of sequences in the first set

“all vs all” alignment is ran by default and sets REFCOUNT to 0 unless user-specified

S1 & S2 should be merged and converted to AMOS format using toAmos command

toAmos –s S1-S2.seq –o S1-S2.afg

Minimus2 Usage, Cont’d.

After you merge the data, you actually run minimus on it

Minimus2 S1-S2 –D REFCOUNT=## Input

S1-S2.afg Output

S1-S2.fasta (contig) S1-S2.singletons.seq (single)

Nucmer Algorithm

A modification of the MUMmer package matching algorithm

Operates via building and then searching a suffix tree data structure

This is a significant upgrade from the minimus1 approach as searching using suffix trees is O(n) and minimus1’s method is O(n2)

Linear time versus polynomial time

MUMmer link : http://mummer.sourceforge.net/

http://mummer.sourceforge.net/

Nucmer Algorithm The Nucmer strategy uses approximately 17 bytes

of memory for each basepair in the reference sequence

The query supplied by the user is streamed past the reference suffix tree so that the memory requirements do not depend on the size of the query sequence

In English: Bigger query does not mean order of magnitude longer operating time

Unique algorithm that can be found and analyzed as it is open source on Sourceforge

MIRA

Based on OLC approach Strategies:

◦ Preprocessing: high confidence region (HCR) ◦ Use a quick heuristic algorithm to alignment the HCR of reads◦ Overlaps are reviewed with Smith-Waterman alignment algorithm ◦ Contigs can be be optionally analysed and corrected by an incorporated

version of an automatic editor◦ Repeats are resolved by searching for typical mis-assembly patterns◦ Optional pre-assembly read extension step: the assembler can try to

extend HCRs of reads by analysing the overlap pairs from the previous alignments.

MIRA outputs

d_results: this directory contains all the output files of the assembly in different formats.

d_info: this directory contains information files of the final assembly.

d_log: this directory contains log files and temporary assembly files.

d_chkpt: this directory contains checkpoint files needed to resume assemblies that crashed or were stopped (not implemented yet, but soon)

d_results

out.padded.fasta: this file contains as FASTA sequence the consensus of the contigs that were assembled in the process. Positions in the consensus containing gaps (also called 'pads', denoted by an asterisk) are still present.

out.unpadded.fasta: this file contains as FASTA sequence the consensus of the contigs that were assembled in the process, put positions in the consensus containing gaps were removed.

qual files Outputs with other formats

◦ caf, ace, gap4d

d_info

info_assembly.txt: some statistics as well as whether or not problematic areas remain in the result.

info_callparameters.txt: This file contains the parameters as given on the mira command line when the assembly was started.

info_contigstats.txt: This file contains in tabular format statistics info_contigreadlist.txt: This file contains information which reads have

been assembled into which contigs. info_readstooshort: A list containing the names of those reads that have

been sorted out of the assembly only due to the fact that they were too short, before any processing started.

error_reads_invalid: A list of sequences that have been found to be invalid due to various reasons (given in the output of the assembler).

Newbler

Another OLC assembler◦ Starts with `indexing`

Scans the .sff file, trims the reads, Performs some checks for possible 3’ and 5’ primers.

◦ Finds overlaps between reads Splits the phase between long reads and short reads. Alignments proceed using seed and extend.

◦ Simplifies overlap graph and generates consensus contigs. Uses the quality information for base calling.

Newbler: Metrics file

454NewblerMetrics.txt◦ runData

Total number of reads and bases in the file, also the number of reads and bases after trimming.

◦ runMetrics Number of searches, seeds and overlaps during the

alignment phase of assembly◦ readAlignmentResults

Number of reads and bases aligned to other reads, Inferred read error – No. of errors, mainly indels, between

the contigs and the reads.◦ consensusDistribution

This section deals with base calling of the consensus contigs.

◦ consensusResults A summary of the read alignments and assembly statistics

Newbler: Contigs 454AllContigs.fna (>100 bp default) 454LargeContigs.fna (>500 bp default)

◦ The fasta header: Gives the the unique contig number, its length in bp and the number of

reads in the alignment used to build this contig

◦ lower case bases correspond to quality values below 40. 454Contigs.ace

◦ All contigs in .ace format. Useful for visualization using for e.g. Eagleview

>contig00001 length=381 numreads=158

>contig00002 length=144 numreads=560GGGAGAACTCATCTCTTGGCAAGTTTCGTGCTTAGATGCTTTCAGCACTTATCTCTTCCGCACTTAGCTACCCGGCAATGCGTCTGGCGACACAACCGGAACACCAGTGaTGCGTCCACTCCGGTCCTCTCGTACTAGGAGCAG

>contig00002 length=144 numreads=56064 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 6464 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 20 64 64 64 64 64 64 64 64 64 6464 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64

Newbler: The status files 454TrimStatus.txt: This file describes what

(trimmed) part of the read was considered for alignment. ◦ read Id, trim points used, used trimmed length, raw length

454ReadStatus.txt: This file describes where reads ended up after assembly was complete.◦ Id, Status (assembled, partially assembled, singleton,

outlier, too short).

454AlignmentInfo.tsv: This file gives a consensus alignment overview for each position in each contig.◦ Position, consensus, quality score, depth, signal, std.

deviation

Automating assembly

Motivation◦ Troublesome installation, number of dependencies◦ Difficult to remember command line parameters

Automation◦ install.sh : A script to install assemblers and their

dependencies◦ assembler.sh: Script to run assemblers with default

arguments.

Future work

Continue to dialog with G.P. to determine assembly of choice.

Metrics:◦ tRNA count◦ rRNA count◦ Protein coding regions

Questions?

Documents

Assembly group Presentation II