10
Sequence variance analysis in Pepino mosaic virus quasi-species through next-generation sequencing Maxime R. Tijdink Laboratory of Virology, Wageningen University, The Netherlands Pepino mosaic virus (PepMV) is a Potex virus that commonly infects members of the Solanaceae family and has symptoms varying from necrotic leaves to yellow mosaic. It has been transmitted to Europe and infects tomato plants and greenhouses. An infectious clone of a necrotic PepMV isolate (PepMV-WUR48) that belongs to the Chilean-2 strain is transferred through different host plants, creating different generations, to discover patterns of viral quasi-species evolution. A sequence variance analysis is done from paired end reads obtained with DeepSequencing, using the Illumina MiSeq platform. Various steps for handling these plant virus datasets are explained as we go through quality control, filtering reads, contig assembly and identification of variations. Single nucleotide polymorphisms (SNPs) on fixed positions were found in the first three generations and after that, the SNPs stay unchanged. Some nucleotides from the infectious clone that are deviant from the PepMV- WUR48 isolate will evolve back to their origin, while changes on other positions will remain. INTRODUCTION Pepino mosaic virus (PepMV) is a +RNA Potex virus discovered in Peru, where it was originally found in pepino plants (Solanum muricatum) [9]. The virus most commonly infects members of the Solanaceae family, but can also infect some other species [12]. The symptoms vary from necrotic leaves, to yellow mosaic in young leaves and enations in the lower surface of leaves [3,6]. PepMV is transmitted by plant contact and also through seeds to the next generation [4]. From wild Solanum species, PepMV eventually spread to tomato and other Solanum crops [11]. Four different strains of PepMV are distinguished, based on their genotype [5]: the Peruvian strain (LP), the US1 strain, the European strain (EU) and the Chilean-2 strain (CH2). PepMV has appeared in Europe and now infects tomato plants in greenhouses [13]. The EU strain was first discovered in the Netherlands in 1999 and phylogenetic analysis showed that it is related to the LP strain. Currently the CH2 strain is the most common strain in Europe. Phylogenetic analyses show that the CH2 is more related to US1 and does not originate from EU or LP [11]. The single stranded RNA+ genome of PepMV is around 6.4 kb long and has a 5’ cap and a 3’ poly-A tail. The genome has 5 open reading frames (ORFs), as it has 3’ and 5’ untranslated regions (UTRs). Also, there is an UTR between ORF1 and ORF2 and one between ORF4 and ORF5. ORF1 encodes a replication enzyme that consists of an mRNA capping enzyme domain, a RNA helicase domain and a RNA- dependent RNA polymerase domain. ORF 2, 3 and 4 are overlapping genes, which are called the triple gene block (TGB), they code for movement proteins (TGB1, TGB2 and TGB3). ORF5 encodes for the coat protein [1]. There are a range of point mutations in the CH2 strain that have a big influence on the symptomology of the virus. A single point mutation on the TGB3 protein can change the symptoms from mild, to necrotic and a single point mutation on the coat protein can cause yellowing [7,8]. PepMV isolates from different wild tomato species, populations in different ecological environments in Peru, show that there is only a variation of three single nucleotide polymorphisms (SNPs) on fixed locations in the CP gene between those populations [11]. To discover more of those fixed polymorphic locations in the PepMV genome, an experiment and a SNP analysis is done. The aim is to get a better understanding about the evolution of PepMV under selection and different environments. Because of the high mutation rate of RNA, one RNA virus population is a ‘cloud’ of different mutants, so RNA viruses can be seen as quasi-species [2].

BSc Thesis: Sequence variance analysis in Pepino mosaic virus quasi-species through next-generation sequencing

  • Upload
    wur

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Sequence variance analysis in Pepino mosaic virus quasi-species

through next-generation sequencing

Maxime R. Tijdink

Laboratory of Virology, Wageningen University, The Netherlands

Pepino mosaic virus (PepMV) is a Potex virus that commonly infects members of the Solanaceae family and has

symptoms varying from necrotic leaves to yellow mosaic. It has been transmitted to Europe and infects tomato

plants and greenhouses. An infectious clone of a necrotic PepMV isolate (PepMV-WUR48) that belongs to the

Chilean-2 strain is transferred through different host plants, creating different generations, to discover patterns

of viral quasi-species evolution. A sequence variance analysis is done from paired end reads obtained with

DeepSequencing, using the Illumina MiSeq platform. Various steps for handling these plant virus datasets are

explained as we go through quality control, filtering reads, contig assembly and identification of variations.

Single nucleotide polymorphisms (SNPs) on fixed positions were found in the first three generations and after

that, the SNPs stay unchanged. Some nucleotides from the infectious clone that are deviant from the PepMV-

WUR48 isolate will evolve back to their origin, while changes on other positions will remain.

INTRODUCTION

Pepino mosaic virus (PepMV) is a +RNA Potex virus

discovered in Peru, where it was originally found in

pepino plants (Solanum muricatum) [9]. The virus

most commonly infects members of the Solanaceae

family, but can also infect some other species [12].

The symptoms vary from necrotic leaves, to yellow

mosaic in young leaves and enations in the lower

surface of leaves [3,6]. PepMV is transmitted by plant

contact and also through seeds to the next

generation [4]. From wild Solanum species, PepMV

eventually spread to tomato and other Solanum crops

[11].

Four different strains of PepMV are distinguished,

based on their genotype [5]: the Peruvian strain (LP),

the US1 strain, the European strain (EU) and the

Chilean-2 strain (CH2). PepMV has appeared in

Europe and now infects tomato plants in greenhouses

[13]. The EU strain was first discovered in the

Netherlands in 1999 and phylogenetic analysis

showed that it is related to the LP strain. Currently

the CH2 strain is the most common strain in Europe.

Phylogenetic analyses show that the CH2 is more

related to US1 and does not originate from EU or LP

[11].

The single stranded RNA+ genome of PepMV is

around 6.4 kb long and has a 5’ cap and a 3’ poly-A

tail. The genome has 5 open reading frames (ORFs),

as it has 3’ and 5’ untranslated regions (UTRs). Also,

there is an UTR between ORF1 and ORF2 and one

between ORF4 and ORF5. ORF1 encodes a replication

enzyme that consists of an mRNA capping enzyme

domain, a RNA helicase domain and a RNA-

dependent RNA polymerase domain. ORF 2, 3 and 4

are overlapping genes, which are called the triple

gene block (TGB), they code for movement proteins

(TGB1, TGB2 and TGB3). ORF5 encodes for the coat

protein [1].

There are a range of point mutations in the CH2 strain

that have a big influence on the symptomology of the

virus. A single point mutation on the TGB3 protein

can change the symptoms from mild, to necrotic and

a single point mutation on the coat protein can cause

yellowing [7,8]. PepMV isolates from different wild

tomato species, populations in different ecological

environments in Peru, show that there is only a

variation of three single nucleotide polymorphisms

(SNPs) on fixed locations in the CP gene between

those populations [11].

To discover more of those fixed polymorphic

locations in the PepMV genome, an experiment and a

SNP analysis is done. The aim is to get a better

understanding about the evolution of PepMV under

selection and different environments.

Because of the high mutation rate of RNA, one RNA

virus population is a ‘cloud’ of different mutants, so

RNA viruses can be seen as quasi-species [2].

The RNA of the necrotic PepMV-WUR48 isolate from

the CH2 strain, is multiplied using plasmid DNA

replication and is used to infect a host plant. From

that host plant, virus particles in various dilutions

were used to infect different kinds of plants. The virus

then is transferred from the plant infected with the

highest dilution that showed severe symptoms, again

to the next range of plants in different dilutions. In

this way, a bottleneck event is created in which a

small portion of the viral quasi-species population is

transferred to the next plant. At the end, the plant

material of six plants from five generations is

sequenced, using the Illumina MiSeq platform. This

ultimately resulted in six libraries with more than 10

million 300 bp RNA sequences; this huge amount of

data has to be analysed.

New developments in next-generation sequencing

technologies have made it possible to obtain huge

quantities of data and also made it easier to detect

known and unknown sequences [10]. The standard

way in bioinformatics to process these data has four

steps: quality control, contig assembly, contig

annotation and identification of variations [10]. There

are various methods to do this, and it is still difficult

to analyse these data. That’s why effort should be

made to experiment with those methods. With this

knowledge, faster and more reliable data analysis

could be made possible.

This article shows how the data are processed and

provides an insight in analysing viral sequence data

from plants. We will go through various steps to clean

up the data sets and explain two ways to assemble

the PepMV genome.

Finally, the viral RNA sequences of the different

generations are compared for SNPs in the genome

with variant detections. This will result in the

detection of probable fixed polymorphic locations in

the PepMV genome.

MATERIALS AND METHODS

PepMV-WUR48 is an isolate of CH2 that is highly

necrotic, which was found in a Dutch greenhouse.

The single stranded RNA was previously multiplied

using polymerase chain reaction and cloned in a

plasmid vector with a T7 RNA polymerase promoter.

This clone is transcribed to single stranded capped

RNA and was subsequently used to infect a Nicotiana

benthamiana host.

From the host plant, viral particles were extracted

and used to infect old tomato plants, young tomato

plants and N. benthamiana plants in 5 various

dilutions of respectively, 128000x, 64000x, 32000x,

16000x and 8000x. The plant infected with the

highest dilution, that showed signs of infection, is

used to infect the next range of the same kind of

plants. The next range of plants were infected with

dilutions of 1250x, 6250x, 31250x, 156250x and

181250x. In the following range, different kinds of

tomato strains were infected with the same dilutions,

only without 181250x. The last step was infecting

Money Maker tomato plants and N. benthamiana

plants (Figure 1). The analysed data are 6 libraries of

300 bp paired end reads obtained with

DeepSequencing, using the Illumina MiSeq platform

(http://www.illumina.com/systems/miseq.ilmn). The

MiSeq libraries contain all the RNA (including the

virus RNA) purified from PepMV infected plants,

except for ribosomal RNA, which was filtered out

during the library preparation.

Figure 1 Overall schema of the experiment. First N. benthamiana is infected with the necrotic strain (PepMV-WUR48), then the virus is transferred in different dilutions to the other plants. The fractions show how many plants show clear symptoms of infection. The plants marked with a circle or a stripe are eventually sequenced.

CLC Genomics Workbench 7 is used for the analysis

of the data. De Novo assemblies and Reference

assemblies are used to build contigs of the sequences

in the libraries. Contigs are identified by BLASTN

searches at NCBI and BLAST searches against a

PepMV genome in CLC Genomics Workbench.

RESULTS

The data of the paired end reads from the six selected

plants, which are 250 to 300 nt long, are stored into

six libraries. All the RNA that was present in the plant

specimen, including plant mRNA and viral RNA, are

likely to be sequenced. The idea is to get the

sequence of the viral genome, which will be a

consensus of all the viral reads.

Quality assessment

As a start, a quality assessment of the libraries is done

on the obtained data. For all the libraries, the median

PHRED score is around 38, which means the chance

of an incorrect base call is 1 in 10,000 (Figure 2).

Starting from base position 250, the median quality

and the coverage drop.

Figure 2 PHRED score of library 1.

Trim sequences

First, the sequences in the library were trimmed. Only

reads with a maximum of 3 ambiguities are used. A

quality trimming is done with a quality limit of 0.05

and the reads must be at least 250 nt long. After this

step, library 4 appeared to have the highest fraction

of 17% discarded reads, library 6 had the lowest

fraction of 9% (Table 1).

Removal of plant reads

In order to lower the chance of forming wrong

contigs, it is better to filter out as much non-viral

reads as possible. To filter out reads from plant RNA,

a reference mapping is done against the genome of

Nicotiana benthamiana. Although the complete

genome is not yet sequenced, scaffolds made of

contigs bigger than 500 bp are used as a reference.

The reference consisted of 71,057 of those contigs.

As default values, a mismatch cost of 2, an insertion

cost of 3, a deletion cost of 3 and a length fraction of

0.5 are used. The similarity fraction is changed from

de default value of 0.8 to 0.5, in order to get more

hits with the plant sequences and finally filter out

more plant reads. The reads are mapped randomly

and finally, the un-mapped reads are collected and

used for further analysis. Reference mapping with the

N. benthamiana genome also worked in tomato

plants and filtered out a high amount of plant reads

(Table 1). In library 4, the most plant reads were

removed, a fraction of 81%. The least plant reads

were removed in library 6, a fraction of 18%.

Remove Duplicates

Removing duplicates was the next logical step to

clean up libraries. Even after the sequence trimming

and reference mapping, the libraries appeared to

contain a massive amount of duplicate reads. After

removing duplicates of the first N. benthamiana

library, only 16% of the reads were left, meaning,

84% of the trimmed and cleaned data contained

duplicates. In library 4, this was only 12% (Table 1).

De Novo Assembly

In order to find PepMV sequences, a De Novo

Assembly is used to generate contigs of the library

reads. Default settings of a word size of 20 and a

bubble size of 50 were used. To spare unnecessary

computations, the assembly only searched for contigs

Library Host plant Sequences Trimmed Unmapped Reference Nicotiana

After removing duplicates

Fraction left

1 N. benthamiana 14,982,940

13,439,350 10,341,990 1,662,874 11%

2 Old tomato 10,410,420 8,451,380 4,810,530 2,545,414 24%

3 Old tomato 13,963,380 12,224,720 7,251,770 2,635,966 19%

4 Money Maker 15,831,160 13,172,610 2,499,870 2,200,438 14%

5 N. benthamiana 14,383,650 12,355,650 6,126,390 1,243,204 9%

6 Money Maker 14,325,120 13,016,880 10,717,330 2,253,854 16%

Table 1 Overview of the number reads in the libraries after the clean-up steps

with a minimum length of 1000. To map the reads,

the following default values were used: a mismatch

cost of 2, an insertion cost of 3, a deletion cost of 3

and a length fraction of 0.5 and a similarity fraction of

0.8.

The range of the number of contigs found per library

appeared to be very broad, from 5 in the first N.

benthamiana library to 1,308 in the Money Maker

tomato library.

The 5 contigs with the highest number of reads are

BLASTed one at the time at NCBI. Because the virus is

very abundant in the host plant, many reads from

viral RNA are likely to be found. The BLAST results

prove that in all the libraries, after the ´clean up´

steps, the viral reads were by far the most abundant

(Table 2). This is also a good method to detect RNA

from other possible viruses or organisms in the host

plant. Here, the bacteriophage used to detect the

quality of the MiSeq is found. For the rest, only plant

mRNA is found, mostly originating from chloroplasts.

A different method to identify the PepMV sequences

is to BLAST all the contigs against the PepMV-WUR48

consensus sequence. This approach is quick and easy,

for it only uses one target sequence. The contigs with

the highest score, were the PepMV contigs.

The six contigs were aligned to the PepMV-WUR48

sequence, which did not result into sufficient overlap.

Sometimes De Novo uses the reverse complement

sequence of the reads in order to assemble contigs.

Library Number of contigs

Contig length Read count Average coverage Blast hit Identities

1 5 6435 5371 1175 1027 1175

1636779 6680 1290 40 40

74487.93 359.76 309.72 11.29 9.98

PepMV CH2 Bacteriophage S13 Nicotiana chloroplast Solanum tuberosum kinase Solanum tuberosum thiolase

99% 98% 99% 93% 92%

2 272 5541 5366 1114 1748 1480

1227074 48800 15425 5115 2735

64256.64 2634.44 3749.55 809.80 448.58

PepMV CH2 Coliphage phi-X174 Solanum lycopersicum chloroplast Solanum lycopersicum PR protein Solanum lycopersicum chromosome 6

99% 99% 100% 99% 86%

3 476 6402 1635 5370 1053 5590

2207781 12600 11780 3365 2590

101427.90 2186.53 634.19 856.11 118.42

PepMV CH2 Solanum lycopersicum PR protein Bacteriophage S13 Solanum lycopersicum chloroplast Solanum lycopersicum chromosome 12

99% 99% 98% 100% 86%

4 1,308 6454 5366 1304 1785 1677

376038 81515 43370 18775 15640

16706.28 4392.57 8727.05 2662.83 2571.09

PepMV CH2 Bacteriophage S13 Solanum lycopersicum chloroplast Solanum lycopersicum strain chromosome 1 Solanum lycopersicum cDNA

99% 98% 99% 100% 94%

5 12 6679 5366 1274 1121 1072

1184539 11595 7235 130 110

51455.91 623.16 1587.77 33.02 28.86

PepMV CH2 Coliphage phi-X174 Nicotiana sylvestris chloroplast Nicotiana tabacum mRNA N. tabacum isocitrate dehydrogenase mRNA

99% 99% 99% 97% 97%

6 29 6744 5365 1908 1480 1493

2170794 16400 1530 525 300

94174.95 879.96 214.09 93.16 56.94

PepMV CH2 Bacteriophage S13 Nicotiana tabacum mitochondrial DNA Solanum lycopersicum chromosome 1 Solanum lycopersicum chromosome 10

99% 98% 99% 99% 88%

Table 2 Results of the De Novo assembly and BLAST results

After using the reverse complements of the contigs,

they aligned properly against PepMV-WUR48.

However, parts at the 3’ and 5’ untranslated region

did not align. Except for the PepMV contig found in

library 2, a common part of 6376 nt long that

matches is being found. After removing the non-

aligning parts, the new trimmed contigs are aligned

to the PepMV-WUR48 consensus. The trimmed

contigs apart from library 2 align to PepMV-WUR48,

starting from nt 24 to nt 6399 on the PepMV-WUR48

reference (Figure 3). The trimmed contig from library

2 aligns from nt 870 to nt 6399 and is 5529 nt long.

The read mappings show that the reads are not

divided randomly, but are more ‘stapled’ on top of

each other (Figure 4). The coverage is very high on

the 3’ end of the contig and also on the 5’ side. In the

middle, the coverage is much lower, the same pattern

is seen for all the PepMV contigs assembled by De

Novo (Figure 5).

Figure 4 Read mappings of De Novo contig 1

Variant detections of all the PepMV contigs have

been done in order to detect variants with a

minimum variant frequency of 10%, so only

significant variants will be shown. All the other

parameters were on default.

In library 1, five variants with high coverage were

detected (Table 3). The frequencies are the reads that

cover the allele, divided by the total coverage of that

nucleotide position.

In library 2, two variants with a very low coverage

were detected, while libraries 3 to 6 show no

variants, meaning there are no heterozygous alleles

were the least abundant one has a frequency of more

than 10%.

Reference assembly

A reference assembly was done, starting with the

PepMV-WUR48 genome in the first library. The reads

of library 1 were first mapped with the PepMV-

WUR48 genome. To map the reads, the following

default values were used: a mismatch cost of 2, an

insertion cost of 3, a deletion cost of 3 and a length

fraction of 0.5 and a similarity fraction of 0.8. From

the first mapping, the consensus is used as a

reference to map the next library. And so on, in the

order of the transmissions from plant to plant in the

experiment. This is done to simulate the experiment

and to come as close as possible to the viral

sequences that have been used to infect the plants.

The consensus has a noise threshold of 0.1 for

ambiguities, so heterozygous allele variants will also

be noted.

Another reference assembly is done, with a mismatch

Region Reference Allele Count Coverage Frequency Codon AA

777 G G 19815 31860 62.19397 GGC G

777 G C 11985 31860 37.6177 GCG A

1358 A A 17980 26540 67.7468 AAA K

1358 A G 8545 26540 32.19668 GAA E

1416 C C 18560 22820 81.33216 GCA A

1416 C T 4260 22820 18.66784 GTA V

1580 G G 21130 24060 87.82211 GAA E

1580 G A 2930 24060 12.17789 AAA K

4098 A A 19185 25520 75.17633 AAA K

4098 A C 6330 25520 24.80408 ACA T

Figure 5 Coverage of the read mappings of De Novo contig 1

Figure 3 Alignment of the trimmed contigs with the WUR48 consensus sequence

Table 3 Variant detection of the mapped reads of the PepMV contig made with the De Novo assembly of library 1.

cost of 1 and a similarity fraction of 0.5, in order to

find more SNPs. Two more SNPs were found in the

variant detection of library 2, at position 6412 and

6414, those positions are in the poly-A tail and they

both only had a coverage of 25. The change of these

parameter settings did not result in the discovery of

significantly more SNPs.

All the consensus sequences were aligned and have

around the same nucleotide length as the original

WUR48 consensus. In both library 5 and 6, a T

insertion is found at position 4 in the untranslated

region. From library 3 until 6, the last 12 nucleotides

are lost. In library 3, that part of the poly-A tail is lost,

so it does not map that lost region again. Because the

shorter sequence is used as a reference for the next

libraries, the reads cannot map to the lost area.

The read mappings of the contigs show that the reads

were unequally mapped to the reference sequence.

Especially at the 5’ side, far more reads were mapped

than at the rest of the reference sequence (Figure 6).

Position Reference Allele Count Coverage Frequency Codon difference Amino acid difference

777 S G 65945 66125 99.73 GGC - GCG G - A

1358 R A 51460 51495 99.93 AAA - GAA K - E

1416 Y C 46705 46750 99.90 GCA - GTA A - V

1580 R G 42195 42245 99.88 GAA - AAA E - K

4098 M C 45210 45355 99.68 AAA - ACA K - T

4483 T C 39260 39365 99.73 ATG - ACG M - T

Position Reference Allele Count Coverage Frequency (%) Codon Change Amino acid difference

255 C T 236770 236895 99.95 ACA - ATA T - I

293 G T 235850 235930 99.97 GCA - TCA A - S

777 G G 13680 22070 61.98 GGC - GCG G - A

777 G C 8340 22070 37.79

880 A G 26690 26715 99.91 ACA - ACG T - T

1358 A A 14090 20915 67.37 AAA - GAA K - E

1358 A G 6815 20915 32.58

1416 C C 13945 17255 80.82 GCA - GTA A - V

1416 C T 3310 17255 19.18

1580 G G 16100 18345 87.76 GAA - AAA E - K

1580 G A 2245 18345 12.24

1868 C T 10745 10770 99.77 CGG - TGG R - W

2297 A G 11230 11230 100.00 ATA - GTA I - V

2550 S C 13785 13795 99.93 TSA - TCA Stop/S- S

2748 T C 18145 18165 99.89 GTT - GCT V - A

3829 Y T 17325 17350 99.86 GAY – GAT D - D

4098 A A 14720 19805 74.32 AAA - ACA K - T

4098 A C 5080 19805 25.65 AAA - ACA K - T

4484 A G 14935 15010 99.50 ATA - ATG I - M

6063 G C 87107 87177 99.92 CGT - CCT R - P

Table 4 Variant detection of the mapped reads of the PepMV contig made with the reference assembly of library 1.

Table 5 Variant detection of the mapped reads of the PepMV contig made with the reference assembly of library 2.

Figure 6 Coverage of the read mappings from the reference assembly of library 3.

Variant detections of the reference read mappings

from all the libraries have been done to detect

variants with a minimum variant frequency of 10%, so

only significant variants are shown. All the other

parameters were on default.

In a variant detection of the first library, the PepMV-

WUR48 genome is compared to the alleles in the

mapped reads of the first library, because the first

library is mapped against the PepMV-WUR48

sequence. The sequence that infected the first N.

benthamiana, originates from just one viral RNA

strand in a whole PepMV-WUR48 population, which

is a huge bottleneck. Comparing the consensus from

library 1 to the PepMV-WUR48 consensus, is a way to

detect SNP positions that vary in the original

population (Table 4). Because the consensus of the

previous library is used as reference for the current

one, also the differences (SNPs) between those

libraries are shown in the variance table. Only in

library 1, 2 and 3, SNPs were found, while the next

libraries conserved the same nucleotides as library 3.

In library 1, 10 SNPs are found, present in almost

100% of the reads, and different from the reference

genome. The only silent mutations come from

ambiguous nucleotides.

No more than five variants with a frequency of

minimum 10% difference were found in library 1, in

position 777, 1358, 1416, 1580 and 4098. In all those

heterozygous variances, the reference nucleotide is

more abundant than the other allele. These findings

are equal to the variant detection of the De Novo

assembly.

In a variant detection of the first library, the

consensus genome from library 1 is used as

reference, and is compared to the alleles in the

mapped reads of library 2. Here it appears that the

heterozygous variations at position 777, 1358, 1416

and 1580 in the first library, all have changed back to

the nucleotides of PepMV-WUR48 consensus. Those

variations have disappeared, as well as all the less

common alleles. The variation in Table 3 and 4 at

position 4098, consisted for 74% of A’s and for 26% of

C’s. Now in the second library, there can be seen that

the frequency has dramatically changed to almost

100% C’s, while the A’s have almost completely

disappeared. Another SNP is detected at position

4483, while in almost 100% of the reads; a T has

changed to C.

The consensus genome from the second library is

used as reference to map the third library, and is

compared to the alleles in the mapped reads of the

third library. Here, 4 clear SNPs were found and

changed in 100% of the cases (Table 5). Where in

library 1 to 2, position 4098 changed from 74% A to

100% C, it changes back to a majority of 100% A in

library 3. This means that the least abundant allele

found in library 1, became the only allele in the next

plant, a host switch from N. benthamiana to old

tomato. In the next host switch, to another old

tomato plant, the allele changed back to the

nucleotide on the PepMV-WUR48 consensus in 100%

of the reads. This is an amino acid change from lysine

to threonine, back to lysine.

In library 3 (Table 6), position 4483, C changed to A

while in library 2, T changed to C (Table 5). Table 4

shows that a divergence to the PepMV-WUR48

consensus is found at position 4484, which is on the

same codon, position 19 in the TGBp1 coding region.

The codon variation is from ATA in the PepMV-

WUR48 consensus, to ATG in 99,5% of the cases in

library 1. In library 2, the codon changes to ACG and

finally in library 3 it changes to AAG. The first variety

in amino acids found is isoleucine to methionine,

which changes to threonine in library 2, and to lycine

in library 3 (Figure 7).

Region Reference Allele Count Coverage Frequency Codon difference

Amino acid difference

2236 C T 70595 70750 99.78 TCC - TCT S - S

4098 C A 57030 57065 99.94 ACA - AAA T - K

4296 C T 42540 42665 99.71 CCC - CTC P - L

4483 C A 53245 53275 99.94 ACG - AAG T - K

Table 6 Variant detection of the mapped reads of the PepMV contig made with the reference assembly of library 3.

Figure 7 SNPs in the 19th TGB1 codon

DISCUSSION

The PepMV-WUR48 consensus probably is different

than the viral sequence that infected the first N.

benthamiana, for it is a consensus sequence from a

viral population. Because PepMV is a quasi-species,

meaning that is has a broad variety of different

genomes in one population, the plant probably is

infected with a deviating sequence. To form a

bottleneck, the first host plant is infected with one

and the same RNA sequence, one sequence is cloned

and multiplied using a vector and these finally led to

the infection. That explains the SNPs found in table 4,

these represent the difference between the infective

sequence and the consensus of the original PepMV-

WUR48 population.

There were 15 differences with the reference

consensus, 10 of those SNPs stayed unchanged in the

other generations, but 5 SNPs were abundant in

different frequencies in the first library. The viral

population of library 1 consisted of different

genotypes, and those fractions of allele differences

can be detected in the variant detection. Four of

those five variating SNPs all changed to the

nucleotides on the PepMV-WUR48 consensus

sequence in library 2, and the last one changed back

in library 3. That SNP at position 4098 probably

changed from 24% T in library 1 to 100% T in library 2,

because of a bottleneck it is coincidence. A dilution

was used to infect the next plant, and that probably

only contained virus particles with only T’s on

position 4098. The change from 100% T back to 100%

A in library 2 to 3, is likely a selection event. None of

these five mutations were silent, so they have impact

on amino acid changes.

The ten differences between library 1 and PepMV-

WUR48 consensus that stayed fixed, show some

silent mutations, and show some change to

nucleotides similar to other PepMV-Ch2 strains.

A hypothesis is that the infectious clone, which is

different from the average PepMV-WUR48

population, could have mutations that were

unfavourable and have mutations that have no

impact on its fitness. There were 5 variations of

nucleotides in the first library, that eventually all

changed back to the nucleotides found in the PepMV-

WUR48 consensus on those positions. A hypothesis is

that mutations which changed back to the same

nucleotides on the PepMV-WUR48 consensus or a

common PepMV-Ch2 genome, were unfavourable.

Another hypothesis is that silent mutations have less

impact on the fitness, so do not have to change back.

The fixed changes show 2 silent mutations and 8

amino acid changes, from which 5 change to the

same amino acids on the same position of the

PepMV-Ch2 strain. That means 7 out of 10 of those

fixed mutations, were silent mutations, or changed to

the nucleotides on the same positions of another

PepMV-Ch2 isolate. These kinds of hypotheses have

to be tested more to be of any significance, but this

could be a cue to some evolutionary dynamics of

PepMV.

To see if these nucleotide changes are connected

with each other, if there is some form of epistasis, it

has to be checked if the SNPs are on the same viral

genome. In this case, there cannot be seen what SNPs

are on the same viral sequence, because the

reference mapping is build up from random reads, so

the origin of the read cannot be detected. The

mapping of those reads should be checked to see

what SNPs are found on the same reads, the problem

here is that the reads are only 300 bp long, and the

polymorphic positions are too far away from each

other. So it is impossible to see if there is epistasis,

meaning that there cannot be seen if certain

nucleotide changes have influence on one another.

Because of the constant mutations in the copying of

the viral RNA, it is expected to find a lot of variations.

The variant detection states the contrary, only in

library 1, variations of more than 10% were found.

Also in library 2, but those have very little coverage. It

could be the case that there is a very strict selection,

the same amino acids are found in almost all the virus

reads. But a lot of silent mutations should not have

much impact on fitness, and because of the high

mutation rate, it is expected to have a lot of variances

in the 3rd

nucleotides of codons that can lead to

them. Still, there is only one silent mutation found in

the plant to plant infections.

Hasiów-Jaroszewska et al. (2011) shows that certain

point mutations can convert PepMV from a mild

pathotype, into a necrotic one. On the TGB3 coding

region, amino acid 67 must change from K to E to

become necrotic. On the PepMV CH2 genome that

had the highest BLAST hit from NCBI, that homologue

codon is AAA and codes for K. In the WUR48

consensus and the ones from the libraries, that is a

GAA and it codes for an E. That may explain why this

strain is already was necrotic.

Another article of Hasiów-Jaroszewska et al. (2013)

shows another interesting point mutation, in the CP

coding region at codon 155. If the codon codes for R

or K, there is yellowing and if it codes for E, there is

none. There are no variations found on that codon in

the PepMV-WUR48 genome and in the libraries. All

the codons at position 155 in the CP coding region

code for E. The only variation found in the CP coding

region is at codon position 145, between library 1 and

the PepMV-WUR48 consensus.

Three more polymorphic locations in the CP gene are

found in wild Solanum species in Peru. PepMV

isolates from different wild tomato species,

populations in different ecological environments

were isolated. The CP gene of those isolates was

amplified and the nucleotide sequence was

determined. SNPs were found on nt position 468, 495

and 712 of the CP gene and that was the only

variation found in all the isolates [11]. The only SNP

found in the CP gene in my analysis does not

correspond with these findings.

An interesting codon change is the 19th codon on the

TGB1 encoding region. There is variation between the

PepMV-WUR48 consensus and library 1, on the third

nucleotide of that codon. In the following two

generations, the second nucleotide of the same

codon changes, and this results in two amino acid

changes. All this is found in almost 100% of the reads.

Compare PepMV-WUR48 with two other PepMV-Ch2

sequences, and there is a difference in the third

nucleotide, which is silent. There is nothing known

about this codon and no experiments are done to see

the impact of change so far.

Normally, the first step of the library cleaning process

is removing the duplicate reads. The reason that it

here was the last step is, the computers did not have

enough storage capacity to store the duplicate reads

before discarding them, only after the other cleaning

steps, the computers had the capacity.

Eventually, there were a massive amount of duplicate

reads in the libraries, a possible explanation could be

a mistake in the sequencing procedure. Due to a

mistake, the MiSeq reads are not sequenced from the

large cDNA strands, but from small strands of 500-

800 nt long. That means that not the whole genome

of the virus is sequenced, but only broken RNA

strands could be sequenced. This explains how the

libraries contained a disproportional amount of

duplicate reads. After removing the duplicates and

the other ‘clean up’ steps, only 9% to 24% of the

reads were left.

The read mappings of the De Novo were not

randomly divided and some parts showed very low

coverage in the middle of the mapping, while there

was a lot of coverage on the 5’ and 3’ ends. This is

probably also due to the fractioning. Maybe the

primers could attach better on those sides, or maybe

the broken pieces of RNA are more abundant in the 3’

and 5’ regions.

Another risk to work with these data, is that the

sequences of the broken RNA parts share similarities

that makes them not function in the first place. That

means that the sequences that are used for the

reference assembly could be different from the intact

genomes that aren’t sequenced.

REFERENCES

1. Aguilar, J.M., Hernandez-Gallardo, M.D., Cenis,

J.L., Lacasa, A. and Aranda, M.A. (2002) Complete

sequence of the Pepino mosaic virus RNA genome.

Arch. Virol. 147, 2009–2015.

2. Eigen M (1993) Viral quasispecies. Sci. Am. 269,

42–49.

3. Hanssen, I.M., Paeleman, A., Vandewoestijne, E.,

Van Bergen, L., Bragard, C., Lievens, B., Vanachter,

A.C.R.C. and Thomma, B.P.H.J. (2009) Pepino

mosaic virus isolates and differential

symptomatology in tomato. Plant Pathol. 58, 450–

460.

4. Hanssen, I.M., Mumford, R., Blystad, D.R.,

Cortez, I., Hasiów-Jaroszewska, B., Hristova, D.,

Pagán, I., Pereira, A.M., Peters, J., Pospieszny, H.,

Ravnikar, M., Stijger, I., Tomassoli, L., Varveri, C.,

van der Vlugt, R., Nielsen, S.L. (2010). Seed

transmission of Pepino mosaic virus in tomato. Eur.

J. Plant Pathol. 126, 145–152.

5. Hanssen I, Thomma B (2010) Pepino mosaic

virus: a successful pathogen that rapidly evolved

from emerging to endemic in tomato crops. Mol.

Plant Pathol. 11, 179–189.

6. Hasiów-Jaroszewska, B., Pospieszny, H. and

Borodynko, N. (2009) New necrotic isolates of

Pepino mosaic virus representing the CH2

genotype. J. Phytopathol. 157, 494–496.

7. Hasiow-Jaroszewska B, Borodynko N, Jackowiak

P, Figlerowicz M, Pospieszny H (2011) A single

mutation in TGB3 converts mild pathotype of

Pepino mosaic virus into necrotic one. Virus Res

159, 57–61.

8. Hasiow-Jaroszewska B, Borodynko N (2012)

Characterization of the necrosis determinant of the

European genotype of pepino mosaic virus by site-

specific mutagenesis of an infectious cDNA clone.

Arch Virol. 157, 337–341.

9. Jones, R.A.C., Koenig, R. and Lesemann, D.E.

(1980) Pepino mosaic virus, a new potexvirus from

pepino (Solanum muricatum). Ann. Appl. Biol. 94,

61–68.

10. Massart S, Olmos A, Jijakli H, Candresse T

(2014) Current impact and future directions of high

throughput sequencing in plant virus diagnostics.

Virus Research 188, 90–96.

11. Moreno-Pérez MG, Pagán I, Aragón-Caballero

L, Cáceres F, Fraile A, García-Arenal F. (2014).

Ecological and genetic determinants of Pepino

Mosaic Virus emergence. J. Virol. 88(6), 3359-68.

12. Salomone, A. and Roggero, P. (2002) Host

range, seed transmission and detection by ELISA

and lateral flow of an Italian isolate of Pepino

mosaic virus. J. Plant Pathol. 84, 65–68.

13. van der Vlugt, R. A. A., Stijger, C. C. M. M.,

Verhoeven, J. T. J., and Lesemann, D. E. 2000. First

Report of Pepino mosaic virus on tomato. Plant

Dis. 84, 103.