Section B. Comparative Genomics Analysis of 2013 H7N9 ......4! Created by the ViPR/IRD team and licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License Section

Created by the ViPR/IRD team and licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License 4

Section B. Comparative Genomics Analysis of 2013 H7N9 Influenza A Viruses

Objective

Upon completion of this exercise, you will be able to use the Influenza Research Database (IRD; http://www.fludb.org/) to:

• Search for virus sequences and view detailed information about these sequences in IRD

• Save selected sequences as a working set in your private Workbench space

• Combine multiple working sets

• Build a phylogenetic tree on a set of sequences to infer their evolutionary relationships

• Use Meta-CATS to identify nucleotide or amino acid positions that significantly differ between groups of virus sequences

• Perform a multiple sequence alignment to observe sequence conservation and variations

• Determine if significant positions are located in viral protein Sequence Features and examine Sequence Feature Variant Type reports

• Search for 3D protein structures and highlight Sequence Features and custom positions on a structure

Background

H7 viruses normally circulate in birds and horses. Before March 2013, a search for H7 influenza strains in IRD returned a total of 1485 strains in IRD, with 1306 from birds, 102 from environmental samples (usually bird droppings), 33 from horses, and only 15 from humans (11 H7N7, H7N1, H7N2, 2 H7N3). No human isolates were H7N9.

In March 2013, several cases of Influenza virus A H7N9 subtype were identified in Shanghai, China and surrounding provinces. From February to May, 2013, a total of 133 cases have been confirmed, including 37 deaths. Since October 2013, a second wave of human cases has been occurring, a total of 74 as of January 21, 2014. Fortunately, no evidence of ongoing human-to-human transmission for the current H7N9 outbreak has yet been found. This suggests that the virus is undergoing “stuttering transmission” in which a virus that normally circulates in an animal reservoir infects a person, but further human-to-human transmission does not occur. In general, viruses capable of stuttering transmission have acquired novel sequence variations that allow them to infect humans (human adaptation), but have yet to acquire sequence variations that allow them to sustain efficient transmission between humans.

We will perform an in-depth statistical analysis using sequence records and analysis tools available in IRD (www.fludb.org) to clarify the origin of the HA segment and to identify candidate sequence variations that might be involved in this type of human adaptation.


Analysis Workflow

Highlight significant positions and Sequence Features on protein structure: - search for H7 HA 3D protein structures - highlight Meta-CATS positions and Sequence Features on a structure

Determine if the significant positions are located in Sequence Features: - follow the Sequence Feature linkage on the Meta-CATS report

- examine Sequence Features containing the significant positions

Visualize multiple sequence alignment: - align HA protein sequences

- observe the variant positions on the alignment

Run Metadata-driven Comparative Analysis Tool (Meta-CATS): - input the protein working set (4) to Meta-CATS - group the sequences into two groups: human H7N9 2013 outbreak isolates, and similar older H7 isolates - identify positions that are significantly different between the two sets - convert position coordinates on the H7 numbering scheme into coordinates on the H3 numbering scheme

Construct nucleotide phylogenetic tree: - construct phylogenetic tree using working set (3) - color tree to reveal host and subtype specific branching patterns

Search for sequences and save sequences into working sets: - search for H7N9 HA nucleotide sequences and save sequences as working set (1) - BLAST for HA nucleotide sequences similar to H7N9 2013 outbreak sequences and save them as working set (2) - combine working sets (1)-(2) into one (3) - convert nucleotide working set (3) into protein working set (4)


I. Search for sequences and save matching sequences into working sets

1. Search for H7N9 HA sequences using structured search interfaces

a. Go to the IRD homepage (http://www.fludb.org/), mouse-over “Search Data” in the grey navigation bar, then “Search Sequences” and click “Nucleotide Sequences”.

b. The Nucleotide Sequence Search page allows you to search for sequences based on data type, virus type, subtype, strain name, segment, host, geographical region, complete sequences or not, H1N1 pandemic sequences or not, and date range.

c. For this exercise, we are going to search for HA segment sequences from H7N9 strains. Select the following criteria and click the orange “Search” button to run the query. Virus Type: ý A Select Segments: ý 4 HA Sub Type: H7N9

Complete Sequences: ý Complete Sequences Only Advanced Options: ý Remove Duplicate Sequences

d. Note that IRD shows instant count of search results here to help you search quickly and efficiently. When you select search criteria on search pages, you will instantly know how many records match your search criteria without clicking the “Search” button and actually running the search.

e. The Search Results page will be displayed. Here you can:

i. Save the search query to your Workbench and rerun the search again later.


ii. Download the sequences (gene, CDS, protein) by clicking “Download”.

iii. Select records and run an analysis on the selected records by mousing-over the “Run Analysis” button and clicking a desired analysis option.

iv. Store selected sequences as a working set in the Workbench so that you can run various analyses on the working set.

v. View the details for any item in the results table by clicking on “View” next to any row.

f. Now click the “Collection Date” header in the results table to sort records by date. If you need advanced sorting options and want to display additional fields, click the “Display Settings” button.

How many HA segment sequences did you find from the current outbreak?

g. To analyze these sequences, we will select records by ticking the checkbox and adding them to a working set by clicking the “Add to working set” button. This way, we will be able to retrieve the data from the Workbench later and run various analyses on the same data set.

h. You’ll be prompted to log in to your Workbench account in order to save data to a working set. If you don’t have an account already, simply register for an account for free by choosing the “Register for a new account” option and following the prompts.

i. A lightbox of “Add to Working Set” will pop up. Now create a new working set and name it “H7N9 HA complete sequences”. Click “Add to Working Set” to save the sequences to a working set.

2. BLAST for HA sequences similar to H7N9 2013 outbreak sequences

Now we are going to expand the sequence set by including HA sequences that are highly similar to the outbreak sequences. We will select a representative isolate and perform a BLAST search of nucleotide sequences. The IRD BLAST tool utilizes the NCBI BLAST program set and has a collection of custom influenza sequence databases to search against.

a. Select A/Shanghai/02/2013 from the results table, mouse over “Run Analysis” and click “BLAST”.

2/10/14 Influenza Research Database - Nucleotide Sequence Search Results

www.fludb.org/brc/influenza_sequence_search_segment_display.do 1/2

Loading Influenza Research Database...

Add to Working Set Save Search Download

1 2 Next > Page: 1 of 2

Your search returned 95 segments. Search Criteria Displaying 50 records per page , sorted by Strain Name inascending order.

Display Settings

Nucleotide Sequence Search Results

Your Selected Items: 95 items selected | Deselect All

Select all 95 segments

More columns were returned than can be displayed without scrolling. Use scroll bars at top and bottom of display to move right and left or reduce the number ofcolumns displayed by using the Display Settings link above.

Segment ProteinName

SequenceAccession

CompleteGenome

SegmentLength Subtype * Collection

Date Host Species Country State/ProvinceFlu

Season (SOP)

Strain Name

4 HA HQ244407 No 1698 H7N9 01/26/2008 *EurasianTeal/Avian

Spain -N/A- 07-08 *A/Anas crecca/Spain/1460/2008(H7N9)

4 HA FN386467 No 1683 H7N9 01/26/2008 *EurasianTeal/Avian

Spain -N/A- 07-08 *A/Anas crecca/Spain/1460/2008(H7N9)

4 HA KF768181 No 1683 H7N9 2013 null China -N/A- -N/A- A/Anhui/PA-1/2013

4 HA CY067670 Yes 1731 H7N9 02/07/2008 *Blue-WingedTeal/Avian

Guatemala -N/A- -N/A- A/blue-winged teal/Guatemala/CIP049-01/2008

4 HA CY067678 Yes 1731 H7N9 03/05/2008 *Blue-WingedTeal/Avian

Guatemala -N/A- -N/A- A/blue-winged teal/Guatemala/CIP049-02/2008

4 HA CY024818 Yes 1697 H7N9 2006 Blue-WingedTeal/Avian

USA Ohio -N/A- *A/blue-winged teal/Ohio/566/2006(H7N9)

4 HA KF420296 Yes 1710 H7N9 03/22/2013 Human China -N/A- -N/A- *A/Changsha/1/2013(H7N9)

4 HA KF420297 Yes 1710 H7N9 04/29/2013 Human China -N/A- -N/A- *A/Changsha/2/2013(H7N9)

4 HA CY146908 * Yes 1683 H7N9 05/03/2013 Chicken/Avian China -N/A- -N/A- *A/chicken/Guangdong/SD641/2013(H7N9)

4 HA KF938946 * No 1696 H7N9 04/2013 Chicken/Avian China -N/A- -N/A- A/chicken/Jiangsu/1021/2013

4 HA CY146916 Yes 1683 H7N9 04/16/2013 Chicken/Avian China -N/A- -N/A- A/chicken/Jiangsu/S002/2013

4 HA CY146924 Yes 1683 H7N9 04/16/2013 Chicken/Avian China -N/A- -N/A- A/chicken/Jiangsu/SC035/2013



4 HA CY146948 Yes 1683 H7N9 05/03/2013 Chicken/Avian China -N/A- -N/A- A/chicken/Jiangxi/SD001/2013

4 HA KF259042 * No 1683 H7N9 2013 Chicken/Avian China -N/A- -N/A- *A/chicken/Rizhao/515/2013(H7N9)

4 HA KF259058 * No 1683 H7N9 2013 Chicken/Avian China -N/A- -N/A- A/chicken/Rizhao/713/2013

4 HA KF297293 * No 1683 H7N9 2013 Chicken/Avian China -N/A- -N/A- A/chicken/Rizhao/719b/2013

4 HA KF542876 Yes 1683 H7N9 04/2013 Chicken/Avian China -N/A- -N/A- A/chicken/Shanghai/017/2013

4 HA CY146956 Yes 1683 H7N9 04/03/2013 Chicken/Avian China -N/A- -N/A- A/chicken/Shanghai/S1053/2013


4 HA CY146972 No 1683 H7N9 04/03/2013 Chicken/Avian China -N/A- -N/A- A/chicken/Shanghai/S1076/2013





4 HA KC899669 No 1683 H7N9 04/2013 Chicken/Avian China -N/A- -N/A- A/chicken/Zhejiang/DTID-ZJU01/2013

4 HA KF768178 * No 1683 H7N9 2013 Chicken/Avian -N/A- -N/A- -N/A- A/chicken/Zhejiang/PA-DTID-ZJU01/2013

4 HA CY147036 Yes 1683 H7N9 04/22/2013 Chicken/Avian China -N/A- -N/A- A/chicken/Zhejiang/SD007/2013

4 HA CY147052 Yes 1683 H7N9 04/11/2013 Chicken/Avian China -N/A- -N/A- A/chicken/Zhejiang/SD033/2013

4 HA CY147060 Yes 1683 H7N9 04/16/2013 Duck/Avian China -N/A- -N/A- A/duck/Anhui/SC702/2013

Run Analysis ▼

Home Nucleotide Sequence Search Results

SEARCH DATA ANALYZE & VISUALIZE WORKBENCH SUBMIT DATA

About Us Community Announcements Links Resources Support Workbench Sign In


b. In the Select Sequence Type lightbox, select “Nucleic Acid (Segment)” and click “Continue”.

c. Now the BLAST setting page is loaded. IRD provides a collection of custom influenza sequence databases to search against using BLAST. Select “Blastn”, “Nucleotides for segment 4 HA”, and “100” results to display. For the other parameters keep the default settings. Click “Run”.

d. On the BLAST Report page, all nearest hits are listed in the table. Click a hit to view its alignment. Click the IRD link (e.g., ird|982104) to view the hit’s segment/ protein details page in IRD. What are the host and subtype of the sequences that are most similar to the H7N9 outbreak sequences?

e. Return to the BLAST Report page by clicking the Results breadcrumb. Now select all hits by ticking the checkbox in the column header and click “Add to Working Set” to add these sequences to a new working set named “A/Shanghai/02/2013 BLAST hits”.

3. Construct segment and protein working sets

a. Now we are going to combine the working sets of H7N9 HA complete sequences and A/Shanghai/02/2013 BLAST hits and use the combined working set for downstream analyses.

2/10/14 Influenza Research Database - Sequence Similarity Search (BLAST)

www.fludb.org/brc/blast.do 1/1


Cite IRD Tutorials Glossary of Terms Report a Bug Request Web Training Contact Us Release Date: Jan 24, 2014

This project is funded by the National Institute of Allergy and Infectious Diseases (NIH / DHHS) under Contract No. HHSN266200400041C and is a collaboration between NorthropGrumman Health IT, J. Craig Venter Institute , Vecna Technologies, SAGE Analytica and Los Alamos National Laboratory.

Blastn (nucleotide)

Blastx (search protein using nucleotidetranslated in 6 reading frames)

FORMAT OF SEQUENCES PROVIDED

Nucleotides for segment 4 H1Nucleotides for segment 4 H3Nucleotides for segment 4 H5Nucleotides for segment 4 HANucleotides for segment 5 NPNucleotides for segment 6 N1

SELECT DATABASE TO SEARCH * Use generated database

Tip: To select multiple or deselect, Ctrl-click (Windows)or Cmd-click (MacOS)

Use working set to run blast

1 record was previously selected from searchresults

INPUT SEQUENCES

102550100150

OUTPUT FORMAT

Remove data from genetically manipulated strains from the resultNumber of Results to Display for each Input Sequence

ADVANCED OPTIONS

RemoveSelect Advanced Option

Select An Advanced Option

Both(Default)TopBottom

Strand

Existence: 0 Extension: 4Existence: 3 Extension: 3Existence: 6 Extension: 2Existence: 5 Extension: 2Existence: 4 Extension: 2

Gap Costs

TrueFalse

Apply Low Complexity Filter

Perform ungapped alignment

OTHER PARAMETERS

RunClear

ANALYSIS NAME

Show All

Add Another Advanced Option

Identify Similar Sequences (BLAST) Compare sequences you provide or select from the IRD database consisting of selected sets of sequences from the IRD database or create your own database forcomparison from a working set on your workbench.Note: An asterisk (*) = required field

Home Nucleotide Sequence Search Results Identify Similar Sequences (BLAST)



Save Analysis Add to Working Set

BLAST ReportBLASTN 2.2.22 [Sep-27-2009] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query:gb:KF021597|Organism:Influenza#A#virus#A/Shanghai/02/2013|Segme Database: Influenza nucleotide sequences Seg4 70,570 sequences; 95,790,805 total letters

Id Sequence header BitScore

EValue

982104 >ird|982104| Country:China| Influenza A virus (A/Shanghai/02/2013(H7N9)) segment 4 hemagglutinin (HA) gene, complete

cds.|gb|KF021597

3386 0.0

970288 >ird|970288| Country:China| Influenza A virus (A/Zhejiang/DTID-ZJU01/2013(H7N9)) segment 4 hemagglutinin (HA) gene, complete

cds.|gb|KC885956

3378 0.0

973128 >ird|973128| Country:China| Influenza A virus (A/Fujian/1/2013(H7N9)) segment 4 hemagglutinin (HA) gene, complete cds.|gb|KC994453 3328 0.0

979408 >ird|979408| Country:China| Influenza A virus (A/environment/Hangzhou/34/2013(H7N9)) segment 4 hemagglutinin (HA) gene, complete

cds.|gb|KF001519

3320 0.0

978376 >ird|978376| Country:China| Influenza A virus (A/Hangzhou/1/2013(H7N9)) segment 4 hemagglutinin (HA) gene, complete

cds.|gb|KC853766

3313 0.0


cds.|gb|KF001513

3305 0.0

985111 >ird|985111| Country:China| Influenza A virus (A/Nanjing/1/2013(H7N9)) segment 4 hemagglutinin (HA) gene, complete cds.|gb|KC896774 3305 0.0

985329 >ird|985329| Country:China| Influenza A virus (A/environment/Nanjing/2913/2013(H7N9)) segment 4 hemagglutinin (HA) gene, complete

cds.|gb|KC896763

3305 0.0

974693 >ird|974693| Country:China| Influenza A virus (A/Shanghai/4664T/2013(H7N9)) segment 4 hemagglutinin (HA) gene, complete

cds.|gb|KC853228

3303 0.0

986173 >ird|986173| Country:China| Influenza A virus (A/Zhejiang/HZ1/2013(H7N9)) segment 4 hemagglutinin (HA) gene, complete

cds.|gb|KF055470

3297 0.0


cds.|gb|KF001516

3297 0.0

985117 >ird|985117| Country:Taiwan| Influenza A virus (A/Taiwan/T02081/2013(H7N9)) segment 4 hemagglutinin (HA) gene, partial

cds.|gb|KF018053

3253 0.0

984540 >ird|984540| Country:Taiwan| Influenza A virus (A/Taiwan/S02076/2013(H7N9)) segment 4 hemagglutinin (HA) gene, partial

cds.|gb|KF018045

3241 0.0

975767 >ird|975767| Country:China| Influenza A virus (A/chicken/Zhejiang/DTID-ZJU01/2013(H7N9)) segment 4 hemagglutinin (HA) gene,

complete cds.|gb|KC899669

3241 0.0

863382 >ird|863382| Country:China| Influenza A virus (A/duck/Zhejiang/12/2011(H7N3)) segment 4 hemagglutinin (HA) gene, complete

cds.|gb|JQ906576

2781 0.0


cds.|gb|JQ906575

2781 0.0


cds.|gb|JQ906573

2750 0.0


cds.|gb|JQ906574

2726 0.0

984601 >ird|984601| Country:China| Influenza A virus (A/duck/Zhejiang/DK16/2013(H7N3)) segment 4 hemagglutinin (HA) gene, complete

cds.|gb|KC961629

2694 0.0

984375 >ird|984375| Country:China| Influenza A virus (A/duck/Zhejiang/DK10/2013(H7N3)) segment 4 hemagglutinin (HA) gene, complete

cds.|gb|KC961628

2694 0.0

328330 >ird|328330| Country:South Korea| Influenza A virus (A/wild bird feces/Korea/HDR22/2006(H7N7)) segment 4 hemagglutinin (HA) gene,

partial cds.|gb|FJ750875

2680 0.0

377511 >ird|377511| Country:Japan| Influenza A virus (A/duck/Shiga/B149/2007(H7N7)) HA gene for hemagglutinin, complete cds.|gb|AB558257 2662 0.0

914430 >ird|914430| Country:Mongolia| Influenza A virus (A/duck/Mongolia/867/2002(H7N1)) genomic RNA, segment 4, complete

sequence.|gb|AB473543

2662 0.0

193728 >ird|193728| Country:China| Influenza A virus (A/duck/Jiangxi/1742/03(H7N7)) hemagglutinin (HA) gene, partial cds.|gb|EU158108 2656 0.0



Home Nucleotide... Results Identify Similar Sequences (BLAST) Results




Click the “Workbench” tab from the grey navigation bar to go to your Workbench. You’ll see the saved working set listed at the top of the Workbench table.

b. Click the checkboxes for the two working sets we just saved. Click “More Actions” then “Combine”. Name the combined working set to “H7N9 HA complete seqs + blastn hits”. This working set contains the HA segment sequences from H7N9 isolates and sequences that are most similar to the HA segment sequences of the current H7N9 outbreak. You will notice the duplicate sequences are automatically removed.

c. Next, we are going to convert the combined segment working set into a protein sequence working set. To do so, select the combined working set. Click “More Actions” then “Convert”.

d. In the Convert Working Set lightbox, select Protein as data type and name the working set to: H7N9 HA complete seqs + blastn hits protein. Click “Convert”.

e. Access your Workbench by clicking the “Workbench” tab. You will see the newly created working sets at the top of the content list.

4. (Optional) Combine custom sequences with IRD sequences

When you have custom sequences and want to analyze your sequences along with IRD sequences, you can upload your sequence file to your Workbench and then combine the uploaded file with a saved working set.

a. Prepare your sequences in FASTA format.

b. Go to the Workbench, click “More Actions” then “Upload File”.

c. Next, name your file, choose File Type (i.e., “Unaligned FASTA”) and Data Type, click “Browse” and select the desired FASTA file on your computer. Choose the desired Fasta File Definition Line format. Click “Upload”. The uploaded file will be displayed in the Workbench table.

d. Now select desired working set. Click “More Actions” then “Combine with Uploaded File”.

e. In the pop-up lightbox, choose the desired sequence file from the list of uploaded files, name the combined working set, and click “Combine”. You will see the newly combined working sets at the top of the content list.

II. Construct an HA segment phylogenetic tree

a. Now we will construct a phylogenetic tree using HA segment sequences from H7N9 isolates and segment sequences that are most similar to the current H7N9 outbreak isolates.

b. We are going to add HA sequence from A/ruddy turnstone/DE/2378/1988 as an outlier to the H7N9 HA and closely related sequences.


i. Mouse over the “Search Data” tab and click “Nucleotide Sequences”.

ii. On the Nucleotide Sequence Search page, select “4 HA” from the “Select Segments” list, type “A/ruddy turnstone/DE/2378/1988” in the strain name box, and click “Search” to run the search.

iii. The search result page will show 1 segment. Select the segment by ticking the checkbox next to it. Then click the “Add to working set” button and add it to the working set: “H7N9 HA complete seqs + blastn hits”.

c. On the Workbench page, click “View” next to H7N9 HA complete seqs + blastn hits.

d. The working set details page displays the sequence records saved in the working set. Select all records by clicking the checkbox above the table. Mouse over “Run Analysis” and click “Generate Phylogenetic Tree”.

e. On the Tree setting page, select “Quick Tree”, choose strain name and date as tree tip label, and click “Build Tree”.

f. While the analysis is running, you can save the analysis to your Workbench by entering a name and then clicking “Save to Workbench”. Once it is saved, you can come back to the Workbench at any time to retrieve the analysis results.

g. After the analysis is finished, a View Phylogenetic Tree page will be loaded. Here you can save the phylogenetic file in Newick or PhyloXML format to your computer. Click “View Tree” to load the Archaeopteryx Tree Viewer window.



INPUT146 SEGMENTS SELECTED FOR TREE

LABEL TREE TIPS (ENDS) WITH

Strain Name

Specify custom format of tip label (max 4) Strain Name Accession Number Date Country USA State Segment Protein Symbol Season Type SubType Host Species 2009 pH1N1-like Phenotype Markers

ANALYSIS NAME

H7N9 2013 HA segment phylogeny

TREE GENERATION

Quick Tree (Let IRD set all parameters - view all parameters)

Custom Tree (I want to set my own parameters and/or I have a large

dataset)

Exclude 1 segments with poorly pre-aligned sequences

Use all 146 segments and run MUSCLE to align them

Build TreeClear

Generate Phylogenetic Tree Tutorial

The "Quick Tree" option uses PhyML [ Guindon, S. and Gascuel, O., (2003) Syst Biol. 52: 696-704 ] and IRD-defined settings to infer phylogenies based on sequences fordatasets of at most 1000 sequences. The "Custom Tree" option offers a choice between the PhyML or RaxML [Stamatakis, A. et al. (2005) Bioinformatics 21: 456-463]algorithms, and the ability to define parameter settings. The Custom Tree option, using RaxML, must be used for datasets exceeding 1000 sequences. Click here to view atutorial on generating a phylogenetic tree using IRD tools.

Home My Workbench Working Set (H7N9 2013 HA nr + BLAST) Generate Phylogenetic Tree


About Us Community Announcements Links Resources Support Sign Out

You are logged in as [email protected]

Influenza Research Database - Phylogenetic Tree http://www.fludb.org/brc/tree.do?method=RunFromDataList&t...

1 of 1 2/10/14 7:42 PM


h. A Tree Viewer window will pop up. Many tree customization options exist including: reroot the tree, collapse/expand/display subtree, swap descendants, decorate (color) the tree leaves by any associated metadata (e.g. host or year of isolation, etc.), resize the tree, zoom in/out, fit the tree to window, change the font size, etc.

i. In the “Tree Decorations” section, select “HA&NA Subtype” from the “Basic Decoration Options”. Click “Show Legend” to display the color code for different subtypes.

ii. The default colors may or may not be ideal for your purpose. You can change the color by using the “Advanced Decoration”. In the Advanced Decoration Options dialog box, select “HA&NA Subtype”, click the Manual Decoration checkbox and click “Go”.

iii. Check H7N9 and choose red in the color palette, then click “Apply”. Now the H7N9 strains are colored in red.

iv. The tree shows that the HA sequences from the H7N9 outbreak are similar to the HA sequences from a series of duck H7N3 isolates from Zhejiang and a series of poultry H7N2/3/7 isolates mostly from Wenzhou, suggesting that at least two independent interspecies transmissions occurred and the HA segment of the new H7N9 isolates was likely derived from an H7N3 ancestor.

v. You can save the tree image by clicking the “File” menu and then a file format.

i. Return to the Tree Results page. Save the tree analysis to your Workbench by clicking “Save Analysis”. Rename the analysis so that you can recognize it later, for example, “H7N9 HA segment phylogeny”. Then click “Save”.



Save Analysis Newick File PhyloXml File Phylip File Tree Parameters PhyML Log Tree Build Parameters

View Tree

The IRD Tree Decorator is a custom-enhancement of Archaeopteryx . The original FORESTER/ATV library is freely available from SourceForge . Credits: Zmasek C.M. and Eddy S.R.

(2001) ATV: display and manipulation of annotated phylogenetic trees. Bioinformatics, 17, 383-384.

Click the "View Tree" button below to launch the tree viewer software in a new window. If you prefer other viewing software, the tree data is available fordownload in Newick or PhyloXml format using the buttons above.

Due to security concerns, certain browsers (e.g. Safari and Firefox) have disabled Java plug-ins by default. If the Tree Viewer takes a long time toload, please test your browser's Java plug-in to make sure it can display Java Applets properly.Safari has recently tightened the security settings on Java Applets, which may affect image export functions of the Tree Viewer. Click Here forinstructions on how to fix this.

ENHANCED TREE VIEWER

The IRD team provides software that allows 'decoration' of your tree by features such as host species, year, country, and subtype. This custom software isbased on Archaeopteryx . In the tree viewer, use the drop-down menu for basic decoration or advanced decoration to select the feature for coloring. Thedecorated tree and corresponding legend can be exported using options in the File drop-down menu.

A user's guide is available. How to create a publication quality tree image

View Phylogenetic TreeHome My Workbench Working... Generate Phylogenetic Tree Results




Influenza Research Database - Phylogenetic Tree Viewer http://www.fludb.org/brc/tree.do?decorator=influenza&method...

1 of 1 2/10/14 7:45 PM


j. Go to your Workbench. You can see the tree is listed at the top of the Workbench table. Click “View” to retrieve the tree analysis result. The parameters used to generate the tree are also saved.

III. Metadata-driven Comparative Analysis Tool for Sequences (Meta-CATS)

Metadata-driven Comparative Analysis Tool for Sequences (Meta-CATS) • A unique comparative genomics analysis tool in IRD to identify nucleotide

/amino acid positions that significantly differ between two or more groups of virus sequences.

• Meta-CATS consists of three parts: a multiple sequence alignment (using MUSCLE), a chi-square goodness of fit test to identify positions (columns) of the multiple sequence alignment that significantly differ from the expected (random) distribution of residues between all metadata groups, and a Pearson's chi-square test to identify the specific pairs of metadata groups that contribute to the observed statistical difference.

• Picket BE, et al. (2013) "Metadata-driven Comparative Analysis Tool for Sequences (meta-CATS): an Automated Process for Identifying Significant Sequence Variations Dependent on Differences in Viral Metadata." Virology, 447(1-2):45-51. doi: 10.1016/j.virol.2013.08.021.


Now we will use Meta-CATS to identify amino acid positions on the HA protein that significantly differ between the human isolates from the current outbreak and older H7 isolates.

a. We are going to analyze the protein sequences we saved previously in the working set: H7N9 HA complete seqs + H7N9 blastn hits protein. So go to your Workbench, find the working set and click “View” to display the sequences.

b. Sort the list by the “Date” column. Now select the following protein sequence records:

i. All human H7N9 from 2013

ii. All non-human H7 isolated before 2012

iii. Mouse over “Run Analysis” and click “Metadata-driven Comparative Analysis Tool”.

c. You are taken to the meta-CATS tool setting page. Here we will separate these sequences into two groups according to the phylogenetic tree analysis: 2013 H7N9 isolates as a group and the rest sequences as another group. We can do so by grouping the sequences by year. Select “Auto Grouping”, and then select “Year” from the dropdown list. Now enter year break point “2012” to get groups of: (1) 2012 & before (Eurasian H7 ancestral isolates), and (2) > 2012 (human H7N9 2013 human outbreak isolates). Select “Unaligned FASTA”, use “0.05” as our significance cutoff value and click “Continue”.

d. On the next page, you will see the sequences are separated into two groups: Group 1 containing <=2012 sequences, and Group 2 containing >2012 sequences. Click the “Run” button.

e. This analysis may take a few minutes to finish. You can save the analysis to your Workbench and retrieve it later. To do so, enter in a name (Ex., human H7N9 2013 HA vs. older H7) and click “Save to Workbench”.

2/10/14 Influenza Research Database - Metadata-driven Comparative Analysis Tool (meta-CATS) - Setup Subset

www.fludb.org/brc/mgc.do 1/1




33 sequencesAdd | Remove

32 sequencesAdd | Remove

RunClear

MAIN LIST OF SEQUENCES

LIST OF SEQUENCES GROUP 1

LIST OF SEQUENCES GROUP 2

-N/A-|-N/A-|Q3KRH9|A/mallard/Sweden/91/02|AY999981-N/A-|-N/A-|Q0A3N8|A/turkey/Minnesota/1/1988|CY014786-N/A-|-N/A-|A7IRT3|A/blue-winged teal/Ohio/566/2006|CY024818-N/A-|-N/A-|B9X1H5|A/duck/Mongolia/119/2008|AB481212-N/A-|-N/A-|C7C2V7|A/Anas crecca/Spain/1460/2008|FN386467HA|-N/A-|E2JIX6|A/goose/Czech Republic/1848-T14/2009(H7N9)|HQ244415HA|-N/A-|D0FI67|A/goose/Czech Republic/1848-K9/2009(H7N9)|GU060482

HA|-N/A-|M9TFV7|A/Zhejiang/DTID-ZJU01/2013|KC885956HA|-N/A-|N0D493|A/Fujian/1/2013|KC994453HA|-N/A-|M4YV75|A/Hangzhou/1/2013|KC853766HA|-N/A-|R4JCY8|A/Hangzhou/3/2013|KF001516HA|-N/A-|R4JC40|A/Hangzhou/2/2013|KF001513HA|-N/A-|R4SCY6|A/Taiwan/S02076/2013|KF018045HA|-N/A-|R4J9D5|A/Nanjing/1/2013|KC896774

Metadata-driven Comparative Analysis Tool (meta-CATS) - Setup Subset Mouse click on the sequences to select them, Ctrl and/or shift key enable the multiple selection. After selecting the sequences, click on "Add" to move into the group."Remove" will send the selected sequences back to the main list.

Auto grouping on: YearGroup1: <=2012 33

Group2: >2012 32

Home My Workbench Working Set (H7N9 2013 HA nr + BLAST protein) Metadata-driven Comparative Analysis Tool for Sequences (meta-CATS)





f. The Meta-CATS analysis result has two reports: a Chi-square Test of Independence result table listing the positions that have a significant non-random distribution between your specified groups, and a Pearson's chi-square test result table listing the specific pairs of groups that contribute to the observed statistical difference. Since this analysis only deals with two groups of sequences, we will primarily focus on the first result table.

g. Review the Chi-square test results to see the positions that differ significantly between the current H7N9 outbreak isolates and other isolates. The residue diversity column lists the counts for each residue within a group. Now sort the results by the Chi-square value to push the most different positions to the top of the table. What is the position number with the highest Chi-square value?

h. Some publications record H7 positions on an H3 numbering scheme. To compare the variant positions identified by meta-CATS with positions reported in publications, we will convert the position numbers of the current analysis to the H3 numbering scheme by selecting the “H3 A/Aichi/2/1968(H3N2)” reference strain from the “Reference Coordinate” dropdown menu.

i. The numbering scheme from the custom alignment of the input sequences are now mapped to both the H3 HA0 (uncleaved) and H3 HA1/HA2 (cleaved) coordinates.

j. Save the analysis result to your Workbench by clicking the “Save Analysis” button.

2/10/14 Influenza Research Database - Metadata-driven Comparative Analysis Tool (meta-CATS) Report

www.fludb.org/brc/mgc.do?method=RetrieveResults&ticketNumber=MG_599805474639&selectionContext=1392082613344&sortBy=cs.csvalue&sortOrder=desc 1/5


Save Analysis Generate Phylogenetic Tree Visualize Aligned Sequences

Metadata-driven Comparative Analysis Report (Ticket# MG_599805474639)

The Metadata-driven Comparative Analysis Tool (meta-CATS) (SOP) consists of three parts: a multiple sequence alignment (using MUSCLE), a chi-square goodness of fit testto identify positions (columns) of the multiple sequence alignment that significantly differ from the expected (random) distribution of residues between all metadata groups,and a Pearson's chi-square test to identify the specific pairs of metadata groups that contribute to the observed statistical difference. When 3 or more groups are included in the analysis, the P-value from the Goodness of Fit test will identify columns having significant variation between all groups, while thePearson's test will identify the specific pair(s) of groups that make the column significant (i.e. if those groups were not included in the analysis, the column would no longer beidentified as significant).

The link (ViewSF) of each position to Sequence Feature page is visible only if the SF has been defined AND the set of PROTEINS data with the same type: subtype orortholog.

Reference CoordinateSelect from the drop-down list to convert the existing position numbers to a different numbering (coordinate) scheme using a Reference Sequence (e.g. convert from H1 to H3numbering)

Select a Reference to Calculate Ref CoordinateSegment_Number:Protein_Name Strain_Name(Subtype)

Chi-square Test of Independence Result

There are 92 positions that have a significant non-random distribution between the specified groups."*" in position column indicates fewer than 5 non-zero residues in any cell of the contingency table.

Position Chi-square Value P-value Degree Freedom Residue Diversity Sequence Feature

235* 64.994 7.704E-15 2 group1(33 Q)group2(1 I, 31 L)

View SF

195* 64.994 7.704E-15 2 group1(1 A, 32 G)group2(32 V)

View SF

183 64.994 7.704E-15 2 group1(17 D, 16 K)group2(32 S)

N/A

541 61.057 5.545E-15 1 group1(33 A)group2(32 V)

N/A

455 61.057 5.545E-15 1 group1(33 N)group2(32 D)

N/A

410* 57.463 3.327E-13 2 group1(2 N, 14 S, 17 T)group2(32 N)

N/A

321* 50.765 2.499E-10 4 group1(12 E, 1 K, 1 P, 4 R, 15 T)group2(32 R)

N/A

198* 47.694 4.4E-11 2 group1(5 A, 2 N, 26 T)group2(32 A)

View SF

130 47.694 4.4E-11 2 group1(5 A, 16 S, 12 T)group2(32 A)

N/A

11* 47.692 2.477E-10 3 group1(16 C, 5 I, 1 M, 11 V)group2(32 I)

N/A

188* 44.78 1.889E-10 2 group1(1 A, 26 I, 6 V)group2(32 V)

N/A

307 44.298 2.82E-11 1 group1(5 D, 28 N)group2(32 D)

N/A

211 44.298 2.82E-11 1 group1(28 I, 5 V)group2(32 V)

N/A

427 38.799 4.698E-10 1 group1(7 I, 26 M)group2(32 I)

N/A

335* 30.074 4.727E-6 4 group1(12 I, 4 K, 4 L, 12 N, 1 S)group2(32 I)

N/A

285* 29.31 1.927E-6 3 group1(4 D, 13 G, 11 N, 5 S)group2(1 D, 31 N)

N/A

462 25.238 5.066E-7 1 group1(13 K, 20 R)group2(32 K)

N/A

414* 24.135 5.743E-6 2 group1(15 K, 16 Q, 2 R) N/A

Home My Workbench Working... Metadata-driven Comparative Analysis Tool for Sequences (meta-CATS) Results





IV. Visualize protein sequence alignment

Now we are going to view the protein sequence alignment to confirm the meta-CATS results and to verify clade relationships inferred from the phylogenetic analysis.

a. From the meta-CATS report page, click “Visualize Aligned Sequences” at the top of the page.

Note: You can also run an alignment on the saved sequence working set by navigating to the working set in your Workbench area and then clicking the “Visualize Aligned Sequences” option from the “Run Analysis” pull down menu.

b. The alignment is presented in the JalView visualization window. The window is interactive.

i. The consensus sequence is shown at the bottom of the window. You can choose to show sequence logos by right-clicking on consensus and then selecting "Show logo".

ii. You can manually adjust the alignment and display using various gray menu options.

iii. Scroll right up to the region of 185-235. Several amino acid substitutions, including G195V and Q235L/I are observed in all recent H7N9 isolates, but are absent from the older H7 proteins.

iv. We can change the View Option to “Conserved vs. reference” such that only the first sequence shows full characters, for the remaining sequences, only the nucleotides/residues differing from the reference sequence are shown as full characters.

v. You can download the input sequences or alignment in various formats, or save the alignment to your Workbench. Click “Save Analysis”, give it a name, and click “Save”.

V. Determine if the significant positions are located in Sequence Features

Sequence Features (SFs) are defined as interesting protein regions with known structural or functional properties. They are obtained from literature and other databases and validated by domain experts. Once a Sequence Feature region has been defined, the number of distinct amino acid sequences


observed in the sequence database are determined and each defined as a unique variant type. The reference strain is always Variant Type 1.

The Sequence Feature (SF) column in the meta-CATS table provides a convenient linkout to a list of all Sequence Features that contain that amino acid position.

a. Return to the meta-CATS report page. Click the “View SF” link for residue position 195. How many Sequence Features are mapped to this position?

b. Click “View” for SF4 to get to the Sequence Feature (SF) Details page.

i. This SF is determinants of receptor binding. It begins at residue 194 and is 5 amino acids long. What is the reference strain used to define the position coordinates of this SF? What is the position range on the reference strain?

ii. How many Variant Types does this SF have?

iii. All 2013 outbreak isolates have a Valine at position 195 and an Alanine at position 198. Now we are going to search for all strains harboring V195 and A198. Click “Find a VT”, fill all positions with wildcards and fill in V at 195 and A at 198. Then click “Search”.

iv. Did you find a VT harboring SVSTA in the 194-198 loop? Click strain count for the VT. Are they from the current outbreak?

2/10/14 Influenza Research Database - Sequence Feature Variant Types

www.fludb.org/brc/influenza_sequenceFeatureVariantTypes_search.do?method=SubmitForm&sourcePosition=195&virusType=3&virusASegmentsProtein=4 HA… 1/1




Download Save Search

Download Save Search

Your search returned 1 record. Search Criteria Displaying 50 records per page , sorted by Feature Identifier inascending order.

Display Settings

Sequence Feature Variant Types

Your Selected Items: 0 items selected

Select the 1 results

FeatureIdentifier Sequence Feature Name

SequenceFeatureLength

VariantTypes Category Amino Acid

Position Evidence Comments

InfluenzaA_H7_SF4

InfluenzaA_H7_determinants-of-receptor-binding_194(5)

5 35 functional 194(178 HA1)-198 PubMed:22345462 Atypical European viruses showG186V whereas the N. Americanstrains show G186A or G186E.

Your Selected Items: 0 items selected

Top* Please mouse-over the asterisk to see additional information

Custom Amino Acid Position Amino Acid Position in SF Reference Strain

195 195

It is possible that the numbered positions from the custom alignment and the numbered positions in the Sequence Featuare Reference Strain are different - even though theyrefere to the equivalent sequence posiiton. Below is a table showing how the custom position relates to the homologous position in the reference strain.

Home Sequence Feature




Cite IRD Tutorials Glossary of Terms Report a Bug Request Web Training Contact Us Release Date: Jul 3, 2013


SEQUENCE FEATURE DEFINITION

Protein Name HASequence Feature Name Influenza A_H7_determinants-of-receptor-binding_194(5)Sequence Feature ID Influenza A_H7_SF4VT-1 Strain (reference strain) A/turkey/Italy/220158/2002(H7N3)Reference Sequence Accession AY586409Reference Position 194(178 HA1)-198

SOURCE STRAIN(S)

Source Strain VTNumber

SourcePosition

SourceAccession

3D ProteinStructure Publication Evidence

Codes Comment

A/duck/HONGKONG/293

/1978(H7N2)

VT-1 185 -189 U20461 -N/A- PubMed:22345462 EXP Atypical European viruses show G186Vwhereas the N. American strains show

G186A or G186E.

Excel Download Find a VT(s)

Search

Reset

Fill wildcards

Strain Count

VARIANT TYPES

Phylogenetic tree view disabled because there are not enough variant types to generate the tree.

Edit specific positions in this VT-1 sequence with IUPAC symbols or use "?" as a wild-card. If necessary, use the horizontal scroll bar to access the entire SF.Click Search to find VT(s) conforming to the edited sequence. Click Reset to restore this panel to the default VT-1 sequence.

Enter Sequence Variation to Find

194

?

195

V

196

?

197

?

198

A

Variant Type388 VT-114 VT-11

Sequence Variation

194 195 196 197 198S G S T T

V A

Total Variations02

"?" : A question mark indicates the sequence is unknown."–" : Indicates a deletion (gap) relative to the VT-1 sequence."•" : Indicates the same letter as in the VT-1 sequence."[ ]" : Indicates an insertion.

Sequence Feature Details (SOP)For more information about using SFVTs, click here . For a detailed description of the development and application of the SFVT approach for the study of influenza virus, pleaseread this scientific article: Noronha JM, Liu M, Squires RB, Pickett BE, Hale BG, Air GM, Galloway SE, Takimoto T, Schmolke M, Hunt V, Klem E, García-Sastre A, McGee M,Scheuermann RH. (2012) Influenza Sequence Feature Variant Type (Flu-SFVT) analysis: evidence for a role of NS1 in influenza host range restriction. J Virol, 86: 5857-5866.doi: 10.1128/JVI.06901-11. PMID: 22398283

Home My Workbench Analysis... Sequence... Sequence Feature Details (Influenza A_H7_determinants-of-receptor-binding_194(5)) Sequence Feature Details (Influenza A_H7_d





VI. Highlight variant positions and Sequence Features on protein structure

IRD imports experimentally-determined virus protein structures from the Protein Data Bank, integrates data from IEDB and UniProt and provides various visualization options. To investigate the structural implications of the sequence variants identified by the IRD meta-CATS analysis, we are going to highlight the positions on a related H7 protein structure.

a. From the grey navigation bar, mouse over “Search Data” and click “3D Protein Structures”.

b. Search for the 3D structures of influenza A HA subtype H7. Virus Type: ý A Subtype: H7 Select Proteins to search: ý 4 HA

c. The Search Results page displays a list of matching structures. We are going to examine the HA structure of H7N3 subtype, so click “View Structure” for 1TI8 to display the structure.

d. Now we are on the 3D protein structure viewer page. Click and drag with your mouse in display window to change the focus point.

i. In the “Display Options” section, you can change the Display Type to line, stick, space, primary structure, secondary structure, etc. We are going to select “Space”.

ii. You can overlay the structure with a sequence conservation heat map by selecting “Sequence conservation computed using all sequences” in the “Highlight Sequence Conservation” section.

iii. This structure is obtained from A/turkey/Italy/214845/02 and the position numbering in our meta-CATS analysis is the same as this strain. Now select Chain A and type in “195, 235” in the “Highlight by SWISS-PROT Position” section to highlight these binding determinants on the structure. G195V and Q235L could increase the binding of avian H5 and H7 viruses to human-type receptors (Yamada, 2006; Srinivasan, 2013). Alternatively, you can highlight the Sequence Feature of receptor binding determinants by checking “Influenza A_H7_determinants-of-receptor-binding_194(5)” from the functional dropdown list in the “Highlight Sequence Features” section.

iv. Variant position 235 is also located within an experimentally determined epitope. Highlight this epitope on the structure by checking “Influenza A_H7_experimentally-determined-epitope_226(14)” from the epitope dropdown list in the “Highlight Sequence Features” section.

v. Click “Spin” to view the structure spinning. Then click “Rock” to rock the structure back and forth.

vi. The custom highlighted protein structure can be downloaded as an image by clicking “Save View As Image” beneath the image, or a 3D movie of either a spinning structure or a rocking structure by clicking “Generate Video”.


References

Noronha JM, et al. Influenza virus sequence feature variant type analysis: evidence of a role for NS1 in influenza virus host range restriction. J Virol. 2012 May;86(10):5857-66. doi: 10.1128/JVI.06901-11. PMID:22398283

Picket BE, et al. Metadata-driven Comparative Analysis Tool for Sequences (meta-CATS): an Automated Process for Identifying Significant Sequence Variations Dependent on Differences in Viral Metadata. Virology, 2013, 447(1-2):45-51. doi: 10.1016/j.virol.2013.08.021. PMID: 24210098

Sorrel EM, et al. Minimal molecular constraints for respiratory droplet transmission of an avian-human H9N2 influenza A virus. Proc Natl Acad Sci U S A. 2009 May 5;106(18):7565-70. doi: 10.1073/pnas.0900877106. PMID:19380727

Srinivasan K, et al. Quantitative description of glycan-receptor binding of influenza a virus h7 hemagglutinin. PLoS One. 2013;8(2):e49597. doi: 10.1371/journal.pone.0049597. PMID:23437033

Yamada S, et al. Haemagglutinin mutations responsible for the binding of H5N1 influenza A viruses to human-type receptors. Nature. 2006 Nov 16;444(7117):378-82. PMID:17108965

Reset View Save View As Image

Type JMol command here. Run

Show JMol Console

Spin Rock

Generate mp4 movie (default=avi) Generate Video

Description: H7 HAEMAGGLUTININ

Strain Name: A/turkey/Italy/214845/02

PDB Link: 1TI8

JMOL COMMAND LINE:Advanced users can enter a JMol command directly using JMol Interactive Script .

SAVE/RESTORE STATE You can save the current state of the image (Display Options, Zoom, Highlight Epitopes,etc.) for later use. Click the Show JMol Console button below, click State, andcopy/paste/save the contents of the upper panel into any text document. When youreopen this PDB file in a later session, you can restore the image to the saved state.Click the Show JMol Console button, paste the text into the lower panel, and clickExecute.

GENERATE 3D MOVIEGenerate a mp4/avi movie, starting with the rendering in above window and spinningaround Y axis by 360 deg. It takes about 30 seconds to generate the movie.

Display Type: Space

Details

Show:

Turn off sequence conservation display

Clear

Highlight Clear

epitope Clear

DISPLAY OPTIONS:These options control the general appearance of the protein structure inthe viewer.

HIGHLIGHT SEQUENCE CONSERVATION Overlay the structure with a sequence conservation "heat map" (see SOPfor details). Blue represents conserved regions and red representsnon-conserved regions

HIGHLIGHT LIGANDS Highlight Ligands in

HIGHLIGHT EPITOPES Highlight epitopes on the structure in . First, select an epitope type

from the list. Then check epitopes to highlight. Flash epitope during rocking

HIGHLIGHT BY SWISS-PROT POSITION Highlight in in this structure corresponding to defined SwissProt

positions. With the dropdown, select a chain and view the positions presentin this file. Enter one or more comma-delimited positions (15,30) or a range(15-30) then click Highlight. If no chain is selected, these positions will behighlighted on all chains which include them.

Select Chain:

A - Q6GYW3 (21 - 334) 195,235

PDB Sequence/Structure Details

HIGHLIGHT SEQUENCE FEATURESHighlight sequence features on the structures in

Sequence Feature Feature Name SequenceFeaturePositions

Zoom: 100%

Spin

Rock Angle: 12 Speed: 10 Pause: 0.0

Influenza A_H7_SF3

InfluenzaA_H7_experimentally-determined-epitope_340(14)

340-353

Influenza A_H7_SF5

InfluenzaA_H7_experimentally-determined-

226-239

Home 3D Protein Structure Search Results Protein Structure Viewer (1TI8)



Influenza Research Database - Protein Structure http://www.fludb.org/brc/structureOperation.do?accession=1TI...

1 of 2 2/11/14 4:30 PM

Documents

Section B. Comparative Genomics Analysis of 2013 H7N9 ......4! Created by the ViPR/IRD team and licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License Section