48
Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Embed Size (px)

Citation preview

Page 1: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Galaxy for Bioinformatics AnalysisAn Introduction

TCD Bioinformatics Support TeamFiona Roche, PhDDate: 31/08/15

Page 2: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Overview

• What is Galaxy?

• Why is it useful?

• Command-line vs Galaxy

• A Basic Analysis with Galaxy

• Resources for Learning

Page 3: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

What is Galaxy?

A web-based genome analysis

platform designed for experimental

biologists

www.galaxyproject.org

Page 4: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Why is it useful to a biologist?

Easy to use! Allows data import from popular resources Provides access to best practice bioinformatics tools Allows you to build analysis pipelines and share them Provides multiple ways to visualise your data

Page 5: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Trinity College Dublin, The University of Dublin

Case Study: Chip-seq Analysis Pipeline

Peak callingEnriched regions

Quality control Map reads to reference genome

Pre-processing of raw reads

Sequencing

Page 6: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Trinity College Dublin, The University of Dublin

Case Study: Chip-seq Analysis Pipeline

Quality control Map reads to reference genome

Peak calling

Pre-processing of raw reads

Enriched regions

Sequencing

Visualisation with genome browser

Motif discovery Relationship with gene structure

Gene set analysis

Differential profile analysis

Page 7: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Trinity College Dublin, The University of Dublin

Question?

Which promoter regions of genes do these enriched regions map to???

Page 8: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Trinity College Dublin, The University of Dublin

Command-line approach

1. Extract gene coordinates from UCSC

2. Extract 1kb upstream coordinates from UCSC

3. Merge upstream coordinates and gene annotation

5. Join the input files

6. Create user track for UCSC

7. Import to UCSC

8. Run a Wrapper script to enable a re-run of this pipeline with different parameters.

4. Clean files

Page 9: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Trinity College Dublin, The University of Dublin

Command-line approach

1. Extract gene coordinates from UCSC

2. Extract 1kb upstream coordinates from UCSC

3. Merge upstream coordinates and gene annotation

5. Join the input files

6. Create user track for UCSC

7. Import to UCSC

8. Run a Wrapper script to enable a re-run of this pipeline with different parameters.

4. Clean files

Page 10: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Trinity College Dublin, The University of Dublin

Galaxy Approach

Page 11: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Trinity College Dublin, The University of Dublin

The Galaxy Interface

Datasources and Tools

Main Analysis window

History of commandsMain Menus

Page 12: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Trinity College Dublin, The University of Dublin

Overview of Analysis

Import two datasets into Galaxy

1. Genomic coordinates of enriched peaks

2. Genomic coordinates of genes

Extract upstream regions of genes

Data cleaning

Identify overlap between promoter regions and enriched regions

Visualise on a genome browser

Question:Which gene promoter regions do these enriched regions map to???

Analysis steps:

Page 13: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Trinity College Dublin, The University of Dublin

Let’s begin!

Register an account

http://bioinf.gen.tcd.ie/workshops/Galaxy/

Page 14: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Trinity College Dublin, The University of Dublin

Let’s begin!

Step 1: Get Data into Galaxy

Page 15: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Step 1: Get data #1 TAF1 peaks

Get Data -> Upload File -> Paste/Fetch -> Enter URL -> Start

1. Click Upload File

2. Click Paste/Fetchto display the URL box above

3. Paste in the URL containing your data

6. Click Start to upload the data to your history!

5. Type hg19 and specify Human Feb. 2009

(GRCh37/hg19) (hg19)

7. Click Close

4. Select ‘tabular’ file type

http://bioinf.gen.tcd.ie/workshops/Galaxy/TAF1_peaks.txt

Page 16: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Data uploaded to your history!

The file was sent to your history and given a

number

The history keeps track of all steps in your analysis

Page 17: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Step 2: Rename your History

1. Click here to rename your history

You can have multiple histories with different names2. Click the cog wheel if you want to create a new history or see a list of your saved histories

Page 18: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Step 3: Review your dataset

1. Click on dataset name to expand/collapse the meta data and mini view of the file content

3. Click the pencil icon to edit the file attributes

2. Click the eye icon to see the file contents in the main analysis window

4. Click the x to delete the file

Page 19: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Step 4a: Edit dataset

1. Click the pencil icon to edit the file attributes

3. First rename the file

5. Click save

Change File name to a shorter name

4. Copy and paste the old name into the info to keep a record of it

2. There are four tabs in edit mode:To change file name click Attributes

Page 20: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Step 4b: Edit dataset

1. Click Datatype to change the file format

3. Define which columns of your TAF1 file are “chrom”, “start” and “end”. Look at the mini view image to see your TAF1 file

4. Click save

Change File format so Galaxy knows where to find chr, start, end

2. Select interval from drop down and then click save

5. Format changed to interval. Galaxy now knows where chr, start and end are.

Page 21: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Step 5: Get data #2 -> GenesGet Data -> UCSC Main Table Browser

Page 22: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Step 5: Get data #2 -> GenesEnsure all drop downs as shown below are selected

1. Select all fields from drop downs as shown above, then click get output

2. Click Send query to Galaxy

Page 23: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Step 6: Edit dataset

Click the pencil icon to edit the file name

Change File name to a shorter name File name changed

File format = bedGalaxy already knows where Chr, start and

end are

Page 24: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Step 7: Get Promoter RegionsTool: Operate on Genomic Intervals Get Flanks

4. Click Execute

3. Select 1000bp upstream

1. Select Genes dataset

2. Select upstream 5. Output sent to history!Same file content as ‘Genes’ but start and end coordinates are replaced with promoter regions

6. Rename file to ‘Promoters’

Page 25: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Step 8: Clean datasetTool: Text Manipulation Cut

1. Cut out the specific columns we want from the ‘Promoters’ file

2. Click Execute

3. Rename the output file to ‘Clean Promoters’

Page 26: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Datasets ready for analysis!

Both files are associated with human hg19

Galaxy knows for each file where chr, start and end

are.

Now, we are ready to join these files and see which

promoters have TAF1 peaks!

Dataset #1 Dataset #2

Page 27: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

How do we Join Genomic Intervals?

Chr1 100 500 int1 + Chr1 200 400 cloneA +

Chr Start End Name Strand Chr1 100 500 int1 +Chr1 1000 1200 int2 +

Intervals that overlap!

Interval file #1 Interval file #2

Example

Chr Start End Name Strand Chr1 200 400 cloneA +Chr1 900 1000 cloneB +

100-500

200-400

1000-1200

900-1000

#1

#2

Page 28: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Step 9: Join on Genomic IntervalsTool: Operate on Genomic Intervals Join

The second dataset is the one we use for the filter (i.e. we want to filter the promoter dataset for just those regions that contain the TAF1 peaks)

The first dataset is the one we want to filter (i.e. the large dataset containing all of the promoter regions)

Click Execute

Inner join returns only the genomic regions that overlap in both files

Page 29: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Step 9: Join on Genomic IntervalsOutput

We have reduced the promoters from >54,000 to 154!All of these promoter regions contain a TAF1 peak region.

Rename the output file to ‘Overlap’

Page 30: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Step 10: Build Custom Tracks for UCSCTool: Graph/Display Data Build custom track

Click ‘Insert Track’ to open the track information.

We will add three tracks to UCSC:

1. TAF1 peaks2. Promoter regions

3. TAF1 peaks in promoter regions

Page 31: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Step 10: Build Custom Tracks for UCSC

Click ‘Insert Track’ to open another track

Select dataset

Label the track

Describe the track

Select the colour of the track

Track 1: TAF1 peaks

Page 32: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Step 10: Build Custom Tracks for UCSCTracks 2 and 3:

Click Execute when all three tracks are filled in

Page 33: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Click here to visualise your three tracks on UCSC Genome Browser

This single output file contains the information to visualise three trackson UCSC Genome Browser

Step 10: Build Custom Tracks for UCSCOutput

Page 34: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Visualisation on UCSC Genome Browser

The three tracks

Zoom out to see a larger genomic

context

Page 35: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Extract Workflow from HistoryWant to rerun your analysis but extract 3kb upstream?

Click the cog wheel and select

‘Extract Workflow’ from the drop down menu

Page 36: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Extract Workflow from History

Create a workflow name

Lists all the tools used to create your

history

Click Create workflow

Page 37: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Extract Workflow from History

Click edit workflow

Or access your workflows from the top menu

Page 38: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Editing Workflows

Click on a box and you can edit the variables of that step in the Details sectionon the right (in orange)

Each box is a step of the analysis

Noodles connect the steps

Use blue window to move around the workflow

Page 39: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Editing Workflows

This input dataset is the transcription factor dataset . Label this dataset in the details box on the right

Page 40: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Editing Workflows

This input dataset is the Gene dataset . Label this dataset in the details box on the right

Page 41: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Editing Workflows

1. Click on Get Flanks tool to edit the upstream promoter region

2. Change the upstream promoter region to 3000

3. Click cog wheel to save workflow. Then click cog wheel again toRun the workflow

Page 42: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Running Workflows

1. Select Transcription factor file (e.g. TAF1_peaks)

3. Send output to a new history

4. Run workflow and go for a coffee!!

2. Select Genes file(e.g. Genes)

Page 43: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Your new History!

Page 44: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Trinity College Dublin, The University of Dublin

Summary

What you learned today

– Getting data into Galaxy

– How to review and edit datasets

– Running Common Galaxy Tools

– How to visualise your data in UCSC genome browser

– How to extract workflows from a history

Page 45: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Large Tool Repository

Page 46: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Trinity College Dublin, The University of Dublin

Data Visualisations

UCSC Genome Browser

Clustered Heatmaps

Visualisation of Chip-seq dataCharts

Circster – structural variation

Page 47: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Galaxy Learning Resources

Page 48: Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Thank You

Please fill in the online survey at bioinf.gen.tcd.ie/surveys/Galaxy