Upload
darren-foster
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Galaxy for Bioinformatics AnalysisAn Introduction
TCD Bioinformatics Support TeamFiona Roche, PhDDate: 31/08/15
Overview
• What is Galaxy?
• Why is it useful?
• Command-line vs Galaxy
• A Basic Analysis with Galaxy
• Resources for Learning
What is Galaxy?
A web-based genome analysis
platform designed for experimental
biologists
www.galaxyproject.org
Why is it useful to a biologist?
Easy to use! Allows data import from popular resources Provides access to best practice bioinformatics tools Allows you to build analysis pipelines and share them Provides multiple ways to visualise your data
Trinity College Dublin, The University of Dublin
Case Study: Chip-seq Analysis Pipeline
Peak callingEnriched regions
Quality control Map reads to reference genome
Pre-processing of raw reads
Sequencing
Trinity College Dublin, The University of Dublin
Case Study: Chip-seq Analysis Pipeline
Quality control Map reads to reference genome
Peak calling
Pre-processing of raw reads
Enriched regions
Sequencing
Visualisation with genome browser
Motif discovery Relationship with gene structure
Gene set analysis
Differential profile analysis
Trinity College Dublin, The University of Dublin
Question?
Which promoter regions of genes do these enriched regions map to???
Trinity College Dublin, The University of Dublin
Command-line approach
1. Extract gene coordinates from UCSC
2. Extract 1kb upstream coordinates from UCSC
3. Merge upstream coordinates and gene annotation
5. Join the input files
6. Create user track for UCSC
7. Import to UCSC
8. Run a Wrapper script to enable a re-run of this pipeline with different parameters.
4. Clean files
Trinity College Dublin, The University of Dublin
Command-line approach
1. Extract gene coordinates from UCSC
2. Extract 1kb upstream coordinates from UCSC
3. Merge upstream coordinates and gene annotation
5. Join the input files
6. Create user track for UCSC
7. Import to UCSC
8. Run a Wrapper script to enable a re-run of this pipeline with different parameters.
4. Clean files
Trinity College Dublin, The University of Dublin
Galaxy Approach
Trinity College Dublin, The University of Dublin
The Galaxy Interface
Datasources and Tools
Main Analysis window
History of commandsMain Menus
Trinity College Dublin, The University of Dublin
Overview of Analysis
Import two datasets into Galaxy
1. Genomic coordinates of enriched peaks
2. Genomic coordinates of genes
Extract upstream regions of genes
Data cleaning
Identify overlap between promoter regions and enriched regions
Visualise on a genome browser
Question:Which gene promoter regions do these enriched regions map to???
Analysis steps:
Trinity College Dublin, The University of Dublin
Let’s begin!
Register an account
http://bioinf.gen.tcd.ie/workshops/Galaxy/
Trinity College Dublin, The University of Dublin
Let’s begin!
Step 1: Get Data into Galaxy
Step 1: Get data #1 TAF1 peaks
Get Data -> Upload File -> Paste/Fetch -> Enter URL -> Start
1. Click Upload File
2. Click Paste/Fetchto display the URL box above
3. Paste in the URL containing your data
6. Click Start to upload the data to your history!
5. Type hg19 and specify Human Feb. 2009
(GRCh37/hg19) (hg19)
7. Click Close
4. Select ‘tabular’ file type
http://bioinf.gen.tcd.ie/workshops/Galaxy/TAF1_peaks.txt
Data uploaded to your history!
The file was sent to your history and given a
number
The history keeps track of all steps in your analysis
Step 2: Rename your History
1. Click here to rename your history
You can have multiple histories with different names2. Click the cog wheel if you want to create a new history or see a list of your saved histories
Step 3: Review your dataset
1. Click on dataset name to expand/collapse the meta data and mini view of the file content
3. Click the pencil icon to edit the file attributes
2. Click the eye icon to see the file contents in the main analysis window
4. Click the x to delete the file
Step 4a: Edit dataset
1. Click the pencil icon to edit the file attributes
3. First rename the file
5. Click save
Change File name to a shorter name
4. Copy and paste the old name into the info to keep a record of it
2. There are four tabs in edit mode:To change file name click Attributes
Step 4b: Edit dataset
1. Click Datatype to change the file format
3. Define which columns of your TAF1 file are “chrom”, “start” and “end”. Look at the mini view image to see your TAF1 file
4. Click save
Change File format so Galaxy knows where to find chr, start, end
2. Select interval from drop down and then click save
5. Format changed to interval. Galaxy now knows where chr, start and end are.
Step 5: Get data #2 -> GenesGet Data -> UCSC Main Table Browser
Step 5: Get data #2 -> GenesEnsure all drop downs as shown below are selected
1. Select all fields from drop downs as shown above, then click get output
2. Click Send query to Galaxy
Step 6: Edit dataset
Click the pencil icon to edit the file name
Change File name to a shorter name File name changed
File format = bedGalaxy already knows where Chr, start and
end are
Step 7: Get Promoter RegionsTool: Operate on Genomic Intervals Get Flanks
4. Click Execute
3. Select 1000bp upstream
1. Select Genes dataset
2. Select upstream 5. Output sent to history!Same file content as ‘Genes’ but start and end coordinates are replaced with promoter regions
6. Rename file to ‘Promoters’
Step 8: Clean datasetTool: Text Manipulation Cut
1. Cut out the specific columns we want from the ‘Promoters’ file
2. Click Execute
3. Rename the output file to ‘Clean Promoters’
Datasets ready for analysis!
Both files are associated with human hg19
Galaxy knows for each file where chr, start and end
are.
Now, we are ready to join these files and see which
promoters have TAF1 peaks!
Dataset #1 Dataset #2
How do we Join Genomic Intervals?
Chr1 100 500 int1 + Chr1 200 400 cloneA +
Chr Start End Name Strand Chr1 100 500 int1 +Chr1 1000 1200 int2 +
Intervals that overlap!
Interval file #1 Interval file #2
Example
Chr Start End Name Strand Chr1 200 400 cloneA +Chr1 900 1000 cloneB +
100-500
200-400
1000-1200
900-1000
#1
#2
Step 9: Join on Genomic IntervalsTool: Operate on Genomic Intervals Join
The second dataset is the one we use for the filter (i.e. we want to filter the promoter dataset for just those regions that contain the TAF1 peaks)
The first dataset is the one we want to filter (i.e. the large dataset containing all of the promoter regions)
Click Execute
Inner join returns only the genomic regions that overlap in both files
Step 9: Join on Genomic IntervalsOutput
We have reduced the promoters from >54,000 to 154!All of these promoter regions contain a TAF1 peak region.
Rename the output file to ‘Overlap’
Step 10: Build Custom Tracks for UCSCTool: Graph/Display Data Build custom track
Click ‘Insert Track’ to open the track information.
We will add three tracks to UCSC:
1. TAF1 peaks2. Promoter regions
3. TAF1 peaks in promoter regions
Step 10: Build Custom Tracks for UCSC
Click ‘Insert Track’ to open another track
Select dataset
Label the track
Describe the track
Select the colour of the track
Track 1: TAF1 peaks
Step 10: Build Custom Tracks for UCSCTracks 2 and 3:
Click Execute when all three tracks are filled in
Click here to visualise your three tracks on UCSC Genome Browser
This single output file contains the information to visualise three trackson UCSC Genome Browser
Step 10: Build Custom Tracks for UCSCOutput
Visualisation on UCSC Genome Browser
The three tracks
Zoom out to see a larger genomic
context
Extract Workflow from HistoryWant to rerun your analysis but extract 3kb upstream?
Click the cog wheel and select
‘Extract Workflow’ from the drop down menu
Extract Workflow from History
Create a workflow name
Lists all the tools used to create your
history
Click Create workflow
Extract Workflow from History
Click edit workflow
Or access your workflows from the top menu
Editing Workflows
Click on a box and you can edit the variables of that step in the Details sectionon the right (in orange)
Each box is a step of the analysis
Noodles connect the steps
Use blue window to move around the workflow
Editing Workflows
This input dataset is the transcription factor dataset . Label this dataset in the details box on the right
Editing Workflows
This input dataset is the Gene dataset . Label this dataset in the details box on the right
Editing Workflows
1. Click on Get Flanks tool to edit the upstream promoter region
2. Change the upstream promoter region to 3000
3. Click cog wheel to save workflow. Then click cog wheel again toRun the workflow
Running Workflows
1. Select Transcription factor file (e.g. TAF1_peaks)
3. Send output to a new history
4. Run workflow and go for a coffee!!
2. Select Genes file(e.g. Genes)
Your new History!
Trinity College Dublin, The University of Dublin
Summary
What you learned today
– Getting data into Galaxy
– How to review and edit datasets
– Running Common Galaxy Tools
– How to visualise your data in UCSC genome browser
– How to extract workflows from a history
Large Tool Repository
Trinity College Dublin, The University of Dublin
Data Visualisations
UCSC Genome Browser
Clustered Heatmaps
Visualisation of Chip-seq dataCharts
Circster – structural variation
Galaxy Learning Resources
Thank You
Please fill in the online survey at bioinf.gen.tcd.ie/surveys/Galaxy