Workflows
Galaxy Workflows
GBCS has been building and maintaining several pipelines for genomic data analysis.
Access : The workflows are avalaible for use and/or optimisation and improvement on the Galaxy production's server – under Shared data → Published workflows, top of the page.
The user friendly Galaxy interface eases the use and chaining of bioinformatics tools. These workflows allow computer biologists to lead complex analysis from raw data processing to processed data visualisation. They are specifically predesigned (however, still editable), by combining Galaxy's advanced features and internally developed tools to achieve efficient analysis. Contrary to the quickly and uncautiously produced workflows, the pipelines created by GBCS support meaningful dataset naming throughout the whole analysis as well as auto transfer of files to the EMBL file server (see NFS transfer). Hence, a copy of intermediate and/or fully processed files is stored for further/complementary analysis right on your file server (and in the Galaxy server until you delete it). Moreover, the workflows' output file storage structure is built to be intuitive and easily updated. Please, contact us for any further suggestions / improvements about this point.
BamFingerprint/Correlate for ChIP-Seq
Very usefull tips !
Tip 1 : Rename your files, especially after demultiplexing !
A well known issue in Galaxy is the crapy naming of produced datasets. And, complying with the crap-in-crap-out concept, this is even more true if you start with horrible names...
So before you use the Tip 2 below, make sure to edit and rename your files to short and meaningful names !!! This is especially true after demultiplexing.
Doing so, you'll quickly realize that Galaxy is much more pleasant to work with !
Tip 2 : Automatically run workflows on multiple files
User can run Galaxy's workflows on multiple files at once. To do so, first, the user needs to upload his files to Galaxy, and import them into a new “History”. Then, at workflow execution, where the file selection occurs (usually during first steps), the "toggle multiple file selection" icon (see following pictures) needs to be clicked.
>
Once multiple file selection is enabled, the file selection appears as a list where the user can either :
- Filter files by name
- Select multiple files using CRTL + Right Click
- Select a sequence of successive files by right clicking on the first file of the sequence and Shift + Right Click on the last file.
Finally, make sure to send each workflow into its own history by checking the "Send results to a new history" checkbox. Note the text box that appears upon checkbox selection. Enter a short name reflecting the workflow you executing (by default the name is usually too long). This way, each workflow will execute into its own history named after the short text you entered post-fixed with the input dataset name (hence the good idea to give short and meaningfull names to your input datasets!). For example, if you typed the text "MappingFilteringAndQC" in the text field, the history running the workflow on e.g. "dataset1.fastq" will be named :
"MappingFilteringAndQC on dataset1.fastq"
Tip 3 : Propagate good dataset names withi your worflows
Note : This tip will ONLY work when the input dataset(s) of a workflow have meaningful names i.e. when you are following tip 1 !
In any step of a workflow, the datasets that the step produces can be renamed using the "Edit Step Actions" box available in the right tab. Choose the "Rename dataset" action and the result dataset you wants the rename action to apply to, and click create. This will create a new box right below, with a text input field (see picture below).
Of course you can rename the dataset with a predefined name like "my-cool-and-more-meaningful-name" but you usualy want more dynamic renaming, based on the step's input file name.
Using step input parameter with the #{NAME} synthax
This can be achieved using the #{NAME} syntax in whcih the NAME must be the exact spelling on the input parameter as indicated at the top. In our example, the Bowtie2 step indicates that FASTQ file input parameter is internally named input_1. So in our example, the input fastq file name can be injected in the bam output file name by using #{input_1} in the renaming text.
Example:
Assume the input fastq file is named "My_TF_Rep1.fastqsanger.gz", then by renaming the output to "{input_1}.bam" will results in having a result BAM file named:
"My_TF_Rep1.fastqsanger.gz.bam" Note how the text outside the #{} was kept as given.
The ${} synthax comes with internal string manipulation functions that can be pretty handy :
- basename : clip off the file extension
- lowercase and uppercase : turn everything lower and upp
So if we now use the renaming string : "{input_1 | basename }.bam" (note the use of the "pipe" character) would give :
"My_TF_Rep1.fastqsanger.bam" as a result dataset name... we are on the right track
These functions can also be chained ie "{input_1 | basename | basename | lowercase }.bam" would result in
"my_tf_rep1.bam" ... now we get it right !!!
Using workflow input parameter with the ${NAME} synthax
Global workflow parameter are automatically created and reused within your dataset renaming using the ${PARAM_NAME} synthax. It works as the #{NAME} synthax but no string manipulation fucntions can be used, only the plain parameter value
Workflows for ChIP–seq data analysis
We have developped several Galaxy workflows for ChIP-seq analysis, these are available under the "Shared Data" menu, then "Published Workflows".
First optional step is to demultiplex your data
For this, use the home made, flexible Jemultiplexer (see below for a screenshot of the Jemultiplexer tool in Galaxy).
And don't forget to rename result files if needed!
Workflow 1 : ChIP-seq data Quality Control, Read Alignment and Filtering
This workflow is made to align raw reads (i.e. “out of the sequencer”, in fastqsanger format – possibly GZipped) over a user-defined reference genome and to produce a raw BAM file using the Bowtie2 alignementtool.
In addition, the workflow performs several filtering steps (unmapped reads removal, ambiguous alignment removal, duplicated alignment removal or marking) and the first-level quality analysis usingFastQC and SPP tools.
show a typical plot from fastQC
Expected Inputs :
- A fastqcsanger file ( .txt or .txt.gz )
Parameters :
- The reference genome for alignment
- The transfer directory
NB. If you want to use this workflow on another reference genome, it is possible by modifying the appropriate parameter above before execution.
The transfer directory is an absolute path to the folder on the file server that the user wants to use as a root to store the results of all of his analysis. For example : /g/mygroup/path/to/my/result/folder Ouputs:
- All readsBAM file, duplicates marked
- FilteredBAM file, duplicates removed
- Associated BED file containing signal intervals (see replicates analysis workflow)
- FastQCreports on BAM duplicates removed file
- DuplicationMetrics HTML files
- AlignmentMetrics HTML files
- StrandCross-correlation results on both BAM files
- As of version 1.3, the workflow also creates a coverage files normalized to 1X coverage (i.e. RPGC) bigwig. The default key parameters are the bin size (50bp) and the fragment lenght (200bp). This file is trasnferred to a signal/rpgc sub-folder
Workflow 2 : Peak Calling and construction of BigWig files for data visualisation
Using the BAM files generated by the alignment workflows discussed above, the analysis can be led further by analysing protein-DNA interactions on Drosophila melanogaster's genome, calling peaks using the MACS 1.4. Peaks are quantitative representations of observed protein-DNA interactions expressed by continuous “signal” values on genomic intervals. This workflow generates several scaled and normalised wigfiles suitable for visualisation using genome browsers.
Expected input :
- The sample and control BAM files (possibly obtained through the alignment workflow above)
Parameters :
- Transfer directory (same as above)
- Tag size
- Bandwidth
- Shiftsize
- Resolution of generated BigWig files
NB. One sets up these parameters at workflow execution, but can also modify all other parameters by importing the workflow and modify it. Although it is advised to do so only if the user is sure of the modification he brings.
Outputs :
- 3 kind BigWig files :
- Scaled files
- Both replicates scaled files (scaled by library size)
- Normalised files, normalisews with control by subtraction:
- Subtracted signal ( Sample - Control ), giving the absolute value of enrichment
- Log transformed then subtracted ( log(Sample)-log(Control)=log(Sample/Control) ), giving the % of enrichment
- Scaled files
Workflow 3 : Replicate Analysis
This workflow is made to estimate the correlation between two samples by performing the linear regression of average signal values of both replicates on the same intervals. It gives an idea of the experiment quality and reproducibility.
Expected Inputs:
- Both replicates BigWig files
- Both replicates BED files
Parameters:
- Transfer directory
- A unique experiment name (e.g. possibly containing both replicates names)
NB. Be careful, this workflow cannot be executed on multiple files at once (see tips) as an appropriate unique filename can not be determined during execution. For this reason, you will have to run it multiple times indicating a unique name as a parameter.
Outputs:
- Raw and NaN filtered txt table files of mean values on intervals
- Linear regression result files (txt table of values and PDF scatterplot)
Workflows for RNA–seq data analysis
We have developped several Galaxy workflows for RNA-seq analysis, these are available under the "Shared Data" menu, then "Published Workflows".
Workflow 1 : Quality Control, Read Alignment and Read Count on RNA-seq read data--"Tophat-thseqcount Workflow"
This work flow generates the count table from "htseq-count" program. It takes RNA-seq FASTQ (fastqsanger ) paired end reads as input files, and maps them with Tophat2. The QC (fastQC) and htseq-count are then processed on the output bam file from Tophat2. Output files are automatically transferred to a user specified directory.
Expected Inputs :
- Paired-end fastqcsanger read files from RNA-seq experiment
- Gene annotation file, which can be obtained from "UCSC Main table browser" on Galaxy tool list (Output Format: "GTF-gene transfer format").
Parameters :
- The transfer directory: the transfer directory is an absolute path to the folder on the file server that the user wants to use as a root to store the results of all of his analysis. For example : /g/mygroup/path/to/my/result/folder
- The reference genome for mapping
Transferred Output Files:
- The output bam files from Tophat2 are transferred to {transferdirectory}/data/bam/
- The output count tables from htseq-count are transferred to {transferdirectory}/analysis/htseqCountTable/
- The output fastQC files are transferred to {transferdirectory}/qc/fastqc/
Workflow 2 : Deseq2 analysis on read count tables: 2 replicates with 1 factor model--"deseq2-2replicates-1factor Worflow"
This workflow conducts deseq2 analysis in the case of 2 replicates and with 1 factor model. It takes the count table from htseq program for each sample as input file, and combines these count tables. The deseq2 is then excuted on this combined count table.
Expected Inputs :
- Annotated gene read count table from htseq-count for each sample
Parameters :
- Transfer directory
- A unique experiment name: where output files from deseq2 are transferred
- Parameters for Deseq2 sample annotation:
- Sample name (Ordered distinct column name in the count table): it must be the same order as the input dataset of the workflow
- Factor name: the effect of this factor is tested in the model (e.g. treatment)
- List of the factor: it must be the same order as sample name list and the input dataset (e.g.: control, case, control, case)
Transferred Output Files:
The following output files are automatically transferred to a directory {transferdirectory}/analysis/deseq2_{expanse}/ :
- A diagnostic pdf file from deseq2
- A tabular file lists gene identifiers and corresponding statistics
- A tabular file lists gene identifiers and corresponding statistics after filtering low count reads
- A tabular file containing combined count tables used by deseq2
Workflow 3 : Deseq2 analysis on read count tables: 2 replicates with 2 factor model--"deseq2-2replicates-2factor Workflow"
Similar to Workflow 2, apart from the input parameters.
Parameters :
- Transfer directory
- A unique experiment name: where output files from deseq2 are transferred
- Parameters for Deseq2 sample annotation:
- Sample name (Ordered distinct column name in the count table): it must be the same order as the input dataset of the workflow
- First factor name: the model controls the effect of this factor when testing the second factor (e.g. patient)
- List of first factor: it must be the same order as sample name list and the input dataset (e.g.: 1,1,2,2)
- Second factor name: the effect of this factor is tested in the model (e.g. treatment)
- List of second factor: it must be the same order as sample name list and the input dataset (e.g.: control, case, control, case)