Handling Partially Demultiplexed Fastq Files
Situation...
You just received the typical email from the GCBridge... but the referrenced fastq files are partionally demultiplexed files as you combined internal (or inline) barcode indexing (barcode in the read sequence) together with Illumina indexing (dedicated read).
As of today, the GCBridge has no way to automatically detect such a situation (it just thinks the data is fully demultiplexed), it is therefore up to you to let us know about such cases and to provide us with the necessary additional information. Here is what you need to do...
What is a partionally demultiplexed file ?
Some groups like steinmetz have started combining internal barcode indexing (barcode in the read sequence) together with Illumina indexing. This means that fastq files transferred by GeneCore for such libraries are partially demultiplexed i.e. they have been only demultiplexed according to Illumina indexing and must be further demultiplexed before the GeneCore Bridge can handle these files correctly.
Procedure to follow
Step 1 : Tell GBCS about the concerned files
Simply send us an email clearly stating the affected files
We will create a ticket in our ticket system that you can use for further correspondance with us (you will be notified by email).
Step 2 : Prepare all the barcode files
For each file that needs to be further demultiplexed, we will need to know the list of internal barcodes to use. Give each of this file a different name.
The File format of the barcode files is as usual i.e it contains 2 tab-delimited columns:
- Column 1 (optinal header SampleName) : the sample name as you want it in the final database ; all expected to be different within the file
- Column 2 (optinal header Barcode) : the barcode ; all expected to be different within the file
Step 3 : Prepare a "fastq-files-to-barcode-files" mapping file
The role of this simple text file is to describe which barcode file should be used for each fastq file that need further demultiplexing.
This file is a tab-delimited file with 3 headed columns (headers are mandatory and spelling should be matching what follows):
- LaneFile1 : name of the fastq file for read 1 ; mandatory
- LaneFile2: name of the fastq file for read 2 ; optional i.e. only for PE. For SE, simply leave the column empty but keep the column header.
- InlineBarcodeFileName : name of the file name describing all the sample name / barcodes. These are the files prepared at step 2. Here only give the file name i.e. no path please.
Finally save this file as tab delimited text file and name it 'seq_files-to-barcode_files-mapping.txt'
Step 4 : Send all files to GBCS
Put all files created at steps 2 and 3 into a new directory and zip it.
Send us an email with the zip archive as an attachment and tell us about demultiplexing options we should use (see Jemulitplexer page for this) and in particular :
- Barcode position : In case of paired end sequencing : Are barcodes found in both fwd and rev reads ? If not, is it in the fwd or the rev read ?
- Barcode matching options: The defaults are allow up to one mismatch (MM=1), the minimum difference before best and second best barcode match should be at least one base (MMD=1) and only bases with a sequencing quality >= 10 are considered (Q=10)
- Should we clip barcodes from sequences (default yes, CLIP=true) and trim the first base after barcode (usually a 'A' due to ligation) ? (default true, XT=1)
Step 5 : Validate data transfer as usual
You will receive the usual GCBridge email once files are fully demultiplexed.
Since this usually ends up in dozens of files, it is very important to give us good final sample names so that we rpovide the good final names in the GCBridge validation form.
Important (transfer form validation):
- Give time to the form to load, it can take a couple of minutes ; especially if demulitplexing produced a lot of files
- If the names are NOT the one you'd expect, try reloading the page with the following added to the URL "&forcesname=1" (without quotes!)
- Check that proposed sample names in the form LOOK CORRECT (ie as you gave them)
- Then simply position experiment, organism and unix group when appropriate. Use the "Same for all" links to copy the values all over the form.
- Submit and be patient
Actions to be taken by GBCS Staff
Upon reception of "Step 1 email"
- Open a ticket.
- Inactivate the notification(s) using a SQL query like :
update notification_group set invalidated = 1 where id = the_ngid
and
update notification set invalidated = 1 where notification_group_id = the_ngid
NB: the ngid can be found in the URL the user got in the notification email
Upon reception of "Step 4 email"
- Log in as galaxy on gbcs server and create a subfolder in the NGS run directory containing the data
- note that you might want to proceed lane-by-lane to limit the number of files in each email (I strongly recommend this actually). In this case, create a sub-folder per lane
- Unzip the file archive in the above created dir (we'll refer to the abs path to this dir as DIR_PATH )
- make sure all requested files (see step 3 and 4) are provided and in adequate format
- split these files into lanes if you proceed lane-by-lane if necessary
- Adapt values in the java class partialdemultiplex.PrepareBarcodeFiles.java (CurrentWork project in CVS) to reflect :
- path to the RUN_DIR (where fastq files are)
- DIR_PATH
- the name of the "seq_files-to-barcode_files-mapping.txt" found in the above DIR_PATH
- the options to use for demultiplexing
- whether it is a PE situation
- your email (to get LSF email upon demultiplexing completion)
- Run the java code (once for each sub-dir you created).
- If you do this from your Eclipse (most likely the case), you need the target fileser to be mounted as 'galaxy' on your desktop.
- Result of previous step : You should now see new things in DIR_PATH:
- A list of new directories, one for each partially demultiplexed fastq file set and named after the "file1" name (simpy the extension .txt.gx where removed). Each of this directory contains the following 2 files :
- A new barcode file.
This new barcode file is named after the original barcode file (as indicated in seq_files-to-barcode_files-mapping.txt in the row corresponding to this file set) with the additinal "_with_filenames.txt" token in its name e.g. "Barcodes_plate1_A1_A2_with_filenames.txt"
- A lsf_sub.com file : a LSF job description file for running jemultiplexer on the cluster using the command (on spinoza or schroedinger)
bsub < /path/to/lsf_sub.com
- jemultiplexer_cmdlines.sh : a shell script containing all jemultiplexer commands in case you want to run them sequentially on a server (ie not on the cluster)
- launch_all_LSF_jobs.sh : a shell script to easily submit all jemultiplexer jobs to the LSF cluster
- post_processing.sh : a shell script to easily finalize re-organization of files once all jemultiplexer jobs have successfully completed.
- Run launch_all_LSF_jobs.sh
find . -name "*com" -exec bsub < {} \;
- Control demulitplexing output and launch post_processing.sh
- check numbers to spot potential barcode error.
- check logs in "jemultiplexer_job_blah.out" LSF output
- if all looks good, launch post_processing.sh (as galaxy on gbcs)
- Relaunch the GCBridge on each lane separately to limit to number of filesets.
- This is simply done by relaunching the GCBridge (ie GeneCoreBridge_startJavaPipelineManual.sh fullpath/demultiplexed_lanex.fastq.gz script) on one file of a given lane (the other files of the same lane will be picked up automatically)
- this will NOT work if your are not galaxy on gbcs !!
Important :
For cases where similar samples were sequenced under different barcodes in the same lane, please make sure the trick about adding forcesname=1 in url.