NGS Data Transfer using the GCBridge : Tutorial
The GeneCore Bridge (GCB) is a pipeline that automates data transfer from GeneCore to emBASE and Galaxy; as soon as it is released from the GeneCore facility.
- Forenotes on partially demultiplexed data
- GCBridge Tutorial Overview
- Filling in the GCB form, a step-by-step tutorial
Forenotes on partially demultiplexed data
Some groups combine internal (or inline) barcode indexing (barcode in the read sequence) together with Illumina indexing (dedicated read). This means that fastq files transferred by GeneCore for such libraries are partially demultiplexed i.e. they have been only demultiplexed according to Illumina indexing and must be further demultiplexed before the GeneCore Bridge can handle these files correctly. We will improve handling of these files but meanwhile here is the procedure that you must follow so we can handle these files thought the usual pipeline.
GCBridge Tutorial Overview
When a sequencing run is done, the files are copied in your group NGS library folder (on your file server), and a first program is run to create an e-mail for each person who use lanes in the flowcell. The e-mail will look like this:
The big URL in the e-mail is a link to the second step of the pipeline: a web form. There the necessary information must be provided. The form will look like this:
- Each box correspond to one lane. The only exception is when your data has been demultiplexed by GeneCore, which is always the case when using standard Illumina barcoding. In this situation, a box per demultiplexed sample is created.
- In case of paired-end data, the mate files are processed together in the same box.
- In the very unlikely situation where too many boxes are on the page and you would like to have only some of these displayed (so you can process them separately), please get in touch with us.
- In case of multiplexed data, MAKE SURE THE 'Multiplexed' checkbox is checked and provide the needed information.
- Make an effort and use meaningful sample names at this moment of the process (don't think like fine, I'll change them later)
- Indicate the number the total number of samples/libraries found in the lane (including those from other users when applicable) and how many of these are yours.
- In case you are sharing the lane with other users, please only describe the samples that belong to you in the mutliplexed text area AND select ALL users sharing the lane.
- In case the lane is multiplexed and assuming your samples do not all belong to the same project, you can submit the form several times provided that a given sample is overall submitted only once (GCBridge records the number of submitted samples/libraries and will stop displaying the form once you submitted all your samples)
When the form is validated, the information is stored in a database. Two additional daemon programs will then read the database: The first one will load the data in Galaxy and the second one in emBASE. When the data processing is complete, you are notified by e-mail.
Filling in the GCB form, a step-by-step tutorial
The emBASE experiment selector:
On the left are listed the emBASE experiments you have access rights on. If no existing experiment corresponds to the data you are uploading, you can create a new one by filling the text fields on the right :
- one for the experiment name, please only use letters (a-Z; no greek, accentuated...letters), numbers and -_ (hyphen, underscore)characters,
- one for the optional but highly recommended experiment prefix : this prefix, which should really be regarded as a project key, will be prepended to all items' name (e.g. sample, extract, library and bioassay names) created when files are loaded in emBASE. This trick will later ease your life by ensuring that all items related to the same experiment/project can be easily searched/recognized by their name (by the beginning of their name actually).
Mouse-over the question mark icons to have more information on the adequate values.The Same experiment for all link on the right copies the value(s) of this field and pastes it in all the other experiment field of the form. This is helpful if you used a lot of lanes in the flowcell, all for the same experiment.
The unix group selector (i.e. the group the files must belong to):
This is the list of the unix groups you belong to and only concerns you if your lab's disk space has quotas associated to unix groups. Otherwise you should have just one choice in the list and just want to ignore this item.
The genome build selector
This is a list of the genome builds that are available on Galaxy. If you need other/newer builds, please contact the GBCS so we can add it for you.
The last fields concern the description of the library(-ies) present in the file (i.e. lane)
The fields to fill in differ between the multiplexed and non-multiplexed situations.
- If your data is not multiplexed (multiplexed checkbox unchecked): you can indicate/rename your sample in the Sample text box shown below :
On the left, the sample name field is pre-filled with a sample name automatically extracted from the file name. Please change it if not appropriate (this name can be reset using theReset Name link on the far right).
By default, GCBridge uses the provided name to create a new sample (and a new extract and a new library) in emBASE. Alternatively, you can re-use an existing sample or library from emBASE, using the drop down on the right. Note that this is the right way to accurately describe technical replicates at the right protocol level.Upon selection of either Yes, match existing samples or Yes, match existing libraries, emBASE is queried using the sample name field and all matching samples (or libraries) are displayed in a drop down list. Note that the experiment prefix are ignored during this search i.e. GCBridge searches all samples that ends with the sample name given in the text field. For this reason, GCBridge might report matches that could seem inappropriate to you. If your sample is not listed in results, please double check the spelling of your sample in the left box, or make sure how you named your sample in emBASE first. Still in trouble finding it ? Contact us !
- Multiplexed data, you must check the checkbox (if not automatically done) and the following fields will be available.
First, you must indicate the total number of libraries contained in the lane and how many of these belong to you like below:
- These 2 numbers are different only if you share this lane with other users (the lane mates)
- The total number is only asked to the lane owner i.e. the user who first received the email : the lane mates only indicate the number of samples that belong to them (and to them only)
- By providing these numbers, GCBridge is now able to know when a submission is complete (a mandatory information to execute demultiplexing, see below)
- In the special situation of already demultiplexed data, you still have to click the multiplex checkbox (if not automatically done) and provide the only sample corresponding to the file(s). If you know the barcode used, please provide it or indicate the keyword STD_ILLU in place of the barcode (should be automatically done). This procedure is to ensure that emBASE correctly understand that the demultiplexed files came from the same sequencing run and lane.
Second, adjust the demultiplexing options that Je will use (Jemultiplexer as been replaced by the newer Je tool):
Je help page knows everything about emBASE and will directly save the demultiplexed files at the right place in your group NGS library (on your group file share, ask us if you don t know what we are talking about here). Please see emBASE documentation o learn more about this.Important notes:
- All lane mates automatically receive an email upon demultiplexing completion
- Demultiplexing is now the default option but you can turn this off (uncheck the checkbox) in case your barcoding situation is not handled by Jemultiplexer. Note that in such a situation, you should get in touch with us as Je is developed by GBCS, so we can always improve the tool and make it match your needs.
- Proposed options are usually good for standard single end and paired end situation, if you need to change them, please consult the Jemultiplexer help page or get in touch with GBCS:
- Single end : 1 mismatch allowed (MM=1), minimum distance with second best match = 1 (MMD=1), min. base quality score for a base to be considered = 10 (Q=10), remove header from reads (C=TRUE) plus 1 extra base (XT=1), add header to the fastq header (ADD=TRUE), no extra trimming on the 3' prime end
- Paired end : same options (MM=1 MMD=1 Q=10 XT=1 ZT=0 C=TRUE ADD=TRUE), and additional ones defining the barcode(s) position at both reads (BPOS=BOTH), specifying that barcodes are supposed to be same in both reads (BRED=TRUE) and that both can be indifferently used to identify the sample (BM=BOTH S=FALSE)
Third, paste the list of samples/libraries and their barcodes in the big textarea, with one sample name / barcode per line (the sample name andbarcode must be separated by a space)
Then select of users sharing the lane with you (the lane mates):
By default, GCBridge uses the provided names to create a new sample per given row (and a new extract and a new library) in emBASE. Alternatively, you can re-use an existing samples or libraries from emBASE; this is the right way to accurately describe technical replicates at the right protocol level. To re-use existing emBASE items, proceed as follows:
- Select the adequate level from the drop down list, below the sample/barcode textarea
- Yes, re-use existing Samples : to tell GCBridge to search for samples in emBASE. Re-using a sample means, creating a new extract and a new library that will be linked to the selected sample . This situation corresponds to preparing a new library from an existing sample.
- Yes, re-use existing Libraries : to tell GCBridge to search for libraries in emBASE. Re-using a library means, linking the results to the existing library (no sample, extract, library created) . This situation corresponds to re-sequencing an existing library.
- Yes, re-use items as indicated in 3d column : special option in which you can have a mixture of sample and library matching, guided by a 'sample' or 'library' key given in a third column.
Upon selection, the sample/barcode textarea is converted into (here we selected Yes, re-use existing Samples) :
Upon selection, GCBridge queries emBASE using the sample name values (first column) and all matching samples (or libraries) are displayed in drop down lists. Results are displayed in :
- green when a single match is found
- orange when multiple matches are found
- red when NO match is found
Understanding how items are searched:
- The sample name is used as given (with spaces ...) and the case is not considered during the search
- The experiment prefix are ignored during this search i.e. GCBridge searches all samples that ends with the sample name given in the text field. For this reason, GCBridge might report matches that could seem inappropriate to you.
- When searching for libraries, the provided barcode is used in the query.
If your sample is not listed in results:
- please double check the spelling of your sample/library (switch back to the textarea view using the re-use dropdown if you need to correct a name), or
- make sure how you named your sample/library in emBASE first , or
- for libraries, check what barcode was associated to it.
Still in trouble finding it ? Contact us at gbcs at embl.de ! You can also change and adapt the emBASE search level for each row using the emBASE Query level, like shown on the picture below:
Fourth, tell GCBridge if your file(s) should be made available in Galaxy (true by default) :
The files will end up in a folder named after the emBASE experiment, in your group data library. Note that you can also synchronize emBASE experiment content with Galaxy, right from emBASE.
Fifth, click save all and, if everything goes well, you should see :
Sixth, if you try to go the the form again, GCBridge checks that you still have sample to submit. Once you submitted all your samples/libraries (ie according to the number you provided), you will see such a page:
Good to Know And Tricks
In case of demultiplexed data
In case of demultiplexed data, the GCBridge alwasy tries to make your life easy and propose a sample name that is extracted from the file names (explaining why it is important to come up with good names right at GC ordering time .
Sample names are normally extracted from the beginning of the file name as the text found bewteen the flowcell ID and the reaction ID.
Consider the file name :
=> The considered sample name will be SK1xRM11_9D (C5U0DACXX is the flowcell ID and 14s008050-2-1 is the GeneCore Reaction ID)
When these sample names are not unique amongst the demulitplexed file names, the form will automatically tries to extract sample names from the last part of the file, in the "lane information part". If these alternative sample names turns out to be unique, they will be used as default names.
From the example above, this is the text following the lane4 token up to the first '_' (underscore) character : (indicated in bold) :
You can force the GCBridge to use the sample from the first part when they are not unique by adding the following text to the URL :
and reloading the page.
For example :
Importantly, non-unique names across the form (in a given lane) should reflect a well-defined situation : the situation when identical samples have been sequenced with different barcodes.