Je
Je is a suite to handle barcoded fastq files with (or without) Unique Molecule Identifiers (UMIs) and filter read duplicates using these UMIs
If you have barcodes and/or UMIs in your fastq files, you'll most likely enjoy Je. Je currently offers 4 tools :
- demultiplex to demultiplex multi-samples fastq files which reads contain barcodes and UMIs (or not)
- demultiplex-illu to demultiplex fastq files according to associated index files (contain the sample encoding barcodes). Reads can additionally contain UMIs (inline)
- clip to remove UMIs contained in reads of fastq files that do not need sample demultiplexing
- markdupes to filter BAM files for read duplicates taking UMIs into account
In short, Je demultiplex, demultiplex-illu and clip add extracted barcodes and UMIs to the read headers and reformat read headers to fulfill read mappers requirements. Indeed most read mappers (bowtie, bwa...) expect headers for read_1 and read_2 to be strictly identical. After mapping, markdupes identifies PCR (and optical) read duplicates based on their mapping positions and UMIs found in read headers.
Source Code & Executables
Je source code (Java) is available in git at EMBL GitLab and executables in the repository dist directory. In addition, we provide the je-suite through bioconda and galaxy.
Installation
Using conda
"conda install -c bioconda je-suite"
Manual
- Download Je latest tar.gz (e.g. je_1.2.tar.gz) from the code repository
- simply go to EMBL GitLab and navigate to the dist directory (Files menu)
- Unpack
- Call "~/jesuite/je -h" to see the available options
Note that java 7 is expected.
For Galaxy
As an admin of your local Galaxy installation, simply look for "suite_je" in the toolshed and click install!
Getting help
- Read this page
- Please subscribe to the Je's mailing list ( je -at- embl.de): subscribe to the ML! Or read the archive.
- Consider the command line help i.e. -h
Citing Je
Je, a versatile suite to handle multiplexed NGS libraries with unique molecular identifiers.
Girardot C, Scholtalbers J, Sauer S, Su SY, Furlong EE.
BMC Bioinformatics. 2016 Oct 8;17(1):419.
Selecting correct demultiplexing options (graphical decision trees)
Options about barcoding configuration / sample resolution :
- Where are the barcodes (BPOS option) : READ_1, READ_2 or BOTH
- Which barcode should be used for sample look up (BM option) : READ_1, READ_2 or BOTH
- If BOTH, are the barcode REDUNDANT i.e. do they both resolve to the same sample (BRED option)
- If BOTH and REDUNDANT, should we require both to resolve to same sample or not (S option)?
Options about barcode matching :
- How many mismatches (MM option)
- Minimum quality for the base (Q option)
- Minimum mismatch delta with the second best match (when mismatches are present, MMD option)
Options about read processing :
- Should the barcode be removed (C option)?
- Should extra bases at beginning (XT option) and/or end (ZT option) of the read be removed?
- => important to control read length (e.g. bc is at one end only)
- => these values can be different for both ends using the synthax e.g. ZT=2:5 i.e. trim 2 and 5 bases from the end for read 1 and 2, respectively.
Options about input and output :
- Allow user to give all output file names and paths ; these names/pathes should be provided in the barcode file (extra columns)
- Allow to read and write in compressed (gzip) format to save space (default)
- MD5 file can be generated (CREATE_MD5_FILE=true)
Below are decision workflow to help you pick the right key options depending on your situation (click pictures to enlarge).
Je's demultiplex module (no Illumina index files)
Je's demultiplex-illu module (with Illumina index files)
Practical demultiplexing examples
Single End
Simply call (will use defaults) :
> je demultiplex F1=file1.txt.gz BF=barcodes.txt
Paired End without UMIs
Scenario 1: the same barcode is found in both reads
Maximize the number of reads : reads are attributed to samples even if only one of the two barcode resolves to the sample S=false, the default (reads are ignored if barcodes resolve to different samples)
> je demultiplex F1=file1.txt.gz F2=file2.txt.gz BF=barcodes.txt
Some as above but a bit more stringent i.e. keep reads only if both barcodes resolve to the same sample
> je demultiplex F1=file1.txt.gz F2=file2.txt.gz BF=barcodes.txt S=true
Note that in both situations:
- Input FASTQ file are expected to be encoded using phred scale + 33 (V option)
- the output files are automatically gzipped (GZ=true), named after a pattern like samplename_barcode.txt.gz (can be overridden by providing file names in extra columns of the barcode file) and placed in the current dir (output dir can be adapted using O=/path/to/dir).
- Unassigned reads are saved (UN=true) in unassigned_1.txt and unassigned_2.txt files (current dir) as well as a summary file jemultiplexer_out_stats.txt (all these file names and path can be adapted if needed).
- the barcode look up use a maximum mismatch (MM=1) of 1, a minimal mismatch delta (MMD=1) with second best barcode match of 1 and a minimum quality of 10 (Q=10)
- the barcodes are removed from reads (C=true) and added to the FASTQ header (ADD=true)
Keep reads only if both barcodes resolve to the same sample, adapt the output dir (all files will be now saved in this dir) and provide demultiplexed file names (unassigned read files will still be named automatically)
> je demultiplex F1=file1.txt.gz F2=file2.txt.gz BF=barcodes.txt S=true O=/tmp/jemultiplexer_out/
with a barcode file like:
sample1 ATATAT sample1.txt.gz sample2 GACGAC sample2.txt.gz
Scenario 2: the barcode is found in only one read
The barcode is in READ_2
> je demultiplex F1=file1.txt.gz F2=file2.txt.gz BF=barcodes.txt O=/tmp/jemultiplexer_out/ BPOS=READ_2
N.B.: BM will be automatically set to READ_2, BRED and S will be ignored.
N.B. 2: reads will have different length as only one read will have the barcode sequence removed , see below example to make sure reads have same length.
The barcode is in READ_2 => make sure read 1 is of same resulting size by using unbalanced trimming options
We need to compensate 6 bases i.e. assuming barcode length is 6, READ_2 will be trimmed of 6 bases
- Example 1: also remove 6 bases at start of READ_1 (XT=6:1)
> je demultiplex F1=file1.txt.gz F2=file2.txt.gz BF=barcodes.txt O=/tmp/jemultiplexer_out/ BPOS=READ_2 XT=6:1
- Example 2: remove 7 bases at the end of READ_1 and none at the end of read 2 (ZT=6:0)
> je jemultiplex F1=file1.txt.gz F2=file2.txt.gz BF=barcodes.txt O=/tmp/jemultiplexer_out/ BPOS=READ_2 ZT=6:0
- Example 3: or use a mix ...
> je demultiplex F1=file1.txt.gz F2=file2.txt.gz BF=barcodes.txt O=/tmp/jemultiplexer_out/ BPOS=READ_2 ZT=6:0 XT=1:1
Scenario 3: a different barcode is found in both read with both barcodes being needed for sample lookup
The barcode file must be like:
sample1 ATATAT:CGCGCG sample2 GACGAC:TACGTT
i.e. sample 1 has barcode ATATAT in READ_1 and CGCGCG in READ_2 ; and
sample 2 has barcode GACGAC in READ_1 and TACGTT in READ_2
Then :
> je demultiplex F1=file1.txt.gz F2=file2.txt.gz BF=barcodes.txt BPOS=BOTH BM=BOTH BRED=false O=/tmp/jemultiplexer_out/
Note that you could use different barcode lookup parameters (if this makes sense!) using the syntax MM=1:2 MMD=2:1 Q=30:10
Paired-end with UMIs or degenerate positions
Scenario 1: one barcode encodes the sample identity (e.g. READ_1) while the other barcode (in READ_2) is a Unique Molecular Identifier
In this example we'll consider that the sample encoding barcode is 6 bases long and the UMI is 10 bases long.
The barcode file will look like:
sample1 ATATAT sample2 GACGAC
To specify we hane 2 barcodes, we need BPOS=BOTH
To tell Je only the first read encodes sample identity, we need to explicitely mention BM=READ_1
We must also specify the individual barcode length ie 6 and 10 : LEN=6:10
Due to the barcode length difference, we'll also compensate length difference using ZT of 4 on READ_1 only (ie to end up with reads of identical length) so we'll add ZT=4:0
> je demultiplex F1=file1.txt.gz F2=file2.txt.gz BF=barcodes.txt BPOS=BOTH BM=READ_1 LEN=6:10 ZT=4:0 O=/tmp/jemultiplexer_out/
Scenario 2 : iCLIP barcodes i.e. barcodes mixed with random sequence
For the iCLIP experiment, barcodes contain random letters (A,T,G, or C). For example, a barcode may look like NNNNATTCNN, where N can be A, T, G, C or N. Only the barcode letters from 5th to 8th positions are used to resolve a sample i.e. Ns are ignored. .
The barcode file for an iCLIP may look like:
sample1 NNNNATATATNNN sample2 NNNNGACGACNNN
PS: All other command line options are like in the other scenarii.
Important : when barcodes (defined in the barcode file) contain Ns, Je demulitplex will copy the extracted sequence at the end of header line (CLIP and ADD options set to true) instead of the barcode (as found in the barcode file, which the default behavior). Note that many tools (bowtie2, ...) will clip this information from the read header (due to the space separator). Because you might need to have this information available down your workflow (when dealing with duplicates), we suggest you to use the RCHAR option (e.g. add RCHAR=':' in the command line) so all spaces are replaced with a ':'.
Scenario 3 : composite barcodes
In case of composite barcodes i.e. sample-encoding barcode followed or prefixed with a UMI, two approaches could be considered.
The first approach to deal with such a scenario is to define a barcode with degenerate position, using Ns, as explained in the previous section, e.g. (6 bases UMI first then sample barcode :
sample1 NNNNNNATATAT sample2 NNNNNNGACGAC
Importantly, this approach results in extracting the 12 bases as a unique barcode i.e. the read name would hold these 12 bases and Je’s markdupes module would then use these 12 bases as UMIs. This is normally not an issue for markdupes unless you run markdupes with a pre-defined list of molecular barcodes. In such a situation, you should make sure that each UMI of the predefined list also include the sample-encoding barcode.
An alternative approach is to combine demultiplex (or demultiplex-illu) with the split module i.e.
- When UMIs are found before the sample-encoding barcode, clip is run first to extract the UMIs followed by demulitplex
- When UMIs are found after the sample-encoding barcode, demulitplex is run first followed by split to extract the UMIs
In both cases, the sample barcode file is the same and does not specify the UMIs:
sample1 ATATAT sample2 GACGAC
Je USE_EMBASE run mode
In this running mode, simply call Je demultiplex on the lane file(s) and demultiplex will use information stored in emBASE to demultiplex and place the demultiplexed files directly in your group NGS library (an example of such a group library is shown below and decribed at here).
- No barcode file : barcodes are fetched from emBASE
- File location is used to look up emBASE
- User calling Je demultiplex is used for authentication and rights
- Demultiplexed files are named according a pre-defined naming scheme and directly placed where they should => emBASE will automatically know about them
- Demultiplexing options right at GCBridge form validation
Comparison with other demultiplexers
Single end mode : fastx barcode splitter
- Matched options for comparison
- Exactly the same results
Paired-end mode : fastq-multx barcode splitter
- Matched options for comparison
- Can only compare in the ‘standard situation’ (one sample == one barcode)
- Exactly the same results