Loading...
 

JemBASE API

Command line utilities to access NGS data stored in emBASE

Utilities are available as executable jars in  /g/funcgen/gbcs-tools/embase-cmdline/

Options always available

  • Help can be obtained by adding -h (or --help) in the command line options
  • The unix user executing the command is used to log in emBASE. If the user is valid, has the credentials to perform the operations, the command is executed. In any cases, only data readable by the user is returned to stdout.
  • In case you need to execute the command with another user than the one you are logged in, you can use the username username option in combination with the password option ; the password will be prompt if left empty on the command line i.e.:

Either use :

 command line ... username girardot password 12345678 

Or

 command line ... username girardot password 

and you'll get a prompt asking for the pwd (what you type in will be masked by stars)

 > emBASE password for --username: ****** 

Options available in all GetNGS... programs i.e. program extracting information connected to sequencing assays

cols , -c : give the list of column names to include in the output (order is preserved in output). In addition to these, the sample annotations columns are always included unless you give the nosa (-n) switch.  For example :

 -c NGSAssayName, LibraryName, RBAFastqDir 

will only print the assay name, library name and the path to the fastq storage dir (and sample annotations if available)

--nosa, -n : do NOT include sample annotation columns in output

In case sample annotations are available, the example above should be re-written

 -c NGSAssayName, LibraryName, RBAFastqDir --nosa 

to only print the assay name, library name and the path to the fastq storage dir

N.B.: The list of available header names can be obtained by calling --help

split, -s: Duplicate output lines to ensure a single file path is listed within the RBAFastqFiles or RBABamFiles fields (comma-separated by default)

This is particularly convenient if you are aiming at working directly with the file paths in e.g. a unix pipe.

For example, if you wish to create symbolic links in the current directory (imagine this is your project directory) to all demultiplexed fastq files for a new lane that just arrived in your NGS data library, you could simply run the following command, which will work in both paired end and single end situation due to the --split :

 java -jar GetNGSByFlowcellLane.jar -f C3WEMACXX -l 6 --split -n -c RBAFastqFiles | awk 'NR!=1{system("ln -s " $1)}'  

N.B. : the awk NR!=1 condition is used to ignore the first output line that contains header(s)

Fetch NGS information for a lane

These utilities extract information about all samples (well libraries to be exact) found in a given lane of a particular run. Extracted information is formatted as columns (tab-delimited) with one library per row (unless --split is used) and is printed directly to stdout. Headers should be self explanatory.

The lane can be specified either using the fully qualified path to the fastq lane file or by specifying the flowcell ID and lane number.

Using the path to the fastq lane file

Example (log on e.g. spinoza):

    java -jar /g/funcgen/gbcs-tools/embase-cmdline/GetNGSByLaneFilepath.jar -f /g/furlong/incoming/2013-07-02-D2731ACXX/D2731ACXX_4C_18_13s004570-1-1_Ghavi-helm_lane113s004570_sequence.txt.gz

Important: for paired data, only the path to the first read file must be given

Using the flowcell id and lane number

Example (log on e.g. spinoza):

    java -jar /g/funcgen/gbcs-tools/embase-cmdline/GetNGSByFlowcellLane.jar -f D26W9ACXX -l 8

Fetch NGS information for all lanes connected to an experiment or to a project

These utilities behave similarly as GetNGSByFlowcellLane.jar and GetNGSByLaneFilepath.jar but extract information for more than one NGS Assay.

Get information for all assays connected to an experiment (or more)

You can either use the experiment name or its emBASE internal ID. Example (log on e.g. spinoza):

 java -jar /g/funcgen/gbcs-tools/embase-cmdline/GetNGSByExperiment.jar -e  MYEXP 

If your experiment name contains spaces, you need to quote it :

 java -jar GetNGSByExperiment.jar -e BiTSHeart FAIRE 

will NOT work while

 java -jar GetNGSByExperiment.jar -e "BiTSHeart FAIRE" 

will Or directly using the internal experiment id, for example the experiment 269

 java -jar /g/funcgen/gbcs-tools/embase-cmdline/GetNGSByExperiment.jar -e  269 

Finally, you can pass as many exp names or ids as you wish with the following equivalent syntaxes :

 java -jar /g/funcgen/gbcs-tools/embase-cmdline/GetNGSByExperiment.jar -e  269  -e 137 

or

 java -jar /g/funcgen/gbcs-tools/embase-cmdline/GetNGSByExperiment.jar -e  269 137 

or

 java -jar /g/funcgen/gbcs-tools/embase-cmdline/GetNGSByExperiment.jar -e  269,137 

Get information for all assays connected to a project

This GetNGSByProject.jar utility works exactly as the GetNGSByExperiment.jar but uses the --project or -pr option to accept one project name or id

 java -jar /g/funcgen/gbcs-tools/embase-cmdline/GetNGSByProject.jar -pr MYPROJECT  

 

Adding demultiplexed files in your NGS library

When files have been demulitplexed outside your group NGS library (or before Jemulitplexer USE_EMBASE option was available), it is a good idea to puch these files back into your group NGS library. 

Here we are talking about both fastq and BAM files.

At this point, I can hear you thinking "Why should I do this?". There are a number of reasons for doing this :

  1. To allow everyone (in your lab) to access these files if needed without re-demultiplexing the fastq, re-mapping reads, etc... (this later point is particularly true for BAM files)
    • ​Key aspect for BAM files : since we are still talking about library-specific file (i.e. prior to merging of replicates), we strongly suggest you to store BAM files with all reads i.e. that is not filtered  (remember we have no way to actually control the content of the BAM files you copy in the NGS lib, so make sure to store the good versions!) 
    • We also suggest you to store BAM files in which duplicates are flagged, sorted by genomic coordinates and to also copy the bam index. This will allow for quick processing and visualization of the data
    • This will ultimately save space as everyone link against the right data in the group repos. When data lies in personal folders, people always tend to copy files unless of linking them (simply because they have no gaurantee on what will happen to the file)
  2. To delete lane files and free costy space. Space usage is a growing problem. An easy way to save sapce is to avoid storing both the lane file AND the demulitplexed versions of it.
    • EmBASE won't let you delete lane files unless you provided all the demultiplexed files and locked them.
    • So once you checked the demultiplexing is OK, move the files back in your group NGS lib, lock the directories for the lane and delete the lane file. 
  3. To enable GBCS to transfer your data upon publication of your research. Indeed, public data repositories requires the demultiplexed data to be provided. If this daa is already store at the right place, we can easily support you. 

 

To easily deposit your demulitplexed files in your group NGS lib, we have develop the AddLibraryFilesToNGSRepository tool, you can find the last version of the tool in the usual embase API dir and help will be displayed by issuing :

 java -jar /g/funcgen/gbcs-tools/embase-cmdline/AddLibraryFilesToNGSRepository.jar -h

 

A typical and minimal command line would be :

 java -jar /g/funcgen/gbcs-tools/embase-cmdline/AddLibraryFilesToNGSRepository.jar -d /path/to/dir/with_files_to_add -s inread

where:

  •  -d is the path to the directory containing the demultiplexed files. The directory should only contains demultiplexed files and all files should be of the same type ie bam or fatsq (gz allowed) 
  • -s is the rematching strategy i.e. tells emBASE which strategy should be used to find out the library for each provided file. There are currently 5 different strategies:

     

    • 'bcfile' : a barcode file is provided. Automatically set if -bc is given; conversely, --barcodes or -bc is expected when this value is given

    • 'inread' : the barcode is found in demultiplexed file as the last field of the read header (fastq) or name (bam)

    • 'name-match' : use emBASE library name to match the file name. Note that a case-insentive is performed.

    • 'barcode-match' : use emBASE library associated barcode to match the file name. Note that a case-insentive is performed.

    • 'ID-match' : use emBASE Library internal ID to match the file name, which must hold a 'LIB' token. Note that a case-insentive is performed.

By default:

  • a dry run is performed and the tool only evaluates the operations to perform (and gives you the feedback). Give the --execute (or -e) flag to tell the tool to execute all operations. We strongly encourage user to first dry run to check everything looks fine.
  • The program tries to 'move' files but this usually fails (java issue I need to fix), therefore a bunch of shell commands are issued that you can then execute to move all the files in their final location (and create symbolic links to them if you specified the -S option). If you prefer to copy files into the NGS lib, use the -C option
  • the process edits each file name to inject the LIB_RBA unique code in the file name (right before the extension). Add the --keep-names (or -K) flag to keep the file names untouched. This is not encouraged as this will result in a lack of traceability.

Additionally, you can specify:

  •  --replace-libfiles or -R : to replace existing files i.e. files found in the NGS repository for the same library (identified by barcode match) will be trashed even when they don't have the same name. Replacing existing files requires that you have write access on the individual raw bio assay (ie in emBASE) AND that the directory is not locked. This option is mutually exclusive with the -I option. To limit the potential damages of batch overwritting files, this option can be turned on ONLY when a flowcell (-f) and lane (-l) are provided.
  • --ignore-existing or -I : to skip over existing files when copying (cannot be used together with the -R option). This is particularly useful if something went wrong and you want to restart a command.
  • --help or -h to see other available options