Overview of IT resources at EMBL and in the GB department: A guide for new GB members
When starting a new job somewhere, all of us have the same questions i.e. what servers can I use, is there a HPC cluster to run long jobs or large arrays of parallel jobs, can I work from home or unterwegs , can I expose a web server to the outside, how do I transfer data to collaborators, where can I store my data ...
The goal of this page is not to repeat what is already available elsewhere (in particular on ITS page) but rather to give you pointers to these resources and also give you a feel of who might be able to help you solving your IT issues. In case of doubts, simply ask the GBCS.
What GBCS can do for you
The Genome Biology Computational Support (GBCS) is part of the Genome Biology Unit. GBCS offers various computational supports for all researchers across the GB unit, including:
- Maintenance of servers
- Interactive compute servers (schroedinger, spinoza)
- Web application, database hosting servers
- If you maintain your own server, we also might be able to help you solving issues ; but here the ITS page is most likely your first interlocutor.
- Installation of tools centrally for broad availability on EMBL servers/HPC cluster.
- softwares we install are then available under /g/funcgen/bin/
- Web Applications and Application Set-up Service
- emBASE for NGS&Microarray Storage (&Analysis). You can find emBASE at emBASE (please request login to us) and more information at emBASE project page
- Galaxy for web-based data analysis. You can log in Galaxy with your usual EMBL account but please send us an email the first time you log in so we complete the setup of your account. More information at Galaxy project page.
- GCBridge for automatically transferring NGS data from GeneCore server to emBASE and Galaxy. More information at GC Bridge page.
- R Studio Server and Shiny Server for using R on spinoza
- Help you analyze your data (not only NGS!) from simple advice to doing-it-for-you
- Help you manage your data (not only NGS!), including database design and web application development or consulting on storage strategy
See also https://gbservices.embl.de
Feel free to contact us if you have any questions (email@example.com).
- Version and share your work using git.embl.de
- Chat with us, IT, your group using dedicated channels in chat.embl.org (i.e. GBCS channel, cluster channel if you are using the HPC cluster, ...)
The EMBL-IT group provides a broad spectrum of information technology services, solutions and support to EMBL users at labs in Heidelberg and Monterotondo, such as EMBL user accounts, network access, E-mail, Printing, software, data sharing, central data storage / file systems, and compute resource.
If you need assistance with "pure" IT problems, you should contact them.
More information at EMBL-IT
Bio-IT is the EMBL portal for computational biology, which provides a focal point for EMBL biocomputing resources to promote the social and professional interactions amongst our bioinformatics community at EMBL. It also organizes computational training courses and seminars.
Very useful information can be found at the Bio-IT site (please note that you might need to create an accoutn with them to access specific content). Also note the Bio-IT chat channel
IT Services provides a centrally managed data storage infrastructure for all groups at EMBL. A tiered storage model has been implemented with three different categories of storage: Tier-1 and Tier-2 storage as well as a Large Volume Archiving tier. EMBL groups can choose space from these storage tiers based on their individual needs for performance, availability, pricing, etc. If your group needs central disk space, please contact IT. More information at IT Service-Data Storage and Storage
Compute Cluster / Big-mem GB Servers
GB Unit big-mem Interactive Servers
These are high memory machines bought by the GB Unit where you can log in for performing interactive computing. These are especially well suited when your resources need changes as you work or when you need a lot of RAM ; this second situation is less of a problem since the IT-managed HPC cluster now has similar beasts. A good example is when you sporadically need e.g. 20 cores and hundreds Gb of RAM to quickly perform matrix operations in R. In such a situation, you'd need to book the maximum resources you need during your work session to be able to work on the cluster ; resulting in massive waste of resources.
- log in from within EMBL with e.g. ssh spinoza
- 40CPUs, 1024GB RAM, CentOS7
- 9 Tb of local space in /tmpdata, please see storage policies
- registered as SLURM submitter i.e. you can directly submit jobs to the HPC (no need to login to login.cluster.embl.de)
- log in from within EMBL with e.g. ssh schroedinger
- 40CPUs, 1024GB RAM, CentOS6
- 9 Tb of local space in /tmpdata, please see storage policies
- registered as LSF submitter i.e. you can bsub to the old-to-be-retired-soon cluster (if not already retired as you read!)
- Schroedinger acts as a backup machine, providing the old environement where one can finish old projects using tools available under /g/software/bin
Using interactive servers allows you to run job without specifying the memory usage. However, VERY IMPORTANTLY, please check the capacity left on the machine before running the jobs and always use a fair share of the resources. Long-lasting jobs with predictable resource usage should always go to the HPC cluster.
EMBL LSF/SLURM Clusters (LSF was replaced by SLURM in Jan 2017)
- More information about EMBL SLURM cluster at EMBL-cluster
- More information about EMBL LSF cluster at EMBL-cluster
LSF Submission System
- A software for managing and accelerating batch workload to distribute jobs to the cluster
- To run jobs on EMBL-clusters, log into submaster (or schroedinger/spinoza) machine and use "bsub" for submitting jobs
- For more information, please check IT-LSFand EMBL-cluster
When the job has problems, you can use "bjobs -l" to check the detail information about the job. For example, it shows which computer (node) the job is executed. you can ssh to that node (such as computen-036) through submaster1 machine (not from schroedinger/spinoza) to check the job running detail status.
Cluster Storage solutions
See also: Storage
When working with huge data files, the question of where to store your input/output files becomes an issue. The only file servers accessible from the cluster nodes are the ones on tier1. Of course, other storages are accessible from the cluster nodes and you should consider using the right one according to the situation. You could store everything on tier1 but this would cost a lot and you should avoid the situation where processes access concurently the same file (e.g. an index file).
So what to do ?
- Use the /tmp space of the cluster node to store intermediate files that you won't need later. You can make sure to have enough tmp space by requesting a minimum amount of tmp space in the SLURM submition options
- Advantage: very fast;
- Disadvantage: temporary and only available to the current node.
- Use the space available under /scratch (check this IT post).
- The huge advantages of /scratch: it is available from all cluster nodes, it supports parallel file access (FHFS) and it is also accessible from spinoza and schroedinger GB servers.
- Disadvantage: no backup and temporary.
- Use tier2 space (4 times cheaper than tier1!) to store your file. Remember that tier2 is also safe, it is a RAID6 redundancy; but there is no back up. You can then stage-in your files at the beginning of your job using scp i.e. you scp the needed data either to tier1 space or /scratch.
- Advantage: cheap, permanent;
- Disadvantage: no backup, not accessible from the cluster.
See? Plenty of solutions not to waste your expensive tier1 space!