Overview of IT resources at EMBL and in the GB department: A guide for new GB members
When starting a new job somewhere, all of us have the same questions i.e. what server can I work on, is there a cluster to run long jobs or large arrays of parallel jobs, can I work remote, can I expose a web server to the outside, how do I transfer data to collaborators, where can I store my data ...
The goal of this page is not to repeat what is already available elsewhere but rather to give you pointers to these resources and also give you a feel of who might be able to help you solving your IT issues. In case of doubts, simply ask the GBCS crew.
What GBCS can do for you
The Genome Biology Computational Support (GBCS) is part of the Genome Biology Unit. GBCS offers various computational supports for all researchers across the GB unit, including:
- Maintenance of servers
- Interactive compute servers (schroedinger, spinoza... etc)
- Web application, database hosting servers (gbcs, gbcs-dev, ouzhou)
- If you maintain your own server, we also might be able to help you solving issues ; but here the IT is certainly your first interlocutor.
- Installation of tools under SEPP for broad availability, where the softwares are available at EMBL-wised under /g/software/bin/. More information and how to use these softwares, please check SEPP-softwares
- Web Applications and Application Set-up Service
- emBASE for NGS&Microarray Storage (&Analysis). You can find emBASE at emBASE (please request login to us) and more information at emBASE project page
- Galaxy for web-based data analysis. You can log in Galaxy with your EMBL account, and send an email to mailto:email@example.com for the first time login so we complete the setup of your account. More information at Galaxy project page.
- GCBridge for automatically transferring NGS data from GeneCore server to emBASE and Galaxy. More information at GC Bridge page.
- R Studio Server for using R on schroedinger and spinoza
- BatchPrimer3 for designing 100s of primers using primer3 in a single click !
- Genome browser like JBrowse ...
- Help you analyze your data (not only NGS!) from simple advice to doing-it-for-you
- Help you manage your data (not only NGS!), including database design and web application development or consulting on storage strategy
Feel free to contact us if you have any questions (firstname.lastname@example.org).
- Version and share your work using git.embl.de
- Chat with us, IT, your group usgin dedicated channels in chat.embl.org
The EMBL-IT group provides a broad spectrum of information technology services, solutions and support to EMBL users at labs in Heidelberg and Monterotondo, such as EMBL user accounts, network access, E-mail, Printing, software, data sharing, central data storage / file systems, and compute resource.
If you need assistance with "pure" IT problems, you should contact them.
More information at EMBL-IT
Bio-IT is the EMBL portal for computational biology, which provides a focal point for EMBL biocomputing resources to promote the social and professional interactions amongst our bioinformatics community at EMBL. It also organizes computational training courses and seminars.
Very useful information can be found at the Bio-IT site.
IT Services provides a centrally managed data storage infrastructure for all groups at EMBL. A tiered storage model has been implemented with three different categories of storage: Tier-1 and Tier-2 storage as well as a Large Volume Archiving tier. EMBL groups can choose space from these storage tiers based on their individual needs for performance, availability, pricing, etc. If your group needs central disk space, please contact IT. More information at IT Service-Data Storage and Storage
Compute Cluster / Big-mem GB Servers
GB Unit big-mem Interactive Servers
These are high memory machines bought by the Unit where you can log in for performing interactive computing. These are especially well suited when your resources need changes as you work or when you need a lot of RAM ; this second situation is less of a problem since the IT-managed LSF cluster now has 8 similar beasts (bigmem queue). An good example situation is when you sporadically need e.g. 20 cores and hundreds Gb of RAM to quickly perform matrix operations in R. In such a situation, you'd need to book the maximum resources you could during your work session to be able to work on the cluster ; resulting in massive waste of resources.
- log in from within EMBL with e.g. ssh schroedinger
- 40CPUs, 1024GB RAM, CentOS6
- 9 Tb of local space in /tmpdata, please see storage policies
- registered as LSF submitter i.e. you can bsub to the cluster
- log in from within EMBL with e.g. ssh spinoza
- 40CPUs, 1024GB RAM, CentOS7
- 9 Tb of local space in /tmpdata, please see storage policies
- registered as SLURM submitter i.e. you can bsub to the cluster
Using interactive servers allows you to run job without specifying the memory usage. However, VERY IMPORTANTLY, please check the capacity left on the machine before running the jobs and always use a fair share of the resources. Long-lasting jobs with predictable resource usage should always go to the LSF/SLURM clusters.
EMBL LSF/SLURM Clusters (LSF will be replaced by SLURM based new cluster soon)
- More information about EMBL SLURM cluster at EMBL-cluster
- More information about EMBL LSF cluster at EMBL-cluster
LSF Submission System
- A software for managing and accelerating batch workload to distribute jobs to the cluster
- To run jobs on EMBL-clusters, log into submaster (or schroedinger/spinoza) machine and use "bsub" for submitting jobs
- For more information, please check IT-LSFand EMBL-cluster
When the job has problems, you can use "bjobs -l" to check the detail information about the job. For example, it shows which computer (node) the job is executed. you can ssh to that node (such as computen-036) through submaster1 machine (not from schroedinger/spinoza) to check the job running detail status.
Cluster Storage solutions
See also: Storage
When working with huge data files, the question of where to store your input/output files becomes an issue. The only file servers accessible from the cluster nodes are the ones on tier1. Of course, other storages are accessible from the cluster nodes and you should consider using the right one according to the situation. You could store everything on tier1 but this would cost a lot and you should avoid the situation where processes access concurently the same file (e.g. an index file).
So what to do ?
- Use the /tmp space of the cluster node to store intermediate files that you won't need later. You can make sure to have enough tmp space by requesting a minimum amount of tmp space in the bsub options e.g. 'bsub -R "select mem>40000 "'; Advantage: very fast; Disadvantage: temporary and only available to the current node.
- Use the 100 Tb of space available under /scratch (check this IT post). The huge advantages of /scratch: it is available from all cluster nodes, it supports parallel file access (FHFS) and it is also accessible from spinoza and schroedinger GB servers. Disadvantage: no backup and temporary.
- Use tier2 space (4 times cheaper than tier1!) to store your file. Remember that tier2 is also safe, it is a RAID6 redundancy; but there is no back up. You can then stage-in your files at the beginning of your job using scp i.e. you scp the needed data either to tier1 space, local /tmp or /scratch. Advantage: cheap, permanent; Disadvantage: no backup, not accessible from the cluster.
See? Plenty of solutions not to waste your expensive tier1 space!