Frequently Asked Questions

Frequently Asked Questions – Access

User Account

While logged in on one of the login nodes, you can use the command account_expire to see your user account's end-of-validity date.

You will get an automatic email reminder 4 weeks before expiry of your account, and another one on the day of expiry.

To extend your account, simply fill and sign the “Nutzungsantrag” again, and send it to us fully signed via office post (or via your local contact person).

That can be caused by two basic reasons: the login node is not answering at all (perhaps due to being down or network problems), or it answers but denies you access.

  • Try another login node (the one you tried may be down)
  • Read the (ssh) error message in its entirety. Sometimes it even explains how to fix the actual problem.
  • Try to log in explicitely with IPv4 or IPv6:
    ssh -XC –4 <tu-id@>lcluster15.hrz.tu-darmstadt.de
    ssh -XC –6 <tu-id@>lcluster17.hrz.tu-darmstadt.de

If nothing of the above works out: open a .

Projects

  • Director of the institute: Most departments are organized into institutes (Fachgebiete). If this does not apply to your organization, please insert the dean, or a person with staff responsibility for the main research group.
  • PI: The principal investigator is responsible for the scientific aspects of the project. This can be the director as well as a junior professor, or post doc.
  • Main researcher/ Project manager: In general, this is the person who does the main part of work in this project. The PM is responsible for the administrative aspects of the scientific project. He or she is also the “technical contact” the HRZ communicates with.
  • Additional researchers: All other researchers who can compute on this project account. This includes other PhD students as well as students, who are working for the project.

We distinguish between the project classes “SMALL”, “MIDDLE”, and “LARGE”.

SMALL includes projects which use up to 204,000 core-hours.
For a 12-month project, this translates to 17,000 core-hours per month, which is roughly equivalent to the continuous use of 24 cores = one compute node.

MIDDLE includes projects between 204,000 and 6,720,000 core-hours, translating to 17,000 to 560,000 core-hours per month in a 12-month project, equivalent to steady use of one island = 32 compute nodes.

LARGE projects are from 6,720,000 core-hours up to 24,500,000 core-hours, translating to 560,000 to 2,040,000 core-hours per month in a 12-month project, equivalent to steady use of ~10% of the total capacity of the Lichtenberg cluster.

The category “EDUCATION” is reserved for courses and trainings, while computing time grants for bachelor and master theses are handled via the above mentioned project classes.

More information about “EDUCATION” can be found at “Lehre and Workshops”.

In general, one main researcher (PhD or post-doc) owns a project, ie. is project manager. This main researcher or PM can decide to add others to his or her project, for instance bachelor or master students, or a colleague he or she is collaborating with on this project.

All these coworkers need to have their own user account on the HLR before being added to a project.

Beware: while sharing your project account is explicitly allowed, sharing your user account is strictly prohibited!

In general, a Lichtenberg project should be in the range and size of a PhD project. For longer research and scientific endeavours, recurrent follow-up projects are needed.

In the initial proposal however, try to outline the whole scientific goal, not just your 1st year's targets.

The project manager is responsible for applying and (after completion) for reporting on the project.
He will be working with the HRZ for the (technical) reviews, and hand in the original of the signed proposal to the HRZ.

The proposal has to be signed by the PM and by the “principal investigator” who needs to be professor or post-doc.

For a “SMALL” project, only the web form needs to be completed, printed out, signed and sent to the HRZ.
This form mainly asks for technical details and a small abstract (150-300 words) of the scientific goals.

For “MIDDLE” and “LARGE” projects, also a detailed scientific project description (DPD) has to be provided by the applicant.

“SMALL” and “MIDDLE” projects can be submitted at any time and the proposals will be handled upon entry.
In case of ambiguous reviews, the proposal is postponed until the decision of the steering committee (Forschungsrechnerbeirat TU Darmstadt).

“LARGE” project proposals are due on the fourth of February, May, August, or November.

All projects are subject to a technical review by the HRZ. “MIDDLE” and “LARGE” project proposals are objected to an extended technical review, i.e. regarding the scalability of the code.

As to scientific metrics: “MIDDLE” projects will be reviewed by two TU Darmstadt experts. For “LARGE” projects, one of the two experts needs to be an external reviewer (from outside TU Darmstadt).

Based on these reviews, the steering committee allocates the resources.

The maximum grant period for any given project is one year, regardless of the S, M or L classes.

If you know that your project needs less than a year, we suggest to write your proposal accordingly. As the computational resources are allotted evenly over the granted time period, shorter projects get a greater resource share per month, resulting in a higher priority per job.

In well-reasoned cases, a project can be extended for one or two months beyond one year.

If your research project will take much longer than a year to complete, you will need to apply for follow-up projects every year.

Nonetheless, in the initial proposal's abstract and description (free wording), we suggest to sketch the whole scientific endeavour and the total time it is likely to need.

Like you do it with your coworkers, referencing substantial contributions to your research publications should include the computational time grants from the Lichtenberg cluster. Properly communicating them improves public understanding of how research funds for HPC are spent and how we are working to support your research.

We thus kindly ask for an acknowledgment in all your publications arose out of or having used calculations on the Lichtenberg:

Calculations for this research were conducted on the Lichtenberg high performance computer of the TU Darmstadt.

If having been supported by the HKHLR, you could add:

For this research, extensive calculations have been conducted on the Lichtenberg high performance computer of the Technische Universität Darmstadt. The authors would like to thank the Hessian Competence Center for High Performance Computing--funded by the Hessen State Ministry of Higher Education, Research and The Arts--for helpful advice.

For all TU Biblio publications, the category „Hochleistungsrechner“ within the „Divisions” list (as a subcategory of „Hochschulrechenzentrum“) was added to TU Biblio.

Please use this category for your research publications related to the Lichtenberg Cluster, then your publication will be listed here accordingly.

Frequently Asked Questions – batch scheduling system

Preparing Jobs

The batch scheduler needs to know some minimal properties of a job to decide which nodes it should be started on.

If for example you would not specify --mem-per-cpu=, a task requiring very large main memory might be scheduled to a node with too little RAM and would thus crash.

To put it another way: with the resource requirements of all user jobs, the scheduler needs to play kind of “multidimensional tetris”. At least along the dimensions runtime, memory size and no. of CPU cores, the scheduler places your jobs as efficiently and as gap-free as possible into the cluster. (In the background, many more parameters are used.)

These three properties of a job are thus the bare minimum to give the scheduler something to schedule with.

Before submitting jobs, you need to determine how many CPUs (= cores) you want (best) to use, how much main memory your scientific program will need and how long the calculating will take.

If your scientific program is already used in your group for problems like yours, you can ask your colleagues about their lessons learned.

If you start afresh with a new scientific program package or a new class of problems: prepare a comparably small test case (no more than 30 minutes runtime), and run it on one of the login nodes (with the desired number of cores) under the control of the UNIX “time” command as follows:

/bin/time --format='MaxMem: %Mkb, WCT: %E' myProgram <testcase>

After the run, you get for example

  • MaxMem: 942080kb, WCT: 1:16.00

on your STDERR channel.

After dividing “MaxMem” by 1024 (to get MBytes), you can determine your #SBATCH --mem-per-cpu= for that test case as

MaxMem in MByte
----------------- (plus a safety margin)
# of cores used

Your #SBATCH -t d-hh:mm:ss is then the “WCT” from above (plus a safety margin).

In our example and if you have used 4 cores:

 942080 / 1024 / 4 =
--mem-per-cpu=230

When you have run your test case with 2, 4, 8 and 16 CPU cores, you can roughly guess the scalability of your problem, and you can size your real job runs accordingly.

In a short hierarchy: The HPC cluster consists of

  • compute nodes
    single, independent computers like your PC/Laptop (just more hardware and performance)
    A node consists of
    • two or more CPUs (central processing units, or processors), placed in a socket
      CPUs are the “program executing” part of a node.
      A CPU consists of
      • several cores, which can be understood as distinct execution units inside a CPU.
        The more cores, the more independent processes or threads can be executed concurrently.
        Each core can either be used by a
        • a process = task (MPI)
          or
        • a thread (“multi-threading”), eg. POSIX threads or most commonly OpenMP

A pure MPI application would start as many distinct processes=tasks as there are cores configured for it. All processes/tasks communicate with each other by means of MPI.

Such applications can use one node, or can be distributed over several nodes, the MPI communication then being routed via Infiniband.

A pure multi-threaded application starts one single process, and from that, it can use several or all cores of a node with separate, (almost) independent execution threads. Each thread will optimally be allocated to one core.

Most recent programs use OpenMP (see $OMP_NUM_THREADS in the documentation of your application).

Such applications cannot be distributed across nodes, but could make use of all cores on a given node.

Hybrid applications mix both parallelization models, by running eg. as many processes = tasks as there are nodes available, and spawning as many threads as there are cores on each node.

.

Important in this context:

For historical reasons from the pre-multicore era, SLURM has parameters referring to CPUs (eg. --mem-per-cpu=).

Today, this means cores instead of CPUs! Even if that's confusing, the rule simply is to calculate “--mem-per-cpu” as if it was named “--mem-per-core”.

For running a lot of similar jobs, we strongly discourage from fiddling with shell script loops around sbatch / squeue. For any amount of jobs >30..50, use Slurm's Job Array feature instead.

Using job arrays not only relieves the Slurm scheduler from unnecessary overhead, but allows you to submit much more ArrayTasks than distinct jobs!

Example use cases are:

  • the same program, the same parameters, but lots of different input files
  • the same program, the same input file, but lots of different parameter sets
  • a serial program (unable to utilize multiple cores [multi-threading] or even several nodes [MPI]), but a lot of input files to analyze, and none of the runs depends on results of any other, i.e. High-Throughput Computing

Rename the “numerous” parts of your job with consecutive numbering, eg. image1.png, image2.png or paramFile1.conf, paramFile2.conf etc.

Let's say you have 3124 sets, then set up a job script with

#SBATCH -a 1-3124
myProgram image$SLURM_ARRAY_TASK_ID.png > image$SLURM_ARRAY_TASK_ID.png.out

and submit it via sbatch. Slurm will now start one job with 3124 ArrayTasks, each one reading its own input image and writing to its own output file.

If you need to limit the number of parallel running ArrayTasks, use

#SBATCH -a 1-3124%10

Slurm will then run at most 10 tasks concurrently.

Further details can be found in 'man sbatch' under “--array=”.

Running Jobs

Check whether all directories mentioned in your job script are in fact there and writable for you.

In particular, the directory specified with

#SBATCH -e /path/to/error/directory/%j.err

for the STDERR of your jobs needs to exist beforehand and must be writable for you.

SLURM ends the job immediately if it is unable to write i.e. the error file (due to a missing target directory).

Due to being a “chicken and egg” problem, a construct inside the job script like

#SBATCH -e /path/to/error/directory/%j.err
mkdir -p /path/to/error/directory/

cannot work either, since for Slurm, the “mkdir” command is already part of the job. Thus, any of “mkdir”s potential output (STDOUT or STDERR) would have to be written to a directory which at begin of the job does not yet exist.

Make sure the relevant modules are loaded in your job script.

While you can load those modules right when logging in on the login node, since these are inherited by “sbatch myJobScript”, this in fact is not reliable. Instead, it renders your jobs dependent on what modules you have loaded in your login session.

We thus recommend to begin each job script with

module purge
module load <each and every relevant module>
myScientificProgram …

to have exactly those modules loaded which are needed, and not more.

This also makes sure your job is reproducible later on, independently of what modules were loaded in your login session at submit time.

This ususally is caused by nested calls to either srun or mpirun within the same job. The second or “inner” instance of srun/mpirun tries to allocate the same resources as the “outer” one already did, and thus cannot complete.

If you have

srun /path/to/myScientificProgram

in your job script, check whether “/path/to/myScientificProgram” in fact is an MPI-capable binary. Then, the above syntax is correct.

But if myScientificProgram turns out to be a script, calling srun or mpirun by itself, then remove the srun in front of myScientificProgram and run it directly.

Example of such error:

srun: Job XXX step creation temporarily disabled, retrying
srun: error: Unable to create step for job XXX: Job/step already completing or completed
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP XXX.0 ON hpb0560 CANCELLED AT 2020-01-08T14:53:33 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB XXX ON hpb0560 CANCELLED AT 2020-01-08T14:53:33 DUE TO TIME LIMIT ***

There is no magic by which Slurm could know the really important part of your job script. The only way for Slurm to detect success or failure is the exit code of your job script, not the real success or failure of any program or command within it.

The exit code of well-written programs is zero in case everything went well, and >0 if an error has occurred.

Imagine the following job script:

#!/bin/bash
#SBATCH …
myScientificProgram …

Here, the last command executed is in fact your scientific program, so the whole job script exits with the exit code of “myScientificProgram” as desired. Thus, Slurm will assign COMPLETED if “myScientificProgram” has had an exit code of 0, and will assign FAILED if not.

If you issue just one simple command after “myScientificProgram”, this will overwrite the exit code of “myScientificProgram” with its own:

#!/bin/bash
#SBATCH …
myScientificProgram …
cp resultfile $HOME/jobresults/

Now, the “cp” command's exit code will be the whole job's exit code, since “cp” is the last command of the job script. If the “cp” command succeeds, Slurm will assign COMPLETED even though “myScientificProgram” might have failed – “cp”s success covers the failure of “myScientificProgram”.

To avoid that, save the exit code of your important program before executing any additional commands:

#!/bin/bash
#SBATCH …
myScientificProgram …
EXITCODE=$?
cp resultfile $HOME/jobresults/
/any/other/job/closure/cleanup/commands …
exit $EXITCODE

Immediately after executing myScientificProgram, its exit code is saved to $EXITCODE, and as a last line now, your job script can re-set this exit code (the one of the real payload).
That way, Slurm get the “real” exit code of “myScientificProgram”, not just the one of the command which happens to be the last one in your job script, and will set COMPLETED or FAILED appropriately.

Only during runtime of your own job(s).

In general, a direct login to the compute nodes is not possible.

However, during execution of your job(s), you are entitled to login to the executing compute nodes (from a login node, not from the internet). For which nodes are running your jobs, see the squeue output's NODELIST.

That is to run top or similar utilities, and in general to see the behaviour of your job(s) at first-hand.

In the case of multi-node MPI jobs, the node list from the squeue output needs to be decomposed to get distinct host names. For example, the node list hpa0[301,307,412-413] are in fact

hpa0301
hpa0307
hpa0412
hpa0413

to which you are all entitled to log in with “ssh hpa0…” while these are executing your job.

If your job ends while you are still logged into one of its compute nodes, you will be logged out automatically, ending up back on the login node.

Pending Jobs

The priority values shown by slurm commands like “squeue” or “sprio” are always to be understood as relative to each other, and in relation to the current demand on the cluster. There is no absolute priority value or priority “threshold”, from which jobs will start to run unconditionally.

During light load (=demand) on cluster resources, a low priority value might be sufficient to get the jobs to run immediately (on free resources). On the other hand, even a very high priority value might not be sufficient, if cluster resources are scarce or completely occupied.

Since most cluster resources are dedicated to the default job runtime of 24 hours, you should always factor in a minimum pending time of one day…

With the command “squeue --start”, you can ask the scheduler for an estimate of when it deems your pending jobs startable.

Please be patient when getting back “N/A” for quite a while, as that is to be expected. Since the scheduler does not touch every job in every scheduling cycle, it might take its time to reach even this “educated guess” on your pending jobs.

In general, your jobs' time spent in PENDING depends not only on your jobs' priority value, but mainly on the total usage of the whole cluster. Hence, there is no 1:1 relationship between your jobs' priority and their prospective PENDING period.

On the Lichtenberg HPC, the scheduler dispatches the jobs in the so-called “Fair Share” mode: the more computing power you use (especially in excess of your monthly project budget), the lower will be your next jobs' priority.

However, this priority degradation has a half-life of roughly a fortnight, so your priority will recover over time.

Your best bet is thus to use your computing budget evenly over the project's total runtime (see 'csreport'). This renders your priority degradation to be quite moderate.

For a planned downtime, we tell the batch scheduler in advance when to end job execution. Based on your job's runtime statement (#SBATCH -t d-hh:mm:ss in the job script), the scheduler decides whether a given job will safely be finished before the downtime, and will start it.

Pending jobs not fitting in the time frame until the downtime will not be started, and simply remain pending.

All pending jobs in all queues will survive (planned) downtimes or outages, and will recommence being scheduled as usual, according to their priorities.

Miscellaneous

Similar to our compute nodes, the login nodes are not installed the usual way on hard disks. Instead, they fetch an OS image from the network each reboot (thus, also after downtimes) and extract the OS “root” image into their main memory.

That assures these nodes being in a clean, defined (and tested) condition after each reboot.

Since “cron”- und “at” entries are stored in the system area being part of that OS image, these entries would not be permanent and are thus unreliable.

To avoid knowledgeable users creating “cron” or “at” jobs nonetheless (and inherently trusting their function for eg. backup purposes), we have switched off “cron” and “at”.

Since June 2020, we have established a new password synchronisation: as soon as you change your TUID password in the IDM system of the TU Darmstadt, the login password to the Lichtenberg HPC will be in sync with it.

For new users, this is in effect from the get-go, ie. their first login to the Lichtenberg HPC will work with their central TUID password.

Existing users will keep their current password (the one last set on the “HLR” tab of the former “ANDO” portal).
As soon as they change their TUID password on the IDM portal, it will overwrite the last HLR password.

The same holds true for guest TUIDs.

cg81tadi (Griebel, Christian) - directory permissions with setGID and sticky bits
directory permissions with setGID and sticky bits

In these directories, quotas are managed using the UNIX groups da_p<ProjID> or da_<Institutskürzel> (in the following symbolized as da_XXX).

Files (and directories) not belonging to the pertaining group will be accounted on the creating user's personal quota. As persons/TUIDs intentionally have only small quota on those group/shared folders), such mis-assigned files will cause the dreaded “quota exceeded” errors.

Directories and files somewhere below of /work/projects/ and /work/groups/:

  • need to have the right group membership of da_XXX (and may not belong to your TUID group)
  • directories need to have permissions as follows: drwxrws---
    The “sticky” bit on group level cares for new files to be automatically assigned the group of the parent directory (not the group of the creting user)

Wrong: drwx------ 35 tuid tuid 8192 Jun 17 23:19 /work/groups/…/myDir

Right: drwxrws--- 35 tuid da_XXX 8192 Jun 17 23:19 /work/groups/…/myDir

Solution:

Change into the parent directory of the problematic one, and check its permissions as described above., using

ls -ld myDir

In case these are not correct and you are the owner::

chgrp -R da_<XXX> myDir

chmod 3770 myDir

In case you are not the owner, ask the owner to execute the above commands.

From time to time, we will revise and edit this web page.

Please send us your question via email to , and if question & answer are of general interest, we will amend this FAQ accordingly.

Frequently Asked Questions – “Software”

Installation

You can list all installed programs and versions with the command module avail. The command module loads and unloads paths and environmental variables as well as it displays available software. A detailed description is available here.

Our module system is built in form of a hierarchical tree with respect to compiler(s) and MPI version(s). Many (open source) software package thus don't appear in the first instance, and become available only until you load a (suitable) compiler (and a (suitable) MPI module, respectively.

In case you didn't load one or the other yet, many packages don't show up in the output of “module avail”.

If you seem to miss a required software, please try

module spider <mySW> oder
module spider | grep -i mySW oder
module --show-hidden avail <mySW>

before installing it yourself or opening a ticket.

Please send us an email . If the requested program is of interest to several users or groups, we will install it in our module tree to make it available for all. Otherwise we will support you (to a certain extent) in attempting a local installation, eg. in your /home/ folder.

Licenses

No, you first have to check whether the software requires a license. In that case, you have to prove you have the rights to use (a sufficient amount of) it, for example if your institute/department contributes to the yearly costs of a TU license, or has purchased its own.

Please read also the comments to licenses in this list.

In general: everything possible is not everything allowed.

It depends. In general, our modules for commercial software fetch their licenses from license servers of the TU Darmstadt. These licenses are dedicated exclusively for members of TU Darmstadt contributing to the license costs. Please send us an email to if you have license questions. We can support you in configuring your software to fetch license tokens from eg. your institute's license servers.

In general not everything is allowed, what is possible.

Runtime Issues

Remove all unnecessary modules. A job script should always start with

module purge
module load <only modules really required for this job>

to ensure a clean environment:.

Remember: whenever you load modules while you are on a login node, any job submitted from this modified environment will inherit these modules' settings!
Therefore, it is strongly recommended to use the above purging/loading statement in all job scripts.

Next, scrutinize your program's runtime libraries (“shared objects”) with

ldd -v /path/to/binary

Your $LD_LIBRARY_PATH might contain an unwanted directory, causing your program to load wrong or outdated libraries, which in fact should rather be coming from the modules you have loaded.

Particularly, the infamous“Bus error” can be caused by non-matching arguments or return values between calling binary and called library, thus causing “unaligned” memory access and crashes.

A crashing program usually causes a memory dump of its process to be created (in Linux, a file called core.<PID> in the directory where the program was started).

Unfortunately, some user jobs repeatedly crashed in a loop, causing lots of coredumps being created on our cluster-wide GPFS filesystem. As this adversely affected its performance and availability, we had to switch off the creation of coredumps by default.

However, for you to debug your software, we didn't prohibit core dumps entirely, and thus writing them can be enabled again:

ulimit -Sc unlimited
<my crashing program call>
gdb /path/to/programBinary core.<PID>

If it is in fact the very same binary (and not only the same “program”), compare together

  • your job scripts
  • the modules you have loaded before submitting the job:
    module list
    (because these are inherited to the job!)
  • your respective shell environment:
    env | sort > myEnv
  • your respective $LD_LIBRARY_PATH setting and the libraries effectively loaded at runtime:
    ldd /path/to/same/binary

Yes, that's possible, by using the so-called “collection” feature of our module system LMod.

More details can be found in our Tips and Tricks section, and inside “man module”.

Machine Architecture

The Lichtenberg HPC runs only Linux, and thus will not run windows (or MacOS) executables natively.

Ask your scientific software vendor or provider for a native Linux version: if it is in fact a scientific application, there's a very good chance they have one…

Due to the unproportional administrative efforts (and the missing windows licenses), we are sorry to have to deny all requests like “virtual windows machines on the cluster” or to install WINE just like that.

Since the Lichtenberg HPC runs CentOS (a RedHat compatible distribution), the native application packaging format is RPM, not .deb.

Though there are ways to convert .deb packages to .rprm or even to install .deb on RPM-based distributions (see the “alien” command's information on the web), we cannot support installing or even converting them. Check with the vendor/supplier to get .rpm packages, or try and compile the program yourself (if the source code is available).

Miscellaneous

From time to time, we will revise and edit this web page.

Please send us your question via email to , and if question & answer are of general interest, we will amend this FAQ accordingly.