Tipps und Tricks

Tips

To define module load in the job script

To make job submission easier and more fault-tolerant for you, Slurm by default passes on all the environment (variables) and all loaded modules of the (login) session you submit the job from.

Thus, for a better reproducibility it is recommended to begin each job script with module purge, followed by only those specific module load … lines necessary for this job. Submitted that way, the job's main program will run with only the required and desired software (versions).

This is especially important if you use for example module initadd to load certain modules from ~/.bashrc (because you need them time and again in each login session).

Archive decompression in /work/scratch--Attention: automatic file cleanup

The extraction of archives (e.g. *.zip, *.tar) often keeps the modification timestamps of all files. If the modification time of the decompressed file is too old, e.g. older than 8 weeks, the freshly extracted files may be deleted by the automatic cleaning policy of the scratch area (run daily).

To avoid such cleaning, you can often use an additional tool parameter, e.g. for tar you can use the parameter -m. Alternatively you can use the touch command to generate an updated modification time attribute.

Attention: starting April 18th 2017, the scratch cleaning cycle will be changed from the 'modification time' to being based on 'creation time' for all files. After this change, there is no need for a modification time update (via additional archive parameters or touch) any more. In other words (after the change), the update of the modification time of a file is pointless and will no longer prevent your file(s) from being deleted.

Missing Slurm support at MPI applications

Many applications have problems to use the correct number of cores within the batch system. This might be a problem of missing Slurm support. In general those applications use their own MPI versions and have to be supported explicitly by the right number of cores and by the Hostfile.

First you have to generate a current Hostfile The following line replaces the usual call: “mpirun <MPI-Programm>”:

srun hostname > hostfile.$SLURM_JOB_ID
mpirun  -n 64  -hostfile hostfile.$SLURM_JOB_ID  <MPI-Program>

The first line (above) generates the Hostfile, additionally the second line gives MPI the number of planned cores (here 64) and the name of the Hostfiles.

Migration from LSF to Slurm

A help for migration with the most important LSF commands and parameters to SLURM is available here.

Important: The choice of the right partition (former queue under LSF) will mostly be done automatically (with commands like sbatch or salloc). That doesn't apply to special cases like “kurs*” or “extension*” queues under LSF – in Slurm, these are special partitions, reservations or project accounts and needs to be requested explicitly in your job scripts

Setting up password-less ssh communication between compute nodes

Parallel computation between different nodes requires mutual password-less logins. By default, this is not allowed.

But you can change this in your own home folder--run the following command while being logged into any of our login nodes:

To generate a key: You can accept the storage location by <ENTER>.

ssh-keygen -P "" -t rsa -C "$LOGNAME@lcluster"

To use the generated key for login information: You will be asked for your login password (please enter), and the ssh configuration will be updated accordingly.

ssh-copy-id -i .ssh/id_rsa.pub localhost

To verify your configuration: You should now be able to log in without a password (e.g. from lcluster2to lcluster4).

ssh lcluster4

The set-up is finished.

Job details at the end

After your job has finished, the following command reports about CPU and memory efficiency of the job:

seff <JobID>

Even more details will be shown by the following command.

sacct -l -j <JobID>
tuda-seff <JobID>

Expiry date of your user account

To see the expiry date of your own user account, use the script /shared/bin/account_expire.

Your user account's validity term is independent of any projects' term or validity you might be associated with.

File transfer to and from the Lichtenberg HPC

Before and after calculations, your data needs to get on and your results to get off the Lichtenberg filesystems.

We recommend the following tools:

One-off: scp

As you can log in via ssh to the login nodes, you can also use SSH's scp tool to copy files and directories from or to the Lichtenberg.

Use the login nodesfor your scp transfers, as these have high bandwidth network ports also to the TU campus network (we do not have any other special in/out nodes).

In case of (large) text/ASCII files, you should use the optional compression (-C) built into the SSH protocol, in order to save network bandwidth and to possibly speed up your transfers.
Omit compression when copying already compressed data like JPG images or videos in modern container formats (mp4, OGG).

Beispiel:

tuid@hla0003:~ $ scp -Cpr myResultDir mylocalworkstation.inst.tu-darmstadt.de:/path/to/my/local/resultdir

Details: man scp,

Repeatedly: rsync

Cases like “I need my calculations' results also on my local workstation's hard disk for analysis with graphical tools” or “my local experiment's raw data need to hop to the lichtenberg as soon as it is generated” are not well covered by scp. As soon as you have to keep (one of) your Lichtenberg directories “in sync” with one on your institute's (local) infrastructure, running scp more than once would be inefficient, as it is not aware of “changes” and would blindly copy the same files over and over again.

That's where rsync can step in. Like scp, it is a command line tool, transferring files from any (remote) “SRC” to any other (remote) “DEST”ination. In contrast to scp however, it has a notion of “changes” and can find out whether a file in “SRC” has been changed and needs to be transferred at all. New as well as small files will simply be transmitted, for large files however, rsync will transfer only their changed blocks (safeguarded by checksums).

In essence: unchanged files are not transferred again, new and changed files will, but for large files, only their changed portions (delta) will be transferred.

Beispiel:

tuid@hla0003:~ $ rsync -aH myResultDir mylocalworkstation.inst.tu-darmstadt.de:/path/to/my/local/resultdir

Details: man rsync,

Remember: both scp and rsync are “one way” tools only! If--between transfers--a file is changed in “DEST”, the next transfer will overwrite it with the (older) version from “SRC”.

Not available on the Lichtenberg:

FTP(S), sFTP, rcp and other older, clear-text protocols.