Next: 9. Maintaining the package
Up: 8. Installation
Previous: 8.3 Building MOLCAS
Installation of MOLCAS for execution in multi-processor environments can be a
bit more involved than the standard installation, this chapter considers those
particulars not covered previously.
The parallellization of MOLCAS uses an internal PGAS framework built upon MPI-2.
The current list of supported MPI-2.2 implementations are given below:
- MPICH2/MPICH3: -parallel mpich
- MVAPICH2: -parallel mvapich
- OpenMPI: -parallel ompi
- Intel MPI: -parallel impi
When one wants to use an external GA library, it has to be configured and
compiled separately. In that case, please read the section on using an
external GA installation to properly configure and install GA!!!
Use ./configure -setup command to see the suggestions about
recommended flags for parallel installation.
IMPORTANT: not all modules support distribution of work and/or
resources through parallel execution, and even if they do it might be that some
functionaliy is limited to serial performance. This is a list of core modules
which can benefit from parallel execution: gateway, seward, scf, rasscf,
caspt2. More detailed information regarding parallel behaviour can be found in
the documentation of the respective module and in the table at the beginning of
the manual about supported parallellism. If no information is available, you
should conclude that there is nothing to be gained from parallel execution.
The caspt2 module still relies on specific features present in the ``Global
Arrays'' (GA) toolkit, developed by Jarek Nieplocha and coworkers at the
Pacific Northwest National Laboratory (http://hpc.pnl.gov/globalarrays)
If you need to use CASPT2 in parallel to be able to perform very demanding
single-point energy calculations, then you need to use the GA library. For
more information, see the section on using an external GA installation. If you
use caspt2 only for numerical gradients, you don't need the GA library.
In the simplest case, the parallel version of MOLCAS may be installed
simply by specifying an appropriate message-passing system as an argument
to configure. For example:
./configure -parallel ompi
When the locations of the MPI lib and include directories is set
incorrectly, you can specify them by setting their common root directory with
the par_root flag or if they are in different directories you can use the
separate par_inc and par_lib flags:
./configure -parallel ompi -par_root /usr/lib/openmpi
./configure -parallel ompi -par_inc /usr/lib/openmpi/include -par_lib /usr/lib/openmpi/lib
Parallel execution of MOLCAS is then achieved by exporting the environment
variable MOLCAS_CPUS, for example when running on 4 nodes use:
and continuing as usual.
More likely, some individual tailoring will be required, the following
summarizes the necessary steps:
Provided that steps 1-4 can be successfully accomplished, the installation
of MOLCAS itself is unlikely to present many difficulties.
- Choose message passing model (candidates are: ompi, mpich, mvapich, impi).
- Check that the correct wrapper compilers were detected, as specified in $MOLCAS/Symbols.
- Install (and test) the Global Arrays package (see below).
- Check the command for executing binaries in parallel, as specified by $RUNBINARY in
- Install (and test) MOLCAS.
The remainder of this chapter is devoted to a more detailed description of
MOLCAS's parallel setup.
The installation instructions may be found at the Global Arrays home page
Note that any problems with installation or other issues specific to GA are
best resolved by contacting the GA authors directly, rather than the
A typical problem with installation of MOLCAS in parallel is thus related to
the Global Arrays (GA) library. It is therefore a very good idea to run the GA
testing code as a job on the cluster where you want to use Molcas to make sure
that it works properly.
After installing GA, pass the location of this installation to MOLCAS configure:
./configure -parallel ompi -ga /path/to/ga
This is the required way of using GA version 5. When configuring GA version 5,
one has to take care that the correct integer sizes are used. For 64 bit
installations, this means passing the flags -enable-i8 -with-blas8
to the GA configure script. Also make sure that if you are using an external
blas library, it uses 8-byte integers! When building GA 5, run make
check before make install to verify your installation.
Most probably, you will use a free MPI-2 implementation such as MPICH2/MPICH3,
MVAPICH2, or Open MPI.
Open MPI: http://www.open-mpi.org/
NOTE: Open MPI versions older than v1.6.5 are not supported. More specifically,
only Open MPI v1.6.5, and v1.8.1 are tested and known to work correctly with MOLCAS.
To use on of these implementations, pass the correct message passing interface
to the -parallel flag of ./configure, i.e. either
mpich for MPICH2/MPICH3, mvapich for MVAPICH2, or
ompi for Open MPI. These implementations come with FORTRAN 77 and C
wrappers for the compiler that was used for building the library (mpif77/mpif90
and mpicc respectively). These are automatically detected by the configure
script and used to build GA and Molcas.
It is a very good idea to verify that the correct compiler environment is
present before configuring MOLCASYou should therefore check that the
backend compiler of the wrappers is correct by running /path/to/mpif77
-show (MPICH2/MPICH3 and MVAPICH2) or /path/to/mpif77 -showme (Open MPI),
which will list the actual executed command. If the backend compiler seems to
be correct, also try to run it to see if it is properly detected (on some
clusters you will need to load the appropriate module for the compiler). If all
is well, you should be able to configure MOLCAS without any problems.
It is highly recommended to use the compiler that was used for the MPI library
to build GA and Molcas to avoid compatibility issues. However, if you really
want to use a different compiler than the compiler which was used for building
the MPI library, you can do so by passing the -fc and -cc
command line arguments (MPICH2/MPICH3 and MVAPICH2) to the wrappers, or setting the
environment variables OMPI_F77/OMPI_F90 and OMPI_CC (Open MPI). In this
case, you should change the F77/F90 and CC variables in the Symbols file to
include these flags.
A few comments on running on a cluster:
The very old MPICH versions sometimes needs a file with a list of the nodes the job at hand is allowed
to use. At default the file is static and located in the MPICH installation
tree. This will not work on a workstation cluster, though, because then all
jobs would use the same nodes.
Instead the queue system sets up a temporary file, which contains a list of the
nodes to be used for the current task. You have to make sure that this filename
is transfered to $mpirun. This is done with the '-machinefile' flag. On a
Beowulf cluster using PBS as queue system the $RUNBINARY variable in
$MOLCAS/molcas.rte should look something like:
RUNBINARY='/path/to/mpirun -machinefile $PBS_NODEFILE -np $MOLCAS_CPUS
The newer MPICH2/MPICH3 as well as MVAPICH2, which works through the use of the HYDRA daemons and does not need
this command line argument, as well as Open MPI most likely only need the -np
$MOLCAS_CPUS command line option. They use mpiexec instead of mpirun.
Several commercial MPI implementations exist such as HP-MPI, IBM's MPI-F, Intel
MPI, SGI's MPT. Those that are supported are listed below. For the others that
are not (yet) supported, it is recommended to configure Molcas without parallel
options and change the Symbols file after the serial configuration by altering
the F77/F90 and CC variables and the F77 and CC values to point to the wrappers.
Please refer to the documentation of your MPI implementation for details on how
to build programs, i.e. which wrappers to use and if necessary what libraries
you need to link in.
Supported -parallel flags for commercial MPI implementations:
In this section, we assume you will be using PBS on a cluster in order to
submit jobs. If you don't use PBS, please ask your system administrator or
consult the cluster documentation for equivalent functionality.
#PBS -l walltime=10:00:00
#PBS -l nodes=4
#PBS -l pmem=3000mb
######## Job settings ###########
######## modules ###########
module load intel/11.1
module load openmpi/1.4.1/intel/11.1
######## molcas settings ###########
######## run ###########
molcas $Project.input -f
The maximum available memory is set using the PBS option pmem. Typically,
MOLCASMEM will then be set to around 75% of the available physical
memory. So for a parallel run, just divide the total physical memory by the
number of processes you will use and take a bit less. For example, for a system
with 2 sockets per node and 64 GB of memory, running 1 process per socket, we
would set pmem to 30000 MB.
The important thing to consider for I/O is to have enough scratch space
available and enough bandwidth to the scratch space. If local disk is large
enough, this is usually preferred over network-attached storage. MOLCAS requires the absolute pathname of the scratch directory to be the same across
Process pinning is sometimes required to achieve maximum performance. For CASPT2
for example, processes need to be pinned to their socket or NUMA domain.
The pinning configuration can usually be given as an option to the MPI runtime.
With Intel MPI for example, one would set the I_MPI_PIN_DOMAIN
variable to socket. Alternatively, you can use a third-party program
to intervene on your behalf, e.g. https://code.google.com/p/likwid/.
Please ask your system administrator how to correctly pin your processes.
When using GA, several problems can occur when trying to run jobs with a large
amount of memory per process. A few example error messages are given here with
their proposed solution.
(rank:0 hostname:node1011 pid:65317):ARMCI DASSERT fail.
The error output in the Molcas errfile (stderr) then says:
Last System Error Message from Task 2:: Cannot allocate memory
Related messages that display a problem with armci_server_register_region
instead of armci_pin_contig_hndl can also occur, and point to similar problems.
This can have two causes:
- Some parameters of the Mellanox mlx4_core kernel module were
set too low, i.e., log_num_mtt and log_mtts_per_seg.
These should be set according to the instructions on
http://community.mellanox.com/docs/DOC-1120. Values of 25 and 0
respectively, or 24 and 1 should be fine.
- The 'max locked memory' process limit was set too low. You can check
this value by running ulimit -a or ulimit -l. Make
sure you check this through an actual job! Easiest is to start an
interactive job and then execute the command. The value should be set
to unlimited, or at least to the amount of physical memory available.
0: error ival=4 (rank:0 hostname:node1011 pid:19142):ARMCI DASSERT fail.
This error is related to the value of the variable
ARMCI_DEFAULT_SHMMAX, try setting it at least to 2048. If this is
still too low, you should consider patching GA to allow higher values.
Next: 9. Maintaining the package
Up: 8. Installation
Previous: 8.3 Building MOLCAS