MOLCAS manual:

Next: 9. Maintaining the package Up: 8. Installation Previous: 8.3 Building MOLCAS


8.4 Installing and running in parallel environments

Installation of MOLCAS for execution in multi-processor environments can be a bit more involved than the standard installation, this chapter considers those particulars not covered previously.

The parallellization of MOLCAS uses an internal PGAS framework built upon MPI-2.

The current list of supported MPI-2.2 implementations are given below:

  • MPICH2/MPICH3: -parallel mpich
  • MVAPICH2: -parallel mvapich
  • OpenMPI: -parallel ompi
  • Intel MPI: -parallel impi

When one wants to use an external GA library, it has to be configured and compiled separately. In that case, please read the section on using an external GA installation to properly configure and install GA!!!

Use ./configure -setup command to see the suggestions about recommended flags for parallel installation.

IMPORTANT: not all modules support distribution of work and/or resources through parallel execution, and even if they do it might be that some functionaliy is limited to serial performance. This is a list of core modules which can benefit from parallel execution: gateway, seward, scf, rasscf, caspt2. More detailed information regarding parallel behaviour can be found in the documentation of the respective module and in the table at the beginning of the manual about supported parallellism. If no information is available, you should conclude that there is nothing to be gained from parallel execution.

The caspt2 module still relies on specific features present in the ``Global Arrays'' (GA) toolkit, developed by Jarek Nieplocha and coworkers at the Pacific Northwest National Laboratory ( If you need to use CASPT2 in parallel to be able to perform very demanding single-point energy calculations, then you need to use the GA library. For more information, see the section on using an external GA installation. If you use caspt2 only for numerical gradients, you don't need the GA library.

8.4.1 General overview of the procedure

In the simplest case, the parallel version of MOLCAS may be installed simply by specifying an appropriate message-passing system as an argument to configure. For example:

./configure -parallel ompi

When the locations of the MPI lib and include directories is set incorrectly, you can specify them by setting their common root directory with the par_root flag or if they are in different directories you can use the separate par_inc and par_lib flags:

./configure -parallel ompi -par_root /usr/lib/openmpi
./configure -parallel ompi -par_inc /usr/lib/openmpi/include -par_lib /usr/lib/openmpi/lib

Parallel execution of MOLCAS is then achieved by exporting the environment variable MOLCAS_CPUS, for example when running on 4 nodes use:

export MOLCAS_CPUS=4

and continuing as usual.

More likely, some individual tailoring will be required, the following summarizes the necessary steps:

  1. Choose message passing model (candidates are: ompi, mpich, mvapich, impi).
  2. Check that the correct wrapper compilers were detected, as specified in $MOLCAS/Symbols.
  3. Install (and test) the Global Arrays package (see below).
  4. Check the command for executing binaries in parallel, as specified by $RUNBINARY in $MOLCAS/molcas.rte.
  5. Install (and test) MOLCAS.
Provided that steps 1-4 can be successfully accomplished, the installation of MOLCAS itself is unlikely to present many difficulties.

The remainder of this chapter is devoted to a more detailed description of MOLCAS's parallel setup.

8.4.2 Using an external Global Arrays installation

The installation instructions may be found at the Global Arrays home page Note that any problems with installation or other issues specific to GA are best resolved by contacting the GA authors directly, rather than the MOLCAS group.

A typical problem with installation of MOLCAS in parallel is thus related to the Global Arrays (GA) library. It is therefore a very good idea to run the GA testing code as a job on the cluster where you want to use Molcas to make sure that it works properly.

After installing GA, pass the location of this installation to MOLCAS configure:

./configure -parallel ompi -ga /path/to/ga

This is the required way of using GA version 5. When configuring GA version 5, one has to take care that the correct integer sizes are used. For 64 bit installations, this means passing the flags -enable-i8 -with-blas8 to the GA configure script. Also make sure that if you are using an external blas library, it uses 8-byte integers! When building GA 5, run make check before make install to verify your installation.

8.4.3 Free MPI implementations

Most probably, you will use a free MPI-2 implementation such as MPICH2/MPICH3, MVAPICH2, or Open MPI.

Open MPI:

NOTE: Open MPI versions older than v1.6.5 are not supported. More specifically, only Open MPI v1.6.5, and v1.8.1 are tested and known to work correctly with MOLCAS.

To use on of these implementations, pass the correct message passing interface to the -parallel flag of ./configure, i.e. either mpich for MPICH2/MPICH3, mvapich for MVAPICH2, or ompi for Open MPI. These implementations come with FORTRAN 77 and C wrappers for the compiler that was used for building the library (mpif77/mpif90 and mpicc respectively). These are automatically detected by the configure script and used to build GA and Molcas.

It is a very good idea to verify that the correct compiler environment is present before configuring MOLCASYou should therefore check that the backend compiler of the wrappers is correct by running /path/to/mpif77 -show (MPICH2/MPICH3 and MVAPICH2) or /path/to/mpif77 -showme (Open MPI), which will list the actual executed command. If the backend compiler seems to be correct, also try to run it to see if it is properly detected (on some clusters you will need to load the appropriate module for the compiler). If all is well, you should be able to configure MOLCAS without any problems.

It is highly recommended to use the compiler that was used for the MPI library to build GA and Molcas to avoid compatibility issues. However, if you really want to use a different compiler than the compiler which was used for building the MPI library, you can do so by passing the -fc and -cc command line arguments (MPICH2/MPICH3 and MVAPICH2) to the wrappers, or setting the environment variables OMPI_F77/OMPI_F90 and OMPI_CC (Open MPI). In this case, you should change the F77/F90 and CC variables in the Symbols file to include these flags.

A few comments on running on a cluster:

The very old MPICH versions sometimes needs a file with a list of the nodes the job at hand is allowed to use. At default the file is static and located in the MPICH installation tree. This will not work on a workstation cluster, though, because then all jobs would use the same nodes.

Instead the queue system sets up a temporary file, which contains a list of the nodes to be used for the current task. You have to make sure that this filename is transfered to $mpirun. This is done with the '-machinefile' flag. On a Beowulf cluster using PBS as queue system the $RUNBINARY variable in $MOLCAS/molcas.rte should look something like:

RUNBINARY='/path/to/mpirun -machinefile $PBS_NODEFILE -np $MOLCAS_CPUS $program'

The newer MPICH2/MPICH3 as well as MVAPICH2, which works through the use of the HYDRA daemons and does not need this command line argument, as well as Open MPI most likely only need the -np $MOLCAS_CPUS command line option. They use mpiexec instead of mpirun.

8.4.4 Commercial MPI implementations

Several commercial MPI implementations exist such as HP-MPI, IBM's MPI-F, Intel MPI, SGI's MPT. Those that are supported are listed below. For the others that are not (yet) supported, it is recommended to configure Molcas without parallel options and change the Symbols file after the serial configuration by altering the F77/F90 and CC variables and the F77 and CC values to point to the wrappers.

Please refer to the documentation of your MPI implementation for details on how to build programs, i.e. which wrappers to use and if necessary what libraries you need to link in.

Supported -parallel flags for commercial MPI implementations:

  • Intel MPI: impi

8.4.5 Running MOLCAS in parallel

In this section, we assume you will be using PBS on a cluster in order to submit jobs. If you don't use PBS, please ask your system administrator or consult the cluster documentation for equivalent functionality. Example of a submit script

#PBS -l walltime=10:00:00   
#PBS -l nodes=4   
#PBS -l pmem=3000mb  
######## Job settings ###########   
export MOLCAS_MEM=800 
export SUBMIT=/home/molcasuser/project/test/  
export Project=test000
export MOLCAS_CPUS=4  
######## modules ###########   
. use_modules  
module load intel/11.1  
module load openmpi/1.4.1/intel/11.1   
######## molcas settings ###########   
export MOLCAS=/usr/local/molcas76.par/  
export WorkDir=/disk/local/

######## run ###########     
cd $SUBMIT  
molcas $Project.input -f Memory

The maximum available memory is set using the PBS option pmem. Typically, MOLCASMEM will then be set to around 75% of the available physical memory. So for a parallel run, just divide the total physical memory by the number of processes you will use and take a bit less. For example, for a system with 2 sockets per node and 64 GB of memory, running 1 process per socket, we would set pmem to 30000 MB. I/O

The important thing to consider for I/O is to have enough scratch space available and enough bandwidth to the scratch space. If local disk is large enough, this is usually preferred over network-attached storage. MOLCAS requires the absolute pathname of the scratch directory to be the same across nodes. Pinning

Process pinning is sometimes required to achieve maximum performance. For CASPT2 for example, processes need to be pinned to their socket or NUMA domain.

The pinning configuration can usually be given as an option to the MPI runtime. With Intel MPI for example, one would set the I_MPI_PIN_DOMAIN variable to socket. Alternatively, you can use a third-party program to intervene on your behalf, e.g. Please ask your system administrator how to correctly pin your processes. GA specific issues

When using GA, several problems can occur when trying to run jobs with a large amount of memory per process. A few example error messages are given here with their proposed solution.

(rank:0 hostname:node1011 pid:65317):ARMCI DASSERT fail.
 cond:(memhdl->memhndl!=((void *)0))

The error output in the Molcas errfile (stderr) then says:

Last System Error Message from Task 2:: Cannot allocate memory

Related messages that display a problem with armci_server_register_region instead of armci_pin_contig_hndl can also occur, and point to similar problems.

This can have two causes:

  • Some parameters of the Mellanox mlx4_core kernel module were set too low, i.e., log_num_mtt and log_mtts_per_seg. These should be set according to the instructions on Values of 25 and 0 respectively, or 24 and 1 should be fine.
  • The 'max locked memory' process limit was set too low. You can check this value by running ulimit -a or ulimit -l. Make sure you check this through an actual job! Easiest is to start an interactive job and then execute the command. The value should be set to unlimited, or at least to the amount of physical memory available.

0: error ival=4 (rank:0 hostname:node1011 pid:19142):ARMCI DASSERT fail.

This error is related to the value of the variable ARMCI_DEFAULT_SHMMAX, try setting it at least to 2048. If this is still too low, you should consider patching GA to allow higher values.

next up previous contents
Next: 9. Maintaining the package Up: 8. Installation Previous: 8.3 Building MOLCAS