NWChem - computational chemistry on parallel computers


Notes on running NWChem on workstations and clusters

Since these notes are generic to most workstations, it is impossible to be completely specific about the locations of files, etc. You will want to know where the nwchem and parallel executables reside. You may want to add this location to your shell's executable search path.

You may need to be aware of where the code expects to find the standard basis set library. The location is fixed at compile time and should have been set to something appropriate for your site. Without any special configuration, NWChem will look for the standard library in the source directory tree. This means that moving the source tree may confuse an existing executable. For most installations, however, we are configuring the library to live in the same directory as the nwchem and parallel executables. Should you run into problems in which NWChem cannot locate the standard basis library, you can easily work around it by using the "file" option on the basis set entry for each library basis set you need.

Since workstations vary widely in how much memory is available, the defaults may not be appropriate to your situation. It is advisable to check the defaults given in the manual and if necessary adjust them in the input deck. Remember that the memory specification is per process, so if you set the limit to 32 MB and run four processes on the machine, you'll use 128 MB in total.

Single process

Single process execution is easy -- just invoke nwchem with the name of the input file as an argument: "nwchem input.nw".

The "parallel" command for multiprocess execution

The "parallel" command is part of the TCGMSG (message passing) package, which NWChem uses to run jobs in parallel. The following description is largely cribbed from the TCGMSG README file (which can be found in the NWChem source tree). Explanations specific to NWChem follow it.

An auxiliary "process group" (aka PROCGRP) file controls the parallel execution. It is usually named with a ".p" suffix. The PROCGRP file can contain multiple lines, and comments are denoted by a "#" sign. Non-comment lines consist of the following fields, separated by white space:

              

  userid     The username on the machine that will be executing the
             process. 

  hostname   The hostname of the machine to execute this process.
             If it is the same machine on which parallel was invoked
             the name must match the value returned by the command 
             hostname. If a remote machine it must allow remote execution
             from this machine (see man pages for rlogin, rsh).

  nslave     The total number of copies of this process to be executing
             on the specified machine. Only 'clusters' of identical processes
             specified in this fashion can use shared memory to communicate.
             If no shared memory is supported on machine  then
             only the value one (1) is valid (e.g. on the Cray).

  executable Full path name on the host  of the image to execute.
             If  is the local machine then a local path will
             suffice.

  workdir    Full path name on the host  of the directory to
             work in. Processes execute a chdir() to this directory before
             returning from pbegin(). If specified as a '.' then remote
             processes will use the login directory on that machine and local
             processes (relative to where parallel was invoked) will use
             the current directory of parallel.

e.g.
  harrison boys      3  /home/harrison/c/ipc/testf.x  /tmp      # my sun 4
  harrison dirac     3  /home/harrison/c/ipc/testf.x  /tmp      # ron's sun4
  harrison eyring    8  /usr5/harrison/c/ipc/testf.x  /scratch  # alliant fx/8

The above PROCGRP file would put processes 0-2 on boys (executing testf.x in /tmp), 3-5 on dirac (executing testf.x in /tmp) and 6-13 on eyring (executing testf.x in /scratch). Processes on each machine use shared memory to communicate with each other, sockets otherwise.

To run NWChem using the parallel command, the comand line is "parallel procgrp input.nw". The first argument of parallel is the name of the PROCGRP file. Parallel automatically adds a ".p" to the end, so in this case it would look for a file named "procgrp.p". A common convention is to name the PROCGRP file "nwchem.p", but remember the actual executable to be invoked is specified within the PROCGRP file, not on the parallel command line. Remaining arguments to parallel are passed to the program being invoked, so we give then NWChem input deck "input.nw".

Execution on remote workstations is initiated using rsh/rexec protocol. Users must have remote execution privileges enabled for parallel to work, this requires that the master workstation hostname appears in the slave's .rhosts files (see the man page, rsh(1)).

Multiple processes -- single machine

If you have a multiprocessor workstation, you can run multiple processes using the shared memory regions. In this case, your PROCGRP file would have a single (non-comment) line with indicating the number of processes you want to run. For example

  gg502	bohr	12	/disk1/gg502/hpcci/nwchem .
which would run twelve processes of nwchem working in the current directory ("." in Unix shorthand).

Clustered workstations

In this case, your PROCGRP file will have multiple lines, generally one for each machine in your cluster. You must be able to rsh to each username/host you specify. Processes on different lines communicate via TCP/IP sockets. It is also possible to run multiple processes on individual nodes in a cluster. In this case, processes sharing a single host communicate internally by shared memmory and with processes on other machines via sockets.

The Global Array Toolkit sets up a process on each node in the cluster to act as a data server to facilitate answering off-processor requests for data. Consequently, the number of specified on each line must be ONE MORE than the number of compute processes you want. For example,

  gg502	bohr	8	/disk1/gg502/hpcci/nwchem /disk1/gg502/wrk
  gg502	coho	4	/scr/gg502/hpcci/nwchem /scr/gg502/wrk
would start 7 compute processes on bohr and 3 on coho, along with one data server on each for a total of twelve processes.

Common problems

When running in parallel on workstations, NWChem uses shared memory and may use semaphores. If the run terminates abnormally (errors not trapped by the code itself, interrupted by the user, etc.) it may not release these resources back to the system. These are global resources, and it is possible for you and/or other users to exhaust them.

To see if you have any of these resources allocated, use the command "ipcs". You will see a table subdivieded into "Message Queues", "Shared Memory" and "Semaphores". The second column lists an id number which you can use to remove your claim on them with the "ipcrm" command. For example user gg502 would deallocate his resources described by the output of the ipcs command

IPC status from fermi as of Wed Aug  9 15:50:32 1995
T     ID     KEY        MODE       OWNER    GROUP
Message Queues:
Shared Memory:
m    600 0x00000000 --rw-rw-rw-   d3g681      101
m   1302 0x00000000 --rw-------    gg502      101
m    903 0x00000000 --rw-------    gg502      101
m   1104 0x00000000 --rw-------    gg502      101
m    306 0x00000000 --rw-------    gg502      101
m    107 0x00000000 --rw-------    gg502      101
m      9 0x00000000 --rw-------    gg502      101
Semaphores:
s    131 0x00000000 --ra-------    gg502      101
s     92 0x00000000 --ra-------    gg502      101
s    113 0x00000000 --ra-------    gg502      101
s     35 0x00000000 --ra-------    gg502      101
by using the command
"ipcrm -m 1302 -m 903 -m 1104 -m 306 -m 107 -m 9 -s 131 -s 92 -s 113 -s 35"
A script (ipcreset) to simplify this procedure is provided by TCGMSG (see its README file).

You will not be able to remove someoneelse's resources unless you have sufficient privilige (i.e. root access).

Caveats

Note that running multiple processes on a single processor machine is only useful for debugging purposes. You'll get faster turn around with production jobs by running them as a single process if you have only one processor available.

Running multiple processes on a single machine via a multi-line PROCGRP file (forcing them to communicate over sockets) is likewise less efficient than using the shared memory facilities with a one-line PROCGRP file specifying the desired number of s.

Heterogeneous clusters (where all nodes are not the same type of hardware) are not generally supported by the current release of the Global Array Toolkit, and consequently by NWChem. If all machines involved use the same representation for data (big vs little endian, IEEE or other floating point, etc.) it will probably work, but otherwise it will not.


Prepared by RJ Harrison: Email: nwchem-support@emsl.pnl.gov.