The scaling of the distributed-data SCF is illustrated, including some very fast in-core calculations on the CRAY-T3D (e.g., the total time for a 324 function, DZP basis, SCF on C14F6H8 is just 5.5 minutes). The excellent scaling of large in-core calculations also proves that the direct algorithm will benefit fully from the introduction of faster integral technology into NWChem.
As part of our High-Performance Computing and Communications Collaboration we have collected a set of benchmark calculations to use to help in establishing design goals and in tracking performance. A paper is in preparation which compares the performance of NWChem with other computational chemistry packages on a variety of parallel computers. The current document picks out certain interesting highlights.
This system was of interest to PNNL researchers as an early cluster model for reactions inside minerals. It is a fragment of the zeolite ZSM-5, terminated with hydrogen atoms and with an embedded pyridine molecule. The high sparsity and small basis caused the previous version of NWChem to spend most of its time accessing global arrays, but the current version spends less than 1.6% of the time in global array accesses (CRAY-T3D, 256 nodes).
This calculation would not be possible with a replicated-data algorithm on many massively parallel computers since the Fock and density matrices require at least 34 Mbytes (storing lower triangular matrices). The distributed-data algorithm requires just 0.26 Mbytes/node (on 256 processors, storing full square matrices) leaving much more memory to optimize the speed of integral evaluation.
The following table presents wall times in hours for complete SCF calculations on this system on several parallel computers. The SP1 at Argonne National Laboratory is unusual since it has SP1 compute nodes but uses an SP2 switch.
No. of proc.s T3D KSR2 SP1 ------ --- ---- --- 64 - - 5 72 - 4.4 - 128 2.9 - -
This medium-sized molecule is representative of many routine calculations and is also about the largest molecule without symmetry for which it is practical to cache all integrals in memory. The combination of massive parallelism and in-core algorithm makes for very high performance. The distributed data method makes over 200 Mbytes more memory available for caching integrals than would be available using a replicated-data approach.
The following table presents total wall times in seconds for a complete SCF calculation on this system using a fully in-core algorithm on the NERSC CRAY-T3D which has 64 Mbytes of memory per node.
T3D Processors New (in-core) ---------- ------------- 128 605 256 334
This cluster was provided by our industrial collaborators as being representative of systems being considered as catalysts. Due to bugs, previous versions of NWChem obtained only first-order convergence for this system. This problem has been addressed and full, stable second-order convergence is now obtained. In addition, optimizations in the use of linear algebra has significantly improved the scalability of this calculation.
The following table presents total wall time in minutes for this problem for the old and new versions of the code running on the KSR-2 and CRAY T3D. The drop-off in performance is typical for small molecules and is due to poor load-balancing - the cobalt atom has many more functions on it than the other atoms. The performance could be enhanced by splitting large atoms, but this is only necessary for small molecules.
KSR T3D ------------------------- ------------------------- No. of Old New New Old New New proc.s Direct Direct In-core Direct Direct In-core ------ ------ ------ ------- ------ ------ ------- 1 - 406.0 - - - - 8 - 53.2 7.5 - - - 16 38.0 29.6 4.5 - 31.0 - 32 31.8* 18.8 2.9 23 17.7 2.3 64 21.3** 12.3 2.8 15 10.9 - 128 - - - 11 7.9 - * = 36 processor time ** = 72 processor time