6.3. OpenMP Parallelization¶
This section describes how \({\mathcal H}\Phi\) utilizes OpenMP for shared-memory parallelization.
6.3.1. Overview¶
\({\mathcal H}\Phi\) uses OpenMP to parallelize operations over the local Hilbert space on each MPI process. This hybrid MPI+OpenMP approach enables efficient use of modern multi-core computing nodes.
6.3.2. Parallelized Operations¶
The following operations are parallelized using OpenMP:
- Hamiltonian-vector multiplication
The main computational kernel in iterative eigensolvers. Each thread processes a portion of the local state vector.
- Diagonal matrix elements
Evaluation of on-site interactions and chemical potentials.
- Inner products and norms
Vector operations required for Lanczos and LOBPCG algorithms.
- Expectation values
Physical observable calculations such as energy, spin correlations, and charge correlations.
6.3.3. Thread-safe Implementation¶
Care must be taken to avoid race conditions when multiple threads update shared data structures. \({\mathcal H}\Phi\) uses the following strategies:
- Separate accumulation
Each thread accumulates results to separate memory locations, which are then combined in a serial section or using OpenMP reduction clauses.
- Read-only shared data
Hamiltonian parameters and state indexing arrays are read-only during parallel regions.
6.3.4. Setting the Number of Threads¶
The number of OpenMP threads is controlled by the OMP_NUM_THREADS environment variable:
export OMP_NUM_THREADS=4
mpirun -np 2 ./HPhi -e namelist.def
For optimal performance, the product of MPI processes and OpenMP threads should match the number of available CPU cores.
6.3.5. Performance Considerations¶
- Load balancing
The local Hilbert space is evenly divided among threads by default. For systems with symmetry restrictions, some threads may have more work.
- Memory bandwidth
Hamiltonian-vector multiplication is often memory-bound. Using fewer threads with better cache locality may sometimes be more efficient than maximizing thread count.
- NUMA effects
On NUMA systems, memory affinity can significantly impact performance. Consider using
OMP_PROC_BINDandOMP_PLACESfor thread pinning.