6.3. OpenMP Parallelization¶

This section describes how \({\mathcal H}\Phi\) utilizes OpenMP for shared-memory parallelization.

6.3.1. Overview¶

\({\mathcal H}\Phi\) uses OpenMP to parallelize operations over the local Hilbert space on each MPI process. This hybrid MPI+OpenMP approach enables efficient use of modern multi-core computing nodes.

6.3.2. Parallelized Operations¶

The following operations are parallelized using OpenMP:

Hamiltonian-vector multiplication: The main computational kernel in iterative eigensolvers. Each thread processes a portion of the local state vector.
Diagonal matrix elements: Evaluation of on-site interactions and chemical potentials.
Inner products and norms: Vector operations required for Lanczos and LOBPCG algorithms.
Expectation values: Physical observable calculations such as energy, spin correlations, and charge correlations.

6.3.3. Thread-safe Implementation¶

Care must be taken to avoid race conditions when multiple threads update shared data structures. \({\mathcal H}\Phi\) uses the following strategies:

Separate accumulation: Each thread accumulates results to separate memory locations, which are then combined in a serial section or using OpenMP reduction clauses.
Read-only shared data: Hamiltonian parameters and state indexing arrays are read-only during parallel regions.

6.3.4. Setting the Number of Threads¶

The number of OpenMP threads is controlled by the OMP_NUM_THREADS environment variable:

export OMP_NUM_THREADS=4
mpirun -np 2 ./HPhi -e namelist.def

For optimal performance, the product of MPI processes and OpenMP threads should match the number of available CPU cores.

6.3.5. Performance Considerations¶

Load balancing: The local Hilbert space is evenly divided among threads by default. For systems with symmetry restrictions, some threads may have more work.
Memory bandwidth: Hamiltonian-vector multiplication is often memory-bound. Using fewer threads with better cache locality may sometimes be more efficient than maximizing thread count.
NUMA effects: On NUMA systems, memory affinity can significantly impact performance. Consider using OMP_PROC_BIND and OMP_PLACES for thread pinning.