6.3. OpenMP Parallelization

This section describes how \({\mathcal H}\Phi\) utilizes OpenMP for shared-memory parallelization.

6.3.1. Overview

\({\mathcal H}\Phi\) uses OpenMP to parallelize operations over the local Hilbert space on each MPI process. This hybrid MPI+OpenMP approach enables efficient use of modern multi-core computing nodes.

6.3.2. Parallelized Operations

The following operations are parallelized using OpenMP:

Hamiltonian-vector multiplication

The main computational kernel in iterative eigensolvers. Each thread processes a portion of the local state vector.

Diagonal matrix elements

Evaluation of on-site interactions and chemical potentials.

Inner products and norms

Vector operations required for Lanczos and LOBPCG algorithms.

Expectation values

Physical observable calculations such as energy, spin correlations, and charge correlations.

6.3.3. Thread-safe Implementation

Care must be taken to avoid race conditions when multiple threads update shared data structures. \({\mathcal H}\Phi\) uses the following strategies:

Separate accumulation

Each thread accumulates results to separate memory locations, which are then combined in a serial section or using OpenMP reduction clauses.

Read-only shared data

Hamiltonian parameters and state indexing arrays are read-only during parallel regions.

6.3.4. Setting the Number of Threads

The number of OpenMP threads is controlled by the OMP_NUM_THREADS environment variable:

export OMP_NUM_THREADS=4
mpirun -np 2 ./HPhi -e namelist.def

For optimal performance, the product of MPI processes and OpenMP threads should match the number of available CPU cores.

6.3.5. Performance Considerations

Load balancing

The local Hilbert space is evenly divided among threads by default. For systems with symmetry restrictions, some threads may have more work.

Memory bandwidth

Hamiltonian-vector multiplication is often memory-bound. Using fewer threads with better cache locality may sometimes be more efficient than maximizing thread count.

NUMA effects

On NUMA systems, memory affinity can significantly impact performance. Consider using OMP_PROC_BIND and OMP_PLACES for thread pinning.