OpenMP Parallelization
======================

This section describes how :math:`{\mathcal H}\Phi` utilizes OpenMP for shared-memory parallelization.

Overview
--------

:math:`{\mathcal H}\Phi` uses OpenMP to parallelize operations over the local Hilbert space
on each MPI process. This hybrid MPI+OpenMP approach enables efficient use of modern
multi-core computing nodes.

Parallelized Operations
-----------------------

The following operations are parallelized using OpenMP:

**Hamiltonian-vector multiplication**
  The main computational kernel in iterative eigensolvers. Each thread processes
  a portion of the local state vector.

**Diagonal matrix elements**
  Evaluation of on-site interactions and chemical potentials.

**Inner products and norms**
  Vector operations required for Lanczos and LOBPCG algorithms.

**Expectation values**
  Physical observable calculations such as energy, spin correlations,
  and charge correlations.

Thread-safe Implementation
--------------------------

Care must be taken to avoid race conditions when multiple threads update
shared data structures. :math:`{\mathcal H}\Phi` uses the following strategies:

**Separate accumulation**
  Each thread accumulates results to separate memory locations,
  which are then combined in a serial section or using OpenMP reduction clauses.

**Read-only shared data**
  Hamiltonian parameters and state indexing arrays are read-only during
  parallel regions.

Setting the Number of Threads
-----------------------------

The number of OpenMP threads is controlled by the ``OMP_NUM_THREADS`` environment variable:

.. code-block:: bash

   export OMP_NUM_THREADS=4
   mpirun -np 2 ./HPhi -e namelist.def

For optimal performance, the product of MPI processes and OpenMP threads
should match the number of available CPU cores.

Performance Considerations
--------------------------

**Load balancing**
  The local Hilbert space is evenly divided among threads by default.
  For systems with symmetry restrictions, some threads may have more work.

**Memory bandwidth**
  Hamiltonian-vector multiplication is often memory-bound.
  Using fewer threads with better cache locality may sometimes be more efficient
  than maximizing thread count.

**NUMA effects**
  On NUMA systems, memory affinity can significantly impact performance.
  Consider using ``OMP_PROC_BIND`` and ``OMP_PLACES`` for thread pinning.