Bandwidth reduction can be a first step in the computation of eigenvalues and eigenvectors for a wide-banded complex Hermitian (or real symmetric) matrix. We present algorithms for this reduction and the corresponding back-transformation of the eigenvectors. These algorithms rely on blocked Householder transformations, thus enabling level 3 BLAS performance, and they feature two levels of parallelism. The efficiency of our approach is demonstrated with numerical experiments.
We first develop a new variant of Cannon’s algorithm for parallel matrix multiplication on rectangular process grids. Then we tailor it to selected situations where at least one triangular matrix is involved, namely “upper triangle of (full × upper triangular),” “lower triangle of (lower triangular × upper triangular),” and “all of (upper triangular × rectangular).” These operations arise in the transformation of generalized hermitian positive definite eigenproblems A X = B X Λ to standard form A X = X Λ, and making use of the triangular structure enables savings in arithmetic operations and communication. Numerical results show that the new implementations outperform previously available routines, and they are particularly effective if a whole sequence of generalized eigenproblems with the same matrix B must be solved, but they can also be competitive for the solution of a single generalized eigenproblem.
For symmetric (hermitian) (dense or banded) matrices the computation of eigenvalues and eigenvectors A x = λB x is an important task, e.g. in electronic structure calculations. If a larger number of eigenvectors are needed, often direct solvers are applied. On parallel architectures the ELPA implementation has proven to be very efficient, also compared to other parallel solvers like EigenExa or MAGMA. The main improvement that allows better parallel efficiency in ELPA is the two-step transformation of dense to band to tridiagonal form. This was the achievement of the ELPA project. The continuation of this project has been targeting at additional improvements like allowing monitoring and autotuning of the ELPA code, optimizing the code for different architectures, developing curtailed algorithms for banded A and B, and applying the improved code to solve typical examples in electronic structure calculations. In this paper we will present the outcome of this project.
We consider generalized eigenvalue problems A x = B x λ with a banded symmetric matrix A and a banded symmetric positive definite matrix B. To reduce the generalized eigenvalue problem to standard form C y = y λ the algorithm proposed by Crawford is applied preserving the banded structure in C. We present a parallel implementation of this method for the ELPA library. Performance analysis shows the capabilities of the approach.
We first briefly report on the status and recent achievements of the ELPA-AEO (Eigenvalue Solvers for Petaflop Applications--Algorithmic Extensions and Optimizations) and ESSEX II (Equipping Sparse Solvers for Exascale) projects. In both collaboratory efforts, scientists from the application areas, mathematicians, and computer scientists work together to develop and make available efficient highly parallel methods for the solution of eigenvalue problems. Then we focus on a topic addressed in both projects, the use of mixed precision computations to enhance efficiency. We give a more detailed description of our approaches for benefiting from either lower or higher precision in three selected contexts and of the results thus obtained.
We present a method for reducing the generalized eigenvalue problem A x = B x λ with banded hermitian matrices A, B, and B positive definite to an equivalent standard eigenvalue problem C y = y λ, such that C again is banded. Our method combines ideas of an algorithm proposed by Crawford in 1973 and of LAPACK's reduction routines _{SY,HE}GST and retains their respective advantages, namely being able to rely on matrix--matrix operations (Crawford) and to handle split factorizations and different bandwidths bA and bB (LAPACK). In addition, it includes two parameters (block size, nb, and width of blocked orthogonal transformations, w) that can be adjusted to optimize performance. We also present a heuristic for automatically determining suitable values for these parameters. Numerical experiments confirm the efficiency of our new method.