[1] A. Marek, V. Blum, R. Johanni, V. Havu, B. Lang, T. Auckenthaler, A. Heinecke, H.-J. Bungartz, and H. Lederer. The ELPA library: Scalable parallel eigenvalue solutions for electronic structure theory and computational science. J. Phys.: Condens. Matter, 26(21):213201, May 2014. [ DOI ]
Obtaining the eigenvalues and eigenvectors of large matrices is a key problem in electronic structure theory and many other areas of computational science. The computational effort formally scales as O(N3) with the size of the investigated problem, N (e.g. the electron count in electronic structure theory), and thus often defines the system size limit that practical calculations cannot overcome. In many cases, more than just a small fraction of the possible eigenvalue/eigenvector pairs is needed, so that iterative solution strategies that focus only on a few eigenvalues become ineffective. Likewise, it is not always desirable or practical to circumvent the eigenvalue solution entirely. We here review some current developments regarding dense eigenvalue solvers and then focus on the Eigenvalue soLvers for Petascale Applications (ELPA) library, which facilitates the efficient algebraic solution of symmetric and Hermitian eigenvalue problems for dense matrices that have real-valued and complex-valued matrix entries, respectively, on parallel computer platforms. ELPA addresses standard as well as generalized eigenvalue problems, relying on the well documented matrix layout of the Scalable Linear Algebra PACKage (ScaLAPACK) library but replacing all actual parallel solution steps with subroutines of its own. For these steps, ELPA significantly outperforms the corresponding ScaLAPACK routines and proprietary libraries that implement the ScaLAPACK interface (e.g. Intel's MKL). The most time-critical step is the reduction of the matrix to tridiagonal form and the corresponding backtransformation of the eigenvectors. ELPA offers both a one-step tridiagonalization (successive Householder transformations) and a two-step transformation that is more efficient especially towards larger matrices and larger numbers of CPU cores. ELPA is based on the MPI standard, with an early hybrid MPI-OpenMPI implementation available as well. Scalability beyond 10000 CPU cores for problem sizes arising in the field of electronic structure theory is demonstrated for current high-performance computer architectures such as Cray or Intel/Infiniband. For a matrix of dimension 260000, scalability up to 295000 CPU cores has been shown on BlueGene/P.

[2] Andreas Marek, Volker Blum, Rainer Johanni, Ville Havu, Bruno Lang, Thomas Auckenthaler, Alexander Heinecke, Hans-Joachim Bungartz, and Hermann Lederer. The ELPA library -- scalable parallel eigenvalue solutions for electronic structure theory and computational science. Scientific highlight of the month, Ψk (http://www.psi-k.net), December 2013. [ .pdf ]
Obtaining the eigenvalues and eigenvectors of large matrices is a key problem in electronic structure theory and many other areas of computational science. The computational effort formally scales as O( N3 ) with the size of the investigated problem, N, and thus often defines the system size limit that practical calculations cannot overcome. In many cases, more than just a small fraction of the possible eigenvalue/eigenvector pairs is needed, so that iterative solution strategies that focus only on few eigenvalues become ineffective. Likewise, it is not always desirable or practical to circumvent the eigenvalue solution entirely. We here review some current developments regarding dense eigenvalue solvers and then focus on the ELPA library, which facilitates the efficient algebraic solution of symmetric and Hermitian eigenvalue problems for dense matrices that have real-valued and complex-valued matrix entries, respectively, on parallel computer platforms. ELPA addresses standard as well as generalized eigenvalue problems, relying on the well documented matrix layout of the ScaLAPACK library but replacing all actual parallel solution steps with subroutines of its own. The most time-critical step is the reduction of the matrix to tridiagonal form and the corresponding backtransformation of the eigenvectors. ELPA offers both a one-step tridiagonalization (successive Householder transformations) and a two-step transformation that is more efficient especially towards larger matrices and larger numbers of CPU cores. ELPA is based on the MPI standard, with an early hybrid MPI-OpenMPI implementation available as well. Scalability beyond 10,000 CPU cores for problem sizes arising in the electronic structure theory is demonstrated for current high-performance computer architectures such as Cray or Intel/Infiniband. For a matrix of dimension 260,000, scalability up to 295,000 CPU cores has been shown on BlueGene/P.

[3] Lukas Krämer, Edoardo Di Napoli, Martin Galgon, Bruno Lang, and Paolo Bientinesi. Dissecting the FEAST algorithm for generalized eigenproblems. J. Comput. Appl. Math., 244:1--9, May 2013. [ DOI ]
We analyze the FEAST method for computing selected eigenvalues and eigenvectors of large sparse matrix pencils. After establishing the close connection between FEAST and the well-known Rayleigh--Ritz method, we identify several critical issues that influence convergence and accuracy of the solver: the choice of the starting vector space, the stopping criterion, how the inner linear systems impact the quality of the solution, and the use of FEAST for computing eigenpairs from multiple intervals. We complement the study with numerical examples, and hint at possible improvements to overcome the existing problems.

[4] Paul R. Willems and Bruno Lang. A framework for the MR3 algorithm: Theory and implementation. SIAM J. Sci. Comput., 35(2):A740--A766, 2013. [ DOI ]
This paper provides a streamlined and modular presentation of the MR3 algorithm for computing selected eigenpairs of symmetric tridiagonal matrices, thus disentangling the principles driving MR3 and the (recursive) “core” algorithm from the specific (e.g., twisted) decompositions used to represent the matrices at different recursion depths and from the (dqds) transformations converting between them. Our approach allows a modular full proof for the correctness of the MR3 algorithm. This proof is based on five requirements concerning the interplay between the core algorithm and its subcomponents. These requirements can also guide in implementing the algorithm, because they expose quantities that can and should be monitored at runtime. Our new implementation XMR, which is based on the above analysis, is described and compared to xSTEMR from Lapack 3.2.2. Numerical experiments comparing the robustness and performance of both implementations are given.

[5] Paul R. Willems and Bruno Lang. The MR3-GK algorithm for the bidiagonal SVD. Electron. Trans. Numer. Anal., 39:1--21, 2012. [ .pdf ]
Determining the singular value decomposition of a bidiagonal matrix is a frequent subtask in numerical computations. We shed new light on a long-known way to utilize the algorithm of multiple relatively robust representations, MR3, for this task by casting the singular value problem in terms of a suitable tridiagonal symmetric eigenproblem (via the Golub--Kahan matrix). Just running MR3 “as is” on the tridiagonal problem does not work, as has been observed before (e.g., B. Großer and B. Lang, An O(n2) algorithm for the bidiagonal SVD, Linear Algebra Appl., 358 (2003), pp. 45--70). In this paper we give more detailed explanations for the problems with running MR3 as a black box solver on the Golub--Kahan matrix. We show that, in contrast to standing opinion, MR3 can be run safely on the Golub--Kahan matrix, with just a minor modification. A proof including error bounds is given for this claim.

[6] Paul R. Willems and Bruno Lang. Twisted factorizations and qd-type transformations for the MR3 algorithm---new representations and analysis. SIAM J. Matrix Anal. Appl., 33(2):523--553, 2012. [ DOI ]
Bidiagonal and twisted factorizations play a prominent role in the MR3 algorithm for computing partial eigensystems of symmetric tridiagonal matrices. Bidiagonal factorizations of shifted symmetric tridiagonals T - τI also underly, at least implicitly, the evaluation of the Sturm count for eigenvalue computations. In this paper we propose new representations (e-rep, Z-rep) for bidiagonal and twisted factorizations relying on other quantities than the usually employed nontrivial entries of the bidiagonal factors (N-rep). We show that the qd algorithms used for shifting the factorizations, e.g., LDL* - τI =: L+D+(L+)*, and for converting between them, can be formulated to work with these representations in a mixed stable way. In fact, the Z-representation achieves lower bounds for the necessary relative perturbations of the inputs and outputs than the traditional N-rep does. To obtain sharp bounds for accumulating perturbations we propose a new notation that extends Higham's [N. J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia, PA, 2nd ed., 2002] by including second-order terms to provide sharper first-order bounds.

[7] T. Auckenthaler, V. Blum, H.-J. Bungartz, T. Huckle, R. Johanni, L. Krämer, B. Lang, H. Lederer, and P. R. Willems. Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations. Parallel Comput., 37(12):783--794, December 2011. [ DOI ]
The computation of selected eigenvalues and eigenvectors of a symmetric (Hermitian) matrix is an important subtask in many contexts, for example in electronic structure calculations. If a significant portion of the eigensystem is required then typically direct eigensolvers are used. The central three steps are: reduce the matrix to tridiagonal form, compute the eigenpairs of the tridiagonal matrix, and transform the eigenvectors back. To better utilize memory hierarchies, the reduction may be effected in two stages: full to banded, and banded to tridiagonal. Then the back transformation of the eigenvectors also involves two stages. For large problems, the eigensystem calculations can be the computational bottleneck, in particular with large numbers of processors. In this paper we discuss variants of the tridiagonal-to-banded back transformation, improving the parallel efficiency for large numbers of processors as well as the per-processor utilization. We also modify the divide-and-conquer algorithm for symmetric tridiagonal matrices such that it can compute a subset of the eigenpairs at reduced cost. The effectiveness of our modifications is demonstrated with numerical experiments.

[8] Martin Galgon, Lukas Krämer, and Bruno Lang. The FEAST algorithm for large sparse eigenvalue problems. Proc. Appl. Math. Mech., 11(1):747--748, December 2011. [ DOI ]
We consider the eigensolver named FEAST that was introduced by Polizzi in 2009 [1]. This solver, tailored to the (partial) solution of large sparse (generalized) eigenproblems offers good potential for parallelism and showed good robustness in our experiments. We briefly introduce the algorithm, point out some problems and give two examples.

[9] T. Auckenthaler, H.-J. Bungartz, T. Huckle, L. Krämer, B. Lang, and P. Willems. Developing algorithms and software for the parallel solution of the symmetric eigenvalue problem. J. Comput. Sci., 2(3):272--278, August 2011. [ DOI ]
Nowadays, the development, maintenance, and ongoing adaptation of simulation software due to new algorithmic or hardware developments are highly complex tasks involving larger teams, often from different groups and disciplines, and located at different places. This requires an increased use of methods and tools from software engineering. At the same time, the high computational demands from the fields of application make it necessary to optimize the modules for code performance and scalability in order to fully exploit the potential of modern parallel architectures.

This paper presents a case study on the ongoing endeavor of improving and developing library software for the parallel computation of eigenvalues for dense symmetric matrices, driven by fields of application such as quantum chemistry. A widespread approach is to, first, transform the matrix to tridiagonal form and, second, to solve the tridiagonal eigenvalue problem, before a back transformation provides the eigenvectors of the original matrix. For overall performance, each of these steps must be optimized in a specific way with respect to numerical and parallel efficiency, which shows the importance of involving different experts and of designing the parallel eigensolver in a modular way. Optimizations for the reduction and the back transformation are discussed in this paper, including numerical results demonstrating their effectiveness.

[10] Paul R. Willems and Bruno Lang. Block factorizations and qd-type transformations for the MR3 algorithm. Electron. Trans. Numer. Anal., 38:363--400, 2011. [ .pdf ]
Factorizing symmetric tridiagonal matrices and propagating the factorizations to shifted matrices are central tasks in the MR3 algorithm for computing partial eigensystems. In this paper we propose block bidiagonal factorizations LDL* with 1 ×1 and 2 ×2 blocks in D as an alternative to the bidiagonal and twisted factorizations used hitherto. With block factorizations, the element growth can be reduced (or avoided altogether), which is essential for the success of the MR3 algorithm, in particular if the latter is used to determine the singular value decomposition of bidiagonal matrices. We show that the qd algorithm used for shifting bidiagonal factorizations, e.g., LDL* - τI =: L+D+(L+)* can be extended to work with blocks in a mixed stable way, including criteria for determining a suitable block structure dynamically.