[1] Karsten Kahl and Bruno Lang. Hypergraph edge elimination---a symbolic phase for hermitian eigensolvers based on rank-1 modifications. Electron. Trans. Numer. Anal., 54:51--67, 2021. [ DOI ]
It is customary to identify sparse matrices with the corresponding adjacency or incidence graphs. For the solution of a linear system of equations using Gaussian elimination, the representation by its adjacency graph allows a symbolic factorization that can be used to predict memory footprints and enables the determination of near-optimal elimination orderings based on heuristics. The Hermitian eigenvalue problem on the other hand seems to evade such treatment at first glance due to its inherent iterative nature. In this paper we prove this assertion wrong by revealing a tight connection of Hermitian eigensolvers based on rank-1 modifications with a symbolic edge elimination procedure. A symbolic calculation based on the incidence graph of the matrix can be used in analogy to the symbolic phase of Gaussian elimination to develop heuristics which reduce memory footprint and computations. Yet, we also show that the question of an optimal elimination strategy remains NP-complete, in analogy to the linear systems case.

[2] Martin Puschmann, Daniel Hernangómez-Pérez, Bruno Lang, Soumya Bera, and Ferdinand Evers. Quartic multifractality and finite-size corrections at the spin quantum Hall transition. Phys. Rev. B, 103:235167:1--16, 2021. [ DOI ]
The spin quantum Hall transition (or class C transition in two dimensions) represents one of the few localization-delocalization transitions for which some of the critical exponents are known exactly. Not known, however, is the multifractal spectrum τq, which describes the system-size scaling of inverse participation ratios Pq, i.e., the q moments of critical wave-function amplitudes. We here report simulations based on the class C Chalker-Coddington network and demonstrate that τq is (essentially) a quartic polynomial in q. Analytical results fix all prefactors except the quartic curvature that we obtain as γ= (2.22 0.15) ×10−3. In order to achieve the necessary accuracy in the presence of sizable corrections to scaling, we have analyzed the evolution with system size of the entire Pq-distribution function. As it turns out, in a sizable window of q values this distribution function exhibits a (single-parameter) scaling collapse already in the preasymptotic regime, where finite-size corrections are not negligible. This observation motivates us to propose new, original approach for extracting τq based on concepts borrowed from the Kolmogorov-Smirnov test of mathematical statistics. We believe that our work provides the conceptual means for high-precision investigations of multifractal spectra also near other localization-delocalization transitions of current interest, especially the integer (class A) quantum Hall effect.

[3] Matthias Stosiek, Bruno Lang, and Ferdinand Evers. Self-consistent-field ensembles of disordered Hamiltonians: Efficient solver and application to superconducting films. Phys. Rev. B, 101(14):144503:1--144503:11, April 2020. [ DOI ]
Our general interest is in self-consistent-field (scf) theories of disordered fermions. They generate physically relevant subensembles (“scf ensembles”) within a given Altland-Zirnbauer class. We are motivated to investigate such ensembles (i) by the possibility to discover new fixed points due to (long-range) interactions; (ii) by analytical scf theories that rely on partial self-consistency approximations awaiting a numerical validation; and (iii) by the overall importance of scf theories for the understanding of complex interaction-mediated phenomena in terms of effective single-particle pictures. In this paper we present an efficient, parallelized implementation solving scf problems with spatially local fields by applying a kernel-polynomial approach. Our first application is the Boguliubov-deGennes theory of the attractive-U Hubbard model in the presence of on-site disorder; the sc fields are the particle density n(r) and the gap function Δ(r). For this case, we reach system sizes unprecedented in earlier work. They allow us to study phenomena emerging at scales substantially larger than the lattice constant, such as the interplay of multifractality and interactions or the formation of superconducting islands. For example, we observe that the coherence length exhibits a nonmonotonic behavior with increasing disorder strength already at moderate U. With respect to methodology our results are important because we establish that partial self-consistency (“energy-only”) schemes as typically employed in analytical approaches tend to miss qualitative physics such as island formation.

[4] Lars Karlsson, Daniel Kressner, and Bruno Lang. Optimally packed chains of bulges in multishift QR algorithms. ACM Trans. Math. Software, 40(2):12:1--12:15, February 2014. [ DOI ]
The QR algorithm is the method of choice for computing all eigenvalues of a dense nonsymmetric matrix A. After an initial reduction to Hessenberg form, a QR iteration can be viewed as chasing a small bulge from the top left to the bottom right corner along the subdiagonal of A. To increase data locality and create potential for parallelism, modern variants of the QR algorithm perform several iterations simultaneously, which amounts to chasing a chain of several bulges instead of a single bulge. To make effective use of level 3 BLAS, it is important to pack these bulges as tightly as possible within the chain. In this work, we show that the tightness of the packing in existing approaches is not optimal and can be increased. This directly translates into a reduced chain length by 33% compared to the state-of-the-art LAPACK implementation of the QR algorithm. To demonstrate the impact of our idea, we have modified the LAPACK implementation to make use of the optimal packing. Numerical experiments reveal a uniform reduction of the execution time, without affecting stability or robustness.

[5] Tina Erica Odaka, Vladen V. Melnikov, Per Jensen, Tsuneo Hirano, Bruno Lang, and Peter Langer. Theoretical study of the double Renner effect for A 2Π MgNC/MgCN: Higher excited rovibrational states. J. Chem. Phys., 126(9):094301, March 2007. [ DOI ]
We report here the implementation of a newly developed, highly efficient matrix diagonalization routine in the DR program [T. E. Odaka, P. Jensen, and T. Hirano, J. Mol. Structure 795, 14 (2006)]. The DR program solves the rovibronic Schrödinger equation for a triatomic molecule with a double Renner effect, i.e., with two accessible linear arrangements of the nuclei at which the electronic energy is doubly degenerate. With the new routines, we can extend the DR calculations of rovibronic energies for A2 Π MgNC/MgCN by considering a much larger set of rovibronic states, in particular states at higher J values, than we were able to access previously.

[6] Paul R. Willems, Bruno Lang, and Christof Vömel. Computing the bidiagonal SVD using multiple relatively robust representations. SIAM J. Matrix Anal. Appl., 28(4):907--926, 2006. [ DOI ]
We describe the design and implementation of a new algorithm for computing the singular value decomposition of a real bidiagonal matrix. This algorithm uses ideas developed by Großer and Lang that extend Parlett's and Dhillon's MRRR algorithm for the tridiagonal symmetric eigenproblem. One key feature of our new implementation is, that k singular triplets can be computed using only O(nk) storage units and floating point operations, where n is the dimension of the matrix. The algorithm will be made available as routine xBDSCR in the upcoming new release of the LAPACK library.

[7] Bruno Lang. Out-of-core solution of large symmetric eigenproblems. Z. angew. Math. Mech., 80:S 809--S 810, 2000.
This paper describes a prototype implementation of a solver for dense symmetric eigenproblems, which are too large to fit into the main memory.

[8] Benedikt Großer and Bruno Lang. Using pentangular factorizations for the reduction to banded form. In P. Amestoy, P. Berger, M. Daydé, I. Duff, V. Frayssé, L. Giraud, and D. Ruiz, editors, EuroPar'99 Parallel Processing, volume 1685 of LNCS, pages 1088--1095, Berlin, 1999. Springer-Verlag. [ DOI ]
Most methods for computing the singular value decomposition (SVD) first bidiagonalize the matrix. The ScaLAPACK implementation of the blocked reduction of a general dense matrix to bidiagonal form performs about one half of the operations with BLAS3. If we subdivide the task into two stages dense ->banded and banded -> bidiagonal, we can increase the portion of matrix-matrix operations and expect higher performance. We give an overview of different techniques for the first stage.

This note summarizes the results of  [B. Großer. Parallele zweistufige Verfahren zur Reduktion auf Bidiagonalgestalt. Diplomarbeit, Fachbereich Mathematik, Bergische Universität GH Wuppertal, 1997], [B. Großer, B. Lang. Efficient parallel reduction to bidiagonal form. TR BUGHW-SC98/2, Fachbereich Mathematik, Bergische Universität GH Wuppertal, 1998].

[9] Bruno Lang. Using level 3 BLAS in rotation-based algorithms. SIAM J. Sci. Comput., 19(2):626--634, March 1998. [ DOI ]
This paper presents a technique that allows using level 3 BLAS in a number of rotation-based algorithms. In particular, the update of an orthogonal transformation matrix which often involves the vast majority of operations can be done with a matrix--matrix product. As a case study, the technique is applied to the QR and QL algorithms for computing the eigensystem of a symmetric tridiagonal matrix. The modifications do not affect the convergence properties of the algorithms nor do they significantly increase the overall number of operations. Thus, the computations can be sped up by more than 50% on machines with a distinct memory hierarchy, like the Intel i860 or IBM RS/6000, provided the block size is set appropriately. We also present a simple theoretical analysis that allows selecting an almost-optimal block size.

[10] Bruno Lang. Effiziente Orthogonaltransformationen bei der Eigen- und Singulärwertberechnung. Habilitationsschrift, Fachbereich Mathematik, Bergische Universität GH Wuppertal, Germany, 1997.
---

[11] Bruno Lang. Using level 3 BLAS in the QR algorithm. Z. angew. Math. Mech., 77(S2):S 607--S 608, 1997. [ DOI ]
We present a technique that allows speeding up the QR algorithms for symmetric tridiagonal and Hessenberg matrices by doing most of the computations in matrix-matrix products. The reduced memory traffic leads to increased performance.

[12] Bruno Lang. Parallel reduction of banded matrices to bidiagonal form. Parallel Comput., 22(1):1--18, January 1996. [ DOI ]
A parallel algorithm for reducing banded matrices to bidiagonal form is presented. In contrast to the rotation-based “standard approach”, our algorithm is based on Householder transforms, therefore exhibiting considerably higher data locality (BLAS level 2 instead of level 1). The update of the transformation matrices which involves the vast majority of the operations can even be blocked to allow the use of level 3 BLAS. Thus, our algorithm will outperform the standard method on a serial computer with a distinct memory hierarchy. In addition, the algorithm can be efficiently implemented in a distributed memory environment, as is demonstrated by numerical results on the Intel Paragon.

[13] Bruno Lang. Reduction of banded matrices to bidiagonal form. Z. angew. Math. Mech., 76(Supplement 1: ICIAM/GAMM 95 Proceedings):155--158, 1996. [ DOI ]
We present a parallel algorithm for reducing banded matrices to bidiagonal form. This reduction is a major step in the computation of singular values and singular vectors. In contrast to the rotation-based “standard approach”, our algorithm is based on Householder transforms, therefore exhibiting considerably better data locality (BLAS level 2 instead of level 1). The update of the transformation matrix, the most expensive part of the reduction, can be blocked to allow the use of level 3 BLAS. Thus, even a serial implementation of our algorithm will outperform the standard method on a machine with a distinct memory hierarchy. In addition, the algorithm can be efficiently implemented in a distributed memory environment, as is demonstrated by numerical results on the Intel Paragon.

[14] Christian H. Bischof, Bruno Lang, and Xiaobai Sun. Parallel tridiagonalization through two-step band reduction. In Proceedings of the Scalable High-Performance Computing Conference, Knoxville, Tennessee, May 23--25, 1994, pages 23--27, Los Alamitos, CA, 1994. IEEE Computer Society. [ DOI ]
We present a two-step variant of the “successive band reduction” paradigm for the tridiagonalization of symmetric matrices. Here we first reduce a full matrix to narrow-banded form, and from there to tridiagonal form. The first step allows easy exploitation of block orthogonal transformations. In the second step, we employ a new blocked version of a banded matrix tridiagonalization algorithm by Lang. In particular, we are able to express the update of the orthogonal transformation matrix in terms of block transformations, which leads to an algorithm that is almost entirely based on BLAS-3 kernels, and has greatly improved data movement and communication characteristics. We also present some performance results on the Intel Touchstone Delta prototype and the IBM SP/1.

[15] Bruno Lang. Parallele Berechnung von Eigensystemen symmetrischer Bandmatrizen. Z. angew. Math. Mech., 74(6):T541--T544, 1994. [ DOI ]
Es wird ein Verfahren zur Berechnung von Eigenwerten und -vektoren symmetrischer Bandmatrizen auf Parallelrechnern vorgestellt. Zunächst wird die Matrix auf Tridiagonalgestalt reduziert, dann werden mit einem beschleunigten Bisektionsverfahren die Eigenwerte und anschließend die Eigenvektoren berechnet. Das vorgestellte Verfahren ermöglicht in allen drei Phasen die Verteilung der Rechenarbeit auf mehrere Prozessoren und zusätzlich den Einsatz eventuell vorhandener Vektorhardware in den einzelnen Prozessoren. Die Effizienz des Verfahrens wird durch Messungen auf dem iPSC/860 Hypercube-Parallelrechner belegt.

[16] Bruno Lang. A parallel algorithm for reducing symmetric banded matrices to tridiagonal form. SIAM J. Sci. Comput., 14(6):1320--1338, November 1993. [ DOI ]
An algorithm is presented for reducing symmetric banded matrices to tridiagonal form via Householder transformations. The algorithm is numerically stable and is well suited to parallel execution on distributed memory multiple instruction multiple data (MIMD) computers. Numerical experiments on the iPSC/860 hypercube show that the new method yields nearly full speedup if it is run on multiple processors. In addition, even on a single processor the new method usually will be several times faster than the corresponding EISPACK and LINPACK routines.

[17] Bruno Lang. Reducing symmetric banded matrices to tridiagonal form---A comparison of a new parallel algorithm with two serial algorithms on the iPSC/860. In L. Bougé, M. Cosnard, Y. Robert, and D. Trystram, editors, Parallel Processing: CONPAR 92--VAPP V --- Proceedings Second Joint International Conference on Vector and Parallel Processing, Lyon, France, September 1992, volume 634 of LNCS, pages 271--282, Berlin, 1992. Springer-Verlag. [ DOI ]
We compare three algorithms for reducing symmetric banded matrices to tridiagonal form and evaluate their performance on the Intel iPSC/860 hypercube parallel computer. Two of these algorithms, the routines BANDR and _SBTRD from the EISPACK and LAPACK libraries, resp., are serial algorithms with little potential for coarse grain parallelism. The third one, called SBTH, is a new parallel algorithm. Results on the iPSC/860 hypercube show that this new method may be several times faster than the others, even when run on a single processor. In addition, execution on multiple processors yields nearly full speedup.

[18] Bruno Lang. Parallele Reduktion symmetrischer Bandmatrizen auf Tridiagonalgestalt. Dissertation, Fakultät für Mathematik, Universität Karlsruhe, Germany, 1991.
---