[1] Martin Galgon, Lukas Krämer, Bruno Lang, Andreas Alvermann, Holger Fehske, Andreas Pieper, Georg Hager, Moritz Kreutzer, Faisal Shahzad, Gerhard Wellein, Achim Basermann, Melven Röhrig-Zöllner, and Jonas Thies. Improved coefficients for polynomial filtering in ESSEX. In Tetsuya Sakurai, Shao-Liang Zhang, Toshiyuki Imamura, Yusaku Yamamoto, Yoshinobu Kuramashi, and Takeo Hoshi, editors, Eigenvalue Problems: Algorithms, Software and Applications, in Petascale Computing. Proc. EPASA 2015, Tsukuba, Japan, September 2015, volume 117 of LNCSE. Springer International Publishing, 2017. To appear.
The ESSEX project is an ongoing effort to provide exascale-enabled sparse eigensolvers, especially for quantum physics and related application areas. In this paper we first briefly summarize some key achievements that have been made within this project. Then we focus on a projection-based eigensolver with polynomial approximation of the projector. This eigensolver can be used for computing hundreds of interior eigenvalues of large sparse matrices. We describe techniques that allow using lower-degree polynomials than possible with standard Chebyshev expansion of the window function and kernel smoothing. With these polynomials, the degree, and thus the number of matrix-vector multiplications, typically can be reduced by roughly one half, resulting in comparable savings in runtime.

[2] Andreas Pieper, Moritz Kreutzer, Andreas Alvermann, Martin Galgon, Holger Fehske, Georg Hager, Bruno Lang, and Gerhard Wellein. High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations. J. Comput. Phys., 325:226-243, 2016. [ http ]
We study Chebyshev filter diagonalization as a tool for the computation of many interior eigenvalues of very large sparse symmetric matrices. In this technique the subspace projection onto the target space of wanted eigenvectors is approximated with filter polynomials obtained from Chebyshev expansions of window functions. After the discussion of the conceptual foundations of Chebyshev filter diagonalization we analyze the impact of the choice of the damping kernel, search space size, and filter polynomial degree on the computational accuracy and effort, before we describe the necessary steps towards a parallel high-performance implementation. Because Chebyshev filter diagonalization avoids the need for matrix inversion it can deal with matrices and problem sizes that are presently not accessible with rational function methods based on direct or iterative linear solvers. To demonstrate the potential of Chebyshev filter diagonalization for large-scale problems of this kind we include as an example the computation of the 102 innermost eigenpairs of a topological insulator matrix with dimension 109 derived from quantum physics applications.

[3] Moritz Kreutzer, Jonas Thies, Andreas Pieper, Andreas Alvermann, Martin Galgon, Melven Röhrig-Zöllner, Faisal Shahzad, Achim Basermann, Alan R. Bishop, Holger Fehske, Georg Hager, Bruno Lang, and Gerhard Wellein. Performance engineering and energy efficiency of building blocks for large, sparse eigenvalue computations on heterogeneous supercomputers. In Hans-Joachim Bungartz, Philipp Neumann, and Wolfgang E. Nagel, editors, Software for Exascale Computing - SPPEXA 2013-2015, volume 113 of LNCSE, pages 317-338. Springer, Switzerland, 2016.
Numerous challenges have to be mastered as applications in scientific computing are being developed for post-petascale parallel systems. While ample parallelism is usually available in the numerical problems at hand, the efficient use of supercomputer resources requires not only good scalability but also a verifiably effective use of resources on the core, the processor, and the accelerator level. Furthermore, power dissipation and energy consumption are becoming further optimization targets besides time-to-solution. Performance Engineering (PE) is the pivotal strategy for developing effective parallel code on all levels of modern architectures. In this paper we report on the development and use of low-level parallel building blocks in the GHOST library (“General, Hybrid, and Optimized Sparse Toolkit”). We demonstrate the use of PE in optimizing a density of states computation using the Kernel Polynomial Method, and show that reduction of runtime and reduction of energy are literally the same goal in this case. We also give a brief overview of the capabilities of GHOST and the applications in which it is being used successfully.

[4] Jonas Thies, Martin Galgon, Faisal Shahzad, Andreas Alvermann, Moritz Kreutzer, Andreas Pieper, Melven Röhrig-Zöllner, Achim Basermann, Holger Fehske, Georg Hager, Bruno Lang, and Gerhard Wellein. Towards an exascale enabled sparse solver repository. In Hans-Joachim Bungartz, Philipp Neumann, and Wolfgang E. Nagel, editors, Software for Exascale Computing - SPPEXA 2013-2015, volume 113 of LNCSE, pages 295-316. Springer, Switzerland, 2016.
As we approach the exascale computing era, disruptive changes in the software landscape are required to tackle the challenges posed by manycore CPUs and accelerators. We discuss the development of a new `exascale enabled’ sparse solver repository (the ESSR) that addresses these challenges-from fundamental design considerations and development processes to actual implementations of some prototypical iterative schemes for computing eigenvalues of sparse matrices. Key features of the ESSR include holistic performance engineering, tight integration between software layers and mechanisms to mitigate hardware failures.

[5] Lukas Krämer and Bruno Lang. Convergence of integration based methods for the solution of standard and generalized hermitian eigenvalue problems. Preprint BUW-IMACM 14/30, 2016. [ .pdf ]
Recently, methods based on spectral projection and numerical integration came up in the literature as candidates for reliable high performance eigenvalue solvers. The key ingredients of this type of eigenvalue solver are a Rayleigh-Ritz process and a routine to compute an approximation to the desired eigenspace. The latter computation can be performed by numerical integration of the resolvent. In this article we investigate the progress of the Rayleigh-Ritz process and the achievable quality of the computed eigenpairs for the case that an upper bound for the normwise difference between the currently used subspace and the desired eigenspace is available. Then, such bounds are derived for the Gauß-Legendre rule and the trapezoidal rule.

[6] Martin Galgon, Lukas Krämer, Jonas Thies, Achim Basermann, and Bruno Lang. On the parallel iterative solution of linear systems arising in the FEAST algorithm for computing inner eigenvalues. Parallel Comput., 49:153-163, 2015.
Methods for the solution of sparse eigenvalue problems that are based on spectral projectors and contour integration have recently attracted more and more attention. Such methods require the solution of many shifted linear systems of full size. In most of the literature concerning these eigenvalue solvers, only few words are said on the solution of the linear systems, but they turn out to be very hard to solve by iterative linear solvers in practice. In this work we identify a row projection method for the solution of the inner linear systems encountered in the FEAST algorithm and introduce a novel hybrid parallel and fully iterative implementation of the eigenvalue solver. Our approach ultimately aims at achieving extreme parallelism by exploiting the algorithm's potential on several levels. We present numerical examples where graphene modeling is one of the target applications. In this application, several hundred or even thousands of eigenvalues from the interior of the spectrum are required, which is a big challenge for state-of-the-art numerical methods.

[7] Martin Galgon, Lukas Krämer, and Bruno Lang. Adaptive choice of projectors in projection based eigensolvers. Preprint BUW-IMACM 15/07, 2015. [ .pdf ]
We present a general framework for algorithms for the solution of Hermitian eigenvalue problems, where eigenvalues in a given interval are sought. One instance of this framework is Polizzi's FEAST algorithm [E. Polizzi: Density-matrix-based algorithm for solving eigenvalue problems. Phys. Rev. B 2009; 79:115112], which is based on numerical integration. Another instance is based on polynomial approximation. We propose adaptive strategies for the choice of the polynomial degree and tolerance of the linear solver, respectively. Numerical experiments reveal that these strategies are able to improve the robustness of the resulting methods, while at the same time achieving near-to-optimum efficiency.

[8] Lars Karlsson, Daniel Kressner, and Bruno Lang. Optimally packed chains of bulges in multishift QR algorithms. ACM Trans. Math. Software, 40(2):12:1-12:15, February 2014.
The QR algorithm is the method of choice for computing all eigenvalues of a dense nonsymmetric matrix A. After an initial reduction to Hessenberg form, a QR iteration can be viewed as chasing a small bulge from the top left to the bottom right corner along the subdiagonal of A. To increase data locality and create potential for parallelism, modern variants of the QR algorithm perform several iterations simultaneously, which amounts to chasing a chain of several bulges instead of a single bulge. To make effective use of level 3 BLAS, it is important to pack these bulges as tightly as possible within the chain. In this work, we show that the tightness of the packing in existing approaches is not optimal and can be increased. This directly translates into a reduced chain length by 33% compared to the state-of-the-art LAPACK implementation of the QR algorithm. To demonstrate the impact of our idea, we have modified the LAPACK implementation to make use of the optimal packing. Numerical experiments reveal a uniform reduction of the execution time, without affecting stability or robustness.

[9] A. Marek, V. Blum, R. Johanni, V. Havu, B. Lang, T. Auckenthaler, A. Heinecke, H.-J. Bungartz, and H. Lederer. The ELPA library: Scalable parallel eigenvalue solutions for electronic structure theory and computational science. J. Phys.: Condens. Matter, 26(21):213201, May 2014.
Obtaining the eigenvalues and eigenvectors of large matrices is a key problem in electronic structure theory and many other areas of computational science. The computational effort formally scales as O(N3) with the size of the investigated problem, N (e.g. the electron count in electronic structure theory), and thus often defines the system size limit that practical calculations cannot overcome. In many cases, more than just a small fraction of the possible eigenvalue/eigenvector pairs is needed, so that iterative solution strategies that focus only on a few eigenvalues become ineffective. Likewise, it is not always desirable or practical to circumvent the eigenvalue solution entirely. We here review some current developments regarding dense eigenvalue solvers and then focus on the Eigenvalue soLvers for Petascale Applications (ELPA) library, which facilitates the efficient algebraic solution of symmetric and Hermitian eigenvalue problems for dense matrices that have real-valued and complex-valued matrix entries, respectively, on parallel computer platforms. ELPA addresses standard as well as generalized eigenvalue problems, relying on the well documented matrix layout of the Scalable Linear Algebra PACKage (ScaLAPACK) library but replacing all actual parallel solution steps with subroutines of its own. For these steps, ELPA significantly outperforms the corresponding ScaLAPACK routines and proprietary libraries that implement the ScaLAPACK interface (e.g. Intel's MKL). The most time-critical step is the reduction of the matrix to tridiagonal form and the corresponding backtransformation of the eigenvectors. ELPA offers both a one-step tridiagonalization (successive Householder transformations) and a two-step transformation that is more efficient especially towards larger matrices and larger numbers of CPU cores. ELPA is based on the MPI standard, with an early hybrid MPI-OpenMPI implementation available as well. Scalability beyond 10000 CPU cores for problem sizes arising in the field of electronic structure theory is demonstrated for current high-performance computer architectures such as Cray or Intel/Infiniband. For a matrix of dimension 260000, scalability up to 295000 CPU cores has been shown on BlueGene/P.

[10] Andreas Alvermann, Achim Basermann, Holger Fehske, Martin Galgon, Georg Hager, Moritz Kreutzer, Lukas Krämer, Bruno Lang, Andreas Pieper, Melven Röhrig-Zöllner, Faisal Shahzad, Jonas Thies, and Gerhard Wellein. ESSEX: Equipping sparse solvers for exascale. In Luís Lopes et al., editors, Euro-Par 2014: Parallel Processing Workshops, volume 8806 of LNCS, pages 577-588. Springer, 2014.
The ESSEX project investigates computational issues arising at exascale for large-scale sparse eigenvalue problems and develops programming concepts and numerical methods for their solution. The project pursues a coherent co-design of all software layers where a holistic performance engineering process guides code development across the classic boundaries of application, numerical method, and basic kernel library. Within ESSEX the numerical methods cover widely applicable solvers such as classic Krylov, Jacobi-Davidson, or the recent FEAST methods, as well as domain-specific iterative schemes relevant for the ESSEX quantum physics application. This report introduces the project structure and presents selected results which demonstrate the potential impact of ESSEX for efficient sparse solvers on highly scalable heterogeneous supercomputers.

[11] Martin Galgon, Lukas Krämer, Bruno Lang, Andreas Alvermann, Holger Fehske, and Andreas Pieper. Improving robustness of the FEAST algorithm and solving eigenvalue problems from graphene nanoribbons. Proc. Appl. Math. Mech., 14(1):821-822, December 2014.
We consider the FEAST eigensolver, introduced by Polizzi in 2009 [5]. We describe an improvement concerning the reliability of the algorithm and discuss an application in the solution of eigenvalue problems from graphene modeling.

[12] Paul R. Willems and Bruno Lang. A framework for the MR3 algorithm: Theory and implementation. SIAM J. Sci. Comput., 35(2):A740-A766, 2013.
This paper provides a streamlined and modular presentation of the MR3 algorithm for computing selected eigenpairs of symmetric tridiagonal matrices, thus disentangling the principles driving MR3 and the (recursive) “core” algorithm from the specific (e.g., twisted) decompositions used to represent the matrices at different recursion depths and from the (dqds) transformations converting between them. Our approach allows a modular full proof for the correctness of the MR3 algorithm. This proof is based on five requirements concerning the interplay between the core algorithm and its subcomponents. These requirements can also guide in implementing the algorithm, because they expose quantities that can and should be monitored at runtime. Our new implementation XMR, which is based on the above analysis, is described and compared to xSTEMR from Lapack 3.2.2. Numerical experiments comparing the robustness and performance of both implementations are given.

[13] Lukas Krämer, Edoardo Di Napoli, Martin Galgon, Bruno Lang, and Paolo Bientinesi. Dissecting the FEAST algorithm for generalized eigenproblems. J. Comput. Appl. Math., 244:1-9, May 2013.
We analyze the FEAST method for computing selected eigenvalues and eigenvectors of large sparse matrix pencils. After establishing the close connection between FEAST and the well-known Rayleigh-Ritz method, we identify several critical issues that influence convergence and accuracy of the solver: the choice of the starting vector space, the stopping criterion, how the inner linear systems impact the quality of the solution, and the use of FEAST for computing eigenpairs from multiple intervals. We complement the study with numerical examples, and hint at possible improvements to overcome the existing problems.

[14] Andreas Marek, Volker Blum, Rainer Johanni, Ville Havu, Bruno Lang, Thomas Auckenthaler, Alexander Heinecke, Hans-Joachim Bungartz, and Hermann Lederer. The ELPA library - scalable parallel eigenvalue solutions for electronic structure theory and computational science. Scientific highlight of the month, Ψk (http://www.psi-k.org), December 2013.
[15] Elke Just and Bruno Lang. Success-guided selection of expanded systems for result-verifying nonlinear solvers. Reliab. Comput., 16:73-83, March 2012.
Most result-verifying solvers for nonlinear systems include a branch-and-bound algorithmic backbone, complemented with accelerating methods such as Constraint Propagation, Interval Newton, and many others. Often the effectiveness of the accelerators can be improved if one works not only with the given system but also with expanded systems, which are obtained by introducing additional variables for suitable subterms. While reducing the overall number of boxes that must be considered during the solution, applying the accelerators to the (often much larger) expanded systems can increase significantly the time spent on a single box. Therefore a good strategy for deciding which methods to apply to which expanded system is essential for the performance and robustness of the whole solver.

We propose a heuristic that selects expanded systems based on performance information gathered during the computations. Numerical experiments show that this new dynamical strategy tends to be superior to a fixed small-to-large traversal (with restarts) of the expanded systems, which has been the default strategy in our nonlinear solver and optimizer SONIC.

[16] Paul R. Willems and Bruno Lang. The MR3-GK algorithm for the bidiagonal SVD. Electron. Trans. Numer. Anal., 39:1-21, 2012.
Determining the singular value decomposition of a bidiagonal matrix is a frequent subtask in numerical computations. We shed new light on a long-known way to utilize the algorithm of multiple relatively robust representations, MR3, for this task by casting the singular value problem in terms of a suitable tridiagonal symmetric eigenproblem (via the Golub-Kahan matrix). Just running MR3 “as is” on the tridiagonal problem does not work, as has been observed before (e.g., B. Großer and B. Lang, An O(n2) algorithm for the bidiagonal SVD, Linear Algebra Appl., 358 (2003), pp. 45-70). In this paper we give more detailed explanations for the problems with running MR3 as a black box solver on the Golub-Kahan matrix. We show that, in contrast to standing opinion, MR3 can be run safely on the Golub-Kahan matrix, with just a minor modification. A proof including error bounds is given for this claim.

[17] Paul R. Willems and Bruno Lang. Twisted factorizations and qd-type transformations for the MR3 algorithm-new representations and analysis. SIAM J. Matrix Anal. Appl., 33(2):523-553, 2012.
Bidiagonal and twisted factorizations play a prominent role in the MR3 algorithm for computing partial eigensystems of symmetric tridiagonal matrices. Bidiagonal factorizations of shifted symmetric tridiagonals T - τI also underly, at least implicitly, the evaluation of the Sturm count for eigenvalue computations. In this paper we propose new representations (e-rep, Z-rep) for bidiagonal and twisted factorizations relying on other quantities than the usually employed nontrivial entries of the bidiagonal factors (N-rep). We show that the qd algorithms used for shifting the factorizations, e.g., LDL* - τI =: L+D+(L+)*, and for converting between them, can be formulated to work with these representations in a mixed stable way. In fact, the Z-representation achieves lower bounds for the necessary relative perturbations of the inputs and outputs than the traditional N-rep does. To obtain sharp bounds for accumulating perturbations we propose a new notation that extends Higham's [N. J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia, PA, 2nd ed., 2002] by including second-order terms to provide sharper first-order bounds.

[18] Paul R. Willems and Bruno Lang. Block factorizations and qd-type transformations for the MR3 algorithm. Electron. Trans. Numer. Anal., 38:363-400, 2011.
Factorizing symmetric tridiagonal matrices and propagating the factorizations to shifted matrices are central tasks in the MR3 algorithm for computing partial eigensystems. In this paper we propose block bidiagonal factorizations LDL* with 1 ×1 and 2 ×2 blocks in D as an alternative to the bidiagonal and twisted factorizations used hitherto. With block factorizations, the element growth can be reduced (or avoided altogether), which is essential for the success of the MR3 algorithm, in particular if the latter is used to determine the singular value decomposition of bidiagonal matrices. We show that the qd algorithm used for shifting bidiagonal factorizations, e.g., LDL* - τI =: L+D+(L+)* can be extended to work with blocks in a mixed stable way, including criteria for determining a suitable block structure dynamically.

[19] Xiaojun Chen, Andreas Frommer, and Bruno Lang. Computational existence proofs for spherical t-designs. Numer. Math., 117(2):289-305, February 2011.
Spherical t-designs provide quadrature rules for the sphere which are exact for polynomials up to degree t. In this paper, we propose a computational algorithm based on interval arithmetic which, for given t, upon successful completion will have proved the existence of a t-design with (t+1)2 nodes on the unit sphere S2 3 and will have computed narrow interval enclosures which are known to contain these nodes with mathematical certainty. Since there is no theoretical result which proves the existence of a t-design with (t+1)2 nodes for arbitrary t, our method contributes to the theory because it was tested successfully for t=1, 2, ..., 100. The t-design is usually not unique; our method aims at finding a well-conditioned one. The method relies on computing an interval enclosure for the zero of a highly nonlinear system of dimension (t+1)2. We therefore develop several special approaches which allow us to use interval arithmetic efficiently in this particular situation. The computations were all done using the MATLAB toolbox INTLAB.

[20] T. Auckenthaler, H.-J. Bungartz, T. Huckle, L. Krämer, B. Lang, and P. Willems. Developing algorithms and software for the parallel solution of the symmetric eigenvalue problem. J. Comput. Sci., 2(3):272-278, August 2011.
Nowadays, the development, maintenance, and ongoing adaptation of simulation software due to new algorithmic or hardware developments are highly complex tasks involving larger teams, often from different groups and disciplines, and located at different places. This requires an increased use of methods and tools from software engineering. At the same time, the high computational demands from the fields of application make it necessary to optimize the modules for code performance and scalability in order to fully exploit the potential of modern parallel architectures.

This paper presents a case study on the ongoing endeavor of improving and developing library software for the parallel computation of eigenvalues for dense symmetric matrices, driven by fields of application such as quantum chemistry. A widespread approach is to, first, transform the matrix to tridiagonal form and, second, to solve the tridiagonal eigenvalue problem, before a back transformation provides the eigenvectors of the original matrix. For overall performance, each of these steps must be optimized in a specific way with respect to numerical and parallel efficiency, which shows the importance of involving different experts and of designing the parallel eigensolver in a modular way. Optimizations for the reduction and the back transformation are discussed in this paper, including numerical results demonstrating their effectiveness.

[21] T. Auckenthaler, V. Blum, H.-J. Bungartz, T. Huckle, R. Johanni, L. Krämer, B. Lang, H. Lederer, and P. R. Willems. Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations. Parallel Comput., 37(12):783-794, December 2011.
The computation of selected eigenvalues and eigenvectors of a symmetric (Hermitian) matrix is an important subtask in many contexts, for example in electronic structure calculations. If a significant portion of the eigensystem is required then typically direct eigensolvers are used. The central three steps are: reduce the matrix to tridiagonal form, compute the eigenpairs of the tridiagonal matrix, and transform the eigenvectors back. To better utilize memory hierarchies, the reduction may be effected in two stages: full to banded, and banded to tridiagonal. Then the back transformation of the eigenvectors also involves two stages. For large problems, the eigensystem calculations can be the computational bottleneck, in particular with large numbers of processors. In this paper we discuss variants of the tridiagonal-to-banded back transformation, improving the parallel efficiency for large numbers of processors as well as the per-processor utilization. We also modify the divide-and-conquer algorithm for symmetric tridiagonal matrices such that it can compute a subset of the eigenpairs at reduced cost. The effectiveness of our modifications is demonstrated with numerical experiments.

[22] Martin Galgon, Lukas Krämer, and Bruno Lang. The FEAST algorithm for large sparse eigenvalue problems. Proc. Appl. Math. Mech., 11(1):747-748, December 2011.
We consider the eigensolver named FEAST that was introduced by Polizzi in 2009 [1]. This solver, tailored to the (partial) solution of large sparse (generalized) eigenproblems offers good potential for parallelism and showed good robustness in our experiments. We briefly introduce the algorithm, point out some problems and give two examples.

[23] Thomas Beelitz, Bruno Lang, Peer Ueberholz, and Paul Willems. Closing the case t = 3 for 3-D spherical t-designs using a result-verifying nonlinear solver. Reliab. Comput., 14:66-77, June 2010.
The question if there exists an N-point spherical t-design is not yet settled for all combinations of t and N. Using our framework SONIC for the solution of nonlinear systems, we were able to close the two remaining open cases for t = 3. More precisely, a computational proof revealed that there are no spherical 3-designs with N = 7 or N = 9 points. We describe how these results were obtained and comment on the open cases for larger values of t.

[24] Thomas Beelitz, Andreas Frommer, Bruno Lang, and Paul Willems. A framework for existence tests based on the topological degree and homotopy. Numer. Math., 111(4):493-507, February 2009.
The invariance of the topological degree under certain homotopies is used to derive a framework for tests to computationally prove the existence of zeros of nonlinear mappings in Rn. These tests use interval arithmetic to enclose the range of a function over a box and are provably more general than many other tests like the Moore-Kioustelidis test, a test based on the Krawczyk operator, and another degree-based test published recently. A numerical example is included.

[25] Martin Mönnigmann, Wolfgang Marquardt, Christian H. Bischof, Thomas Beelitz, Bruno Lang, and Paul Willems. A hybrid approach for efficient robust design of dynamic systems. SIAM Rev., 49(2):236-254, 2007.
We propose a novel approach for the parametrically robust design of dynamic systems. The approach can be applied to system models with parameters that are uncertain in the sense that values for these parameters are not known precisely, but only within certain bounds. The novel approach is guaranteed to find an optimal steady state that is stable for each parameter combination within these bounds.

Our approach combines the use of a standard solver for constrained optimization problems with the rigorous solution of nonlinear systems. The constraints for the optimization problems are based on the concept of parameter space normal vectors that measure the distance of a tentative optimum to the nearest known critical point, i.e., a point where stability may be lost. Such normal vectors are derived using methods from Nonlinear Dynamics. After the optimization, the rigorous solver is used to provide a guarantee that no critical points exist in the vicinity of the optimum, or to detect such points. In the latter case, the optimization is resumed, taking the newly found critical points into account. This optimize-and-verify procedure is repeated until the rigorous nonlinear solver can guarantee that the vicinity of the optimum is free from critical points and therefore the optimum is parametrically robust. In contrast to existing design methodologies, our approach can be automated and does not rely on the experience of the designing engineer.

A simple model of a fermenter is used to illustrate the concepts and the order of activities arising in a typical design process.

[26] J. Bloch, A. Frommer, B. Lang, and T. Wettig. An iterative method to compute the sign function of a non-Hermitian matrix and its application to the overlap Dirac operator at nonzero chemical potential. Comput. Phys. Comm., 177(2):933-943, December 2007.
The overlap Dirac operator in lattice QCD requires the computation of the sign function of a matrix. While this matrix is usually Hermitian, it becomes non-Hermitian in the presence of a quark chemical potential. We show how the action of the sign function of a non-Hermitian matrix on an arbitrary vector can be computed efficiently on large lattices by an iterative method. A Krylov subspace approximation based on the Arnoldi algorithm is described for the evaluation of a generic matrix function. The efficiency of the method is spoiled when the matrix has eigenvalues close to a function discontinuity. This is cured by adding a small number of critical eigenvectors to the Krylov subspace, for which we propose two different deflation schemes. The ensuing modified Arnoldi method is then applied to the sign function, which has a discontinuity along the imaginary axis. The numerical results clearly show the improved efficiency of the method. Our modification is particularly effective when the action of the sign function of the same matrix has to be computed many times on different vectors, e.g., if the overlap Dirac operator is inverted using an iterative method.

[27] Christian H. Bischof, H. Martin Bücker, Arno Rasch, Emil Slusanschi, and Bruno Lang. Automatic differentiation of the general-purpose computational fluid dynamics package FLUENT. ASME J. Fluids Engrg., 129(5):652-658, May 2007.
Derivatives are a crucial ingredient to a broad variety of computational techniques in science and engineering. While numerical approaches for evaluating derivatives suffer from truncation error, automatic differentiation is accurate up to machine precision. The term automatic differentiation comprises a set of techniques for mechanically transforming a given computer program to another one capable of evaluating derivatives. A common misconception about automatic differentiation is that this technique only works on local pieces of fairly simple code. Here, it is shown that automatic differentiation is not only applicable to small academic codes, but scales to advanced industrial software packages. In particular, the general-purpose computational fluid dynamics software package FLUENT is transformed by automatic differentiation.

[28] Tina Erica Odaka, Vladen V. Melnikov, Per Jensen, Tsuneo Hirano, Bruno Lang, and Peter Langer. Theoretical study of the double Renner effect for A 2Π MgNC/MgCN: Higher excited rovibrational states. J. Chem. Phys., 126(9):094301, March 2007.
We report here the implementation of a newly developed, highly efficient matrix diagonalization routine in the DR program [T. E. Odaka, P. Jensen, and T. Hirano, J. Mol. Structure 795, 14 (2006)]. The DR program solves the rovibronic Schrödinger equation for a triatomic molecule with a double Renner effect, i.e., with two accessible linear arrangements of the nuclei at which the electronic energy is doubly degenerate. With the new routines, we can extend the DR calculations of rovibronic energies for A2 Π MgNC/MgCN by considering a much larger set of rovibronic states, in particular states at higher J values, than we were able to access previously.

[29] Andreas Frommer and Bruno Lang. Fast and accurate multi-argument interval evaluation of polynomials. In Proc. SCAN 2006, 12th International Symposium on Scientific Computing, Computer Arithmetic, and Validated Numerics, September 26-29, 2006, Duisburg, Germany, page 31. IEEE Computer Society, 2007.
The verification of the existence of certain spherical t-designs involves the evaluation of a degree-t polynomial Jt at a very large number of (interval) arguments. To make the overall verification process feasible computationally, this evaluation must be fast, and the enclosures for the function values must be affected with only modest over-estimation. We discuss several approaches for multi-argument interval evaluation of the polynomial Jt and show how they can be adapted to other polynomials p. One particularly effective new method is based on expanding the polynomial p around several points ξj and truncating each resulting expansion pξ_j to a lower-degree polynomial.

[30] Andreas Frommer, Fatmir Hoxha, and Bruno Lang. Proving the existence of zeros using the topological degree and interval arithmetic. J. Comput. Appl. Math., 199(2):397-402, February 2007.
The invariance of the topological degree under certain homotopies is used to computationally prove the existence of zeros of nonlinear mappings in Rn. These existence tests use interval arithmetic to enclose the range of a function over a box. We show that our test is more general than a well-known test based on Miranda's theorem, and we show by a numerical example that the new test can be successful on substantially larger boxes.

[31] Jacques Bloch, Tilo Wettig, Andreas Frommer, and Bruno Lang. An iterative method to compute the overlap Dirac operator at nonzero chemical potential. In Gunnar Bali et al., editors, Proc. Lattice 2007, XXVth Intl. Symp. Lattice Field Theory, July 30-August 4, 2007, Regensburg, Germany, Proceedings of Science, page 169, Trieste, Italy, 2007. SISSA.
The overlap Dirac operator at nonzero quark chemical potential involves the computation of the sign of a non-Hermitian matrix. In this talk we present an iterative method, first proposed by us in Ref. [1], which allows for an efficient computation of the operator, even on large lattices. The starting point is a Krylov subspace approximation, based on the Arnoldi algorithm, for the evaluation of a generic matrix function. The efficiency of this method is spoiled when the matrix has eigenvalues close to a function discontinuity. To cure this, a small number of critical eigenvectors are added to the Krylov subspace, and two different deflation schemes are proposed in this augmented subspace. The ensuing method is then applied to the sign function of the overlap Dirac operator, for two different lattice sizes. The sign function has a discontinuity along the imaginary axis and the numerical results show how deflation dramatically improves the efficiency of the method.

[32] Ivo Kabadshow and Bruno Lang. Latency-optimized parallelization of the FMM near-field computations. In Yong Shi, Geert Dick van Albada, Jack Dongarra, and Peter M. A. Sloot, editors, Computational Science, Proc. ICCS 2007, 7th International Conference on Computational Science, May 27-30, 2007, Beijing, China, Part I, volume 4487 of LNCS, pages 716-722, Berlin, 2007. Springer-Verlag.
In this paper we present a new parallelization scheme for the FMM near-field. The parallelization is based on the Global Arrays Toolkit and uses one-sided communication with overlapping. It employs a purely static load-balancing approach to minimize the number of communication steps and benefits from a maximum utilization of data locality. In contrast to other implementations the communication is initiated by the process owning the data via a put call, not the process receiving the data (via a get call).

[33] Paul R. Willems, Bruno Lang, and Christof Vömel. Computing the bidiagonal SVD using multiple relatively robust representations. SIAM J. Matrix Anal. Appl., 28(4):907-926, 2006.
We describe the design and implementation of a new algorithm for computing the singular value decomposition of a real bidiagonal matrix. This algorithm uses ideas developed by Großer and Lang that extend Parlett's and Dhillon's MRRR algorithm for the tridiagonal symmetric eigenproblem. One key feature of our new implementation is, that k singular triplets can be computed using only O(nk) storage units and floating point operations, where n is the dimension of the matrix. The algorithm will be made available as routine xBDSCR in the upcoming new release of the LAPACK library.

[34] Tatjana Eitrich and Bruno Lang. Efficient optimization of support vector machine learning parameters for unbalanced datasets. J. Comput. Appl. Math., 196(2):425-436, November 2006.
Support vector machines are powerful kernel methods for classification and regression tasks. If trained optimally, they produce excellent separating hyperplanes. The quality of the training, however, depends not only on the given training data but also on additional learning parameters, which are difficult to adjust, in particular for unbalanced datasets. Traditionally, grid search techniques have been used for determining suitable values for these parameters. In this paper we propose an automated approach to adjusting the learning parameters using a derivative-free numerical optimizer. To make the optimization process more efficient, a new sensitive quality measure is introduced. Numerical tests with a well-known dataset show that our approach can produce support vector machines that are very well tuned to their classification tasks.

[35] Thomas Beelitz, Bruno Lang, and Christian H. Bischof. Efficient task scheduling in the parallel result-verifying solution of nonlinear systems. Reliab. Comput., 12(2):141-151, April 2006.
Nonlinear systems occur in diverse applications, e.g., in the steady state analysis of chemical processes. If safety concerns require the results to be provably correct then result-verifying algorithms relying on interval arithmetic should be used for solving these systems. Since such algorithms are very computationally intensive, the coarse-grained inter-box parallelism should be exploited to make them feasible in practice. In this paper we briefly describe our framework SONIC for the verified solution of nonlinear systems and give detailed information about its parallelization with OpenMP and MPI. Our numerical results show that the implemented parallelization schemes are indeed successful. The more sophisticated MPI implementation seems to be superior to the easy-to-implement OpenMP version and shows almost linear speedup up to a large number of processors.

[36] Tatjana Eitrich and Bruno Lang. On the efficient implementation of a serial and parallel decomposition algorithm for fast support vector machine training including a multi-parameter kernel. Int. J. Comput. Intell., 3(2):91-98, 2006.
This work deals with aspects of support vector machine learning for large-scale data mining tasks. Based on a decomposition algorithm for support vector machine training that can be run in serial as well as shared memory parallel mode we introduce a transformation of the training data that allows for the usage of an expensive generalized kernel without additional costs. We present experiments for the Gaussian kernel, but usage of other kernel functions is possible, too. In order to further speed up the decomposition algorithm we analyze the critical problem of working set selection for large training data sets. In addition, we analyze the influence of the working set sizes onto the scalability of the parallel decomposition scheme. Our tests and conclusions led to several modifications of the algorithm and the improvement of overall support vector machine learning performance. Our method allows for using extensive parameter search methods to optimize classification accuracy.

[37] Tatjana Eitrich and Bruno Lang. Data mining with parallel support vector machines for classification. In Tatyana Yakhno and Erich J. Neuhold, editors, Proc. ADVIS 2006, 4th Intl. Conf. on Advances in Information Systems, October 18-20, 2006, Izmir, Turkey, volume 4243 of LNCS, pages 197-206, Berlin, 2006. Springer-Verlag.
The increasing amount of data used for classification, as well as the demand for complex models with a large number of well tuned parameters, naturally lead to the search for efficient approaches making use of massively parallel systems. We describe the parallelization of support vector machine learning for shared memory systems. The support vector machine is a powerful and reliable data mining method. Our learning algorithm relies on a decomposition scheme, which in turn uses a special variable projection method, for solving the quadratic program associated with support vector machine learning. By using hybrid parallel programming, our parallelization approach can be combined with the parallelism of a distributed cross validation routine and parallel parameter optimization methods.

[38] Tatjana Eitrich and Bruno Lang. On the advantages of weighted L1-norm support vector learning for unbalanced binary classification problems. In Proc. IS'06, 3rd Intl. IEEE Conf. Intelligent Systems, September 4-6, 2006, University of Westminster, UK, pages 575-580. IEEE Computer Society, September 2006.
In this paper we analyze support vector machine classification using the soft margin approach that allows for errors and margin violations during the training stage. Two models for learning the separating hyperplane do exist. We study the behavior of the resulting optimization algorithms in terms of training time and test accuracy for unbalanced data sets. The main goal of our work is to compare the features of the resulting classification functions, which are mainly defined by the support vectors arising during the support vector machine training.

[39] Tatjana Eitrich and Bruno Lang. On the optimal working set size in serial and parallel support vector machine learning with the decomposition algorithm. In Peter Christen, Paul J. Kennedy, Jiuyong Li, Simeon J. Simoff, and Graham J. Williams, editors, Data Mining and Analytics 2006, Proc. Fifth Australasian Data Mining Conference (AusDM2006), November 29-30, 2006, Sydney, Australia, volume 61 of Conferences in Research and Practice in Information Technology (CRPIT), pages 121-128. Australian Computer Society, Inc., 2006.
The support vector machine (SVM) is a well-established and accurate supervised learning method for the classification of data in various application fields. The statistical learning task - the so-called training - can be formulated as a quadratic optimization problem. During the last years the decomposition algorithm for solving this optimization problem became the most frequently used method for support vector machine learning and is the basis of many SVM implementations today. It is characterized by an internal parameter called working set size. Usually small working sets have been assigned. The increasing amount of data used for classification led to new parallel implementations of the decomposition method with efficient inner solvers. With these solvers larger working sets can be assigned. It was shown, that for parallel training with the decomposition algorithm large working sets achieve good speedup values. However, the choice of the optimal working set size for parallel training is not clear. In this paper, we show how the working set size influences the number of decomposition steps, the number of kernel function evaluations and the overall training time in serial and parallel computation.

[40] Tatjana Eitrich, Wolfgang Frings, and Bruno Lang. HyParSVM-a new hybrid parallel software for support vector machine learning on SMP clusters. In Wolfgang E. Nagel, Wolfgang W. Walter, and Wolfgang Lehner, editors, Parallel Processing, Proc. Euro-Par 2006, 12th European Conference on Parallel Computing, August 29-September 1, 2006, Dresden, Germany, volume 4128 of LNCS, pages 350-359. Springer-Verlag, 2006.
In this paper we describe a new hybrid distributed/shared memory parallel software for support vector machine learning on large data sets. The support vector machine (SVM) method is a well-known and reliable machine learning technique for classification and regression tasks. Based on a recently developed shared memory decomposition algorithm for support vector machine classifier design we increased the level of parallelism by implementing a cross validation routine based on message passing. With this extention we obtained a flexible parallel SVM software that can be used on high-end machines with SMP architectures to process the large data sets that arise more and more in bioinformatics and other fields of research.

[41] Tatjana Eitrich, Bruno Lang, and Achim Streit. Customizing the APPSPACK software for parallel parameter tuning of a hybrid parallel support vector machine. In Giuseppe Di Fatta, Michael R. Berthold, and Srinivasan Parthasarathy, editors, Proc. PDM 2006, Workshop on Parallel Data Mining in conjunction with ECML/PKDD 2006, September 18-22, 2006, Berlin, Germany, pages 38-50, 2006.
We consider the problem of tuning parameters for the learning method support vector machine. The APPSPACK software, an asynchronous parameter tuning method, is well suited for SVM parameter fitting due to several characteristics. No derivative information is needed for bound constrained optimization and the code can be run in parallel mode. Recently, a hybrid parallel support vector machine has been proposed. To couple both parallel packages, the APPSPACK software needs to be customized to allow for a parallel function evaluation in addition to the parallelism provided by APPSPACK. In this paper we describe our customization of the APPSPACK software to facilitate a top down parallelism in SVM parameter tuning.

[42] Tatjana Eitrich and Bruno Lang. Parallel cost-sensitive support vector machine software for classification. In Ulrich H. E. Hansmann, Jan Meinke, Sandipan Mohanty, and Olav Zimmermann, editors, Proc. Workshop From Computational Biophysics to Systems Biology, June 06-09, 2006, Jülich, Germany, volume 34 of NIC Series, pages 141-144. John von Neumann Institute for Computing, Jülich, 2006.
-

[43] Wolfgang Marquardt, Martin Mönnigmann, Thomas Beelitz, Bruno Lang, Paul Willems, and Christan H. Bischof. Robust design of dynamic systems. ECMI Newsletter, 39:15, March 2006.
-

[44] Andreas Frommer and Bruno Lang. Thematic section: Result-verifying computing. ECMI Newsletter, 39:5-18, March 2006.
-

[45] C. H. Bischof, H. M. Bücker, B. Lang, A. Rasch, and E. Slusanschi. Efficient and accurate derivatives for a software process chain in airfoil shape optimization. Future Gen. Comput. Syst., 21(8):1333-1344, October 2005.
When using a Newton-based numerical algorithm to optimize the shape of an airfoil with respect to certain design parameters, a crucial ingredient is the derivative of the objective function with respect to the design parameters. In large-scale aerodynamics, this objective function is an output of a computational fluid dynamics program written in a high-level programming language such as Fortran or C.Numerical differentiation is commonly used to approximate derivatives but is subject to truncation and subtractive cancellation errors. For a particular two-dimensional airfoil, we instead apply automatic differentiation to compute accurate derivatives of the lift and drag coefficients with respect to geometric shape parameters. In automatic differentiation, a given program is transformed into another program capable of computing the original function together with its derivatives. In the problem at hand, the objective function consists of a sequence of programs: a MATLAB program followed by two Fortran 77 programs. It is shown how automatic differentiation is applied to a sequence of programs while keeping the computational complexity within reasonable limits. The derivatives computed by automatic differentiation are compared with approximations based on divided differences.

[46] Andreas Frommer and Bruno Lang. Existence tests for solutions of nonlinear equations using Borsuk's theorem. SIAM J. Numer. Anal., 43(3):1348-1361, 2005.
We show how interval arithmetic can be used in connection with Borsuk's theorem to computationally prove the existence of a solution of a system of nonlinear equations. It turns out that this new test, which can be checked computationally in several different ways, is more general than an existing test based on Miranda's theorem in the sense that it is successful for a larger set of situations. A numerical example is included.

[47] Benedikt Großer and Bruno Lang. On symmetric eigenproblems induced by the bidiagonal SVD. SIAM J. Matrix Anal. Appl., 26(3):599-620, 2005.
The relatively robust representations (RRR) algorithm is the method of choice to compute highly accurate eigenvector approximations for symmetric tridiagonal matrices. The task of computing singular vector pairs for a bidiagonal matrix B = UΣVT is closely connected to the RRR algorithm regarding BTB, BBT, or the Golub-Kahan matrix TGK. Nevertheless, separate application of the RRR algorithm to these matrices leads to poor results regarding either numerical orthogonality or the residual BV - UΣ . It turns out that the coupling strategy proposed in [B. Grosser and B. Lang, Linear Algebra Appl., 358 (2003), pp. 45-70] resolves this problem. This article provides the corresponding perturbation theory: We compare the eigenvalues of the separate and coupled decompositions and explain why singular vector pairs approximated via couplings are of superior quality.

[48] Rainer Steiger, Christian H. Bischof, Bruno Lang, and Walter Thiel. Using automatic differentiation to compute derivatives of a quantum-chemical computer program. Future Gen. Comput. Syst., 21(8):1324-1332, October 2005.
The ADIFOR 2.0 tool for Automatic Differentiation of Fortran programs has been used to generate analytic gradient code for all semiempirical SCF methods available in the MNDO97 program. The correctness and accuracy of the new code have been verified. Its performance has been compared with that of hand-coded analytic derivative routines and with numerical differentiation. From a quantum-chemical point of view, the major advance of this work is the development of previously unavailable analytic gradient code for the recently proposed OM1 and OM2 methods.

[49] Tatjana Eitrich and Bruno Lang. Analysis of support vector machine training costs for large and unbalanced data from pharmaceutical industry. In Proc. 1st ICGST Intl. Conf. Artificial Intelligence and Machine Learning (AIML-05), December 19-21, 2005, Cairo, Egypt, pages 58-64. ICGST, 2005.
Modern classification methods are able to analyze large, complex and sometimes also unbalanced data sets. It is important not only to produce good results but also to control the time to achieve them. High computational costs lead to the exclusion of data mining methods even in case of good accuracy. Our paper deals with support vector machines in view of the consumption of CPU time. We study their learning behavior for unbalanced data sets with increasing size. We also examine the question whether it is necessary and practicable to parallelize this method.

[50] Tatjana Eitrich and Bruno Lang. Parallel tuning of support vector machine learning parameters for large and unbalanced data sets. In Michael R. Berthold, Robert Glen, Kai Diederichs, Oliver Kohlbacher, and Ingrid Fischer, editors, Computational Life Sciences: First International Symposium, CompLife 2005, Konstanz, Germany, September 25-27, Proceedings, volume 3695 of LNBI, pages 253-264, Heidelberg, 2005. Springer-Verlag.
We consider the problem of selecting and tuning learning parameters for support vector machines, especially for the classification of large and unbalanced data sets. We show why and how simple models with few parameters should be refined and propose an automated approach for tuning the increased number of parameters in the extended model. Based on a sensitive quality measure we analyze correlations between the number of parameters, the learning time and the performance of the trained SVM in classifying independent test data. In addition we study the influence of the quality measure on the classification performance and compare the behavior of serial and asynchronous parallel parameter tuning on an IBM p690 cluster.

[51] Jens Gräbel, Bruno Lang, and Peer Ueberholz. Performance optimization for the parallel Gauss-Seidel smoother. Proc. Appl. Math. Mech., 5(1):831-832, 2005.
Simulation, e.g., in the field of computational fluid dynamics, accounts for a major part of the computing time on high-performance systems. Many simulation packages still rely on Gauss-Seidel iteration, either as the main linear solver or as a smoother for multigrid schemes. Straight-forward implementations of this solver have efficiency problems on today's most common high-performance computers, i.e., multiprocessor clusters with pronounced memory hierarchies. In this work we present two simple techniques for improving the performance of the parallel Gauss-Seidel method for the 3D Poisson equation by optimizing cache usage as well as reducing the number of communication steps.

[52] Thomas Beelitz, Andreas Frommer, Bruno Lang, and Paul Willems. Symbolic-numeric techniques for solving nonlinear systems. Proc. Appl. Math. Mech., 5(1):705-708, December 2005.
We describe a multilevel technique combining symbolic and numeric methods to obtain guaranteed enclosures for all solutions of a nonlinear system within a given search box.

[53] A. Frommer, B. Lang, and M. Schnurr. A comparison of the Moore and Miranda existence tests. Computing, 72(3-4):349-354, June 2004.
We compare two computational tests for the existence of a zero of a nonlinear system. One of the tests is based on a theorem by Moore, the other relies on Miranda's theorem. It turns out that the Miranda test is always at least as powerful as the Moore test. We also indicate some conditions under which both tests are equivalent.

[54] Thomas Beelitz, Christian Bischof, Bruno Lang, and Klaus Schulte Althoff. Result-verifying solution of nonlinear systems in the analysis of chemical processes. In René Alt, Andreas Frommer, R. Baker Kearfott, and Wolfram Luther, editors, Numerical Software with Result Verification - Proc. Intl. Dagstuhl Seminar, January 19-24, 2003, Dagstuhl Castle, Germany, number 2991 in LNCS, pages 198-205. Springer, Berlin, 2004.
A framework for the verified solution of nonlinear systems arising in the analysis and design of chemical processes is described. The framework combines a symbolic preprocessing step with an interval-based branch-and-bound solver whose efficiency is increased with several acceleration techniques. One of these methods is based on order-2 Taylor expansion; it is also discussed in this note.

[55] Stefanie Krivsky and Bruno Lang. Using interval arithmetic for determining the structure of convex hulls. Numer. Algorithms, 37(1-4):233-240, December 2004.
We discuss the use of interval arithmetic in the computation of the convex hull of n points in D dimensions. Convex hull algorithms rely on simple geometric tests, such as “does some point p lie in a certain half-space or affine subspace?” to determine the structure of the hull. These tests in turn can be carried out by solving appropriate (not necessarily square) linear systems. If interval-based methods are used for the solution of these systems then the correct hull can be obtained at lower cost than with exact arithmetic. In addition, interval-based methods can determine the topological structure of the convex hull even if the position of the points is not known exactly. In the present paper we compare various interval linear solvers with respect to their ability to handle close-to-pathological situations. This property determines how often interval arithmetic cannot provide the required information and therefore some computations must be redone with exact arithmetic.

[56] Thomas Beelitz, Christian H. Bischof, and Bruno Lang. A hybrid subdivision strategy for result-verifying nonlinear solvers. Proc. Appl. Math. Mech., 4:632-633, 2004.
Many different heuristics have been proposed for selecting the subdivision direction in branch-and-bound result-verifying nonlinear solvers. We investigate the impact of the box-splitting techniques on the overall performance of the solver and propose a new approach combining some of the simple heuristics in a hybrid way. Numerical experiments with medium-sized example problems indicate that our approach is successful.

[57] Andreas Frommer and Bruno Lang. On preconditioners for the Borsuk existence test. Proc. Appl. Math. Mech., 4:638-639, 2004.
Existence or fixed point theorems, combined with interval analytic methods, provide a means to computationally prove the existence of a zero of a nonlinear system in a given interval vector. One such test is based on Borsuk's existence theorem. We discuss preconditioning techniques that are aimed at improving the effectiveness of this test.

[58] H. Martin Bücker, Bruno Lang, Hans-Joachim Pflug, and Andre Vehreschild. Threads in an undergraduate course: A Java example illuminating different multithreading approaches. In A. Laganà, M. L. Gavrilova, V. Kumar, Y. Mun, C. J. K. Tan, and O. Gervasi, editors, Computational Science and Its Applications - ICCSA 2004, Proc. Intl. Conf. Computational Science and its Applications, Assisi, Italy, May 14-17, 2004. Part II, number 3044 in LNCS, pages 882-891. Springer, Berlin, 2004.
Multithreading is a fundamental approach to expressing parallelism in programs. Since Java is emerging as the de facto standard language for platform independent software development in higher education, there is need for teaching multithreading in the context of Java. We use a simple problem from scientific computing to explain two different multithreading approaches to second-year students. More precisely, a simple boundary value problem is considered, which makes visible the differences between native Java threads and the OpenMP interface. So, the students are able to appreciate the respective merits and drawbacks of a thread package that is integrated into the Java programming language and an approach combining compiler support with library calls.

[59] Thomas Beelitz, Christian H. Bischof, Bruno Lang, and Paul Willems. SONIC-A framework for the rigorous solution of nonlinear problems. Report BUW-SC 04/7, Fachbereich Mathematik und Naturwissenschaften, Bergische Universität Wuppertal, 2004. [ Preprint ]
SONIC is a new rigorous nonlinear solver and constrained optimizer based on interval computations. Having been developed primarily to support the parametrically robust design of dynamic systems, SONIC can also be used to solve nonlinear problems arising in other application areas. Since its high-level solvers can be coupled with several different basic interval libraries, it can achieve optimized performance while maintaining a high degree of portability. In addition to the serial code, efficient parallel versions for shared and distributed memory architectures are available.

[60] Benedikt Großer and Bruno Lang. An O( n2 ) algorithm for the bidiagonal SVD. Linear Algebra Appl., 358(1-3):45-70, January 2003.
The RRR algorithm computes the eigendecomposition of a symmetric tridiagonal matrix T with an O(n2) complexity. This article discusses how this method can be extended to the bidiagonal SVD B = U ΣVT. It turns out that using the RRR algorithm as a black box to compute BTB=VΣ2VT and BBT=UΣ2UT separately may give poor results for B V - U Σ. The use of the standard Jordan-Wielandt representation can fail as well if clusters of tiny singular values are present. A solution is to work on BTB and to keep factorizations of BBT implicitly. We introduce a set of coupling transformations which allow us to replace the representation u = (1)/(σ) B v by a more stable representation u = X v, where X is a diagonal matrix. Numerical results of our implementation are compared with the LAPACK routines DSTEGR, DBDSQR and DBDSDC.

[61] Christian H. Bischof, H. Martin Bücker, Bruno Lang, and Arno Rasch. An interactive environment for supporting the paradigm shift from simulation to optimization. Scientific Prog., 11(4):263-272, 2003.
Numerical simulation is a powerful tool in science and engineering, and it is also used for optimizing the design of products and experiments rather than only for reproducing the behavior of scientific and engineering systems. In order to reduce the number of simulation runs, the traditional “trial and error” approach for finding near-to-optimum design parameters is more and more replaced with efficient numerical optimization algorithms. Done by hand, the coupling of simulation and optimization software is tedious and error-prone. In this note we introduce a software environment called EFCOSS (Environment For Combining Optimization and Simulation Software) that facilitates and speeds up this task by doing much of the required work automatically. Our framework includes support for automatic differentiation providing the derivatives required by many optimization algorithms. We describe the process of integrating the widely used computational fluid dynamics package FLUENT and a MINPACK-1 least squares optimizer into EFCOSS and follow a sample session solving a data assimilation problem.

[62] Christian H. Bischof, H. Martin Bücker, Bruno Lang, and Arno Rasch. Automated gradient calculation. In Josef Ballmann, editor, Flow Modulation and Fluid-Structure Interaction at Airplane Wings, number 84 in Notes on Numerical Fluid Mechanics and Multidisciplinary Design, pages 205-224. Springer, Berlin, 2003.
Automatic differentiation is a powerful technique for evaluating derivatives of functions given in the form of a high-level programming language such as Fortran, C, or C++. The program is treated as a sequence of elementary statements to which the chain rule of differential calculus is applied mechanically. A key feature of this technique is its ability to generate accurate derivatives rather than approximations obtained from numerical differentiation like divided differences. We survey automatic differentiation and report on preliminary results that have been obtained applying the automatic differentiation tool ADIFOR to a large-scale computational fluid dynamics solver, TFS, developed at Aerodynamisches Institut, Aachen University of Technology. This solver comprises approximately 24,000 lines of Fortran 77 and is able to resolve steady and unsteady laminar and turbulent flows in two and three dimensions.

[63] C. H. Bischof, H. M. Bücker, B. Lang, A. Rasch, and J. W. Risch. Extending the functionality of the general-purpose finite element package SEPRAN by automatic differentiation. Int. J. Numer. Meth. Engng., 58(14):2225-2238, December 2003.
From an abstract point of view, a numerical simulation implements a mathematical function that produces some output from some given input. Derivatives (or sensitivities) of the function's output with respect to its input can be obtained-free from truncation error-by using a technique called automatic differentiation. Given a computer code in a high-level programming language like Fortran, C, or, C++, automatic differentiation generates another code capable of computing not only the original function but also its derivatives. Thus, the application of automatic differentiation significantly extends the functionality of a simulation package. For instance, automatic differentiation enables, in a completely mechanical fashion, the usage of derivative-based optimization algorithms where the evaluation of the objective function comprises some given large-scale engineering simulation. In this note, the automatic differentiation tool Adifor is used to transform the general purpose finite element package SEPRAN.In doing so, we automatically translate the given 400,000 lines of Fortran 77 into a new program consisting of 600,000 lines of Fortran 77. We compare our approach with a traditional approach based on numerical differentiation and quantify its advantages in terms of accuracy and computational efficiency for a standard fluid flow problem.

[64] H. M. Bücker, B. Lang, and C. H. Bischof. Parallel programming in computational science: An introductory practical training course for computer science at Aachen University. Future Gen. Comput. Syst., 19(8):1309-1319, November 2003.
Parallel programming of high-performance computers has emerged as a key technology for the numerical solution of large-scale problems arising in computational science and engineering (CSE). The authors believe that principles and techniques of parallel programming are among the essential ingredients of any CSE as well as computer science curriculum. Today, opinions on the role and importance of parallel programming are diverse. Rather than seeing it as a marginal beneficial skill optionally taught at the graduate level, we understand parallel programming as crucial basic skill that should be taught as an integral part of the undergraduate computer science curriculum. A practical training course developed for computer science undergraduates at Aachen University is described. Its goal is to introduce young computer science students to different parallel programming paradigms for shared and distributed memory computers as well as to give a first exposition to the field of computational science by simple, yet carefully chosen sample problems.

[65] Christian H. Bischof, H. Martin Bücker, Bruno Lang, and Arno Rasch. Solving large-scale optimization problems with EFCOSS. Advances Engrg. Software, 34(10):633-639, October 2003.
Derivatives play a prominent role in many areas of scientific computing. Traditionally, divided differences are employed to approximate derivatives, leading to results of dubious quality at often great computational expense. Automatic differentiation (AD), by contrast, is a powerful technique for accurately evaluating derivatives of functions described in a high-level programming language. AD requires little human effort and produces derivatives without truncation error. Although there is no conceptual difference between small and large codes, applying AD to programs with hundreds of thousands lines of code is still a challenging task and requires a robust AD tool. We report on recent accomplishments of AD applied to the general purpose finite element package SEPRAN consisting of approximately 400,000 lines of Fortran77 and its integration into a prototype problem solving environment called EFCOSS supporting interoperability of simulation codes with optimization software using AD technology.

[66] Stefanie Krivsky and Bruno Lang. Verified computation of higher-dimensional convex hulls and the solution of linear systems. Electron. J. Math. Comput., 1:21-35, March 2003.
In this paper we discuss the use of interval arithmetic in the computation of the convex hull of n points in D dimensions. If interval-based methods are used for the solution of certain appropriate (not necessarily square) linear systems then the correct topological structure of the convex hull can be obtained at lower cost than with exact rational arithmetic. In addition, interval-based methods can determine the topological structure of the convex hull even if the position of the points is not known exactly. In our experiments we compared various linear solvers with respect to their speed and their ability to handle close-to-pathological situations. The latter property determines how often interval arithmetic cannot provide the required information and therefore some computations must be redone with rational arithmetic.

[67] H. Martin Bücker, Bruno Lang, Arno Rasch, and Christian H. Bischof. Automatic parallelism in differentiation of Fourier transforms. In Chin-Cheng Hung, Henry Hexmoor, et al., editors, Proc. 2003 ACM Symposium on Applied Computing (SAC 2003), March 9-12, 2003, Melbourne, FL, pages 148-152, New York, NY, 2003. Assoc. Comput. Mach.
For functions given in the form of a computer program, automatic differentiation is an efficient technique to accurately evaluate the derivatives of that function. Starting from a given computer program, automatic differentiation generates another program for the evaluation of the original function and its derivatives in a fully mechanical way. While the efficiency of this black box approach is already high as compared to numerical differentiation based on divided differences, automatic differentiation can be applied even more efficiently by taking into account high-level knowledge about the given computer program. We show that, in the case where the function involves a Fourier transform, the degree of parallelism in the program generated by automatic differentiation can be increased leading to a rich set of automatic parallelization strategies that are not available when employing a black box automatic parallelization approach. Experiments of the new automatic parallelization approach are reported on a SunFire 6800 server using up to 20 processors.

[68] H. Martin Bücker, Bruno Lang, Hans-Joachim Pflug, and Andreas Wolf. A hybrid MPI-OpenMP implementation of the conjugate gradient method in Java. In Dieter an Mey, editor, Proc. EWOMP '03, Fifth European Workshop on OpenMP, September 22-26, 2003, Aachen, Germany, pages 205-213, Aachen, Germany, 2003. Center for Computing and Communication, Aachen University.
The message passing paradigm is the dominating programming model for parallel processing on distributed-memory architectures. OpenMP is the most widely used parallel programming model for shared-memory systems. Contemporary parallel computers consist of multiple shared-memory computers connected via a network so that a hybrid parallelization approach combining message passing and OpenMP is often desired. We present a hybrid parallelization of an iterative method for the solution of linear systems with a large sparse symmetric coefficient matrix which is implemented in Java. To this end, we bring together mpiJava and JOMP and compare the performance on a Sun Fire 6800 system with a hybrid parallel version of the same problem implemented in Fortran. The scalability of both implementations is similar while, surprisingly, the performance of the Fortran version with data prefetching is only a factor of about 2 larger than the Java version when up to 22 processors are used.

[69] Christian H. Bischof, Bruno Lang, and Andre Vehreschild. Automatic differentiation for MATLAB programs. Proc. Appl. Math. Mech., 2(1):50-53, March 2003. Proc. GAMM-Jahrestagung, March 25-28, 2002, Augsburg, Germany.
Derivative information is required in numerous applications, including sensitivity analysis and numerical optimization. For simple functions, symbolic differentiation-done either manually or with a computer algebra system-can provide the derivatives, whereas divided differences (DD) have been used traditionally for functions defined by (potentially very complex) computer programs, even if only approximate values can be obtained this way. An alternative approach for such functions is automatic differentiation (AD), yielding exact derivatives at often lower cost than DD, and without restrictions on the program complexity. In this paper we compare the functionality and describe the use of ADMIT/ADMAT and ADiMat. These two AD tools provide derivatives for programs written in the MATLAB language, which is widely used for prototype and production software in scientific and engineering applications. While ADMIT/ADMAT implements a pure operator overloading approach of AD, ADiMat also employes source transformation techniques.

[70] Thomas Beelitz, Christian H. Bischof, and Bruno Lang. Intervals and OpenMP: Towards an efficient parallel result-verifying nonlinear solver. In Dieter an Mey, editor, Proc. EWOMP '03, Fifth European Workshop on OpenMP, September 22-26, 2003, Aachen, Germany, pages 119-125, Aachen, Germany, 2003. Center for Computing and Communication, Aachen University.
Nonlinear systems occur in diverse applications, i.e., in the steady state analysis of chemical processes. If safety concerns require the results to be provably correct then result-verifying algorithms relying on interval arithmetic should be used for solving these systems. Since such algorithms are very computationally intensive, parallelism must be exploited to make them feasible in practice. We describe our framework for the verified solution of nonlinear systems and our approach to parallelizing it with OpenMP. First numerical results show that the parallelization scheme is indeed successful but that the attainable speedup is limited for some unknown reason.

[71] H. Martin Bücker, Bruno Lang, Arno Rasch, and Christian H. Bischof. Computing sensitivities of the electrostatic potential by automatic differentiation. Comput. Phys. Comm., 147(1-2):720-723, August 2002.
Given a computer model for the electrostatic potential in an L-shaped region with media of different dielectric permeabilities in two subregions, we are interested in the robustness of the simulation by identifying the rate of change of the potential with respect to a change in the permeabilities. Such sensitivity analyses, assessing the rate of change of certain model outputs implied by varying certain model inputs, can be carried out by computing the corresponding partial derivatives. In large-scale computational physics, the underlying computer model is typically available as a complicated computer code in a high-level programming language such as Fortran, C, or C++. To obtain accurate and efficient derivatives of functions given in this form, we use a technique called automatic or algorithmic differentiation. Unlike numerical differentiation based on divided differences, derivatives generated by automatic differentiation are free of truncation error. Here, the automatic differentiation tool Adifor is used to transform the given computer model-implemented with the general purpose finite element package SEPRAN-into a new computer code computing the derivatives of the electrostatic potential with respect to the dielectric permeabilities. In doing so, we automatically translate 400,000 lines of Fortran 77 into a new program consisting of 600,000 lines of Fortran 77. We compare our approach with a traditional approach based on numerical differentiation and quantify its advantages in terms of accuracy and computational efficiency.

[72] Christian H. Bischof, H. Martin Bücker, and Bruno Lang. Automatic differentiation for computational finance. In Erricos John Kontoghiorghes, Berç Rustem, and Stavros Siokos, editors, Computational Methods in Decision-Making, Economics and Finance Proc., Neuchâtel, August 16, 2000, chapter 15, pages 297-310. Kluwer Academic Publishers, Dordrecht, The Netherlands, 2002.
Automatic differentiation (AD) is a powerful technique allowing to compute derivatives of a function given by a (potentially very large) piece of code. The basic principles of AD and some available tools implementing this technology are reviewed. AD is superior to divided differences because AD-generated derivative values are free of approximation errors, and superior to symbolic differentiation because code of very high complexity can be handled, in contrast to computer algebra systems whose applicability is limited to rather simple functions. In addition, the cost for computing gradients of scalar-valued functions with either divided differences or symbolic differentiation grows linearly with the number of variables, whereas the so-called reverse mode of AD can compute such gradients at constant cost.

[73] C. H. Bischof, H. M. Bücker, B. Lang, A. Rasch, and A. Vehreschild. Combining source transformation and operator overloading techniques to compute derivatives for MATLAB programs. In Proc. 2nd IEEE Intl. Workshop on Source Code Analysis and Manipulation (SCAM 2002), October 1, 2002, Montreal, Canada, pages 65-72. IEEE Computer Society, November 2002.
Derivatives of mathematical functions play a key role in various areas of numerical and technical computing. Many of these computations are done in MATLAB, a popular environment for technical computing providing engineers and scientists with capabilities for mathematical computing, analysis, visualization, and algorithmic development. For functions written in the MATLAB language, a novel software tool is proposed to automatically transform a given MATLAB program into another MATLAB program capable of computing not only the original function but also user-specified derivatives of that function. That is, a program transformation known as automatic differentiation is performed to change the semantics of the program in a fashion based on the chain rule of differential calculus. The crucial ingredient of the tool is a combination of source-to-source transformation and operator overloading. The overall design of the tool is described and numerical experiments are reported demonstrating the efficiency of the resulting code for a sample problem.

[74] H. Martin Bücker, Bruno Lang, Arno Rasch, and Christian H. Bischof. Computation of sensitivity information for aircraft design by automatic differentiation. In Peter M. A. Sloot, C. J. Kenneth Tan, Jack J. Dongarra, and Alfons G. Hoekstra, editors, Computational Science - ICCS 2002 International Conference, April 21-24, 2002, Amsterdam, The Netherlands, Proceedings, Part II, number 2330 in LNCS, pages 1069-1076, Berlin, 2002. Springer-Verlag.
Given a numerical simulation of the near wake of an airfoil, automatic differentiation is used to accurately compute the sensitivities of the Mach number with respect to the angle of attack. Such sensitivity information is crucial when integrating a pure simulation code into an optimization framework involving a gradient-based optimization technique. In this note, the ADIFOR system implementing the technology of automatic differentiation for functions written in Fortran 77 is used to mechanically transform a given flow solver called TFS into a new program capable of computing the original simulation and the desired derivatives in a simultaneous fashion. Numerical experiments of derivatives obtained from automatic differentiation and finite differences approximations are reported.

[75] H. Martin Bücker, Bruno Lang, Dieter an Mey, Arno Rasch, and Christian H. Bischof. Explicit loop scheduling in OpenMP for parallel automatic differentiation. In Jalal N. Almhana and C. Bhavsar Virendrakumar, editors, Proc. 16th Annual International Symposium on High Performance Computing Systems and Applications (HPCS 2002), Moncton, Canada, June 16-19, 2002, pages 121-126, Los Alamitos, CA, 2002. IEEE Computer Society.
Derivatives of almost arbitrary functions can be evaluated efficiently by automatic differentiation whenever the functions are given in the form of computer programs in a high-level programming language such as Fortran, C, or C++. In contrast to numerical differentiation, where derivatives are only approximated, automatic differentiation generates derivatives that are accurate up to machine precision. Sophisticated software tools implementing the technology of automatic differentiation are capable of automatically generating code for the product of the Jacobian matrix and a so-called seed matrix. It is shown how these tools can benefit from concepts of shared memory programming to parallelize, in a completely mechanical fashion, the gradient operations associated with each statement of the given code. The feasibility of our approach is demonstrated by numerical experiments. They were performed with a code that was generated automatically by the Adifor system and augmented with OpenMP directives.

[76] H. Martin Bücker, Bruno Lang, Arno Rasch, and Christian H. Bischof. Sensitivity analysis of an airfoil simulation using automatic differentiation. In M. H. Hamza, editor, Proc. International IASTED Conference on Modelling, Identification, and Control (MIC 2002), Innsbruck, Austria, February 18-21, 2002, pages 73-76, Anaheim, CA, 2002. ACTA Press.
Derivatives are not only useful in sensitivity analysis but play a vital role in various areas of computational science including parameter optimization of computer models. Among the various methods for obtaining derivatives, automatic differentiation (AD) combines freedom from approximation errors, high performance, and the ability to handle arbitrarily complex codes arising from large-scale scientific investigations. In this note, we show how AD technology can aid in the sensitivity analysis of a computer model of an aircraft. To this end, the software tool ADIFOR, implementing the AD technology for functions written in Fortran 77, was applied to a large-scale computational fluid dynamics solver, TFS, developed at Aerodynamisches Institut, Aachen University of Technology. This solver comprises approximately 24,000 lines of Fortran 77 and is able to resolve steady and unsteady laminar and turbulent flows in two and three dimensions.

[77] H. Martin Bücker, Bruno Lang, Dieter an Mey, and Christian H. Bischof. Bringing together automatic differentiation and OpenMP. In M. Furnari and E. Gallopoulos, editors, Proceedings International Conference on Supercomputing, ICS 2001, Sorrento, Italy, June 17-21, 2001, pages 246-251, New York, NY, 2001. ACM Press.
Derivatives of almost arbitrary functions can be evaluated efficiently by automatic differentiation whenever the functions are given in the form of computer programs in a high-level programming language such as Fortran, C, or C++. Furthermore, in contrast to numerical differentiation where derivatives are approximated, automatic differentiation generates derivatives that are accurate up to machine precision. The so-called forward mode of automatic differentiation computes derivatives by carrying forward a gradient associated with each intermediate variable simultaneously with the evaluation of the function itself. It is shown how software tools implementing the technology of automatic differentiation can benefit from simple concepts of shared memory programming to parallelize the gradient operations. The feasibility of our approach is demonstrated by numerical experiments. They were performed with a code that was generated automatically by the Adifor system and augmented with OpenMP directives.

[78] Bruno Lang. A comparison of techniques for evaluating centered forms. In Ulrich Kulisch, Rudolf Lohner, and Axel Facius, editors, Perspectives on Enclosure Methods, pages 149-155. Springer-Verlag, Wien, 2001.
Second-order enclosures for a function's range can be computed with centered forms, which involve a so-called slope vector. In this paper we discuss several techniques for determining such vectors and compare them with respect to tightness of the resulting enclosure. We advocate that a two-stage slope vector computation with symbolic preprocessing is optimal.

[79] Christian H. Bischof, H. Martin Bücker, Bruno Lang, Arno Rasch, and Jakob W. Risch. A CORBA-based environment for coupling large-scale simulation and optimization software. In H. R. Arabnia, editor, Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA'2001, Las Vegas, Nevada, June 25-28, 2001, volume 1, pages 68-72. CSREA Press, 2001.
Simulation is an essential part in computational science and engineering. Since the underlying physical models become more and more complex, the adjustment of certain model parameters (“parameter identification”) is often a nontrivial task and should be addressed by means of efficient numerical optimization algorithms. We describe an environment for coupling simulation and optimization software. The modular architecture of the system supports easy modification of the problem configuration, including the exchange of software components. Therefore our tool is well-suited for rapid prototyping of optimization problems.

[80] Bruno Lang. Derivative-based subdivision in multi-dimensional verified Gaussian quadrature. In Götz Alefeld, Jiři Rohn, Siegfried Rump, and Tetsuro Yamamoto, editors, Symbolic Algebraic Methods and Verification Methods, pages 145-152, Wien, 2001. Springer-Verlag.
An implementation of verified Gaussian quadrature involves algorithmic parameters whose correct setting is crucial for adequate performance. We identify several of these control parameters and, based on experimental data, we try to give advice for choosing suitable parameter values in order to obtain reasonable average performance for a “black-box” quadrature routine.

[81] H. Martin Bücker, Bruno Lang, Arno Rasch, and Christian H. Bischof. From analytic to automated derivatives: A case study of the electrostatic potential. In Algorithms and Software for Mobile Communications, Proc. 10th Aachen Symposium on Signal Theory, Aachen, Germany, September 20-21, 2001, pages 255-260, Berlin, 2001. VDE Verlag.
Given a large-scale engineering simulation, one is often interested in the derivatives of certain program outputs with respect to certain input parameters, e.g., for sensitivity analysis. From an abstract point of view, the computer program defines a mathematical function whose derivatives are sought. In this note, two cases are investigated where the underlying functions are given by the solution of different two-dimensional electrostatic potential problems. For a rectangular region, the function and its derivatives can be derived analytically. For a function based on an L-shaped region, the function is computed by a simulation code and its derivatives are obtained by automatic differentiation. In general, this powerful technique is applicable if a function is given in the form of a computer program in a high-level programming language such as Fortran, C, or C++. In contrast to numerical differentiation providing approximations based on divided differences, the derivatives computed by automatic differentiation are accurate. Furthermore, automatic differentiation is also shown to be computationally efficient.

[82] Christian H. Bischof, H. Martin Bücker, Jörg Henrichs, and Bruno Lang. Hands-on training for undergraduates in high-performance computing using Java. In Tor Sørevik, Fredrik Manne, Randi Moe, and Assefaw Hadish Gebremedhin, editors, Applied Parallel Computing-New Paradigms for HPC in Industry and Academia, Proceedings 5th International Workshop, PARA 2000, Bergen, Norway, June 2000, volume 1947 of LNCS, pages 306-315, Berlin, 2001. Springer-Verlag.
In recent years, the object-oriented approach has emerged as a key technology for building highly complex scientific codes, as has the use of parallel computers for the solution of large-scale problems. We believe that the paradigm shift towards parallelism will continue and, therefore, principles and techniques of writing parallel programs should be taught to the students at an early stage of their education rather than as an advanced topic near the end of a curriculum. A certain understanding of the practical aspects of numerical modeling is also a useful facet in computer science education. The reason is that, in addition to their traditional prime rôle in computational science and engineering, numerical techniques are also increasingly employed in seemingly non-numerical settings as large-scale data mining and web searching. This paper describes a practical training course for undergraduates, where carefully selected problems of high-performance computing are solved using the programming language Java.

[83] Christian H. Bischof, H. Martin Bücker, Bruno Lang, Arno Rasch, and Jakob W. Risch. On the use of a differentiated finite element package for sensitivity analysis. In Vassil N. Alexandrov, Jack J. Dongarra, Benjoe A. Juliano, René S. Renner, and C. J. Kenneth Tan, editors, Computational Science - ICCS 2001 International Conference San Francisco, CA, May 2001, Proceedings, Part I, volume 2073 of LNCS, pages 795-801, Berlin, 2001. Springer.
Derivatives are ubiquitous in various areas of computational science including sensitivity analysis and parameter optimization of computer models. Among the various methods for obtaining derivatives, automatic differentiation (AD) combines freedom from approximation errors, high performance, and the ability to handle arbitrarily complex codes arising from large-scale scientific investigations. In this note, we show how AD technology can aid in the sensitivity analysis of a computer model by considering a classic fluid flow experiment as an example. To this end, the software tool ADIFOR implementing the AD technology for functions written in Fortran 77 was applied to the large finite element package SEPRAN. Differentiated versions of SEPRAN enable sensitivity analysis for a wide range of applications, not only from computational fluid dynamics.

[84] H. Martin Bücker, Bruno Lang, and Christian H. Bischof. Teaching different parallel programming paradigms using Java. In Proceedings of the Third Annual Workshop on Java for High-Performance Computing, Sorrento, Italy, June 17, 2001, pages 73-81, 2001.
Parallel computing has emerged as a key technology for the solution of large-scale problems arising in computational science and engineering. Therefore, principles and techniques of writing parallel programs should be taught to the students at an early stage of their education rather than as an advanced topic near the end of a curriculum. This note describes a practical training course for undergraduates, where the programming language Java is used to pave the student's way from concurrent programming using native Java threads via OpenMP-like shared memory programming to distributed memory programming based on the message passing interface MPI.

[85] Christian H. Bischof, Bruno Lang, Wolfgang Marquardt, and Martin Mönnigmann. Verified determination of singularities in chemical processes. In Walter Krämer and Jürgen Wolff von Gudenberg, editors, Scientific Computing, Validated Numerics, Interval Methods, pages 305-316, New York, 2001. Kluwer Academic/Plenum Publishers.
Understanding and controlling the behavior of chemical processes are important issues, for safety as well as economical reasons. Some processes can have multiple steady states and even switch between them in a complex way, the reasons for the multiplicity not always being well understood. A singularity theory approach for investigating such behavior leads to nonlinear systems whose solutions correspond to specific singular states of the process. In order to exclude certain types of singularities, rigorous methods must be used to check the solvability of the matching systems. As these systems are highly structured, our solution method combines a symbolic preprocessing phase (term manipulation for utilizing the structure) with a branch-and-bound type rigorous interval-based solver. We report on our experience with this approach for small-to-medium sized example problems.

[86] Christian Bischof, Martin Bücker, Bruno Lang, and Arno Rasch. Automatic differentiation of the FLUENT CFD program: Procedure and first results. Report RWTH-CS-SC-01-22, RWTH Aachen, Institute for Scientific Computing, 2001.
[87] Christian H. Bischof, Bruno Lang, and Xiaobai Sun. A framework for symmetric band reduction. ACM Trans. Math. Software, 26(4):581-601, December 2000.
We develop an algorithmic framework for reducing the bandwidth of symmetric matrices via orthogonal similarity transformations. This framework includes the reduction of full matrices to banded or tridiagonal form and the reduction of banded matrices to narrower banded or tridiagonal form, possibly in multiple steps. Our framework leads to algorithms that require fewer floating-point operations than do standard algorithms, if only the eigenvalues are required. In addition, it allows for space-time tradeoffs and enables or increases the use of blocked transformations.

[88] Christian H. Bischof, Bruno Lang, and Xiaobai Sun. Algorithm 807: The SBR toolbox-software for successive band reduction. ACM Trans. Math. Software, 26(4):602-616, December 2000.
We present a software toolbox for symmetric band reduction via orthogonal transformations, together with a testing and timing program. The toolbox contains drivers and computational routines for the reduction of full symmetric matrices to banded form and the reduction of banded matrices to narrower banded or tridiagonal form, with optional accumulation of the orthogonal transformations, as well as repacking routines for storage rearrangement. The functionality and the calling sequences of the routines are described, with a detailed discussion of the “control” parameters that allow adaptation of the codes to particular machine and matrix characteristics. We also briefly describe the testing and timing program included in the toolbox.

[89] Bruno Lang. Direct solvers for symmetric eigenvalue problems. In Johannes Grotendorst, editor, Modern Methods and Algorithms of Quantum Chemistry, Jülich, February 21-25, 2000, Proceedings, 2nd Edition, pages 231-259, Jülich, 2000. John von Neumann Institute for Computing.
This article reviews classical and recent direct methods for computing eigenvalues and eigenvectors of symmetric full or banded matrices. The ideas underlying the methods are presented, and the properties of the algorithms with respect to accurary and performance are discussed. Finally, pointers to relevant software are given.

[90] Bruno Lang. Out-of-core solution of large symmetric eigenproblems. Z. angew. Math. Mech., 80:S 809-S 810, 2000.
This paper describes a prototype implementation of a solver for dense symmetric eigenproblems, which are too large to fit into the main memory.

[91] Christian H. Bischof, H. Martin Bücker, Jörg Henrichs, and Bruno Lang. Software-Praktikum Paralleles Programmieren mit Java. Report RWTH-CS-SC-00-13, RWTH Aachen, Institute for Scientific Computing, 2000.
-

[92] Bruno Lang. Efficient eigenvalue and singular value computations on shared memory machines. Parallel Comput., 25(7):845-860, July 1999.
We describe two techniques for speeding up eigenvalue and singular value computations on shared memory parallel computers. Depending on the information that is required, different steps in the overall process can be made more efficient. If only the eigenvalues or singular values are sought then the reduction to condensed form may be done in two or more steps to make best use of optimized level-3 BLAS. If eigenvectors and/or singular vectors are required, too, then their accumulation can be speeded up by another blocking technique. The efficiency of the blocked algorithms depends heavily on the values of certain control parameters. We also present a very simple performance model that allows selecting these parameters automatically.

[93] Benedikt Großer and Bruno Lang. Efficient parallel reduction to bidiagonal form. Parallel Comput., 25(8):969-986, September 1999.
Most methods for calculating the SVD (singular value decomposition) require to first bidiagonalize the matrix. The blocked reduction of a general, dense matrix to bidiagonal form, as implemented in ScaLAPACK, does about one half of the operations with BLAS3. By subdividing the reduction into two stages dense -> banded and banded -> bidiagonal with cubic and quadratic arithmetic costs respectively, we are able to carry out a much higher portion of the calculations in matrix-matrix multiplications. Thus, higher performance can be expected.

This paper presents and compares three parallel techniques for reducing a full matrix to banded form. (The second reduction stage is described in another paper [B. Lang. Parallel reduction of banded matrices to bidiagonal form. Parallel Comput. 22:1-18, 1996]). Numerical experiments on the Intel Paragon and IBM SP/1 distributed memory parallel computers demonstrate that the two-stage reduction approach can be significantly superior if only the singular values are required.

[94] Bruno Lang. A comparison of subdivision strategies for verified multi-dimensional Gaussian quadrature. In Tibor Csendes, editor, Developments in Reliable Computing - SCAN-98 Proceedings, pages 67-75, Dordrecht, The Netherlands, 1999. Kluwer Academic Publishers.
This paper compares several strategies for subdividing the domain in verified multi-dimensional Gaussian quadrature. Subdivision may be used in two places in the quadrature algorithm. First, subdividing the box can reduce the over-estimation of the partial derivatives' ranges, which are needed to bound the approximation error. Second, if the required error bound cannot be met with the whole domain then the box is split into subboxes, and the quadrature algorithm is recursively applied to these. Both variants of subdivision are considered in this paper.

[95] Benedikt Großer and Bruno Lang. Using pentangular factorizations for the reduction to banded form. In P. Amestoy, P. Berger, M. Daydé, I. Duff, V. Frayssé, L. Giraud, and D. Ruiz, editors, EuroPar'99 Parallel Processing, volume 1685 of LNCS, pages 1088-1095, Berlin, 1999. Springer-Verlag.
Most methods for computing the singular value decomposition (SVD) first bidiagonalize the matrix. The ScaLAPACK implementation of the blocked reduction of a general dense matrix to bidiagonal form performs about one half of the operations with BLAS3. If we subdivide the task into two stages dense ->banded and banded -> bidiagonal, we can increase the portion of matrix-matrix operations and expect higher performance. We give an overview of different techniques for the first stage.

This note summarizes the results of  [B. Großer. Parallele zweistufige Verfahren zur Reduktion auf Bidiagonalgestalt. Diplomarbeit, Fachbereich Mathematik, Bergische Universität GH Wuppertal, 1997], [B. Großer, B. Lang. Efficient parallel reduction to bidiagonal form. TR BUGHW-SC98/2, Fachbereich Mathematik, Bergische Universität GH Wuppertal, 1998].

[96] Rudolf Berrendorf, Christian Bischof, Holger Brunst, Martin Bücker, Ulrich Detert, Rüdiger Esser, Michael Gerndt, Johannes Grotendorst, Inge Gutheil, Hans-Christian Hoppe, Friedel Hoßfeld, Bernd Körfgen, Bruno Lang, Dieter an Mey, Bernd Mohr, Wolfgang E. Nagel, Karl Solchenbach, Godehard Sutmann, Valentina Tikko, and Lothar Wollschläger. Gekoppelte SMP-Systeme im wissenschaftlich-technischen Hochleistungsrechnen - Status und Entwicklungsbedarf (GoSMP). Analyse im Auftrag des BMBF, Zentralinstitut für Angewandte Mathematik, Forschungszentrum Jülich GmbH, Germany, December 1999.
In der Analyse werden der derzeitige Status und die voraussehbare Entwicklung gekoppelter SMP-Systeme bezüglich der Hardware und Software dargestellt sowie Forschungs- und Entwicklungsaufgaben im Softwarebereich identifiziert, deren Lösung für den effizienten Einsatz gekoppelter SMP-Systeme für das wissenschaftlich-technische Hochleistungsrechnen erforderlich ist. Es werden konkrete Themengebiete für eine Projektförderung durch das BMBF empfohlen.

[97] Bruno Lang. Efficient algorithms for reducing banded matrices to bidiagonal and tridiagonal form. In Peter Arbenz, Marcin Paprzycki, Ahmed Sameh, and Vivek Sarin, editors, High Performance Algorithms for Structured Matrix Problems, volume 2 of Advances in the Theory of Computation and Computational Mathematics, pages 75-89. Nova Science Publishers, Commack, NY, 1998.
This paper presents efficient techniques for the orthogonal reduction of banded matrices to bidiagonal and symmetric tridiagonal form. The algorithms are numerically stable and well suited to parallel execution. Experiments on the Intel Paragon show that even on a single processor these methods usually will be several times faster than the corresponding LAPACK routines. In addition, they yield nearly full speedup when run on multiple processors.

[98] Bruno Lang. Using level 3 BLAS in rotation-based algorithms. SIAM J. Sci. Comput., 19(2):626-634, March 1998.
This paper presents a technique that allows using level 3 BLAS in a number of rotation-based algorithms. In particular, the update of an orthogonal transformation matrix which often involves the vast majority of operations can be done with a matrix-matrix product. As a case study, the technique is applied to the QR and QL algorithms for computing the eigensystem of a symmetric tridiagonal matrix. The modifications do not affect the convergence properties of the algorithms nor do they significantly increase the overall number of operations. Thus, the computations can be sped up by more than 50% on machines with a distinct memory hierarchy, like the Intel i860 or IBM RS/6000, provided the block size is set appropriately. We also present a simple theoretical analysis that allows selecting an almost-optimal block size.

[99] Bruno Lang. Verified quadrature in determining Newton's constant of gravitation. J. Univers. Comput. Sci., 4(1):16-24, 1998.
This paper describes the use of interval arithmetic to bound errors in an experiment for determining Newton's constant of gravitation. Using verified Gaussian quadrature we were able to assess the numerical errors as well as the effect of several tolerances in the physical experiment.

[100] Bruno Lang. Effiziente Orthogonaltransformationen bei der Eigen- und Singulärwertberechnung. Habilitationsschrift, Fachbereich Mathematik, Bergische Universität GH Wuppertal, Germany, 1997.
[101] Bruno Lang. Using level 3 BLAS in the QR algorithm. Z. angew. Math. Mech., 77(S2):S 607-S 608, 1997.
We present a technique that allows speeding up the QR algorithms for symmetric tridiagonal and Hessenberg matrices by doing most of the computations in matrix-matrix products. The reduced memory traffic leads to increased performance.

[102] Bruno Lang. Parallel reduction of banded matrices to bidiagonal form. Parallel Comput., 22(1):1-18, January 1996.
A parallel algorithm for reducing banded matrices to bidiagonal form is presented. In contrast to the rotation-based “standard approach”, our algorithm is based on Householder transforms, therefore exhibiting considerably higher data locality (BLAS level 2 instead of level 1). The update of the transformation matrices which involves the vast majority of the operations can even be blocked to allow the use of level 3 BLAS. Thus, our algorithm will outperform the standard method on a serial computer with a distinct memory hierarchy. In addition, the algorithm can be efficiently implemented in a distributed memory environment, as is demonstrated by numerical results on the Intel Paragon.

[103] Oliver Holzmann, Bruno Lang, and Holger Schütt. Newton's constant of gravitation and verified numerical quadrature. Reliab. Comput., 2(3):229-239, November 1996.
In this paper we describe the use of interval arithmetic in an experiment for determining G, Newton's constant of gravitation. Using an interval version of Gaussian quadrature, we bound the effects of numerical errors and of several tolerances in the physical experiment. This allowed to idientify “critical” tolerances which must be reduced in order to obtain G with the desired accuracy.

[104] Bruno Lang. Reduction of banded matrices to bidiagonal form. Z. angew. Math. Mech., 76(Supplement 1: ICIAM/GAMM 95 Proceedings):155-158, 1996.
We present a parallel algorithm for reducing banded matrices to bidiagonal form. This reduction is a major step in the computation of singular values and singular vectors. In contrast to the rotation-based “standard approach”, our algorithm is based on Householder transforms, therefore exhibiting considerably better data locality (BLAS level 2 instead of level 1). The update of the transformation matrix, the most expensive part of the reduction, can be blocked to allow the use of level 3 BLAS. Thus, even a serial implementation of our algorithm will outperform the standard method on a machine with a distinct memory hierarchy. In addition, the algorithm can be efficiently implemented in a distributed memory environment, as is demonstrated by numerical results on the Intel Paragon.

[105] Götz Alefeld, Andreas Frommer, and Bruno Lang, editors. Scientific Computing and Validated Numerics - SCAN-95 Proceedings, volume 90 of Mathematical Research, Berlin, 1996. Akademie Verlag.
-

[106] Bruno Lang. Verifizierte Lösung von Gleichungs- und Ungleichungssystemen. Z. angew. Math. Mech., 75(S II):S541-S542, 1995.
-

[107] Christian H. Bischof, Bruno Lang, and Xiaobai Sun. Parallel tridiagonalization through two-step band reduction. In Proceedings of the Scalable High-Performance Computing Conference, Knoxville, Tennessee, May 23-25, 1994, pages 23-27, Los Alamitos, CA, 1994. IEEE Computer Society.
We present a two-step variant of the “successive band reduction” paradigm for the tridiagonalization of symmetric matrices. Here we first reduce a full matrix to narrow-banded form, and from there to tridiagonal form. The first step allows easy exploitation of block orthogonal transformations. In the second step, we employ a new blocked version of a banded matrix tridiagonalization algorithm by Lang. In particular, we are able to express the update of the orthogonal transformation matrix in terms of block transformations, which leads to an algorithm that is almost entirely based on BLAS-3 kernels, and has greatly improved data movement and communication characteristics. We also present some performance results on the Intel Touchstone Delta prototype and the IBM SP/1.

[108] Bruno Lang. Parallele Berechnung von Eigensystemen symmetrischer Bandmatrizen. Z. angew. Math. Mech., 74(6):T541-T544, 1994.
Es wird ein Verfahren zur Berechnung von Eigenwerten und -vektoren symmetrischer Bandmatrizen auf Parallelrechnern vorgestellt. Zunächst wird die Matrix auf Tridiagonalgestalt reduziert, dann werden mit einem beschleunigten Bisektionsverfahren die Eigenwerte und anschließend die Eigenvektoren berechnet. Das vorgestellte Verfahren ermöglicht in allen drei Phasen die Verteilung der Rechenarbeit auf mehrere Prozessoren und zusätzlich den Einsatz eventuell vorhandener Vektorhardware in den einzelnen Prozessoren. Die Effizienz des Verfahrens wird durch Messungen auf dem iPSC/860 Hypercube-Parallelrechner belegt.

[109] Bruno Lang. A parallel algorithm for reducing symmetric banded matrices to tridiagonal form. SIAM J. Sci. Comput., 14(6):1320-1338, November 1993.
An algorithm is presented for reducing symmetric banded matrices to tridiagonal form via Householder transformations. The algorithm is numerically stable and is well suited to parallel execution on distributed memory multiple instruction multiple data (MIMD) computers. Numerical experiments on the iPSC/860 hypercube show that the new method yields nearly full speedup if it is run on multiple processors. In addition, even on a single processor the new method usually will be several times faster than the corresponding EISPACK and LINPACK routines.

[110] Bruno Lang. Reducing symmetric banded matrices to tridiagonal form-A comparison of a new parallel algorithm with two serial algorithms on the iPSC/860. In L. Bougé, M. Cosnard, Y. Robert, and D. Trystram, editors, Parallel Processing: CONPAR 92-VAPP V - Proceedings Second Joint International Conference on Vector and Parallel Processing, Lyon, France, September 1992, volume 634 of LNCS, pages 271-282, Berlin, 1992. Springer-Verlag.
We compare three algorithms for reducing symmetric banded matrices to tridiagonal form and evaluate their performance on the Intel iPSC/860 hypercube parallel computer. Two of these algorithms, the routines BANDR and _SBTRD from the EISPACK and LAPACK libraries, resp., are serial algorithms with little potential for coarse grain parallelism. The third one, called SBTH, is a new parallel algorithm. Results on the iPSC/860 hypercube show that this new method may be several times faster than the others, even when run on a single processor. In addition, execution on multiple processors yields nearly full speedup.

[111] Bruno Lang. Parallele Reduktion symmetrischer Bandmatrizen auf Tridiagonalgestalt. Dissertation, Fakultät für Mathematik, Universität Karlsruhe, Germany, 1991.
[112] Bruno Lang. Matrix vector multiplication for symmetric matrices on parallel computers and applications. Z. angew. Math. Mech., 71(6):T807-T808, 1991.
Matrix vector products with symmetric matrices arise in several numerical algorithms, e.g. in the Householder method for eigenproblems and in some iterative methods for the solution of linear systems. Here we propose a procedure for computing such a product on a local memory parallel computer, if only one triangle of the matrix is actually stored. Our method neither increases the number of arithmetic operations nor introduces prohibitive communication overhead.

[113] Stephan Abramowski, Bruno Lang, and Heinrich Müller. Moving regular k-gons in contact. In J. van Leeuwen, editor, Graph-Theoretic Concepts in Computer Science, International Workshop WG '88 Proceedings, Amsterdam, June 15-17, 1988, volume 344 of LNCS, pages 229-242, Berlin, 1989. Springer-Verlag.
Given m circles in the plane, contacts between them can be specified by a system of quadratic distance equalities. An approximate solution for the trajectories of the circles for a system of one degree of freedom is given, by replacing the circles by translationally moving regular k-gons. The approximation yields trajectories that are piecewise linear. The next linear generation of the m trajectories are found by an incremental algorithm in O( m2 ) time. Further, an algorithm is presented which finds the next collision between m k-gons moving on lines at constant speed in time O( k m2-x ) for a constant x > 0 using linear space. Finally, more practical collision detection algorithms are sketched based on neighborhood information which, however, do not guarantee a nontrivial worst-case bound.