[1] 
Martin Galgon, Lukas Krämer, Bruno Lang, Andreas Alvermann, Holger
Fehske, Andreas Pieper, Georg Hager, Moritz Kreutzer, Faisal Shahzad, Gerhard
Wellein, Achim Basermann, Melven RöhrigZöllner, and Jonas Thies.
Improved coefficients for polynomial filtering in ESSEX.
In Tetsuya Sakurai, ShaoLiang Zhang, Toshiyuki Imamura, Yusaku
Yamamoto, Yoshinobu Kuramashi, and Takeo Hoshi, editors, Eigenvalue
Problems: Algorithms, Software and Applications, in Petascale Computing.
Proc. EPASA 2015, Tsukuba, Japan, September 2015, volume 117 of
LNCSE. Springer International Publishing, 2017.
To appear.
The ESSEX project is an ongoing effort to provide exascaleenabled sparse eigensolvers, especially for quantum physics and related application areas. In this paper we first briefly summarize some key achievements that have been made within this project. Then we focus on a projectionbased eigensolver with polynomial approximation of the projector. This eigensolver can be used for computing hundreds of interior eigenvalues of large sparse matrices. We describe techniques that allow using lowerdegree polynomials than possible with standard Chebyshev expansion of the window function and kernel smoothing. With these polynomials, the degree, and thus the number of matrixvector multiplications, typically can be reduced by roughly one half, resulting in comparable savings in runtime.

[2] 
Andreas Pieper, Moritz Kreutzer, Andreas Alvermann, Martin Galgon, Holger
Fehske, Georg Hager, Bruno Lang, and Gerhard Wellein.
Highperformance implementation of Chebyshev filter diagonalization
for interior eigenvalue computations.
J. Comput. Phys., 325:226243, 2016.
[ http ]
We study Chebyshev filter diagonalization as a tool for the computation of many interior eigenvalues of very large sparse symmetric matrices. In this technique the subspace projection onto the target space of wanted eigenvectors is approximated with filter polynomials obtained from Chebyshev expansions of window functions. After the discussion of the conceptual foundations of Chebyshev filter diagonalization we analyze the impact of the choice of the damping kernel, search space size, and filter polynomial degree on the computational accuracy and effort, before we describe the necessary steps towards a parallel highperformance implementation. Because Chebyshev filter diagonalization avoids the need for matrix inversion it can deal with matrices and problem sizes that are presently not accessible with rational function methods based on direct or iterative linear solvers. To demonstrate the potential of Chebyshev filter diagonalization for largescale problems of this kind we include as an example the computation of the 10^{2} innermost eigenpairs of a topological insulator matrix with dimension 10^{9} derived from quantum physics applications.

[3] 
Moritz Kreutzer, Jonas Thies, Andreas Pieper, Andreas Alvermann, Martin Galgon,
Melven RöhrigZöllner, Faisal Shahzad, Achim Basermann, Alan R.
Bishop, Holger Fehske, Georg Hager, Bruno Lang, and Gerhard Wellein.
Performance engineering and energy efficiency of building blocks for
large, sparse eigenvalue computations on heterogeneous supercomputers.
In HansJoachim Bungartz, Philipp Neumann, and Wolfgang E. Nagel,
editors, Software for Exascale Computing  SPPEXA 20132015, volume 113
of LNCSE, pages 317338. Springer, Switzerland, 2016.
Numerous challenges have to be mastered as applications in scientific computing are being developed for postpetascale parallel systems. While ample parallelism is usually available in the numerical problems at hand, the efficient use of supercomputer resources requires not only good scalability but also a verifiably effective use of resources on the core, the processor, and the accelerator level. Furthermore, power dissipation and energy consumption are becoming further optimization targets besides timetosolution. Performance Engineering (PE) is the pivotal strategy for developing effective parallel code on all levels of modern architectures. In this paper we report on the development and use of lowlevel parallel building blocks in the GHOST library (“General, Hybrid, and Optimized Sparse Toolkit”). We demonstrate the use of PE in optimizing a density of states computation using the Kernel Polynomial Method, and show that reduction of runtime and reduction of energy are literally the same goal in this case. We also give a brief overview of the capabilities of GHOST and the applications in which it is being used successfully.

[4] 
Jonas Thies, Martin Galgon, Faisal Shahzad, Andreas Alvermann, Moritz Kreutzer,
Andreas Pieper, Melven RöhrigZöllner, Achim Basermann, Holger
Fehske, Georg Hager, Bruno Lang, and Gerhard Wellein.
Towards an exascale enabled sparse solver repository.
In HansJoachim Bungartz, Philipp Neumann, and Wolfgang E. Nagel,
editors, Software for Exascale Computing  SPPEXA 20132015, volume 113
of LNCSE, pages 295316. Springer, Switzerland, 2016.
As we approach the exascale computing era, disruptive changes in the software landscape are required to tackle the challenges posed by manycore CPUs and accelerators. We discuss the development of a new `exascale enabledâ€™ sparse solver repository (the ESSR) that addresses these challengesfrom fundamental design considerations and development processes to actual implementations of some prototypical iterative schemes for computing eigenvalues of sparse matrices. Key features of the ESSR include holistic performance engineering, tight integration between software layers and mechanisms to mitigate hardware failures.

[5] 
Lukas Krämer and Bruno Lang.
Convergence of integration based methods for the solution of standard
and generalized hermitian eigenvalue problems.
Preprint BUWIMACM 14/30, 2016.
[ .pdf ]
Recently, methods based on spectral projection and numerical integration came up in the literature as candidates for reliable high performance eigenvalue solvers. The key ingredients of this type of eigenvalue solver are a RayleighRitz process and a routine to compute an approximation to the desired eigenspace. The latter computation can be performed by numerical integration of the resolvent. In this article we investigate the progress of the RayleighRitz process and the achievable quality of the computed eigenpairs for the case that an upper bound for the normwise difference between the currently used subspace and the desired eigenspace is available. Then, such bounds are derived for the GaußLegendre rule and the trapezoidal rule.

[6] 
Martin Galgon, Lukas Krämer, Jonas Thies, Achim Basermann, and Bruno
Lang.
On the parallel iterative solution of linear systems arising in the
FEAST algorithm for computing inner eigenvalues.
Parallel Comput., 49:153163, 2015.
Methods for the solution of sparse eigenvalue problems that are based on spectral projectors and contour integration have recently attracted more and more attention. Such methods require the solution of many shifted linear systems of full size. In most of the literature concerning these eigenvalue solvers, only few words are said on the solution of the linear systems, but they turn out to be very hard to solve by iterative linear solvers in practice. In this work we identify a row projection method for the solution of the inner linear systems encountered in the FEAST algorithm and introduce a novel hybrid parallel and fully iterative implementation of the eigenvalue solver. Our approach ultimately aims at achieving extreme parallelism by exploiting the algorithm's potential on several levels. We present numerical examples where graphene modeling is one of the target applications. In this application, several hundred or even thousands of eigenvalues from the interior of the spectrum are required, which is a big challenge for stateoftheart numerical methods.

[7] 
Martin Galgon, Lukas Krämer, and Bruno Lang.
Adaptive choice of projectors in projection based eigensolvers.
Preprint BUWIMACM 15/07, 2015.
[ .pdf ]
We present a general framework for algorithms for the solution of Hermitian eigenvalue problems, where eigenvalues in a given interval are sought. One instance of this framework is Polizzi's FEAST algorithm [E. Polizzi: Densitymatrixbased algorithm for solving eigenvalue problems. Phys. Rev. B 2009; 79:115112], which is based on numerical integration. Another instance is based on polynomial approximation. We propose adaptive strategies for the choice of the polynomial degree and tolerance of the linear solver, respectively. Numerical experiments reveal that these strategies are able to improve the robustness of the resulting methods, while at the same time achieving neartooptimum efficiency.

[8] 
Lars Karlsson, Daniel Kressner, and Bruno Lang.
Optimally packed chains of bulges in multishift QR algorithms.
ACM Trans. Math. Software, 40(2):12:112:15, February 2014.
The QR algorithm is the method of choice for computing all eigenvalues of a dense nonsymmetric matrix A. After an initial reduction to Hessenberg form, a QR iteration can be viewed as chasing a small bulge from the top left to the bottom right corner along the subdiagonal of A. To increase data locality and create potential for parallelism, modern variants of the QR algorithm perform several iterations simultaneously, which amounts to chasing a chain of several bulges instead of a single bulge. To make effective use of level 3 BLAS, it is important to pack these bulges as tightly as possible within the chain. In this work, we show that the tightness of the packing in existing approaches is not optimal and can be increased. This directly translates into a reduced chain length by 33% compared to the stateoftheart LAPACK implementation of the QR algorithm. To demonstrate the impact of our idea, we have modified the LAPACK implementation to make use of the optimal packing. Numerical experiments reveal a uniform reduction of the execution time, without affecting stability or robustness.

[9] 
A. Marek, V. Blum, R. Johanni, V. Havu, B. Lang, T. Auckenthaler, A. Heinecke,
H.J. Bungartz, and H. Lederer.
The ELPA library: Scalable parallel eigenvalue solutions for
electronic structure theory and computational science.
J. Phys.: Condens. Matter, 26(21):213201, May 2014.
Obtaining the eigenvalues and eigenvectors of large matrices is a key problem in electronic structure theory and many other areas of computational science. The computational effort formally scales as O(N^{3}) with the size of the investigated problem, N (e.g. the electron count in electronic structure theory), and thus often defines the system size limit that practical calculations cannot overcome. In many cases, more than just a small fraction of the possible eigenvalue/eigenvector pairs is needed, so that iterative solution strategies that focus only on a few eigenvalues become ineffective. Likewise, it is not always desirable or practical to circumvent the eigenvalue solution entirely. We here review some current developments regarding dense eigenvalue solvers and then focus on the Eigenvalue soLvers for Petascale Applications (ELPA) library, which facilitates the efficient algebraic solution of symmetric and Hermitian eigenvalue problems for dense matrices that have realvalued and complexvalued matrix entries, respectively, on parallel computer platforms. ELPA addresses standard as well as generalized eigenvalue problems, relying on the well documented matrix layout of the Scalable Linear Algebra PACKage (ScaLAPACK) library but replacing all actual parallel solution steps with subroutines of its own. For these steps, ELPA significantly outperforms the corresponding ScaLAPACK routines and proprietary libraries that implement the ScaLAPACK interface (e.g. Intel's MKL). The most timecritical step is the reduction of the matrix to tridiagonal form and the corresponding backtransformation of the eigenvectors. ELPA offers both a onestep tridiagonalization (successive Householder transformations) and a twostep transformation that is more efficient especially towards larger matrices and larger numbers of CPU cores. ELPA is based on the MPI standard, with an early hybrid MPIOpenMPI implementation available as well. Scalability beyond 10000 CPU cores for problem sizes arising in the field of electronic structure theory is demonstrated for current highperformance computer architectures such as Cray or Intel/Infiniband. For a matrix of dimension 260000, scalability up to 295000 CPU cores has been shown on BlueGene/P.

[10] 
Andreas Alvermann, Achim Basermann, Holger Fehske, Martin Galgon, Georg Hager,
Moritz Kreutzer, Lukas Krämer, Bruno Lang, Andreas Pieper, Melven
RöhrigZöllner, Faisal Shahzad, Jonas Thies, and Gerhard Wellein.
ESSEX: Equipping sparse solvers for exascale.
In Luís Lopes et al., editors, EuroPar 2014: Parallel
Processing Workshops, volume 8806 of LNCS, pages 577588. Springer,
2014.
The ESSEX project investigates computational issues arising at exascale for largescale sparse eigenvalue problems and develops programming concepts and numerical methods for their solution. The project pursues a coherent codesign of all software layers where a holistic performance engineering process guides code development across the classic boundaries of application, numerical method, and basic kernel library. Within ESSEX the numerical methods cover widely applicable solvers such as classic Krylov, JacobiDavidson, or the recent FEAST methods, as well as domainspecific iterative schemes relevant for the ESSEX quantum physics application. This report introduces the project structure and presents selected results which demonstrate the potential impact of ESSEX for efficient sparse solvers on highly scalable heterogeneous supercomputers.

[11] 
Martin Galgon, Lukas Krämer, Bruno Lang, Andreas Alvermann, Holger
Fehske, and Andreas Pieper.
Improving robustness of the FEAST algorithm and solving eigenvalue
problems from graphene nanoribbons.
Proc. Appl. Math. Mech., 14(1):821822, December 2014.
We consider the FEAST eigensolver, introduced by Polizzi in 2009 [5]. We describe an improvement concerning the reliability of the algorithm and discuss an application in the solution of eigenvalue problems from graphene modeling.

[12] 
Paul R. Willems and Bruno Lang.
A framework for the MR^{3} algorithm: Theory and
implementation.
SIAM J. Sci. Comput., 35(2):A740A766, 2013.
This paper provides a streamlined and modular presentation of the MR^{3} algorithm for computing selected eigenpairs of symmetric tridiagonal matrices, thus disentangling the principles driving MR^{3} and the (recursive) “core” algorithm from the specific (e.g., twisted) decompositions used to represent the matrices at different recursion depths and from the (dqds) transformations converting between them. Our approach allows a modular full proof for the correctness of the MR^{3} algorithm. This proof is based on five requirements concerning the interplay between the core algorithm and its subcomponents. These requirements can also guide in implementing the algorithm, because they expose quantities that can and should be monitored at runtime. Our new implementation XMR, which is based on the above analysis, is described and compared to xSTEMR from Lapack 3.2.2. Numerical experiments comparing the robustness and performance of both implementations are given.

[13] 
Lukas Krämer, Edoardo Di Napoli, Martin Galgon, Bruno Lang, and Paolo
Bientinesi.
Dissecting the FEAST algorithm for generalized eigenproblems.
J. Comput. Appl. Math., 244:19, May 2013.
We analyze the FEAST method for computing selected eigenvalues and eigenvectors of large sparse matrix pencils. After establishing the close connection between FEAST and the wellknown RayleighRitz method, we identify several critical issues that influence convergence and accuracy of the solver: the choice of the starting vector space, the stopping criterion, how the inner linear systems impact the quality of the solution, and the use of FEAST for computing eigenpairs from multiple intervals. We complement the study with numerical examples, and hint at possible improvements to overcome the existing problems.

[14]  Andreas Marek, Volker Blum, Rainer Johanni, Ville Havu, Bruno Lang, Thomas Auckenthaler, Alexander Heinecke, HansJoachim Bungartz, and Hermann Lederer. The ELPA library  scalable parallel eigenvalue solutions for electronic structure theory and computational science. Scientific highlight of the month, Ψ_{k} (http://www.psik.org), December 2013. 
[15] 
Elke Just and Bruno Lang.
Successguided selection of expanded systems for resultverifying
nonlinear solvers.
Reliab. Comput., 16:7383, March 2012.
Most resultverifying solvers for nonlinear systems include a branchandbound algorithmic backbone, complemented with accelerating methods such as Constraint Propagation, Interval Newton, and many others. Often the effectiveness of the accelerators can be improved if one works not only with the given system but also with expanded systems, which are obtained by introducing additional variables for suitable subterms. While reducing the overall number of boxes that must be considered during the solution, applying the accelerators to the (often much larger) expanded systems can increase significantly the time spent on a single box. Therefore a good strategy for deciding which methods to apply to which expanded system is essential for the performance and robustness of the whole solver.

[16] 
Paul R. Willems and Bruno Lang.
The MR^{3}GK algorithm for the bidiagonal SVD.
Electron. Trans. Numer. Anal., 39:121, 2012.
Determining the singular value decomposition of a bidiagonal matrix is a frequent subtask in numerical computations. We shed new light on a longknown way to utilize the algorithm of multiple relatively robust representations, MR^{3}, for this task by casting the singular value problem in terms of a suitable tridiagonal symmetric eigenproblem (via the GolubKahan matrix). Just running MR^{3} “as is” on the tridiagonal problem does not work, as has been observed before (e.g., B. Großer and B. Lang, An O(n^{2}) algorithm for the bidiagonal SVD, Linear Algebra Appl., 358 (2003), pp. 4570). In this paper we give more detailed explanations for the problems with running MR^{3} as a black box solver on the GolubKahan matrix. We show that, in contrast to standing opinion, MR^{3} can be run safely on the GolubKahan matrix, with just a minor modification. A proof including error bounds is given for this claim.

[17] 
Paul R. Willems and Bruno Lang.
Twisted factorizations and qdtype transformations for the
MR^{3} algorithmnew representations and analysis.
SIAM J. Matrix Anal. Appl., 33(2):523553, 2012.
Bidiagonal and twisted factorizations play a prominent role in the MR^{3} algorithm for computing partial eigensystems of symmetric tridiagonal matrices. Bidiagonal factorizations of shifted symmetric tridiagonals T  τI also underly, at least implicitly, the evaluation of the Sturm count for eigenvalue computations. In this paper we propose new representations (erep, Zrep) for bidiagonal and twisted factorizations relying on other quantities than the usually employed nontrivial entries of the bidiagonal factors (Nrep). We show that the qd algorithms used for shifting the factorizations, e.g., LDL^{*}  τI =: L^{+}D^{+}(L^{+})^{*}, and for converting between them, can be formulated to work with these representations in a mixed stable way. In fact, the Zrepresentation achieves lower bounds for the necessary relative perturbations of the inputs and outputs than the traditional Nrep does. To obtain sharp bounds for accumulating perturbations we propose a new notation that extends Higham's [N. J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia, PA, 2nd ed., 2002] by including secondorder terms to provide sharper firstorder bounds.

[18] 
Paul R. Willems and Bruno Lang.
Block factorizations and qdtype transformations for the
MR^{3} algorithm.
Electron. Trans. Numer. Anal., 38:363400, 2011.
Factorizing symmetric tridiagonal matrices and propagating the factorizations to shifted matrices are central tasks in the MR^{3} algorithm for computing partial eigensystems. In this paper we propose block bidiagonal factorizations LDL^{*} with 1 ×1 and 2 ×2 blocks in D as an alternative to the bidiagonal and twisted factorizations used hitherto. With block factorizations, the element growth can be reduced (or avoided altogether), which is essential for the success of the MR^{3} algorithm, in particular if the latter is used to determine the singular value decomposition of bidiagonal matrices. We show that the qd algorithm used for shifting bidiagonal factorizations, e.g., LDL^{*}  τI =: L^{+}D^{+}(L^{+})^{*} can be extended to work with blocks in a mixed stable way, including criteria for determining a suitable block structure dynamically.

[19] 
Xiaojun Chen, Andreas Frommer, and Bruno Lang.
Computational existence proofs for spherical tdesigns.
Numer. Math., 117(2):289305, February 2011.
Spherical tdesigns provide quadrature rules for the sphere which are exact for polynomials up to degree t. In this paper, we propose a computational algorithm based on interval arithmetic which, for given t, upon successful completion will have proved the existence of a tdesign with (t+1)^{2} nodes on the unit sphere S^{2} ^{3} and will have computed narrow interval enclosures which are known to contain these nodes with mathematical certainty. Since there is no theoretical result which proves the existence of a tdesign with (t+1)^{2} nodes for arbitrary t, our method contributes to the theory because it was tested successfully for t=1, 2, ..., 100. The tdesign is usually not unique; our method aims at finding a wellconditioned one. The method relies on computing an interval enclosure for the zero of a highly nonlinear system of dimension (t+1)^{2}. We therefore develop several special approaches which allow us to use interval arithmetic efficiently in this particular situation. The computations were all done using the MATLAB toolbox INTLAB.

[20] 
T. Auckenthaler, H.J. Bungartz, T. Huckle, L. Krämer, B. Lang, and
P. Willems.
Developing algorithms and software for the parallel solution of the
symmetric eigenvalue problem.
J. Comput. Sci., 2(3):272278, August 2011.
Nowadays, the development, maintenance, and ongoing adaptation of simulation software due to new algorithmic or hardware developments are highly complex tasks involving larger teams, often from different groups and disciplines, and located at different places. This requires an increased use of methods and tools from software engineering. At the same time, the high computational demands from the fields of application make it necessary to optimize the modules for code performance and scalability in order to fully exploit the potential of modern parallel architectures.

[21] 
T. Auckenthaler, V. Blum, H.J. Bungartz, T. Huckle, R. Johanni,
L. Krämer, B. Lang, H. Lederer, and P. R. Willems.
Parallel solution of partial symmetric eigenvalue problems from
electronic structure calculations.
Parallel Comput., 37(12):783794, December 2011.
The computation of selected eigenvalues and eigenvectors of a symmetric (Hermitian) matrix is an important subtask in many contexts, for example in electronic structure calculations. If a significant portion of the eigensystem is required then typically direct eigensolvers are used. The central three steps are: reduce the matrix to tridiagonal form, compute the eigenpairs of the tridiagonal matrix, and transform the eigenvectors back. To better utilize memory hierarchies, the reduction may be effected in two stages: full to banded, and banded to tridiagonal. Then the back transformation of the eigenvectors also involves two stages. For large problems, the eigensystem calculations can be the computational bottleneck, in particular with large numbers of processors. In this paper we discuss variants of the tridiagonaltobanded back transformation, improving the parallel efficiency for large numbers of processors as well as the perprocessor utilization. We also modify the divideandconquer algorithm for symmetric tridiagonal matrices such that it can compute a subset of the eigenpairs at reduced cost. The effectiveness of our modifications is demonstrated with numerical experiments.

[22] 
Martin Galgon, Lukas Krämer, and Bruno Lang.
The FEAST algorithm for large sparse eigenvalue problems.
Proc. Appl. Math. Mech., 11(1):747748, December 2011.
We consider the eigensolver named FEAST that was introduced by Polizzi in 2009 [1]. This solver, tailored to the (partial) solution of large sparse (generalized) eigenproblems offers good potential for parallelism and showed good robustness in our experiments. We briefly introduce the algorithm, point out some problems and give two examples.

[23] 
Thomas Beelitz, Bruno Lang, Peer Ueberholz, and Paul Willems.
Closing the case t = 3 for 3D spherical tdesigns using a
resultverifying nonlinear solver.
Reliab. Comput., 14:6677, June 2010.
The question if there exists an Npoint spherical tdesign is not yet settled for all combinations of t and N. Using our framework SONIC for the solution of nonlinear systems, we were able to close the two remaining open cases for t = 3. More precisely, a computational proof revealed that there are no spherical 3designs with N = 7 or N = 9 points. We describe how these results were obtained and comment on the open cases for larger values of t.

[24] 
Thomas Beelitz, Andreas Frommer, Bruno Lang, and Paul Willems.
A framework for existence tests based on the topological degree and
homotopy.
Numer. Math., 111(4):493507, February 2009.
The invariance of the topological degree under certain homotopies is used to derive a framework for tests to computationally prove the existence of zeros of nonlinear mappings in R^{n}. These tests use interval arithmetic to enclose the range of a function over a box and are provably more general than many other tests like the MooreKioustelidis test, a test based on the Krawczyk operator, and another degreebased test published recently. A numerical example is included.

[25] 
Martin Mönnigmann, Wolfgang Marquardt, Christian H. Bischof, Thomas
Beelitz, Bruno Lang, and Paul Willems.
A hybrid approach for efficient robust design of dynamic systems.
SIAM Rev., 49(2):236254, 2007.
We propose a novel approach for the parametrically robust design of dynamic systems. The approach can be applied to system models with parameters that are uncertain in the sense that values for these parameters are not known precisely, but only within certain bounds. The novel approach is guaranteed to find an optimal steady state that is stable for each parameter combination within these bounds.

[26] 
J. Bloch, A. Frommer, B. Lang, and T. Wettig.
An iterative method to compute the sign function of a nonHermitian
matrix and its application to the overlap Dirac operator at nonzero
chemical potential.
Comput. Phys. Comm., 177(2):933943, December 2007.
The overlap Dirac operator in lattice QCD requires the computation of the sign function of a matrix. While this matrix is usually Hermitian, it becomes nonHermitian in the presence of a quark chemical potential. We show how the action of the sign function of a nonHermitian matrix on an arbitrary vector can be computed efficiently on large lattices by an iterative method. A Krylov subspace approximation based on the Arnoldi algorithm is described for the evaluation of a generic matrix function. The efficiency of the method is spoiled when the matrix has eigenvalues close to a function discontinuity. This is cured by adding a small number of critical eigenvectors to the Krylov subspace, for which we propose two different deflation schemes. The ensuing modified Arnoldi method is then applied to the sign function, which has a discontinuity along the imaginary axis. The numerical results clearly show the improved efficiency of the method. Our modification is particularly effective when the action of the sign function of the same matrix has to be computed many times on different vectors, e.g., if the overlap Dirac operator is inverted using an iterative method.

[27] 
Christian H. Bischof, H. Martin Bücker, Arno Rasch, Emil Slusanschi, and
Bruno Lang.
Automatic differentiation of the generalpurpose computational fluid
dynamics package FLUENT.
ASME J. Fluids Engrg., 129(5):652658, May 2007.
Derivatives are a crucial ingredient to a broad variety of computational techniques in science and engineering. While numerical approaches for evaluating derivatives suffer from truncation error, automatic differentiation is accurate up to machine precision. The term automatic differentiation comprises a set of techniques for mechanically transforming a given computer program to another one capable of evaluating derivatives. A common misconception about automatic differentiation is that this technique only works on local pieces of fairly simple code. Here, it is shown that automatic differentiation is not only applicable to small academic codes, but scales to advanced industrial software packages. In particular, the generalpurpose computational fluid dynamics software package FLUENT is transformed by automatic differentiation.

[28] 
Tina Erica Odaka, Vladen V. Melnikov, Per Jensen, Tsuneo Hirano, Bruno Lang,
and Peter Langer.
Theoretical study of the double Renner effect for
A ^{2}Π MgNC/MgCN: Higher excited rovibrational states.
J. Chem. Phys., 126(9):094301, March 2007.
We report here the implementation of a newly developed, highly efficient matrix diagonalization routine in the DR program [T. E. Odaka, P. Jensen, and T. Hirano, J. Mol. Structure 795, 14 (2006)]. The DR program solves the rovibronic Schrödinger equation for a triatomic molecule with a double Renner effect, i.e., with two accessible linear arrangements of the nuclei at which the electronic energy is doubly degenerate. With the new routines, we can extend the DR calculations of rovibronic energies for A^{2} Π MgNC/MgCN by considering a much larger set of rovibronic states, in particular states at higher J values, than we were able to access previously.

[29] 
Andreas Frommer and Bruno Lang.
Fast and accurate multiargument interval evaluation of polynomials.
In Proc. SCAN 2006, 12th International Symposium on Scientific
Computing, Computer Arithmetic, and Validated Numerics, September 2629,
2006, Duisburg, Germany, page 31. IEEE Computer Society, 2007.
The verification of the existence of certain spherical tdesigns involves the evaluation of a degreet polynomial J_{t} at a very large number of (interval) arguments. To make the overall verification process feasible computationally, this evaluation must be fast, and the enclosures for the function values must be affected with only modest overestimation. We discuss several approaches for multiargument interval evaluation of the polynomial J_{t} and show how they can be adapted to other polynomials p. One particularly effective new method is based on expanding the polynomial p around several points ξ_{j} and truncating each resulting expansion p_{ξ_j} to a lowerdegree polynomial.

[30] 
Andreas Frommer, Fatmir Hoxha, and Bruno Lang.
Proving the existence of zeros using the topological degree and
interval arithmetic.
J. Comput. Appl. Math., 199(2):397402, February 2007.
The invariance of the topological degree under certain homotopies is used to computationally prove the existence of zeros of nonlinear mappings in R^{n}. These existence tests use interval arithmetic to enclose the range of a function over a box. We show that our test is more general than a wellknown test based on Miranda's theorem, and we show by a numerical example that the new test can be successful on substantially larger boxes.

[31] 
Jacques Bloch, Tilo Wettig, Andreas Frommer, and Bruno Lang.
An iterative method to compute the overlap Dirac operator at
nonzero chemical potential.
In Gunnar Bali et al., editors, Proc. Lattice 2007, XXVth
Intl. Symp. Lattice Field Theory, July 30August 4, 2007, Regensburg,
Germany, Proceedings of Science, page 169, Trieste, Italy, 2007. SISSA.
The overlap Dirac operator at nonzero quark chemical potential involves the computation of the sign of a nonHermitian matrix. In this talk we present an iterative method, first proposed by us in Ref. [1], which allows for an efficient computation of the operator, even on large lattices. The starting point is a Krylov subspace approximation, based on the Arnoldi algorithm, for the evaluation of a generic matrix function. The efficiency of this method is spoiled when the matrix has eigenvalues close to a function discontinuity. To cure this, a small number of critical eigenvectors are added to the Krylov subspace, and two different deflation schemes are proposed in this augmented subspace. The ensuing method is then applied to the sign function of the overlap Dirac operator, for two different lattice sizes. The sign function has a discontinuity along the imaginary axis and the numerical results show how deflation dramatically improves the efficiency of the method.

[32] 
Ivo Kabadshow and Bruno Lang.
Latencyoptimized parallelization of the FMM nearfield
computations.
In Yong Shi, Geert Dick van Albada, Jack Dongarra, and Peter M. A.
Sloot, editors, Computational Science, Proc. ICCS 2007, 7th
International Conference on Computational Science, May 2730, 2007, Beijing,
China, Part I, volume 4487 of LNCS, pages 716722, Berlin, 2007.
SpringerVerlag.
In this paper we present a new parallelization scheme for the FMM nearfield. The parallelization is based on the Global Arrays Toolkit and uses onesided communication with overlapping. It employs a purely static loadbalancing approach to minimize the number of communication steps and benefits from a maximum utilization of data locality. In contrast to other implementations the communication is initiated by the process owning the data via a put call, not the process receiving the data (via a get call).

[33] 
Paul R. Willems, Bruno Lang, and Christof Vömel.
Computing the bidiagonal SVD using multiple relatively robust
representations.
SIAM J. Matrix Anal. Appl., 28(4):907926, 2006.
We describe the design and implementation of a new algorithm for computing the singular value decomposition of a real bidiagonal matrix. This algorithm uses ideas developed by Großer and Lang that extend Parlett's and Dhillon's MRRR algorithm for the tridiagonal symmetric eigenproblem. One key feature of our new implementation is, that k singular triplets can be computed using only O(nk) storage units and floating point operations, where n is the dimension of the matrix. The algorithm will be made available as routine xBDSCR in the upcoming new release of the LAPACK library.

[34] 
Tatjana Eitrich and Bruno Lang.
Efficient optimization of support vector machine learning parameters
for unbalanced datasets.
J. Comput. Appl. Math., 196(2):425436, November 2006.
Support vector machines are powerful kernel methods for classification and regression tasks. If trained optimally, they produce excellent separating hyperplanes. The quality of the training, however, depends not only on the given training data but also on additional learning parameters, which are difficult to adjust, in particular for unbalanced datasets. Traditionally, grid search techniques have been used for determining suitable values for these parameters. In this paper we propose an automated approach to adjusting the learning parameters using a derivativefree numerical optimizer. To make the optimization process more efficient, a new sensitive quality measure is introduced. Numerical tests with a wellknown dataset show that our approach can produce support vector machines that are very well tuned to their classification tasks.

[35] 
Thomas Beelitz, Bruno Lang, and Christian H. Bischof.
Efficient task scheduling in the parallel resultverifying solution
of nonlinear systems.
Reliab. Comput., 12(2):141151, April 2006.
Nonlinear systems occur in diverse applications, e.g., in the steady state analysis of chemical processes. If safety concerns require the results to be provably correct then resultverifying algorithms relying on interval arithmetic should be used for solving these systems. Since such algorithms are very computationally intensive, the coarsegrained interbox parallelism should be exploited to make them feasible in practice. In this paper we briefly describe our framework SONIC for the verified solution of nonlinear systems and give detailed information about its parallelization with OpenMP and MPI. Our numerical results show that the implemented parallelization schemes are indeed successful. The more sophisticated MPI implementation seems to be superior to the easytoimplement OpenMP version and shows almost linear speedup up to a large number of processors.

[36] 
Tatjana Eitrich and Bruno Lang.
On the efficient implementation of a serial and parallel
decomposition algorithm for fast support vector machine training including a
multiparameter kernel.
Int. J. Comput. Intell., 3(2):9198, 2006.
This work deals with aspects of support vector machine learning for largescale data mining tasks. Based on a decomposition algorithm for support vector machine training that can be run in serial as well as shared memory parallel mode we introduce a transformation of the training data that allows for the usage of an expensive generalized kernel without additional costs. We present experiments for the Gaussian kernel, but usage of other kernel functions is possible, too. In order to further speed up the decomposition algorithm we analyze the critical problem of working set selection for large training data sets. In addition, we analyze the influence of the working set sizes onto the scalability of the parallel decomposition scheme. Our tests and conclusions led to several modifications of the algorithm and the improvement of overall support vector machine learning performance. Our method allows for using extensive parameter search methods to optimize classification accuracy.

[37] 
Tatjana Eitrich and Bruno Lang.
Data mining with parallel support vector machines for classification.
In Tatyana Yakhno and Erich J. Neuhold, editors, Proc. ADVIS
2006, 4th Intl. Conf. on Advances in Information Systems, October 1820,
2006, Izmir, Turkey, volume 4243 of LNCS, pages 197206, Berlin,
2006. SpringerVerlag.
The increasing amount of data used for classification, as well as the demand for complex models with a large number of well tuned parameters, naturally lead to the search for efficient approaches making use of massively parallel systems. We describe the parallelization of support vector machine learning for shared memory systems. The support vector machine is a powerful and reliable data mining method. Our learning algorithm relies on a decomposition scheme, which in turn uses a special variable projection method, for solving the quadratic program associated with support vector machine learning. By using hybrid parallel programming, our parallelization approach can be combined with the parallelism of a distributed cross validation routine and parallel parameter optimization methods.

[38] 
Tatjana Eitrich and Bruno Lang.
On the advantages of weighted L_{1}norm support vector learning
for unbalanced binary classification problems.
In Proc. IS'06, 3rd Intl. IEEE Conf. Intelligent Systems,
September 46, 2006, University of Westminster, UK, pages 575580. IEEE
Computer Society, September 2006.
In this paper we analyze support vector machine classification using the soft margin approach that allows for errors and margin violations during the training stage. Two models for learning the separating hyperplane do exist. We study the behavior of the resulting optimization algorithms in terms of training time and test accuracy for unbalanced data sets. The main goal of our work is to compare the features of the resulting classification functions, which are mainly defined by the support vectors arising during the support vector machine training.

[39] 
Tatjana Eitrich and Bruno Lang.
On the optimal working set size in serial and parallel support vector
machine learning with the decomposition algorithm.
In Peter Christen, Paul J. Kennedy, Jiuyong Li, Simeon J. Simoff, and
Graham J. Williams, editors, Data Mining and Analytics 2006, Proc. Fifth Australasian Data Mining Conference (AusDM2006), November 2930, 2006,
Sydney, Australia, volume 61 of Conferences in Research and Practice in
Information Technology (CRPIT), pages 121128. Australian Computer Society,
Inc., 2006.
The support vector machine (SVM) is a wellestablished and accurate supervised learning method for the classification of data in various application fields. The statistical learning task  the socalled training  can be formulated as a quadratic optimization problem. During the last years the decomposition algorithm for solving this optimization problem became the most frequently used method for support vector machine learning and is the basis of many SVM implementations today. It is characterized by an internal parameter called working set size. Usually small working sets have been assigned. The increasing amount of data used for classification led to new parallel implementations of the decomposition method with efficient inner solvers. With these solvers larger working sets can be assigned. It was shown, that for parallel training with the decomposition algorithm large working sets achieve good speedup values. However, the choice of the optimal working set size for parallel training is not clear. In this paper, we show how the working set size influences the number of decomposition steps, the number of kernel function evaluations and the overall training time in serial and parallel computation.

[40] 
Tatjana Eitrich, Wolfgang Frings, and Bruno Lang.
HyParSVMa new hybrid parallel software for support vector machine
learning on SMP clusters.
In Wolfgang E. Nagel, Wolfgang W. Walter, and Wolfgang Lehner,
editors, Parallel Processing, Proc. EuroPar 2006, 12th European
Conference on Parallel Computing, August 29September 1, 2006, Dresden,
Germany, volume 4128 of LNCS, pages 350359. SpringerVerlag, 2006.
In this paper we describe a new hybrid distributed/shared memory parallel software for support vector machine learning on large data sets. The support vector machine (SVM) method is a wellknown and reliable machine learning technique for classification and regression tasks. Based on a recently developed shared memory decomposition algorithm for support vector machine classifier design we increased the level of parallelism by implementing a cross validation routine based on message passing. With this extention we obtained a flexible parallel SVM software that can be used on highend machines with SMP architectures to process the large data sets that arise more and more in bioinformatics and other fields of research.

[41] 
Tatjana Eitrich, Bruno Lang, and Achim Streit.
Customizing the APPSPACK software for parallel parameter tuning of
a hybrid parallel support vector machine.
In Giuseppe Di Fatta, Michael R. Berthold, and Srinivasan
Parthasarathy, editors, Proc. PDM 2006, Workshop on Parallel Data
Mining in conjunction with ECML/PKDD 2006, September 1822, 2006, Berlin,
Germany, pages 3850, 2006.
We consider the problem of tuning parameters for the learning method support vector machine. The APPSPACK software, an asynchronous parameter tuning method, is well suited for SVM parameter fitting due to several characteristics. No derivative information is needed for bound constrained optimization and the code can be run in parallel mode. Recently, a hybrid parallel support vector machine has been proposed. To couple both parallel packages, the APPSPACK software needs to be customized to allow for a parallel function evaluation in addition to the parallelism provided by APPSPACK. In this paper we describe our customization of the APPSPACK software to facilitate a top down parallelism in SVM parameter tuning.

[42] 
Tatjana Eitrich and Bruno Lang.
Parallel costsensitive support vector machine software for
classification.
In Ulrich H. E. Hansmann, Jan Meinke, Sandipan Mohanty, and Olav
Zimmermann, editors, Proc. Workshop From Computational Biophysics to
Systems Biology, June 0609, 2006, Jülich, Germany, volume 34 of
NIC Series, pages 141144. John von Neumann Institute for Computing,
Jülich, 2006.


[43] 
Wolfgang Marquardt, Martin Mönnigmann, Thomas Beelitz, Bruno Lang, Paul
Willems, and Christan H. Bischof.
Robust design of dynamic systems.
ECMI Newsletter, 39:15, March 2006.


[44] 
Andreas Frommer and Bruno Lang.
Thematic section: Resultverifying computing.
ECMI Newsletter, 39:518, March 2006.


[45] 
C. H. Bischof, H. M. Bücker, B. Lang, A. Rasch, and E. Slusanschi.
Efficient and accurate derivatives for a software process chain in
airfoil shape optimization.
Future Gen. Comput. Syst., 21(8):13331344, October 2005.
When using a Newtonbased numerical algorithm to optimize the shape of an airfoil with respect to certain design parameters, a crucial ingredient is the derivative of the objective function with respect to the design parameters. In largescale aerodynamics, this objective function is an output of a computational fluid dynamics program written in a highlevel programming language such as Fortran or C.Numerical differentiation is commonly used to approximate derivatives but is subject to truncation and subtractive cancellation errors. For a particular twodimensional airfoil, we instead apply automatic differentiation to compute accurate derivatives of the lift and drag coefficients with respect to geometric shape parameters. In automatic differentiation, a given program is transformed into another program capable of computing the original function together with its derivatives. In the problem at hand, the objective function consists of a sequence of programs: a MATLAB program followed by two Fortran 77 programs. It is shown how automatic differentiation is applied to a sequence of programs while keeping the computational complexity within reasonable limits. The derivatives computed by automatic differentiation are compared with approximations based on divided differences.

[46] 
Andreas Frommer and Bruno Lang.
Existence tests for solutions of nonlinear equations using Borsuk's
theorem.
SIAM J. Numer. Anal., 43(3):13481361, 2005.
We show how interval arithmetic can be used in connection with Borsuk's theorem to computationally prove the existence of a solution of a system of nonlinear equations. It turns out that this new test, which can be checked computationally in several different ways, is more general than an existing test based on Miranda's theorem in the sense that it is successful for a larger set of situations. A numerical example is included.

[47] 
Benedikt Großer and Bruno Lang.
On symmetric eigenproblems induced by the bidiagonal SVD.
SIAM J. Matrix Anal. Appl., 26(3):599620, 2005.
The relatively robust representations (RRR) algorithm is the method of choice to compute highly accurate eigenvector approximations for symmetric tridiagonal matrices. The task of computing singular vector pairs for a bidiagonal matrix B = UΣV^{T} is closely connected to the RRR algorithm regarding B^{T}B, BB^{T}, or the GolubKahan matrix T_{GK}. Nevertheless, separate application of the RRR algorithm to these matrices leads to poor results regarding either numerical orthogonality or the residual BV  UΣ . It turns out that the coupling strategy proposed in [B. Grosser and B. Lang, Linear Algebra Appl., 358 (2003), pp. 4570] resolves this problem. This article provides the corresponding perturbation theory: We compare the eigenvalues of the separate and coupled decompositions and explain why singular vector pairs approximated via couplings are of superior quality.

[48] 
Rainer Steiger, Christian H. Bischof, Bruno Lang, and Walter Thiel.
Using automatic differentiation to compute derivatives of a
quantumchemical computer program.
Future Gen. Comput. Syst., 21(8):13241332, October 2005.
The ADIFOR 2.0 tool for Automatic Differentiation of Fortran programs has been used to generate analytic gradient code for all semiempirical SCF methods available in the MNDO97 program. The correctness and accuracy of the new code have been verified. Its performance has been compared with that of handcoded analytic derivative routines and with numerical differentiation. From a quantumchemical point of view, the major advance of this work is the development of previously unavailable analytic gradient code for the recently proposed OM1 and OM2 methods.

[49] 
Tatjana Eitrich and Bruno Lang.
Analysis of support vector machine training costs for large and
unbalanced data from pharmaceutical industry.
In Proc. 1st ICGST Intl. Conf. Artificial Intelligence and
Machine Learning (AIML05), December 1921, 2005, Cairo, Egypt, pages
5864. ICGST, 2005.
Modern classification methods are able to analyze large, complex and sometimes also unbalanced data sets. It is important not only to produce good results but also to control the time to achieve them. High computational costs lead to the exclusion of data mining methods even in case of good accuracy. Our paper deals with support vector machines in view of the consumption of CPU time. We study their learning behavior for unbalanced data sets with increasing size. We also examine the question whether it is necessary and practicable to parallelize this method.

[50] 
Tatjana Eitrich and Bruno Lang.
Parallel tuning of support vector machine learning parameters for
large and unbalanced data sets.
In Michael R. Berthold, Robert Glen, Kai Diederichs, Oliver
Kohlbacher, and Ingrid Fischer, editors, Computational Life Sciences:
First International Symposium, CompLife 2005, Konstanz, Germany, September
2527, Proceedings, volume 3695 of LNBI, pages 253264, Heidelberg,
2005. SpringerVerlag.
We consider the problem of selecting and tuning learning parameters for support vector machines, especially for the classification of large and unbalanced data sets. We show why and how simple models with few parameters should be refined and propose an automated approach for tuning the increased number of parameters in the extended model. Based on a sensitive quality measure we analyze correlations between the number of parameters, the learning time and the performance of the trained SVM in classifying independent test data. In addition we study the influence of the quality measure on the classification performance and compare the behavior of serial and asynchronous parallel parameter tuning on an IBM p690 cluster.

[51] 
Jens Gräbel, Bruno Lang, and Peer Ueberholz.
Performance optimization for the parallel GaussSeidel smoother.
Proc. Appl. Math. Mech., 5(1):831832, 2005.
Simulation, e.g., in the field of computational fluid dynamics, accounts for a major part of the computing time on highperformance systems. Many simulation packages still rely on GaussSeidel iteration, either as the main linear solver or as a smoother for multigrid schemes. Straightforward implementations of this solver have efficiency problems on today's most common highperformance computers, i.e., multiprocessor clusters with pronounced memory hierarchies. In this work we present two simple techniques for improving the performance of the parallel GaussSeidel method for the 3D Poisson equation by optimizing cache usage as well as reducing the number of communication steps.

[52] 
Thomas Beelitz, Andreas Frommer, Bruno Lang, and Paul Willems.
Symbolicnumeric techniques for solving nonlinear systems.
Proc. Appl. Math. Mech., 5(1):705708, December 2005.
We describe a multilevel technique combining symbolic and numeric methods to obtain guaranteed enclosures for all solutions of a nonlinear system within a given search box.

[53] 
A. Frommer, B. Lang, and M. Schnurr.
A comparison of the Moore and Miranda existence tests.
Computing, 72(34):349354, June 2004.
We compare two computational tests for the existence of a zero of a nonlinear system. One of the tests is based on a theorem by Moore, the other relies on Miranda's theorem. It turns out that the Miranda test is always at least as powerful as the Moore test. We also indicate some conditions under which both tests are equivalent.

[54] 
Thomas Beelitz, Christian Bischof, Bruno Lang, and Klaus Schulte Althoff.
Resultverifying solution of nonlinear systems in the analysis of
chemical processes.
In René Alt, Andreas Frommer, R. Baker Kearfott, and Wolfram
Luther, editors, Numerical Software with Result Verification  Proc. Intl. Dagstuhl Seminar, January 1924, 2003, Dagstuhl Castle, Germany,
number 2991 in LNCS, pages 198205. Springer, Berlin, 2004.
A framework for the verified solution of nonlinear systems arising in the analysis and design of chemical processes is described. The framework combines a symbolic preprocessing step with an intervalbased branchandbound solver whose efficiency is increased with several acceleration techniques. One of these methods is based on order2 Taylor expansion; it is also discussed in this note.

[55] 
Stefanie Krivsky and Bruno Lang.
Using interval arithmetic for determining the structure of convex
hulls.
Numer. Algorithms, 37(14):233240, December 2004.
We discuss the use of interval arithmetic in the computation of the convex hull of n points in D dimensions. Convex hull algorithms rely on simple geometric tests, such as “does some point p lie in a certain halfspace or affine subspace?” to determine the structure of the hull. These tests in turn can be carried out by solving appropriate (not necessarily square) linear systems. If intervalbased methods are used for the solution of these systems then the correct hull can be obtained at lower cost than with exact arithmetic. In addition, intervalbased methods can determine the topological structure of the convex hull even if the position of the points is not known exactly. In the present paper we compare various interval linear solvers with respect to their ability to handle closetopathological situations. This property determines how often interval arithmetic cannot provide the required information and therefore some computations must be redone with exact arithmetic.

[56] 
Thomas Beelitz, Christian H. Bischof, and Bruno Lang.
A hybrid subdivision strategy for resultverifying nonlinear solvers.
Proc. Appl. Math. Mech., 4:632633, 2004.
Many different heuristics have been proposed for selecting the subdivision direction in branchandbound resultverifying nonlinear solvers. We investigate the impact of the boxsplitting techniques on the overall performance of the solver and propose a new approach combining some of the simple heuristics in a hybrid way. Numerical experiments with mediumsized example problems indicate that our approach is successful.

[57] 
Andreas Frommer and Bruno Lang.
On preconditioners for the Borsuk existence test.
Proc. Appl. Math. Mech., 4:638639, 2004.
Existence or fixed point theorems, combined with interval analytic methods, provide a means to computationally prove the existence of a zero of a nonlinear system in a given interval vector. One such test is based on Borsuk's existence theorem. We discuss preconditioning techniques that are aimed at improving the effectiveness of this test.

[58] 
H. Martin Bücker, Bruno Lang, HansJoachim Pflug, and Andre Vehreschild.
Threads in an undergraduate course: A Java example illuminating
different multithreading approaches.
In A. Laganà, M. L. Gavrilova, V. Kumar, Y. Mun, C. J. K. Tan,
and O. Gervasi, editors, Computational Science and Its Applications 
ICCSA 2004, Proc. Intl. Conf. Computational Science and its Applications,
Assisi, Italy, May 1417, 2004. Part II, number 3044 in LNCS, pages
882891. Springer, Berlin, 2004.
Multithreading is a fundamental approach to expressing parallelism in programs. Since Java is emerging as the de facto standard language for platform independent software development in higher education, there is need for teaching multithreading in the context of Java. We use a simple problem from scientific computing to explain two different multithreading approaches to secondyear students. More precisely, a simple boundary value problem is considered, which makes visible the differences between native Java threads and the OpenMP interface. So, the students are able to appreciate the respective merits and drawbacks of a thread package that is integrated into the Java programming language and an approach combining compiler support with library calls.

[59] 
Thomas Beelitz, Christian H. Bischof, Bruno Lang, and Paul Willems.
SONICA framework for the rigorous solution of nonlinear
problems.
Report BUWSC 04/7, Fachbereich Mathematik und Naturwissenschaften,
Bergische Universität Wuppertal, 2004.
[ Preprint ]
SONIC is a new rigorous nonlinear solver and constrained optimizer based on interval computations. Having been developed primarily to support the parametrically robust design of dynamic systems, SONIC can also be used to solve nonlinear problems arising in other application areas. Since its highlevel solvers can be coupled with several different basic interval libraries, it can achieve optimized performance while maintaining a high degree of portability. In addition to the serial code, efficient parallel versions for shared and distributed memory architectures are available.

[60] 
Benedikt Großer and Bruno Lang.
An O( n^{2} ) algorithm for the bidiagonal SVD.
Linear Algebra Appl., 358(13):4570, January 2003.
The RRR algorithm computes the eigendecomposition of a symmetric tridiagonal matrix T with an O(n^{2}) complexity. This article discusses how this method can be extended to the bidiagonal SVD B = U ΣV^{T}. It turns out that using the RRR algorithm as a black box to compute B^{T}B=VΣ^{2}V^{T} and BB^{T}=UΣ^{2}U^{T} separately may give poor results for B V  U Σ. The use of the standard JordanWielandt representation can fail as well if clusters of tiny singular values are present. A solution is to work on B^{T}B and to keep factorizations of BB^{T} implicitly. We introduce a set of coupling transformations which allow us to replace the representation u = (1)/(σ) B v by a more stable representation u = X v, where X is a diagonal matrix. Numerical results of our implementation are compared with the LAPACK routines DSTEGR, DBDSQR and DBDSDC.

[61] 
Christian H. Bischof, H. Martin Bücker, Bruno Lang, and Arno Rasch.
An interactive environment for supporting the paradigm shift from
simulation to optimization.
Scientific Prog., 11(4):263272, 2003.
Numerical simulation is a powerful tool in science and engineering, and it is also used for optimizing the design of products and experiments rather than only for reproducing the behavior of scientific and engineering systems. In order to reduce the number of simulation runs, the traditional “trial and error” approach for finding neartooptimum design parameters is more and more replaced with efficient numerical optimization algorithms. Done by hand, the coupling of simulation and optimization software is tedious and errorprone. In this note we introduce a software environment called EFCOSS (Environment For Combining Optimization and Simulation Software) that facilitates and speeds up this task by doing much of the required work automatically. Our framework includes support for automatic differentiation providing the derivatives required by many optimization algorithms. We describe the process of integrating the widely used computational fluid dynamics package FLUENT and a MINPACK1 least squares optimizer into EFCOSS and follow a sample session solving a data assimilation problem.

[62] 
Christian H. Bischof, H. Martin Bücker, Bruno Lang, and Arno Rasch.
Automated gradient calculation.
In Josef Ballmann, editor, Flow Modulation and FluidStructure
Interaction at Airplane Wings, number 84 in Notes on Numerical Fluid
Mechanics and Multidisciplinary Design, pages 205224. Springer, Berlin,
2003.
Automatic differentiation is a powerful technique for evaluating derivatives of functions given in the form of a highlevel programming language such as Fortran, C, or C++. The program is treated as a sequence of elementary statements to which the chain rule of differential calculus is applied mechanically. A key feature of this technique is its ability to generate accurate derivatives rather than approximations obtained from numerical differentiation like divided differences. We survey automatic differentiation and report on preliminary results that have been obtained applying the automatic differentiation tool ADIFOR to a largescale computational fluid dynamics solver, TFS, developed at Aerodynamisches Institut, Aachen University of Technology. This solver comprises approximately 24,000 lines of Fortran 77 and is able to resolve steady and unsteady laminar and turbulent flows in two and three dimensions.

[63] 
C. H. Bischof, H. M. Bücker, B. Lang, A. Rasch, and J. W. Risch.
Extending the functionality of the generalpurpose finite element
package SEPRAN by automatic differentiation.
Int. J. Numer. Meth. Engng., 58(14):22252238, December
2003.
From an abstract point of view, a numerical simulation implements a mathematical function that produces some output from some given input. Derivatives (or sensitivities) of the function's output with respect to its input can be obtainedfree from truncation errorby using a technique called automatic differentiation. Given a computer code in a highlevel programming language like Fortran, C, or, C++, automatic differentiation generates another code capable of computing not only the original function but also its derivatives. Thus, the application of automatic differentiation significantly extends the functionality of a simulation package. For instance, automatic differentiation enables, in a completely mechanical fashion, the usage of derivativebased optimization algorithms where the evaluation of the objective function comprises some given largescale engineering simulation. In this note, the automatic differentiation tool Adifor is used to transform the general purpose finite element package SEPRAN.In doing so, we automatically translate the given 400,000 lines of Fortran 77 into a new program consisting of 600,000 lines of Fortran 77. We compare our approach with a traditional approach based on numerical differentiation and quantify its advantages in terms of accuracy and computational efficiency for a standard fluid flow problem.

[64] 
H. M. Bücker, B. Lang, and C. H. Bischof.
Parallel programming in computational science: An introductory
practical training course for computer science at Aachen University.
Future Gen. Comput. Syst., 19(8):13091319, November 2003.
Parallel programming of highperformance computers has emerged as a key technology for the numerical solution of largescale problems arising in computational science and engineering (CSE). The authors believe that principles and techniques of parallel programming are among the essential ingredients of any CSE as well as computer science curriculum. Today, opinions on the role and importance of parallel programming are diverse. Rather than seeing it as a marginal beneficial skill optionally taught at the graduate level, we understand parallel programming as crucial basic skill that should be taught as an integral part of the undergraduate computer science curriculum. A practical training course developed for computer science undergraduates at Aachen University is described. Its goal is to introduce young computer science students to different parallel programming paradigms for shared and distributed memory computers as well as to give a first exposition to the field of computational science by simple, yet carefully chosen sample problems.

[65] 
Christian H. Bischof, H. Martin Bücker, Bruno Lang, and Arno Rasch.
Solving largescale optimization problems with EFCOSS.
Advances Engrg. Software, 34(10):633639, October 2003.
Derivatives play a prominent role in many areas of scientific computing. Traditionally, divided differences are employed to approximate derivatives, leading to results of dubious quality at often great computational expense. Automatic differentiation (AD), by contrast, is a powerful technique for accurately evaluating derivatives of functions described in a highlevel programming language. AD requires little human effort and produces derivatives without truncation error. Although there is no conceptual difference between small and large codes, applying AD to programs with hundreds of thousands lines of code is still a challenging task and requires a robust AD tool. We report on recent accomplishments of AD applied to the general purpose finite element package SEPRAN consisting of approximately 400,000 lines of Fortran77 and its integration into a prototype problem solving environment called EFCOSS supporting interoperability of simulation codes with optimization software using AD technology.

[66] 
Stefanie Krivsky and Bruno Lang.
Verified computation of higherdimensional convex hulls and the
solution of linear systems.
Electron. J. Math. Comput., 1:2135, March 2003.
In this paper we discuss the use of interval arithmetic in the computation of the convex hull of n points in D dimensions. If intervalbased methods are used for the solution of certain appropriate (not necessarily square) linear systems then the correct topological structure of the convex hull can be obtained at lower cost than with exact rational arithmetic. In addition, intervalbased methods can determine the topological structure of the convex hull even if the position of the points is not known exactly. In our experiments we compared various linear solvers with respect to their speed and their ability to handle closetopathological situations. The latter property determines how often interval arithmetic cannot provide the required information and therefore some computations must be redone with rational arithmetic.

[67] 
H. Martin Bücker, Bruno Lang, Arno Rasch, and Christian H. Bischof.
Automatic parallelism in differentiation of Fourier transforms.
In ChinCheng Hung, Henry Hexmoor, et al., editors, Proc. 2003
ACM Symposium on Applied Computing (SAC 2003), March 912, 2003, Melbourne,
FL, pages 148152, New York, NY, 2003. Assoc. Comput. Mach.
For functions given in the form of a computer program, automatic differentiation is an efficient technique to accurately evaluate the derivatives of that function. Starting from a given computer program, automatic differentiation generates another program for the evaluation of the original function and its derivatives in a fully mechanical way. While the efficiency of this black box approach is already high as compared to numerical differentiation based on divided differences, automatic differentiation can be applied even more efficiently by taking into account highlevel knowledge about the given computer program. We show that, in the case where the function involves a Fourier transform, the degree of parallelism in the program generated by automatic differentiation can be increased leading to a rich set of automatic parallelization strategies that are not available when employing a black box automatic parallelization approach. Experiments of the new automatic parallelization approach are reported on a SunFire 6800 server using up to 20 processors.

[68] 
H. Martin Bücker, Bruno Lang, HansJoachim Pflug, and Andreas Wolf.
A hybrid MPIOpenMP implementation of the conjugate gradient
method in Java.
In Dieter an Mey, editor, Proc. EWOMP '03, Fifth European
Workshop on OpenMP, September 2226, 2003, Aachen, Germany, pages 205213,
Aachen, Germany, 2003. Center for Computing and Communication, Aachen
University.
The message passing paradigm is the dominating programming model for parallel processing on distributedmemory architectures. OpenMP is the most widely used parallel programming model for sharedmemory systems. Contemporary parallel computers consist of multiple sharedmemory computers connected via a network so that a hybrid parallelization approach combining message passing and OpenMP is often desired. We present a hybrid parallelization of an iterative method for the solution of linear systems with a large sparse symmetric coefficient matrix which is implemented in Java. To this end, we bring together mpiJava and JOMP and compare the performance on a Sun Fire 6800 system with a hybrid parallel version of the same problem implemented in Fortran. The scalability of both implementations is similar while, surprisingly, the performance of the Fortran version with data prefetching is only a factor of about 2 larger than the Java version when up to 22 processors are used.

[69] 
Christian H. Bischof, Bruno Lang, and Andre Vehreschild.
Automatic differentiation for MATLAB programs.
Proc. Appl. Math. Mech., 2(1):5053, March 2003.
Proc. GAMMJahrestagung, March 2528, 2002, Augsburg, Germany.
Derivative information is required in numerous applications, including sensitivity analysis and numerical optimization. For simple functions, symbolic differentiationdone either manually or with a computer algebra systemcan provide the derivatives, whereas divided differences (DD) have been used traditionally for functions defined by (potentially very complex) computer programs, even if only approximate values can be obtained this way. An alternative approach for such functions is automatic differentiation (AD), yielding exact derivatives at often lower cost than DD, and without restrictions on the program complexity. In this paper we compare the functionality and describe the use of ADMIT/ADMAT and ADiMat. These two AD tools provide derivatives for programs written in the MATLAB language, which is widely used for prototype and production software in scientific and engineering applications. While ADMIT/ADMAT implements a pure operator overloading approach of AD, ADiMat also employes source transformation techniques.

[70] 
Thomas Beelitz, Christian H. Bischof, and Bruno Lang.
Intervals and OpenMP: Towards an efficient parallel
resultverifying nonlinear solver.
In Dieter an Mey, editor, Proc. EWOMP '03, Fifth European
Workshop on OpenMP, September 2226, 2003, Aachen, Germany, pages 119125,
Aachen, Germany, 2003. Center for Computing and Communication, Aachen
University.
Nonlinear systems occur in diverse applications, i.e., in the steady state analysis of chemical processes. If safety concerns require the results to be provably correct then resultverifying algorithms relying on interval arithmetic should be used for solving these systems. Since such algorithms are very computationally intensive, parallelism must be exploited to make them feasible in practice. We describe our framework for the verified solution of nonlinear systems and our approach to parallelizing it with OpenMP. First numerical results show that the parallelization scheme is indeed successful but that the attainable speedup is limited for some unknown reason.

[71] 
H. Martin Bücker, Bruno Lang, Arno Rasch, and Christian H. Bischof.
Computing sensitivities of the electrostatic potential by automatic
differentiation.
Comput. Phys. Comm., 147(12):720723, August 2002.
Given a computer model for the electrostatic potential in an Lshaped region with media of different dielectric permeabilities in two subregions, we are interested in the robustness of the simulation by identifying the rate of change of the potential with respect to a change in the permeabilities. Such sensitivity analyses, assessing the rate of change of certain model outputs implied by varying certain model inputs, can be carried out by computing the corresponding partial derivatives. In largescale computational physics, the underlying computer model is typically available as a complicated computer code in a highlevel programming language such as Fortran, C, or C++. To obtain accurate and efficient derivatives of functions given in this form, we use a technique called automatic or algorithmic differentiation. Unlike numerical differentiation based on divided differences, derivatives generated by automatic differentiation are free of truncation error. Here, the automatic differentiation tool Adifor is used to transform the given computer modelimplemented with the general purpose finite element package SEPRANinto a new computer code computing the derivatives of the electrostatic potential with respect to the dielectric permeabilities. In doing so, we automatically translate 400,000 lines of Fortran 77 into a new program consisting of 600,000 lines of Fortran 77. We compare our approach with a traditional approach based on numerical differentiation and quantify its advantages in terms of accuracy and computational efficiency.

[72] 
Christian H. Bischof, H. Martin Bücker, and Bruno Lang.
Automatic differentiation for computational finance.
In Erricos John Kontoghiorghes, Berç Rustem, and Stavros
Siokos, editors, Computational Methods in DecisionMaking, Economics and
Finance Proc., Neuchâtel, August 16, 2000, chapter 15, pages 297310.
Kluwer Academic Publishers, Dordrecht, The Netherlands, 2002.
Automatic differentiation (AD) is a powerful technique allowing to compute derivatives of a function given by a (potentially very large) piece of code. The basic principles of AD and some available tools implementing this technology are reviewed. AD is superior to divided differences because ADgenerated derivative values are free of approximation errors, and superior to symbolic differentiation because code of very high complexity can be handled, in contrast to computer algebra systems whose applicability is limited to rather simple functions. In addition, the cost for computing gradients of scalarvalued functions with either divided differences or symbolic differentiation grows linearly with the number of variables, whereas the socalled reverse mode of AD can compute such gradients at constant cost.

[73] 
C. H. Bischof, H. M. Bücker, B. Lang, A. Rasch, and A. Vehreschild.
Combining source transformation and operator overloading techniques
to compute derivatives for MATLAB programs.
In Proc. 2nd IEEE Intl. Workshop on Source Code Analysis and
Manipulation (SCAM 2002), October 1, 2002, Montreal, Canada, pages 6572.
IEEE Computer Society, November 2002.
Derivatives of mathematical functions play a key role in various areas of numerical and technical computing. Many of these computations are done in MATLAB, a popular environment for technical computing providing engineers and scientists with capabilities for mathematical computing, analysis, visualization, and algorithmic development. For functions written in the MATLAB language, a novel software tool is proposed to automatically transform a given MATLAB program into another MATLAB program capable of computing not only the original function but also userspecified derivatives of that function. That is, a program transformation known as automatic differentiation is performed to change the semantics of the program in a fashion based on the chain rule of differential calculus. The crucial ingredient of the tool is a combination of sourcetosource transformation and operator overloading. The overall design of the tool is described and numerical experiments are reported demonstrating the efficiency of the resulting code for a sample problem.

[74] 
H. Martin Bücker, Bruno Lang, Arno Rasch, and Christian H. Bischof.
Computation of sensitivity information for aircraft design by
automatic differentiation.
In Peter M. A. Sloot, C. J. Kenneth Tan, Jack J. Dongarra, and
Alfons G. Hoekstra, editors, Computational Science  ICCS 2002
International Conference, April 2124, 2002, Amsterdam, The Netherlands,
Proceedings, Part II, number 2330 in LNCS, pages 10691076, Berlin, 2002.
SpringerVerlag.
Given a numerical simulation of the near wake of an airfoil, automatic differentiation is used to accurately compute the sensitivities of the Mach number with respect to the angle of attack. Such sensitivity information is crucial when integrating a pure simulation code into an optimization framework involving a gradientbased optimization technique. In this note, the ADIFOR system implementing the technology of automatic differentiation for functions written in Fortran 77 is used to mechanically transform a given flow solver called TFS into a new program capable of computing the original simulation and the desired derivatives in a simultaneous fashion. Numerical experiments of derivatives obtained from automatic differentiation and finite differences approximations are reported.

[75] 
H. Martin Bücker, Bruno Lang, Dieter an Mey, Arno Rasch, and Christian H.
Bischof.
Explicit loop scheduling in OpenMP for parallel automatic
differentiation.
In Jalal N. Almhana and C. Bhavsar Virendrakumar, editors,
Proc. 16th Annual International Symposium on High Performance Computing
Systems and Applications (HPCS 2002), Moncton, Canada, June 1619, 2002,
pages 121126, Los Alamitos, CA, 2002. IEEE Computer Society.
Derivatives of almost arbitrary functions can be evaluated efficiently by automatic differentiation whenever the functions are given in the form of computer programs in a highlevel programming language such as Fortran, C, or C++. In contrast to numerical differentiation, where derivatives are only approximated, automatic differentiation generates derivatives that are accurate up to machine precision. Sophisticated software tools implementing the technology of automatic differentiation are capable of automatically generating code for the product of the Jacobian matrix and a socalled seed matrix. It is shown how these tools can benefit from concepts of shared memory programming to parallelize, in a completely mechanical fashion, the gradient operations associated with each statement of the given code. The feasibility of our approach is demonstrated by numerical experiments. They were performed with a code that was generated automatically by the Adifor system and augmented with OpenMP directives.

[76] 
H. Martin Bücker, Bruno Lang, Arno Rasch, and Christian H. Bischof.
Sensitivity analysis of an airfoil simulation using automatic
differentiation.
In M. H. Hamza, editor, Proc. International IASTED Conference
on Modelling, Identification, and Control (MIC 2002), Innsbruck, Austria,
February 1821, 2002, pages 7376, Anaheim, CA, 2002. ACTA Press.
Derivatives are not only useful in sensitivity analysis but play a vital role in various areas of computational science including parameter optimization of computer models. Among the various methods for obtaining derivatives, automatic differentiation (AD) combines freedom from approximation errors, high performance, and the ability to handle arbitrarily complex codes arising from largescale scientific investigations. In this note, we show how AD technology can aid in the sensitivity analysis of a computer model of an aircraft. To this end, the software tool ADIFOR, implementing the AD technology for functions written in Fortran 77, was applied to a largescale computational fluid dynamics solver, TFS, developed at Aerodynamisches Institut, Aachen University of Technology. This solver comprises approximately 24,000 lines of Fortran 77 and is able to resolve steady and unsteady laminar and turbulent flows in two and three dimensions.

[77] 
H. Martin Bücker, Bruno Lang, Dieter an Mey, and Christian H. Bischof.
Bringing together automatic differentiation and OpenMP.
In M. Furnari and E. Gallopoulos, editors, Proceedings
International Conference on Supercomputing, ICS 2001, Sorrento, Italy, June
1721, 2001, pages 246251, New York, NY, 2001. ACM Press.
Derivatives of almost arbitrary functions can be evaluated efficiently by automatic differentiation whenever the functions are given in the form of computer programs in a highlevel programming language such as Fortran, C, or C++. Furthermore, in contrast to numerical differentiation where derivatives are approximated, automatic differentiation generates derivatives that are accurate up to machine precision. The socalled forward mode of automatic differentiation computes derivatives by carrying forward a gradient associated with each intermediate variable simultaneously with the evaluation of the function itself. It is shown how software tools implementing the technology of automatic differentiation can benefit from simple concepts of shared memory programming to parallelize the gradient operations. The feasibility of our approach is demonstrated by numerical experiments. They were performed with a code that was generated automatically by the Adifor system and augmented with OpenMP directives.

[78] 
Bruno Lang.
A comparison of techniques for evaluating centered forms.
In Ulrich Kulisch, Rudolf Lohner, and Axel Facius, editors,
Perspectives on Enclosure Methods, pages 149155. SpringerVerlag, Wien,
2001.
Secondorder enclosures for a function's range can be computed with centered forms, which involve a socalled slope vector. In this paper we discuss several techniques for determining such vectors and compare them with respect to tightness of the resulting enclosure. We advocate that a twostage slope vector computation with symbolic preprocessing is optimal.

[79] 
Christian H. Bischof, H. Martin Bücker, Bruno Lang, Arno Rasch, and
Jakob W. Risch.
A CORBAbased environment for coupling largescale simulation and
optimization software.
In H. R. Arabnia, editor, Proceedings of the International
Conference on Parallel and Distributed Processing Techniques and
Applications, PDPTA'2001, Las Vegas, Nevada, June 2528, 2001, volume 1,
pages 6872. CSREA Press, 2001.
Simulation is an essential part in computational science and engineering. Since the underlying physical models become more and more complex, the adjustment of certain model parameters (“parameter identification”) is often a nontrivial task and should be addressed by means of efficient numerical optimization algorithms. We describe an environment for coupling simulation and optimization software. The modular architecture of the system supports easy modification of the problem configuration, including the exchange of software components. Therefore our tool is wellsuited for rapid prototyping of optimization problems.

[80] 
Bruno Lang.
Derivativebased subdivision in multidimensional verified Gaussian
quadrature.
In Götz Alefeld, Jiři Rohn, Siegfried Rump, and Tetsuro
Yamamoto, editors, Symbolic Algebraic Methods and Verification Methods,
pages 145152, Wien, 2001. SpringerVerlag.
An implementation of verified Gaussian quadrature involves algorithmic parameters whose correct setting is crucial for adequate performance. We identify several of these control parameters and, based on experimental data, we try to give advice for choosing suitable parameter values in order to obtain reasonable average performance for a “blackbox” quadrature routine.

[81] 
H. Martin Bücker, Bruno Lang, Arno Rasch, and Christian H. Bischof.
From analytic to automated derivatives: A case study of the
electrostatic potential.
In Algorithms and Software for Mobile Communications, Proc. 10th Aachen Symposium on Signal Theory, Aachen, Germany, September 2021,
2001, pages 255260, Berlin, 2001. VDE Verlag.
Given a largescale engineering simulation, one is often interested in the derivatives of certain program outputs with respect to certain input parameters, e.g., for sensitivity analysis. From an abstract point of view, the computer program defines a mathematical function whose derivatives are sought. In this note, two cases are investigated where the underlying functions are given by the solution of different twodimensional electrostatic potential problems. For a rectangular region, the function and its derivatives can be derived analytically. For a function based on an Lshaped region, the function is computed by a simulation code and its derivatives are obtained by automatic differentiation. In general, this powerful technique is applicable if a function is given in the form of a computer program in a highlevel programming language such as Fortran, C, or C++. In contrast to numerical differentiation providing approximations based on divided differences, the derivatives computed by automatic differentiation are accurate. Furthermore, automatic differentiation is also shown to be computationally efficient.

[82] 
Christian H. Bischof, H. Martin Bücker, Jörg Henrichs, and Bruno
Lang.
Handson training for undergraduates in highperformance computing
using Java.
In Tor Sørevik, Fredrik Manne, Randi Moe, and Assefaw Hadish
Gebremedhin, editors, Applied Parallel ComputingNew Paradigms for HPC
in Industry and Academia, Proceedings 5th International Workshop, PARA 2000,
Bergen, Norway, June 2000, volume 1947 of LNCS, pages 306315,
Berlin, 2001. SpringerVerlag.
In recent years, the objectoriented approach has emerged as a key technology for building highly complex scientific codes, as has the use of parallel computers for the solution of largescale problems. We believe that the paradigm shift towards parallelism will continue and, therefore, principles and techniques of writing parallel programs should be taught to the students at an early stage of their education rather than as an advanced topic near the end of a curriculum. A certain understanding of the practical aspects of numerical modeling is also a useful facet in computer science education. The reason is that, in addition to their traditional prime rôle in computational science and engineering, numerical techniques are also increasingly employed in seemingly nonnumerical settings as largescale data mining and web searching. This paper describes a practical training course for undergraduates, where carefully selected problems of highperformance computing are solved using the programming language Java.

[83] 
Christian H. Bischof, H. Martin Bücker, Bruno Lang, Arno Rasch, and
Jakob W. Risch.
On the use of a differentiated finite element package for sensitivity
analysis.
In Vassil N. Alexandrov, Jack J. Dongarra, Benjoe A. Juliano,
René S. Renner, and C. J. Kenneth Tan, editors, Computational
Science  ICCS 2001 International Conference San Francisco, CA, May 2001,
Proceedings, Part I, volume 2073 of LNCS, pages 795801, Berlin,
2001. Springer.
Derivatives are ubiquitous in various areas of computational science including sensitivity analysis and parameter optimization of computer models. Among the various methods for obtaining derivatives, automatic differentiation (AD) combines freedom from approximation errors, high performance, and the ability to handle arbitrarily complex codes arising from largescale scientific investigations. In this note, we show how AD technology can aid in the sensitivity analysis of a computer model by considering a classic fluid flow experiment as an example. To this end, the software tool ADIFOR implementing the AD technology for functions written in Fortran 77 was applied to the large finite element package SEPRAN. Differentiated versions of SEPRAN enable sensitivity analysis for a wide range of applications, not only from computational fluid dynamics.

[84] 
H. Martin Bücker, Bruno Lang, and Christian H. Bischof.
Teaching different parallel programming paradigms using Java.
In Proceedings of the Third Annual Workshop on Java for
HighPerformance Computing, Sorrento, Italy, June 17, 2001, pages 7381,
2001.
Parallel computing has emerged as a key technology for the solution of largescale problems arising in computational science and engineering. Therefore, principles and techniques of writing parallel programs should be taught to the students at an early stage of their education rather than as an advanced topic near the end of a curriculum. This note describes a practical training course for undergraduates, where the programming language Java is used to pave the student's way from concurrent programming using native Java threads via OpenMPlike shared memory programming to distributed memory programming based on the message passing interface MPI.

[85] 
Christian H. Bischof, Bruno Lang, Wolfgang Marquardt, and Martin
Mönnigmann.
Verified determination of singularities in chemical processes.
In Walter Krämer and Jürgen Wolff von Gudenberg, editors,
Scientific Computing, Validated Numerics, Interval Methods, pages
305316, New York, 2001. Kluwer Academic/Plenum Publishers.
Understanding and controlling the behavior of chemical processes are important issues, for safety as well as economical reasons. Some processes can have multiple steady states and even switch between them in a complex way, the reasons for the multiplicity not always being well understood. A singularity theory approach for investigating such behavior leads to nonlinear systems whose solutions correspond to specific singular states of the process. In order to exclude certain types of singularities, rigorous methods must be used to check the solvability of the matching systems. As these systems are highly structured, our solution method combines a symbolic preprocessing phase (term manipulation for utilizing the structure) with a branchandbound type rigorous intervalbased solver. We report on our experience with this approach for smalltomedium sized example problems.

[86]  Christian Bischof, Martin Bücker, Bruno Lang, and Arno Rasch. Automatic differentiation of the FLUENT CFD program: Procedure and first results. Report RWTHCSSC0122, RWTH Aachen, Institute for Scientific Computing, 2001. 
[87] 
Christian H. Bischof, Bruno Lang, and Xiaobai Sun.
A framework for symmetric band reduction.
ACM Trans. Math. Software, 26(4):581601, December 2000.
We develop an algorithmic framework for reducing the bandwidth of symmetric matrices via orthogonal similarity transformations. This framework includes the reduction of full matrices to banded or tridiagonal form and the reduction of banded matrices to narrower banded or tridiagonal form, possibly in multiple steps. Our framework leads to algorithms that require fewer floatingpoint operations than do standard algorithms, if only the eigenvalues are required. In addition, it allows for spacetime tradeoffs and enables or increases the use of blocked transformations.

[88] 
Christian H. Bischof, Bruno Lang, and Xiaobai Sun.
Algorithm 807: The SBR toolboxsoftware for successive band
reduction.
ACM Trans. Math. Software, 26(4):602616, December 2000.
We present a software toolbox for symmetric band reduction via orthogonal transformations, together with a testing and timing program. The toolbox contains drivers and computational routines for the reduction of full symmetric matrices to banded form and the reduction of banded matrices to narrower banded or tridiagonal form, with optional accumulation of the orthogonal transformations, as well as repacking routines for storage rearrangement. The functionality and the calling sequences of the routines are described, with a detailed discussion of the “control” parameters that allow adaptation of the codes to particular machine and matrix characteristics. We also briefly describe the testing and timing program included in the toolbox.

[89] 
Bruno Lang.
Direct solvers for symmetric eigenvalue problems.
In Johannes Grotendorst, editor, Modern Methods and Algorithms
of Quantum Chemistry, Jülich, February 2125, 2000, Proceedings, 2nd
Edition, pages 231259, Jülich, 2000. John von Neumann Institute for
Computing.
This article reviews classical and recent direct methods for computing eigenvalues and eigenvectors of symmetric full or banded matrices. The ideas underlying the methods are presented, and the properties of the algorithms with respect to accurary and performance are discussed. Finally, pointers to relevant software are given.

[90] 
Bruno Lang.
Outofcore solution of large symmetric eigenproblems.
Z. angew. Math. Mech., 80:S 809S 810, 2000.
This paper describes a prototype implementation of a solver for dense symmetric eigenproblems, which are too large to fit into the main memory.

[91] 
Christian H. Bischof, H. Martin Bücker, Jörg Henrichs, and Bruno
Lang.
SoftwarePraktikum Paralleles Programmieren mit Java.
Report RWTHCSSC0013, RWTH Aachen, Institute for Scientific
Computing, 2000.


[92] 
Bruno Lang.
Efficient eigenvalue and singular value computations on shared memory
machines.
Parallel Comput., 25(7):845860, July 1999.
We describe two techniques for speeding up eigenvalue and singular value computations on shared memory parallel computers. Depending on the information that is required, different steps in the overall process can be made more efficient. If only the eigenvalues or singular values are sought then the reduction to condensed form may be done in two or more steps to make best use of optimized level3 BLAS. If eigenvectors and/or singular vectors are required, too, then their accumulation can be speeded up by another blocking technique. The efficiency of the blocked algorithms depends heavily on the values of certain control parameters. We also present a very simple performance model that allows selecting these parameters automatically.

[93] 
Benedikt Großer and Bruno Lang.
Efficient parallel reduction to bidiagonal form.
Parallel Comput., 25(8):969986, September 1999.
Most methods for calculating the SVD (singular value decomposition) require to first bidiagonalize the matrix. The blocked reduction of a general, dense matrix to bidiagonal form, as implemented in ScaLAPACK, does about one half of the operations with BLAS3. By subdividing the reduction into two stages dense > banded and banded > bidiagonal with cubic and quadratic arithmetic costs respectively, we are able to carry out a much higher portion of the calculations in matrixmatrix multiplications. Thus, higher performance can be expected.

[94] 
Bruno Lang.
A comparison of subdivision strategies for verified multidimensional
Gaussian quadrature.
In Tibor Csendes, editor, Developments in Reliable Computing 
SCAN98 Proceedings, pages 6775, Dordrecht, The Netherlands, 1999. Kluwer
Academic Publishers.
This paper compares several strategies for subdividing the domain in verified multidimensional Gaussian quadrature. Subdivision may be used in two places in the quadrature algorithm. First, subdividing the box can reduce the overestimation of the partial derivatives' ranges, which are needed to bound the approximation error. Second, if the required error bound cannot be met with the whole domain then the box is split into subboxes, and the quadrature algorithm is recursively applied to these. Both variants of subdivision are considered in this paper.

[95] 
Benedikt Großer and Bruno Lang.
Using pentangular factorizations for the reduction to banded form.
In P. Amestoy, P. Berger, M. Daydé, I. Duff, V. Frayssé,
L. Giraud, and D. Ruiz, editors, EuroPar'99 Parallel Processing, volume
1685 of LNCS, pages 10881095, Berlin, 1999. SpringerVerlag.
Most methods for computing the singular value decomposition (SVD) first bidiagonalize the matrix. The ScaLAPACK implementation of the blocked reduction of a general dense matrix to bidiagonal form performs about one half of the operations with BLAS3. If we subdivide the task into two stages dense >banded and banded > bidiagonal, we can increase the portion of matrixmatrix operations and expect higher performance. We give an overview of different techniques for the first stage.

[96] 
Rudolf Berrendorf, Christian Bischof, Holger Brunst, Martin Bücker,
Ulrich Detert, Rüdiger Esser, Michael Gerndt, Johannes Grotendorst,
Inge Gutheil, HansChristian Hoppe, Friedel Hoßfeld, Bernd Körfgen,
Bruno Lang, Dieter an Mey, Bernd Mohr, Wolfgang E. Nagel, Karl Solchenbach,
Godehard Sutmann, Valentina Tikko, and Lothar Wollschläger.
Gekoppelte SMPSysteme im wissenschaftlichtechnischen
Hochleistungsrechnen  Status und Entwicklungsbedarf (GoSMP).
Analyse im Auftrag des BMBF, Zentralinstitut für Angewandte
Mathematik, Forschungszentrum Jülich GmbH, Germany, December 1999.
In der Analyse werden der derzeitige Status und die voraussehbare Entwicklung gekoppelter SMPSysteme bezüglich der Hardware und Software dargestellt sowie Forschungs und Entwicklungsaufgaben im Softwarebereich identifiziert, deren Lösung für den effizienten Einsatz gekoppelter SMPSysteme für das wissenschaftlichtechnische Hochleistungsrechnen erforderlich ist. Es werden konkrete Themengebiete für eine Projektförderung durch das BMBF empfohlen.

[97] 
Bruno Lang.
Efficient algorithms for reducing banded matrices to bidiagonal and
tridiagonal form.
In Peter Arbenz, Marcin Paprzycki, Ahmed Sameh, and Vivek Sarin,
editors, High Performance Algorithms for Structured Matrix Problems,
volume 2 of Advances in the Theory of Computation and Computational
Mathematics, pages 7589. Nova Science Publishers, Commack, NY, 1998.
This paper presents efficient techniques for the orthogonal reduction of banded matrices to bidiagonal and symmetric tridiagonal form. The algorithms are numerically stable and well suited to parallel execution. Experiments on the Intel Paragon show that even on a single processor these methods usually will be several times faster than the corresponding LAPACK routines. In addition, they yield nearly full speedup when run on multiple processors.

[98] 
Bruno Lang.
Using level 3 BLAS in rotationbased algorithms.
SIAM J. Sci. Comput., 19(2):626634, March 1998.
This paper presents a technique that allows using level 3 BLAS in a number of rotationbased algorithms. In particular, the update of an orthogonal transformation matrix which often involves the vast majority of operations can be done with a matrixmatrix product. As a case study, the technique is applied to the QR and QL algorithms for computing the eigensystem of a symmetric tridiagonal matrix. The modifications do not affect the convergence properties of the algorithms nor do they significantly increase the overall number of operations. Thus, the computations can be sped up by more than 50% on machines with a distinct memory hierarchy, like the Intel i860 or IBM RS/6000, provided the block size is set appropriately. We also present a simple theoretical analysis that allows selecting an almostoptimal block size.

[99] 
Bruno Lang.
Verified quadrature in determining Newton's constant of
gravitation.
J. Univers. Comput. Sci., 4(1):1624, 1998.
This paper describes the use of interval arithmetic to bound errors in an experiment for determining Newton's constant of gravitation. Using verified Gaussian quadrature we were able to assess the numerical errors as well as the effect of several tolerances in the physical experiment.

[100]  Bruno Lang. Effiziente Orthogonaltransformationen bei der Eigen und Singulärwertberechnung. Habilitationsschrift, Fachbereich Mathematik, Bergische Universität GH Wuppertal, Germany, 1997. 
[101] 
Bruno Lang.
Using level 3 BLAS in the QR algorithm.
Z. angew. Math. Mech., 77(S2):S 607S 608, 1997.
We present a technique that allows speeding up the QR algorithms for symmetric tridiagonal and Hessenberg matrices by doing most of the computations in matrixmatrix products. The reduced memory traffic leads to increased performance.

[102] 
Bruno Lang.
Parallel reduction of banded matrices to bidiagonal form.
Parallel Comput., 22(1):118, January 1996.
A parallel algorithm for reducing banded matrices to bidiagonal form is presented. In contrast to the rotationbased “standard approach”, our algorithm is based on Householder transforms, therefore exhibiting considerably higher data locality (BLAS level 2 instead of level 1). The update of the transformation matrices which involves the vast majority of the operations can even be blocked to allow the use of level 3 BLAS. Thus, our algorithm will outperform the standard method on a serial computer with a distinct memory hierarchy. In addition, the algorithm can be efficiently implemented in a distributed memory environment, as is demonstrated by numerical results on the Intel Paragon.

[103] 
Oliver Holzmann, Bruno Lang, and Holger Schütt.
Newton's constant of gravitation and verified numerical quadrature.
Reliab. Comput., 2(3):229239, November 1996.
In this paper we describe the use of interval arithmetic in an experiment for determining G, Newton's constant of gravitation. Using an interval version of Gaussian quadrature, we bound the effects of numerical errors and of several tolerances in the physical experiment. This allowed to idientify “critical” tolerances which must be reduced in order to obtain G with the desired accuracy.

[104] 
Bruno Lang.
Reduction of banded matrices to bidiagonal form.
Z. angew. Math. Mech., 76(Supplement 1: ICIAM/GAMM 95
Proceedings):155158, 1996.
We present a parallel algorithm for reducing banded matrices to bidiagonal form. This reduction is a major step in the computation of singular values and singular vectors. In contrast to the rotationbased “standard approach”, our algorithm is based on Householder transforms, therefore exhibiting considerably better data locality (BLAS level 2 instead of level 1). The update of the transformation matrix, the most expensive part of the reduction, can be blocked to allow the use of level 3 BLAS. Thus, even a serial implementation of our algorithm will outperform the standard method on a machine with a distinct memory hierarchy. In addition, the algorithm can be efficiently implemented in a distributed memory environment, as is demonstrated by numerical results on the Intel Paragon.

[105] 
Götz Alefeld, Andreas Frommer, and Bruno Lang, editors.
Scientific Computing and Validated Numerics  SCAN95
Proceedings, volume 90 of Mathematical Research, Berlin, 1996.
Akademie Verlag.


[106] 
Bruno Lang.
Verifizierte Lösung von Gleichungs und
Ungleichungssystemen.
Z. angew. Math. Mech., 75(S II):S541S542, 1995.


[107] 
Christian H. Bischof, Bruno Lang, and Xiaobai Sun.
Parallel tridiagonalization through twostep band reduction.
In Proceedings of the Scalable HighPerformance Computing
Conference, Knoxville, Tennessee, May 2325, 1994, pages 2327, Los
Alamitos, CA, 1994. IEEE Computer Society.
We present a twostep variant of the “successive band reduction” paradigm for the tridiagonalization of symmetric matrices. Here we first reduce a full matrix to narrowbanded form, and from there to tridiagonal form. The first step allows easy exploitation of block orthogonal transformations. In the second step, we employ a new blocked version of a banded matrix tridiagonalization algorithm by Lang. In particular, we are able to express the update of the orthogonal transformation matrix in terms of block transformations, which leads to an algorithm that is almost entirely based on BLAS3 kernels, and has greatly improved data movement and communication characteristics. We also present some performance results on the Intel Touchstone Delta prototype and the IBM SP/1.

[108] 
Bruno Lang.
Parallele Berechnung von Eigensystemen symmetrischer
Bandmatrizen.
Z. angew. Math. Mech., 74(6):T541T544, 1994.
Es wird ein Verfahren zur Berechnung von Eigenwerten und vektoren symmetrischer Bandmatrizen auf Parallelrechnern vorgestellt. Zunächst wird die Matrix auf Tridiagonalgestalt reduziert, dann werden mit einem beschleunigten Bisektionsverfahren die Eigenwerte und anschließend die Eigenvektoren berechnet. Das vorgestellte Verfahren ermöglicht in allen drei Phasen die Verteilung der Rechenarbeit auf mehrere Prozessoren und zusätzlich den Einsatz eventuell vorhandener Vektorhardware in den einzelnen Prozessoren. Die Effizienz des Verfahrens wird durch Messungen auf dem iPSC/860 HypercubeParallelrechner belegt.

[109] 
Bruno Lang.
A parallel algorithm for reducing symmetric banded matrices to
tridiagonal form.
SIAM J. Sci. Comput., 14(6):13201338, November 1993.
An algorithm is presented for reducing symmetric banded matrices to tridiagonal form via Householder transformations. The algorithm is numerically stable and is well suited to parallel execution on distributed memory multiple instruction multiple data (MIMD) computers. Numerical experiments on the iPSC/860 hypercube show that the new method yields nearly full speedup if it is run on multiple processors. In addition, even on a single processor the new method usually will be several times faster than the corresponding EISPACK and LINPACK routines.

[110] 
Bruno Lang.
Reducing symmetric banded matrices to tridiagonal formA
comparison of a new parallel algorithm with two serial algorithms on the
iPSC/860.
In L. Bougé, M. Cosnard, Y. Robert, and D. Trystram, editors,
Parallel Processing: CONPAR 92VAPP V  Proceedings Second Joint
International Conference on Vector and Parallel Processing, Lyon, France,
September 1992, volume 634 of LNCS, pages 271282, Berlin, 1992.
SpringerVerlag.
We compare three algorithms for reducing symmetric banded matrices to tridiagonal form and evaluate their performance on the Intel iPSC/860 hypercube parallel computer. Two of these algorithms, the routines BANDR and _SBTRD from the EISPACK and LAPACK libraries, resp., are serial algorithms with little potential for coarse grain parallelism. The third one, called SBTH, is a new parallel algorithm. Results on the iPSC/860 hypercube show that this new method may be several times faster than the others, even when run on a single processor. In addition, execution on multiple processors yields nearly full speedup.

[111]  Bruno Lang. Parallele Reduktion symmetrischer Bandmatrizen auf Tridiagonalgestalt. Dissertation, Fakultät für Mathematik, Universität Karlsruhe, Germany, 1991. 
[112] 
Bruno Lang.
Matrix vector multiplication for symmetric matrices on parallel
computers and applications.
Z. angew. Math. Mech., 71(6):T807T808, 1991.
Matrix vector products with symmetric matrices arise in several numerical algorithms, e.g. in the Householder method for eigenproblems and in some iterative methods for the solution of linear systems. Here we propose a procedure for computing such a product on a local memory parallel computer, if only one triangle of the matrix is actually stored. Our method neither increases the number of arithmetic operations nor introduces prohibitive communication overhead.

[113] 
Stephan Abramowski, Bruno Lang, and Heinrich Müller.
Moving regular kgons in contact.
In J. van Leeuwen, editor, GraphTheoretic Concepts in Computer
Science, International Workshop WG '88 Proceedings, Amsterdam, June 1517,
1988, volume 344 of LNCS, pages 229242, Berlin, 1989.
SpringerVerlag.
Given m circles in the plane, contacts between them can be specified by a system of quadratic distance equalities. An approximate solution for the trajectories of the circles for a system of one degree of freedom is given, by replacing the circles by translationally moving regular kgons. The approximation yields trajectories that are piecewise linear. The next linear generation of the m trajectories are found by an incremental algorithm in O( m^{2} ) time. Further, an algorithm is presented which finds the next collision between m kgons moving on lines at constant speed in time O( k m^{2x} ) for a constant x > 0 using linear space. Finally, more practical collision detection algorithms are sketched based on neighborhood information which, however, do not guarantee a nontrivial worstcase bound.
