The ESSEX project is an ongoing effort to provide exascale-enabled sparse eigensolvers, especially for quantum physics and related application areas. In this paper we first briefly summarize some key achievements that have been made within this project. Then we focus on a projection-based eigensolver with polynomial approximation of the projector. This eigensolver can be used for computing hundreds of interior eigenvalues of large sparse matrices. We describe techniques that allow using lower-degree polynomials than possible with standard Chebyshev expansion of the window function and kernel smoothing. With these polynomials, the degree, and thus the number of matrix-vector multiplications, typically can be reduced by roughly one half, resulting in comparable savings in runtime.
We study Chebyshev filter diagonalization as a tool for the computation of many interior eigenvalues of very large sparse symmetric matrices. In this technique the subspace projection onto the target space of wanted eigenvectors is approximated with filter polynomials obtained from Chebyshev expansions of window functions. After the discussion of the conceptual foundations of Chebyshev filter diagonalization we analyze the impact of the choice of the damping kernel, search space size, and filter polynomial degree on the computational accuracy and effort, before we describe the necessary steps towards a parallel high-performance implementation. Because Chebyshev filter diagonalization avoids the need for matrix inversion it can deal with matrices and problem sizes that are presently not accessible with rational function methods based on direct or iterative linear solvers. To demonstrate the potential of Chebyshev filter diagonalization for large-scale problems of this kind we include as an example the computation of the 102 innermost eigenpairs of a topological insulator matrix with dimension 109 derived from quantum physics applications.
Numerous challenges have to be mastered as applications in scientific computing are being developed for post-petascale parallel systems. While ample parallelism is usually available in the numerical problems at hand, the efficient use of supercomputer resources requires not only good scalability but also a verifiably effective use of resources on the core, the processor, and the accelerator level. Furthermore, power dissipation and energy consumption are becoming further optimization targets besides time-to-solution. Performance Engineering (PE) is the pivotal strategy for developing effective parallel code on all levels of modern architectures. In this paper we report on the development and use of low-level parallel building blocks in the GHOST library (“General, Hybrid, and Optimized Sparse Toolkit”). We demonstrate the use of PE in optimizing a density of states computation using the Kernel Polynomial Method, and show that reduction of runtime and reduction of energy are literally the same goal in this case. We also give a brief overview of the capabilities of GHOST and the applications in which it is being used successfully.
As we approach the exascale computing era, disruptive changes in the software landscape are required to tackle the challenges posed by manycore CPUs and accelerators. We discuss the development of a new `exascale enabled’ sparse solver repository (the ESSR) that addresses these challenges-from fundamental design considerations and development processes to actual implementations of some prototypical iterative schemes for computing eigenvalues of sparse matrices. Key features of the ESSR include holistic performance engineering, tight integration between software layers and mechanisms to mitigate hardware failures.
Recently, methods based on spectral projection and numerical integration came up in the literature as candidates for reliable high performance eigenvalue solvers. The key ingredients of this type of eigenvalue solver are a Rayleigh-Ritz process and a routine to compute an approximation to the desired eigenspace. The latter computation can be performed by numerical integration of the resolvent. In this article we investigate the progress of the Rayleigh-Ritz process and the achievable quality of the computed eigenpairs for the case that an upper bound for the normwise difference between the currently used subspace and the desired eigenspace is available. Then, such bounds are derived for the Gauß-Legendre rule and the trapezoidal rule.
Methods for the solution of sparse eigenvalue problems that are based on spectral projectors and contour integration have recently attracted more and more attention. Such methods require the solution of many shifted linear systems of full size. In most of the literature concerning these eigenvalue solvers, only few words are said on the solution of the linear systems, but they turn out to be very hard to solve by iterative linear solvers in practice. In this work we identify a row projection method for the solution of the inner linear systems encountered in the FEAST algorithm and introduce a novel hybrid parallel and fully iterative implementation of the eigenvalue solver. Our approach ultimately aims at achieving extreme parallelism by exploiting the algorithm's potential on several levels. We present numerical examples where graphene modeling is one of the target applications. In this application, several hundred or even thousands of eigenvalues from the interior of the spectrum are required, which is a big challenge for state-of-the-art numerical methods.
We present a general framework for algorithms for the solution of Hermitian eigenvalue problems, where eigenvalues in a given interval are sought. One instance of this framework is Polizzi's FEAST algorithm [E. Polizzi: Density-matrix-based algorithm for solving eigenvalue problems. Phys. Rev. B 2009; 79:115112], which is based on numerical integration. Another instance is based on polynomial approximation. We propose adaptive strategies for the choice of the polynomial degree and tolerance of the linear solver, respectively. Numerical experiments reveal that these strategies are able to improve the robustness of the resulting methods, while at the same time achieving near-to-optimum efficiency.
The QR algorithm is the method of choice for computing all eigenvalues of a dense nonsymmetric matrix A. After an initial reduction to Hessenberg form, a QR iteration can be viewed as chasing a small bulge from the top left to the bottom right corner along the subdiagonal of A. To increase data locality and create potential for parallelism, modern variants of the QR algorithm perform several iterations simultaneously, which amounts to chasing a chain of several bulges instead of a single bulge. To make effective use of level 3 BLAS, it is important to pack these bulges as tightly as possible within the chain. In this work, we show that the tightness of the packing in existing approaches is not optimal and can be increased. This directly translates into a reduced chain length by 33% compared to the state-of-the-art LAPACK implementation of the QR algorithm. To demonstrate the impact of our idea, we have modified the LAPACK implementation to make use of the optimal packing. Numerical experiments reveal a uniform reduction of the execution time, without affecting stability or robustness.
Obtaining the eigenvalues and eigenvectors of large matrices is a key problem in electronic structure theory and many other areas of computational science. The computational effort formally scales as O(N3) with the size of the investigated problem, N (e.g. the electron count in electronic structure theory), and thus often defines the system size limit that practical calculations cannot overcome. In many cases, more than just a small fraction of the possible eigenvalue/eigenvector pairs is needed, so that iterative solution strategies that focus only on a few eigenvalues become ineffective. Likewise, it is not always desirable or practical to circumvent the eigenvalue solution entirely. We here review some current developments regarding dense eigenvalue solvers and then focus on the Eigenvalue soLvers for Petascale Applications (ELPA) library, which facilitates the efficient algebraic solution of symmetric and Hermitian eigenvalue problems for dense matrices that have real-valued and complex-valued matrix entries, respectively, on parallel computer platforms. ELPA addresses standard as well as generalized eigenvalue problems, relying on the well documented matrix layout of the Scalable Linear Algebra PACKage (ScaLAPACK) library but replacing all actual parallel solution steps with subroutines of its own. For these steps, ELPA significantly outperforms the corresponding ScaLAPACK routines and proprietary libraries that implement the ScaLAPACK interface (e.g. Intel's MKL). The most time-critical step is the reduction of the matrix to tridiagonal form and the corresponding backtransformation of the eigenvectors. ELPA offers both a one-step tridiagonalization (successive Householder transformations) and a two-step transformation that is more efficient especially towards larger matrices and larger numbers of CPU cores. ELPA is based on the MPI standard, with an early hybrid MPI-OpenMPI implementation available as well. Scalability beyond 10000 CPU cores for problem sizes arising in the field of electronic structure theory is demonstrated for current high-performance computer architectures such as Cray or Intel/Infiniband. For a matrix of dimension 260000, scalability up to 295000 CPU cores has been shown on BlueGene/P.
The ESSEX project investigates computational issues arising at exascale for large-scale sparse eigenvalue problems and develops programming concepts and numerical methods for their solution. The project pursues a coherent co-design of all software layers where a holistic performance engineering process guides code development across the classic boundaries of application, numerical method, and basic kernel library. Within ESSEX the numerical methods cover widely applicable solvers such as classic Krylov, Jacobi-Davidson, or the recent FEAST methods, as well as domain-specific iterative schemes relevant for the ESSEX quantum physics application. This report introduces the project structure and presents selected results which demonstrate the potential impact of ESSEX for efficient sparse solvers on highly scalable heterogeneous supercomputers.
We consider the FEAST eigensolver, introduced by Polizzi in 2009 . We describe an improvement concerning the reliability of the algorithm and discuss an application in the solution of eigenvalue problems from graphene modeling.
This paper provides a streamlined and modular presentation of the MR3 algorithm for computing selected eigenpairs of symmetric tridiagonal matrices, thus disentangling the principles driving MR3 and the (recursive) “core” algorithm from the specific (e.g., twisted) decompositions used to represent the matrices at different recursion depths and from the (dqds) transformations converting between them. Our approach allows a modular full proof for the correctness of the MR3 algorithm. This proof is based on five requirements concerning the interplay between the core algorithm and its subcomponents. These requirements can also guide in implementing the algorithm, because they expose quantities that can and should be monitored at runtime. Our new implementation XMR, which is based on the above analysis, is described and compared to xSTEMR from Lapack 3.2.2. Numerical experiments comparing the robustness and performance of both implementations are given.
We analyze the FEAST method for computing selected eigenvalues and eigenvectors of large sparse matrix pencils. After establishing the close connection between FEAST and the well-known Rayleigh-Ritz method, we identify several critical issues that influence convergence and accuracy of the solver: the choice of the starting vector space, the stopping criterion, how the inner linear systems impact the quality of the solution, and the use of FEAST for computing eigenpairs from multiple intervals. We complement the study with numerical examples, and hint at possible improvements to overcome the existing problems.
Most result-verifying solvers for nonlinear systems include a branch-and-bound algorithmic backbone, complemented with accelerating methods such as Constraint Propagation, Interval Newton, and many others. Often the effectiveness of the accelerators can be improved if one works not only with the given system but also with expanded systems, which are obtained by introducing additional variables for suitable subterms. While reducing the overall number of boxes that must be considered during the solution, applying the accelerators to the (often much larger) expanded systems can increase significantly the time spent on a single box. Therefore a good strategy for deciding which methods to apply to which expanded system is essential for the performance and robustness of the whole solver.
We propose a heuristic that selects expanded systems based on performance information gathered during the computations. Numerical experiments show that this new dynamical strategy tends to be superior to a fixed small-to-large traversal (with restarts) of the expanded systems, which has been the default strategy in our nonlinear solver and optimizer SONIC.
Determining the singular value decomposition of a bidiagonal matrix is a frequent subtask in numerical computations. We shed new light on a long-known way to utilize the algorithm of multiple relatively robust representations, MR3, for this task by casting the singular value problem in terms of a suitable tridiagonal symmetric eigenproblem (via the Golub-Kahan matrix). Just running MR3 “as is” on the tridiagonal problem does not work, as has been observed before (e.g., B. Großer and B. Lang, An O(n2) algorithm for the bidiagonal SVD, Linear Algebra Appl., 358 (2003), pp. 45-70). In this paper we give more detailed explanations for the problems with running MR3 as a black box solver on the Golub-Kahan matrix. We show that, in contrast to standing opinion, MR3 can be run safely on the Golub-Kahan matrix, with just a minor modification. A proof including error bounds is given for this claim.
Bidiagonal and twisted factorizations play a prominent role in the MR3 algorithm for computing partial eigensystems of symmetric tridiagonal matrices. Bidiagonal factorizations of shifted symmetric tridiagonals T - τI also underly, at least implicitly, the evaluation of the Sturm count for eigenvalue computations. In this paper we propose new representations (e-rep, Z-rep) for bidiagonal and twisted factorizations relying on other quantities than the usually employed nontrivial entries of the bidiagonal factors (N-rep). We show that the qd algorithms used for shifting the factorizations, e.g., LDL* - τI =: L+D+(L+)*, and for converting between them, can be formulated to work with these representations in a mixed stable way. In fact, the Z-representation achieves lower bounds for the necessary relative perturbations of the inputs and outputs than the traditional N-rep does. To obtain sharp bounds for accumulating perturbations we propose a new notation that extends Higham's [N. J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia, PA, 2nd ed., 2002] by including second-order terms to provide sharper first-order bounds.
Factorizing symmetric tridiagonal matrices and propagating the factorizations to shifted matrices are central tasks in the MR3 algorithm for computing partial eigensystems. In this paper we propose block bidiagonal factorizations LDL* with 1 ×1 and 2 ×2 blocks in D as an alternative to the bidiagonal and twisted factorizations used hitherto. With block factorizations, the element growth can be reduced (or avoided altogether), which is essential for the success of the MR3 algorithm, in particular if the latter is used to determine the singular value decomposition of bidiagonal matrices. We show that the qd algorithm used for shifting bidiagonal factorizations, e.g., LDL* - τI =: L+D+(L+)* can be extended to work with blocks in a mixed stable way, including criteria for determining a suitable block structure dynamically.
Spherical t-designs provide quadrature rules for the sphere which are exact for polynomials up to degree t. In this paper, we propose a computational algorithm based on interval arithmetic which, for given t, upon successful completion will have proved the existence of a t-design with (t+1)2 nodes on the unit sphere S2 3 and will have computed narrow interval enclosures which are known to contain these nodes with mathematical certainty. Since there is no theoretical result which proves the existence of a t-design with (t+1)2 nodes for arbitrary t, our method contributes to the theory because it was tested successfully for t=1, 2, ..., 100. The t-design is usually not unique; our method aims at finding a well-conditioned one. The method relies on computing an interval enclosure for the zero of a highly nonlinear system of dimension (t+1)2. We therefore develop several special approaches which allow us to use interval arithmetic efficiently in this particular situation. The computations were all done using the MATLAB toolbox INTLAB.
Nowadays, the development, maintenance, and ongoing adaptation of simulation software due to new algorithmic or hardware developments are highly complex tasks involving larger teams, often from different groups and disciplines, and located at different places. This requires an increased use of methods and tools from software engineering. At the same time, the high computational demands from the fields of application make it necessary to optimize the modules for code performance and scalability in order to fully exploit the potential of modern parallel architectures.
This paper presents a case study on the ongoing endeavor of improving and developing library software for the parallel computation of eigenvalues for dense symmetric matrices, driven by fields of application such as quantum chemistry. A widespread approach is to, first, transform the matrix to tridiagonal form and, second, to solve the tridiagonal eigenvalue problem, before a back transformation provides the eigenvectors of the original matrix. For overall performance, each of these steps must be optimized in a specific way with respect to numerical and parallel efficiency, which shows the importance of involving different experts and of designing the parallel eigensolver in a modular way. Optimizations for the reduction and the back transformation are discussed in this paper, including numerical results demonstrating their effectiveness.
The computation of selected eigenvalues and eigenvectors of a symmetric (Hermitian) matrix is an important subtask in many contexts, for example in electronic structure calculations. If a significant portion of the eigensystem is required then typically direct eigensolvers are used. The central three steps are: reduce the matrix to tridiagonal form, compute the eigenpairs of the tridiagonal matrix, and transform the eigenvectors back. To better utilize memory hierarchies, the reduction may be effected in two stages: full to banded, and banded to tridiagonal. Then the back transformation of the eigenvectors also involves two stages. For large problems, the eigensystem calculations can be the computational bottleneck, in particular with large numbers of processors. In this paper we discuss variants of the tridiagonal-to-banded back transformation, improving the parallel efficiency for large numbers of processors as well as the per-processor utilization. We also modify the divide-and-conquer algorithm for symmetric tridiagonal matrices such that it can compute a subset of the eigenpairs at reduced cost. The effectiveness of our modifications is demonstrated with numerical experiments.
We consider the eigensolver named FEAST that was introduced by Polizzi in 2009 . This solver, tailored to the (partial) solution of large sparse (generalized) eigenproblems offers good potential for parallelism and showed good robustness in our experiments. We briefly introduce the algorithm, point out some problems and give two examples.
The question if there exists an N-point spherical t-design is not yet settled for all combinations of t and N. Using our framework SONIC for the solution of nonlinear systems, we were able to close the two remaining open cases for t = 3. More precisely, a computational proof revealed that there are no spherical 3-designs with N = 7 or N = 9 points. We describe how these results were obtained and comment on the open cases for larger values of t.
The invariance of the topological degree under certain homotopies is used to derive a framework for tests to computationally prove the existence of zeros of nonlinear mappings in Rn. These tests use interval arithmetic to enclose the range of a function over a box and are provably more general than many other tests like the Moore-Kioustelidis test, a test based on the Krawczyk operator, and another degree-based test published recently. A numerical example is included.
We propose a novel approach for the parametrically robust design of dynamic systems. The approach can be applied to system models with parameters that are uncertain in the sense that values for these parameters are not known precisely, but only within certain bounds. The novel approach is guaranteed to find an optimal steady state that is stable for each parameter combination within these bounds.
Our approach combines the use of a standard solver for constrained optimization problems with the rigorous solution of nonlinear systems. The constraints for the optimization problems are based on the concept of parameter space normal vectors that measure the distance of a tentative optimum to the nearest known critical point, i.e., a point where stability may be lost. Such normal vectors are derived using methods from Nonlinear Dynamics. After the optimization, the rigorous solver is used to provide a guarantee that no critical points exist in the vicinity of the optimum, or to detect such points. In the latter case, the optimization is resumed, taking the newly found critical points into account. This optimize-and-verify procedure is repeated until the rigorous nonlinear solver can guarantee that the vicinity of the optimum is free from critical points and therefore the optimum is parametrically robust. In contrast to existing design methodologies, our approach can be automated and does not rely on the experience of the designing engineer.
A simple model of a fermenter is used to illustrate the concepts and the order of activities arising in a typical design process.
The overlap Dirac operator in lattice QCD requires the computation of the sign function of a matrix. While this matrix is usually Hermitian, it becomes non-Hermitian in the presence of a quark chemical potential. We show how the action of the sign function of a non-Hermitian matrix on an arbitrary vector can be computed efficiently on large lattices by an iterative method. A Krylov subspace approximation based on the Arnoldi algorithm is described for the evaluation of a generic matrix function. The efficiency of the method is spoiled when the matrix has eigenvalues close to a function discontinuity. This is cured by adding a small number of critical eigenvectors to the Krylov subspace, for which we propose two different deflation schemes. The ensuing modified Arnoldi method is then applied to the sign function, which has a discontinuity along the imaginary axis. The numerical results clearly show the improved efficiency of the method. Our modification is particularly effective when the action of the sign function of the same matrix has to be computed many times on different vectors, e.g., if the overlap Dirac operator is inverted using an iterative method.
Derivatives are a crucial ingredient to a broad variety of computational techniques in science and engineering. While numerical approaches for evaluating derivatives suffer from truncation error, automatic differentiation is accurate up to machine precision. The term automatic differentiation comprises a set of techniques for mechanically transforming a given computer program to another one capable of evaluating derivatives. A common misconception about automatic differentiation is that this technique only works on local pieces of fairly simple code. Here, it is shown that automatic differentiation is not only applicable to small academic codes, but scales to advanced industrial software packages. In particular, the general-purpose computational fluid dynamics software package FLUENT is transformed by automatic differentiation.
We report here the implementation of a newly developed, highly efficient matrix diagonalization routine in the DR program [T. E. Odaka, P. Jensen, and T. Hirano, J. Mol. Structure 795, 14 (2006)]. The DR program solves the rovibronic Schrödinger equation for a triatomic molecule with a double Renner effect, i.e., with two accessible linear arrangements of the nuclei at which the electronic energy is doubly degenerate. With the new routines, we can extend the DR calculations of rovibronic energies for A2 Π MgNC/MgCN by considering a much larger set of rovibronic states, in particular states at higher J values, than we were able to access previously.
The verification of the existence of certain spherical t-designs involves the evaluation of a degree-t polynomial Jt at a very large number of (interval) arguments. To make the overall verification process feasible computationally, this evaluation must be fast, and the enclosures for the function values must be affected with only modest over-estimation. We discuss several approaches for multi-argument interval evaluation of the polynomial Jt and show how they can be adapted to other polynomials p. One particularly effective new method is based on expanding the polynomial p around several points ξj and truncating each resulting expansion pξ_j to a lower-degree polynomial.
The invariance of the topological degree under certain homotopies is used to computationally prove the existence of zeros of nonlinear mappings in Rn. These existence tests use interval arithmetic to enclose the range of a function over a box. We show that our test is more general than a well-known test based on Miranda's theorem, and we show by a numerical example that the new test can be successful on substantially larger boxes.
The overlap Dirac operator at nonzero quark chemical potential involves the computation of the sign of a non-Hermitian matrix. In this talk we present an iterative method, first proposed by us in Ref. , which allows for an efficient computation of the operator, even on large lattices. The starting point is a Krylov subspace approximation, based on the Arnoldi algorithm, for the evaluation of a generic matrix function. The efficiency of this method is spoiled when the matrix has eigenvalues close to a function discontinuity. To cure this, a small number of critical eigenvectors are added to the Krylov subspace, and two different deflation schemes are proposed in this augmented subspace. The ensuing method is then applied to the sign function of the overlap Dirac operator, for two different lattice sizes. The sign function has a discontinuity along the imaginary axis and the numerical results show how deflation dramatically improves the efficiency of the method.
In this paper we present a new parallelization scheme for the FMM near-field. The parallelization is based on the Global Arrays Toolkit and uses one-sided communication with overlapping. It employs a purely static load-balancing approach to minimize the number of communication steps and benefits from a maximum utilization of data locality. In contrast to other implementations the communication is initiated by the process owning the data via a put call, not the process receiving the data (via a get call).
We describe the design and implementation of a new algorithm for computing the singular value decomposition of a real bidiagonal matrix. This algorithm uses ideas developed by Großer and Lang that extend Parlett's and Dhillon's MRRR algorithm for the tridiagonal symmetric eigenproblem. One key feature of our new implementation is, that k singular triplets can be computed using only O(nk) storage units and floating point operations, where n is the dimension of the matrix. The algorithm will be made available as routine xBDSCR in the upcoming new release of the LAPACK library.
Support vector machines are powerful kernel methods for classification and regression tasks. If trained optimally, they produce excellent separating hyperplanes. The quality of the training, however, depends not only on the given training data but also on additional learning parameters, which are difficult to adjust, in particular for unbalanced datasets. Traditionally, grid search techniques have been used for determining suitable values for these parameters. In this paper we propose an automated approach to adjusting the learning parameters using a derivative-free numerical optimizer. To make the optimization process more efficient, a new sensitive quality measure is introduced. Numerical tests with a well-known dataset show that our approach can produce support vector machines that are very well tuned to their classification tasks.
Nonlinear systems occur in diverse applications, e.g., in the steady state analysis of chemical processes. If safety concerns require the results to be provably correct then result-verifying algorithms relying on interval arithmetic should be used for solving these systems. Since such algorithms are very computationally intensive, the coarse-grained inter-box parallelism should be exploited to make them feasible in practice. In this paper we briefly describe our framework SONIC for the verified solution of nonlinear systems and give detailed information about its parallelization with OpenMP and MPI. Our numerical results show that the implemented parallelization schemes are indeed successful. The more sophisticated MPI implementation seems to be superior to the easy-to-implement OpenMP version and shows almost linear speedup up to a large number of processors.
This work deals with aspects of support vector machine learning for large-scale data mining tasks. Based on a decomposition algorithm for support vector machine training that can be run in serial as well as shared memory parallel mode we introduce a transformation of the training data that allows for the usage of an expensive generalized kernel without additional costs. We present experiments for the Gaussian kernel, but usage of other kernel functions is possible, too. In order to further speed up the decomposition algorithm we analyze the critical problem of working set selection for large training data sets. In addition, we analyze the influence of the working set sizes onto the scalability of the parallel decomposition scheme. Our tests and conclusions led to several modifications of the algorithm and the improvement of overall support vector machine learning performance. Our method allows for using extensive parameter search methods to optimize classification accuracy.
The increasing amount of data used for classification, as well as the demand for complex models with a large number of well tuned parameters, naturally lead to the search for efficient approaches making use of massively parallel systems. We describe the parallelization of support vector machine learning for shared memory systems. The support vector machine is a powerful and reliable data mining method. Our learning algorithm relies on a decomposition scheme, which in turn uses a special variable projection method, for solving the quadratic program associated with support vector machine learning. By using hybrid parallel programming, our parallelization approach can be combined with the parallelism of a distributed cross validation routine and parallel parameter optimization methods.
In this paper we analyze support vector machine classification using the soft margin approach that allows for errors and margin violations during the training stage. Two models for learning the separating hyperplane do exist. We study the behavior of the resulting optimization algorithms in terms of training time and test accuracy for unbalanced data sets. The main goal of our work is to compare the features of the resulting classification functions, which are mainly defined by the support vectors arising during the support vector machine training.
The support vector machine (SVM) is a well-established and accurate supervised learning method for the classification of data in various application fields. The statistical learning task - the so-called training - can be formulated as a quadratic optimization problem. During the last years the decomposition algorithm for solving this optimization problem became the most frequently used method for support vector machine learning and is the basis of many SVM implementations today. It is characterized by an internal parameter called working set size. Usually small working sets have been assigned. The increasing amount of data used for classification led to new parallel implementations of the decomposition method with efficient inner solvers. With these solvers larger working sets can be assigned. It was shown, that for parallel training with the decomposition algorithm large working sets achieve good speedup values. However, the choice of the optimal working set size for parallel training is not clear. In this paper, we show how the working set size influences the number of decomposition steps, the number of kernel function evaluations and the overall training time in serial and parallel computation.
In this paper we describe a new hybrid distributed/shared memory parallel software for support vector machine learning on large data sets. The support vector machine (SVM) method is a well-known and reliable machine learning technique for classification and regression tasks. Based on a recently developed shared memory decomposition algorithm for support vector machine classifier design we increased the level of parallelism by implementing a cross validation routine based on message passing. With this extention we obtained a flexible parallel SVM software that can be used on high-end machines with SMP architectures to process the large data sets that arise more and more in bioinformatics and other fields of research.
We consider the problem of tuning parameters for the learning method support vector machine. The APPSPACK software, an asynchronous parameter tuning method, is well suited for SVM parameter fitting due to several characteristics. No derivative information is needed for bound constrained optimization and the code can be run in parallel mode. Recently, a hybrid parallel support vector machine has been proposed. To couple both parallel packages, the APPSPACK software needs to be customized to allow for a parallel function evaluation in addition to the parallelism provided by APPSPACK. In this paper we describe our customization of the APPSPACK software to facilitate a top down parallelism in SVM parameter tuning.
When using a Newton-based numerical algorithm to optimize the shape of an airfoil with respect to certain design parameters, a crucial ingredient is the derivative of the objective function with respect to the design parameters. In large-scale aerodynamics, this objective function is an output of a computational fluid dynamics program written in a high-level programming language such as Fortran or C.Numerical differentiation is commonly used to approximate derivatives but is subject to truncation and subtractive cancellation errors. For a particular two-dimensional airfoil, we instead apply automatic differentiation to compute accurate derivatives of the lift and drag coefficients with respect to geometric shape parameters. In automatic differentiation, a given program is transformed into another program capable of computing the original function together with its derivatives. In the problem at hand, the objective function consists of a sequence of programs: a MATLAB program followed by two Fortran 77 programs. It is shown how automatic differentiation is applied to a sequence of programs while keeping the computational complexity within reasonable limits. The derivatives computed by automatic differentiation are compared with approximations based on divided differences.
We show how interval arithmetic can be used in connection with Borsuk's theorem to computationally prove the existence of a solution of a system of nonlinear equations. It turns out that this new test, which can be checked computationally in several different ways, is more general than an existing test based on Miranda's theorem in the sense that it is successful for a larger set of situations. A numerical example is included.
The relatively robust representations (RRR) algorithm is the method of choice to compute highly accurate eigenvector approximations for symmetric tridiagonal matrices. The task of computing singular vector pairs for a bidiagonal matrix B = UΣVT is closely connected to the RRR algorithm regarding BTB, BBT, or the Golub-Kahan matrix TGK. Nevertheless, separate application of the RRR algorithm to these matrices leads to poor results regarding either numerical orthogonality or the residual BV - UΣ . It turns out that the coupling strategy proposed in [B. Grosser and B. Lang, Linear Algebra Appl., 358 (2003), pp. 45-70] resolves this problem. This article provides the corresponding perturbation theory: We compare the eigenvalues of the separate and coupled decompositions and explain why singular vector pairs approximated via couplings are of superior quality.
The ADIFOR 2.0 tool for Automatic Differentiation of Fortran programs has been used to generate analytic gradient code for all semiempirical SCF methods available in the MNDO97 program. The correctness and accuracy of the new code have been verified. Its performance has been compared with that of hand-coded analytic derivative routines and with numerical differentiation. From a quantum-chemical point of view, the major advance of this work is the development of previously unavailable analytic gradient code for the recently proposed OM1 and OM2 methods.
Modern classification methods are able to analyze large, complex and sometimes also unbalanced data sets. It is important not only to produce good results but also to control the time to achieve them. High computational costs lead to the exclusion of data mining methods even in case of good accuracy. Our paper deals with support vector machines in view of the consumption of CPU time. We study their learning behavior for unbalanced data sets with increasing size. We also examine the question whether it is necessary and practicable to parallelize this method.
We consider the problem of selecting and tuning learning parameters for support vector machines, especially for the classification of large and unbalanced data sets. We show why and how simple models with few parameters should be refined and propose an automated approach for tuning the increased number of parameters in the extended model. Based on a sensitive quality measure we analyze correlations between the number of parameters, the learning time and the performance of the trained SVM in classifying independent test data. In addition we study the influence of the quality measure on the classification performance and compare the behavior of serial and asynchronous parallel parameter tuning on an IBM p690 cluster.
Simulation, e.g., in the field of computational fluid dynamics, accounts for a major part of the computing time on high-performance systems. Many simulation packages still rely on Gauss-Seidel iteration, either as the main linear solver or as a smoother for multigrid schemes. Straight-forward implementations of this solver have efficiency problems on today's most common high-performance computers, i.e., multiprocessor clusters with pronounced memory hierarchies. In this work we present two simple techniques for improving the performance of the parallel Gauss-Seidel method for the 3D Poisson equation by optimizing cache usage as well as reducing the number of communication steps.
We describe a multilevel technique combining symbolic and numeric methods to obtain guaranteed enclosures for all solutions of a nonlinear system within a given search box.
We compare two computational tests for the existence of a zero of a nonlinear system. One of the tests is based on a theorem by Moore, the other relies on Miranda's theorem. It turns out that the Miranda test is always at least as powerful as the Moore test. We also indicate some conditions under which both tests are equivalent.
A framework for the verified solution of nonlinear systems arising in the analysis and design of chemical processes is described. The framework combines a symbolic preprocessing step with an interval-based branch-and-bound solver whose efficiency is increased with several acceleration techniques. One of these methods is based on order-2 Taylor expansion; it is also discussed in this note.
We discuss the use of interval arithmetic in the computation of the convex hull of n points in D dimensions. Convex hull algorithms rely on simple geometric tests, such as “does some point p lie in a certain half-space or affine subspace?” to determine the structure of the hull. These tests in turn can be carried out by solving appropriate (not necessarily square) linear systems. If interval-based methods are used for the solution of these systems then the correct hull can be obtained at lower cost than with exact arithmetic. In addition, interval-based methods can determine the topological structure of the convex hull even if the position of the points is not known exactly. In the present paper we compare various interval linear solvers with respect to their ability to handle close-to-pathological situations. This property determines how often interval arithmetic cannot provide the required information and therefore some computations must be redone with exact arithmetic.
Many different heuristics have been proposed for selecting the subdivision direction in branch-and-bound result-verifying nonlinear solvers. We investigate the impact of the box-splitting techniques on the overall performance of the solver and propose a new approach combining some of the simple heuristics in a hybrid way. Numerical experiments with medium-sized example problems indicate that our approach is successful.
Existence or fixed point theorems, combined with interval analytic methods, provide a means to computationally prove the existence of a zero of a nonlinear system in a given interval vector. One such test is based on Borsuk's existence theorem. We discuss preconditioning techniques that are aimed at improving the effectiveness of this test.
Multithreading is a fundamental approach to expressing parallelism in programs. Since Java is emerging as the de facto standard language for platform independent software development in higher education, there is need for teaching multithreading in the context of Java. We use a simple problem from scientific computing to explain two different multithreading approaches to second-year students. More precisely, a simple boundary value problem is considered, which makes visible the differences between native Java threads and the OpenMP interface. So, the students are able to appreciate the respective merits and drawbacks of a thread package that is integrated into the Java programming language and an approach combining compiler support with library calls.
SONIC is a new rigorous nonlinear solver and constrained optimizer based on interval computations. Having been developed primarily to support the parametrically robust design of dynamic systems, SONIC can also be used to solve nonlinear problems arising in other application areas. Since its high-level solvers can be coupled with several different basic interval libraries, it can achieve optimized performance while maintaining a high degree of portability. In addition to the serial code, efficient parallel versions for shared and distributed memory architectures are available.
The RRR algorithm computes the eigendecomposition of a symmetric tridiagonal matrix T with an O(n2) complexity. This article discusses how this method can be extended to the bidiagonal SVD B = U ΣVT. It turns out that using the RRR algorithm as a black box to compute BTB=VΣ2VT and BBT=UΣ2UT separately may give poor results for B V - U Σ. The use of the standard Jordan-Wielandt representation can fail as well if clusters of tiny singular values are present. A solution is to work on BTB and to keep factorizations of BBT implicitly. We introduce a set of coupling transformations which allow us to replace the representation u = (1)/(σ) B v by a more stable representation u = X v, where X is a diagonal matrix. Numerical results of our implementation are compared with the LAPACK routines DSTEGR, DBDSQR and DBDSDC.
Numerical simulation is a powerful tool in science and engineering, and it is also used for optimizing the design of products and experiments rather than only for reproducing the behavior of scientific and engineering systems. In order to reduce the number of simulation runs, the traditional “trial and error” approach for finding near-to-optimum design parameters is more and more replaced with efficient numerical optimization algorithms. Done by hand, the coupling of simulation and optimization software is tedious and error-prone. In this note we introduce a software environment called EFCOSS (Environment For Combining Optimization and Simulation Software) that facilitates and speeds up this task by doing much of the required work automatically. Our framework includes support for automatic differentiation providing the derivatives required by many optimization algorithms. We describe the process of integrating the widely used computational fluid dynamics package FLUENT and a MINPACK-1 least squares optimizer into EFCOSS and follow a sample session solving a data assimilation problem.
Automatic differentiation is a powerful technique for evaluating derivatives of functions given in the form of a high-level programming language such as Fortran, C, or C++. The program is treated as a sequence of elementary statements to which the chain rule of differential calculus is applied mechanically. A key feature of this technique is its ability to generate accurate derivatives rather than approximations obtained from numerical differentiation like divided differences. We survey automatic differentiation and report on preliminary results that have been obtained applying the automatic differentiation tool ADIFOR to a large-scale computational fluid dynamics solver, TFS, developed at Aerodynamisches Institut, Aachen University of Technology. This solver comprises approximately 24,000 lines of Fortran 77 and is able to resolve steady and unsteady laminar and turbulent flows in two and three dimensions.
From an abstract point of view, a numerical simulation implements a mathematical function that produces some output from some given input. Derivatives (or sensitivities) of the function's output with respect to its input can be obtained-free from truncation error-by using a technique called automatic differentiation. Given a computer code in a high-level programming language like Fortran, C, or, C++, automatic differentiation generates another code capable of computing not only the original function but also its derivatives. Thus, the application of automatic differentiation significantly extends the functionality of a simulation package. For instance, automatic differentiation enables, in a completely mechanical fashion, the usage of derivative-based optimization algorithms where the evaluation of the objective function comprises some given large-scale engineering simulation. In this note, the automatic differentiation tool Adifor is used to transform the general purpose finite element package SEPRAN.In doing so, we automatically translate the given 400,000 lines of Fortran 77 into a new program consisting of 600,000 lines of Fortran 77. We compare our approach with a traditional approach based on numerical differentiation and quantify its advantages in terms of accuracy and computational efficiency for a standard fluid flow problem.
Parallel programming of high-performance computers has emerged as a key technology for the numerical solution of large-scale problems arising in computational science and engineering (CSE). The authors believe that principles and techniques of parallel programming are among the essential ingredients of any CSE as well as computer science curriculum. Today, opinions on the role and importance of parallel programming are diverse. Rather than seeing it as a marginal beneficial skill optionally taught at the graduate level, we understand parallel programming as crucial basic skill that should be taught as an integral part of the undergraduate computer science curriculum. A practical training course developed for computer science undergraduates at Aachen University is described. Its goal is to introduce young computer science students to different parallel programming paradigms for shared and distributed memory computers as well as to give a first exposition to the field of computational science by simple, yet carefully chosen sample problems.
Derivatives play a prominent role in many areas of scientific computing. Traditionally, divided differences are employed to approximate derivatives, leading to results of dubious quality at often great computational expense. Automatic differentiation (AD), by contrast, is a powerful technique for accurately evaluating derivatives of functions described in a high-level programming language. AD requires little human effort and produces derivatives without truncation error. Although there is no conceptual difference between small and large codes, applying AD to programs with hundreds of thousands lines of code is still a challenging task and requires a robust AD tool. We report on recent accomplishments of AD applied to the general purpose finite element package SEPRAN consisting of approximately 400,000 lines of Fortran77 and its integration into a prototype problem solving environment called EFCOSS supporting interoperability of simulation codes with optimization software using AD technology.
In this paper we discuss the use of interval arithmetic in the computation of the convex hull of n points in D dimensions. If interval-based methods are used for the solution of certain appropriate (not necessarily square) linear systems then the correct topological structure of the convex hull can be obtained at lower cost than with exact rational arithmetic. In addition, interval-based methods can determine the topological structure of the convex hull even if the position of the points is not known exactly. In our experiments we compared various linear solvers with respect to their speed and their ability to handle close-to-pathological situations. The latter property determines how often interval arithmetic cannot provide the required information and therefore some computations must be redone with rational arithmetic.
For functions given in the form of a computer program, automatic differentiation is an efficient technique to accurately evaluate the derivatives of that function. Starting from a given computer program, automatic differentiation generates another program for the evaluation of the original function and its derivatives in a fully mechanical way. While the efficiency of this black box approach is already high as compared to numerical differentiation based on divided differences, automatic differentiation can be applied even more efficiently by taking into account high-level knowledge about the given computer program. We show that, in the case where the function involves a Fourier transform, the degree of parallelism in the program generated by automatic differentiation can be increased leading to a rich set of automatic parallelization strategies that are not available when employing a black box automatic parallelization approach. Experiments of the new automatic parallelization approach are reported on a SunFire 6800 server using up to 20 processors.
The message passing paradigm is the dominating programming model for parallel processing on distributed-memory architectures. OpenMP is the most widely used parallel programming model for shared-memory systems. Contemporary parallel computers consist of multiple shared-memory computers connected via a network so that a hybrid parallelization approach combining message passing and OpenMP is often desired. We present a hybrid parallelization of an iterative method for the solution of linear systems with a large sparse symmetric coefficient matrix which is implemented in Java. To this end, we bring together mpiJava and JOMP and compare the performance on a Sun Fire 6800 system with a hybrid parallel version of the same problem implemented in Fortran. The scalability of both implementations is similar while, surprisingly, the performance of the Fortran version with data prefetching is only a factor of about 2 larger than the Java version when up to 22 processors are used.
Derivative information is required in numerous applications, including sensitivity analysis and numerical optimization. For simple functions, symbolic differentiation-done either manually or with a computer algebra system-can provide the derivatives, whereas divided differences (DD) have been used traditionally for functions defined by (potentially very complex) computer programs, even if only approximate values can be obtained this way. An alternative approach for such functions is automatic differentiation (AD), yielding exact derivatives at often lower cost than DD, and without restrictions on the program complexity. In this paper we compare the functionality and describe the use of ADMIT/ADMAT and ADiMat. These two AD tools provide derivatives for programs written in the MATLAB language, which is widely used for prototype and production software in scientific and engineering applications. While ADMIT/ADMAT implements a pure operator overloading approach of AD, ADiMat also employes source transformation techniques.
Nonlinear systems occur in diverse applications, i.e., in the steady state analysis of chemical processes. If safety concerns require the results to be provably correct then result-verifying algorithms relying on interval arithmetic should be used for solving these systems. Since such algorithms are very computationally intensive, parallelism must be exploited to make them feasible in practice. We describe our framework for the verified solution of nonlinear systems and our approach to parallelizing it with OpenMP. First numerical results show that the parallelization scheme is indeed successful but that the attainable speedup is limited for some unknown reason.
Given a computer model for the electrostatic potential in an L-shaped region with media of different dielectric permeabilities in two subregions, we are interested in the robustness of the simulation by identifying the rate of change of the potential with respect to a change in the permeabilities. Such sensitivity analyses, assessing the rate of change of certain model outputs implied by varying certain model inputs, can be carried out by computing the corresponding partial derivatives. In large-scale computational physics, the underlying computer model is typically available as a complicated computer code in a high-level programming language such as Fortran, C, or C++. To obtain accurate and efficient derivatives of functions given in this form, we use a technique called automatic or algorithmic differentiation. Unlike numerical differentiation based on divided differences, derivatives generated by automatic differentiation are free of truncation error. Here, the automatic differentiation tool Adifor is used to transform the given computer model-implemented with the general purpose finite element package SEPRAN-into a new computer code computing the derivatives of the electrostatic potential with respect to the dielectric permeabilities. In doing so, we automatically translate 400,000 lines of Fortran 77 into a new program consisting of 600,000 lines of Fortran 77. We compare our approach with a traditional approach based on numerical differentiation and quantify its advantages in terms of accuracy and computational efficiency.
Automatic differentiation (AD) is a powerful technique allowing to compute derivatives of a function given by a (potentially very large) piece of code. The basic principles of AD and some available tools implementing this technology are reviewed. AD is superior to divided differences because AD-generated derivative values are free of approximation errors, and superior to symbolic differentiation because code of very high complexity can be handled, in contrast to computer algebra systems whose applicability is limited to rather simple functions. In addition, the cost for computing gradients of scalar-valued functions with either divided differences or symbolic differentiation grows linearly with the number of variables, whereas the so-called reverse mode of AD can compute such gradients at constant cost.
Derivatives of mathematical functions play a key role in various areas of numerical and technical computing. Many of these computations are done in MATLAB, a popular environment for technical computing providing engineers and scientists with capabilities for mathematical computing, analysis, visualization, and algorithmic development. For functions written in the MATLAB language, a novel software tool is proposed to automatically transform a given MATLAB program into another MATLAB program capable of computing not only the original function but also user-specified derivatives of that function. That is, a program transformation known as automatic differentiation is performed to change the semantics of the program in a fashion based on the chain rule of differential calculus. The crucial ingredient of the tool is a combination of source-to-source transformation and operator overloading. The overall design of the tool is described and numerical experiments are reported demonstrating the efficiency of the resulting code for a sample problem.
Given a numerical simulation of the near wake of an airfoil, automatic differentiation is used to accurately compute the sensitivities of the Mach number with respect to the angle of attack. Such sensitivity information is crucial when integrating a pure simulation code into an optimization framework involving a gradient-based optimization technique. In this note, the ADIFOR system implementing the technology of automatic differentiation for functions written in Fortran 77 is used to mechanically transform a given flow solver called TFS into a new program capable of computing the original simulation and the desired derivatives in a simultaneous fashion. Numerical experiments of derivatives obtained from automatic differentiation and finite differences approximations are reported.
Derivatives of almost arbitrary functions can be evaluated efficiently by automatic differentiation whenever the functions are given in the form of computer programs in a high-level programming language such as Fortran, C, or C++. In contrast to numerical differentiation, where derivatives are only approximated, automatic differentiation generates derivatives that are accurate up to machine precision. Sophisticated software tools implementing the technology of automatic differentiation are capable of automatically generating code for the product of the Jacobian matrix and a so-called seed matrix. It is shown how these tools can benefit from concepts of shared memory programming to parallelize, in a completely mechanical fashion, the gradient operations associated with each statement of the given code. The feasibility of our approach is demonstrated by numerical experiments. They were performed with a code that was generated automatically by the Adifor system and augmented with OpenMP directives.
Derivatives are not only useful in sensitivity analysis but play a vital role in various areas of computational science including parameter optimization of computer models. Among the various methods for obtaining derivatives, automatic differentiation (AD) combines freedom from approximation errors, high performance, and the ability to handle arbitrarily complex codes arising from large-scale scientific investigations. In this note, we show how AD technology can aid in the sensitivity analysis of a computer model of an aircraft. To this end, the software tool ADIFOR, implementing the AD technology for functions written in Fortran 77, was applied to a large-scale computational fluid dynamics solver, TFS, developed at Aerodynamisches Institut, Aachen University of Technology. This solver comprises approximately 24,000 lines of Fortran 77 and is able to resolve steady and unsteady laminar and turbulent flows in two and three dimensions.
Derivatives of almost arbitrary functions can be evaluated efficiently by automatic differentiation whenever the functions are given in the form of computer programs in a high-level programming language such as Fortran, C, or C++. Furthermore, in contrast to numerical differentiation where derivatives are approximated, automatic differentiation generates derivatives that are accurate up to machine precision. The so-called forward mode of automatic differentiation computes derivatives by carrying forward a gradient associated with each intermediate variable simultaneously with the evaluation of the function itself. It is shown how software tools implementing the technology of automatic differentiation can benefit from simple concepts of shared memory programming to parallelize the gradient operations. The feasibility of our approach is demonstrated by numerical experiments. They were performed with a code that was generated automatically by the Adifor system and augmented with OpenMP directives.
Second-order enclosures for a function's range can be computed with centered forms, which involve a so-called slope vector. In this paper we discuss several techniques for determining such vectors and compare them with respect to tightness of the resulting enclosure. We advocate that a two-stage slope vector computation with symbolic preprocessing is optimal.
Simulation is an essential part in computational science and engineering. Since the underlying physical models become more and more complex, the adjustment of certain model parameters (“parameter identification”) is often a nontrivial task and should be addressed by means of efficient numerical optimization algorithms. We describe an environment for coupling simulation and optimization software. The modular architecture of the system supports easy modification of the problem configuration, including the exchange of software components. Therefore our tool is well-suited for rapid prototyping of optimization problems.
An implementation of verified Gaussian quadrature involves algorithmic parameters whose correct setting is crucial for adequate performance. We identify several of these control parameters and, based on experimental data, we try to give advice for choosing suitable parameter values in order to obtain reasonable average performance for a “black-box” quadrature routine.
Given a large-scale engineering simulation, one is often interested in the derivatives of certain program outputs with respect to certain input parameters, e.g., for sensitivity analysis. From an abstract point of view, the computer program defines a mathematical function whose derivatives are sought. In this note, two cases are investigated where the underlying functions are given by the solution of different two-dimensional electrostatic potential problems. For a rectangular region, the function and its derivatives can be derived analytically. For a function based on an L-shaped region, the function is computed by a simulation code and its derivatives are obtained by automatic differentiation. In general, this powerful technique is applicable if a function is given in the form of a computer program in a high-level programming language such as Fortran, C, or C++. In contrast to numerical differentiation providing approximations based on divided differences, the derivatives computed by automatic differentiation are accurate. Furthermore, automatic differentiation is also shown to be computationally efficient.
In recent years, the object-oriented approach has emerged as a key technology for building highly complex scientific codes, as has the use of parallel computers for the solution of large-scale problems. We believe that the paradigm shift towards parallelism will continue and, therefore, principles and techniques of writing parallel programs should be taught to the students at an early stage of their education rather than as an advanced topic near the end of a curriculum. A certain understanding of the practical aspects of numerical modeling is also a useful facet in computer science education. The reason is that, in addition to their traditional prime rôle in computational science and engineering, numerical techniques are also increasingly employed in seemingly non-numerical settings as large-scale data mining and web searching. This paper describes a practical training course for undergraduates, where carefully selected problems of high-performance computing are solved using the programming language Java.
Derivatives are ubiquitous in various areas of computational science including sensitivity analysis and parameter optimization of computer models. Among the various methods for obtaining derivatives, automatic differentiation (AD) combines freedom from approximation errors, high performance, and the ability to handle arbitrarily complex codes arising from large-scale scientific investigations. In this note, we show how AD technology can aid in the sensitivity analysis of a computer model by considering a classic fluid flow experiment as an example. To this end, the software tool ADIFOR implementing the AD technology for functions written in Fortran 77 was applied to the large finite element package SEPRAN. Differentiated versions of SEPRAN enable sensitivity analysis for a wide range of applications, not only from computational fluid dynamics.
Parallel computing has emerged as a key technology for the solution of large-scale problems arising in computational science and engineering. Therefore, principles and techniques of writing parallel programs should be taught to the students at an early stage of their education rather than as an advanced topic near the end of a curriculum. This note describes a practical training course for undergraduates, where the programming language Java is used to pave the student's way from concurrent programming using native Java threads via OpenMP-like shared memory programming to distributed memory programming based on the message passing interface MPI.
Understanding and controlling the behavior of chemical processes are important issues, for safety as well as economical reasons. Some processes can have multiple steady states and even switch between them in a complex way, the reasons for the multiplicity not always being well understood. A singularity theory approach for investigating such behavior leads to nonlinear systems whose solutions correspond to specific singular states of the process. In order to exclude certain types of singularities, rigorous methods must be used to check the solvability of the matching systems. As these systems are highly structured, our solution method combines a symbolic preprocessing phase (term manipulation for utilizing the structure) with a branch-and-bound type rigorous interval-based solver. We report on our experience with this approach for small-to-medium sized example problems.
We develop an algorithmic framework for reducing the bandwidth of symmetric matrices via orthogonal similarity transformations. This framework includes the reduction of full matrices to banded or tridiagonal form and the reduction of banded matrices to narrower banded or tridiagonal form, possibly in multiple steps. Our framework leads to algorithms that require fewer floating-point operations than do standard algorithms, if only the eigenvalues are required. In addition, it allows for space-time tradeoffs and enables or increases the use of blocked transformations.
We present a software toolbox for symmetric band reduction via orthogonal transformations, together with a testing and timing program. The toolbox contains drivers and computational routines for the reduction of full symmetric matrices to banded form and the reduction of banded matrices to narrower banded or tridiagonal form, with optional accumulation of the orthogonal transformations, as well as repacking routines for storage rearrangement. The functionality and the calling sequences of the routines are described, with a detailed discussion of the “control” parameters that allow adaptation of the codes to particular machine and matrix characteristics. We also briefly describe the testing and timing program included in the toolbox.
This article reviews classical and recent direct methods for computing eigenvalues and eigenvectors of symmetric full or banded matrices. The ideas underlying the methods are presented, and the properties of the algorithms with respect to accurary and performance are discussed. Finally, pointers to relevant software are given.
This paper describes a prototype implementation of a solver for dense symmetric eigenproblems, which are too large to fit into the main memory.
We describe two techniques for speeding up eigenvalue and singular value computations on shared memory parallel computers. Depending on the information that is required, different steps in the overall process can be made more efficient. If only the eigenvalues or singular values are sought then the reduction to condensed form may be done in two or more steps to make best use of optimized level-3 BLAS. If eigenvectors and/or singular vectors are required, too, then their accumulation can be speeded up by another blocking technique. The efficiency of the blocked algorithms depends heavily on the values of certain control parameters. We also present a very simple performance model that allows selecting these parameters automatically.
Most methods for calculating the SVD (singular value decomposition) require to first bidiagonalize the matrix. The blocked reduction of a general, dense matrix to bidiagonal form, as implemented in ScaLAPACK, does about one half of the operations with BLAS3. By subdividing the reduction into two stages dense -> banded and banded -> bidiagonal with cubic and quadratic arithmetic costs respectively, we are able to carry out a much higher portion of the calculations in matrix-matrix multiplications. Thus, higher performance can be expected.
This paper presents and compares three parallel techniques for reducing a full matrix to banded form. (The second reduction stage is described in another paper [B. Lang. Parallel reduction of banded matrices to bidiagonal form. Parallel Comput. 22:1-18, 1996]). Numerical experiments on the Intel Paragon and IBM SP/1 distributed memory parallel computers demonstrate that the two-stage reduction approach can be significantly superior if only the singular values are required.
This paper compares several strategies for subdividing the domain in verified multi-dimensional Gaussian quadrature. Subdivision may be used in two places in the quadrature algorithm. First, subdividing the box can reduce the over-estimation of the partial derivatives' ranges, which are needed to bound the approximation error. Second, if the required error bound cannot be met with the whole domain then the box is split into subboxes, and the quadrature algorithm is recursively applied to these. Both variants of subdivision are considered in this paper.
Most methods for computing the singular value decomposition (SVD) first bidiagonalize the matrix. The ScaLAPACK implementation of the blocked reduction of a general dense matrix to bidiagonal form performs about one half of the operations with BLAS3. If we subdivide the task into two stages dense ->banded and banded -> bidiagonal, we can increase the portion of matrix-matrix operations and expect higher performance. We give an overview of different techniques for the first stage.
This note summarizes the results of [B. Großer. Parallele zweistufige Verfahren zur Reduktion auf Bidiagonalgestalt. Diplomarbeit, Fachbereich Mathematik, Bergische Universität GH Wuppertal, 1997], [B. Großer, B. Lang. Efficient parallel reduction to bidiagonal form. TR BUGHW-SC98/2, Fachbereich Mathematik, Bergische Universität GH Wuppertal, 1998].
In der Analyse werden der derzeitige Status und die voraussehbare Entwicklung gekoppelter SMP-Systeme bezüglich der Hardware und Software dargestellt sowie Forschungs- und Entwicklungsaufgaben im Softwarebereich identifiziert, deren Lösung für den effizienten Einsatz gekoppelter SMP-Systeme für das wissenschaftlich-technische Hochleistungsrechnen erforderlich ist. Es werden konkrete Themengebiete für eine Projektförderung durch das BMBF empfohlen.
This paper presents efficient techniques for the orthogonal reduction of banded matrices to bidiagonal and symmetric tridiagonal form. The algorithms are numerically stable and well suited to parallel execution. Experiments on the Intel Paragon show that even on a single processor these methods usually will be several times faster than the corresponding LAPACK routines. In addition, they yield nearly full speedup when run on multiple processors.
This paper presents a technique that allows using level 3 BLAS in a number of rotation-based algorithms. In particular, the update of an orthogonal transformation matrix which often involves the vast majority of operations can be done with a matrix-matrix product. As a case study, the technique is applied to the QR and QL algorithms for computing the eigensystem of a symmetric tridiagonal matrix. The modifications do not affect the convergence properties of the algorithms nor do they significantly increase the overall number of operations. Thus, the computations can be sped up by more than 50% on machines with a distinct memory hierarchy, like the Intel i860 or IBM RS/6000, provided the block size is set appropriately. We also present a simple theoretical analysis that allows selecting an almost-optimal block size.
This paper describes the use of interval arithmetic to bound errors in an experiment for determining Newton's constant of gravitation. Using verified Gaussian quadrature we were able to assess the numerical errors as well as the effect of several tolerances in the physical experiment.
We present a technique that allows speeding up the QR algorithms for symmetric tridiagonal and Hessenberg matrices by doing most of the computations in matrix-matrix products. The reduced memory traffic leads to increased performance.
A parallel algorithm for reducing banded matrices to bidiagonal form is presented. In contrast to the rotation-based “standard approach”, our algorithm is based on Householder transforms, therefore exhibiting considerably higher data locality (BLAS level 2 instead of level 1). The update of the transformation matrices which involves the vast majority of the operations can even be blocked to allow the use of level 3 BLAS. Thus, our algorithm will outperform the standard method on a serial computer with a distinct memory hierarchy. In addition, the algorithm can be efficiently implemented in a distributed memory environment, as is demonstrated by numerical results on the Intel Paragon.
In this paper we describe the use of interval arithmetic in an experiment for determining G, Newton's constant of gravitation. Using an interval version of Gaussian quadrature, we bound the effects of numerical errors and of several tolerances in the physical experiment. This allowed to idientify “critical” tolerances which must be reduced in order to obtain G with the desired accuracy.
We present a parallel algorithm for reducing banded matrices to bidiagonal form. This reduction is a major step in the computation of singular values and singular vectors. In contrast to the rotation-based “standard approach”, our algorithm is based on Householder transforms, therefore exhibiting considerably better data locality (BLAS level 2 instead of level 1). The update of the transformation matrix, the most expensive part of the reduction, can be blocked to allow the use of level 3 BLAS. Thus, even a serial implementation of our algorithm will outperform the standard method on a machine with a distinct memory hierarchy. In addition, the algorithm can be efficiently implemented in a distributed memory environment, as is demonstrated by numerical results on the Intel Paragon.
We present a two-step variant of the “successive band reduction” paradigm for the tridiagonalization of symmetric matrices. Here we first reduce a full matrix to narrow-banded form, and from there to tridiagonal form. The first step allows easy exploitation of block orthogonal transformations. In the second step, we employ a new blocked version of a banded matrix tridiagonalization algorithm by Lang. In particular, we are able to express the update of the orthogonal transformation matrix in terms of block transformations, which leads to an algorithm that is almost entirely based on BLAS-3 kernels, and has greatly improved data movement and communication characteristics. We also present some performance results on the Intel Touchstone Delta prototype and the IBM SP/1.
Es wird ein Verfahren zur Berechnung von Eigenwerten und -vektoren symmetrischer Bandmatrizen auf Parallelrechnern vorgestellt. Zunächst wird die Matrix auf Tridiagonalgestalt reduziert, dann werden mit einem beschleunigten Bisektionsverfahren die Eigenwerte und anschließend die Eigenvektoren berechnet. Das vorgestellte Verfahren ermöglicht in allen drei Phasen die Verteilung der Rechenarbeit auf mehrere Prozessoren und zusätzlich den Einsatz eventuell vorhandener Vektorhardware in den einzelnen Prozessoren. Die Effizienz des Verfahrens wird durch Messungen auf dem iPSC/860 Hypercube-Parallelrechner belegt.
An algorithm is presented for reducing symmetric banded matrices to tridiagonal form via Householder transformations. The algorithm is numerically stable and is well suited to parallel execution on distributed memory multiple instruction multiple data (MIMD) computers. Numerical experiments on the iPSC/860 hypercube show that the new method yields nearly full speedup if it is run on multiple processors. In addition, even on a single processor the new method usually will be several times faster than the corresponding EISPACK and LINPACK routines.
We compare three algorithms for reducing symmetric banded matrices to tridiagonal form and evaluate their performance on the Intel iPSC/860 hypercube parallel computer. Two of these algorithms, the routines BANDR and _SBTRD from the EISPACK and LAPACK libraries, resp., are serial algorithms with little potential for coarse grain parallelism. The third one, called SBTH, is a new parallel algorithm. Results on the iPSC/860 hypercube show that this new method may be several times faster than the others, even when run on a single processor. In addition, execution on multiple processors yields nearly full speedup.
Matrix vector products with symmetric matrices arise in several numerical algorithms, e.g. in the Householder method for eigenproblems and in some iterative methods for the solution of linear systems. Here we propose a procedure for computing such a product on a local memory parallel computer, if only one triangle of the matrix is actually stored. Our method neither increases the number of arithmetic operations nor introduces prohibitive communication overhead.
Given m circles in the plane, contacts between them can be specified by a system of quadratic distance equalities. An approximate solution for the trajectories of the circles for a system of one degree of freedom is given, by replacing the circles by translationally moving regular k-gons. The approximation yields trajectories that are piecewise linear. The next linear generation of the m trajectories are found by an incremental algorithm in O( m2 ) time. Further, an algorithm is presented which finds the next collision between m k-gons moving on lines at constant speed in time O( k m2-x ) for a constant x > 0 using linear space. Finally, more practical collision detection algorithms are sketched based on neighborhood information which, however, do not guarantee a nontrivial worst-case bound.