[1]	Pascal R. Bähr, Bruno Lang, Peer Ueberholz, Marton Ady, and Roberto Kersevan. Development of a hardware-accelerated simulation kernel for ultra-high vacuum with Nvidia RTX GPUs. Int. J. High Perform. Comput. Appl., 36(2):141--152, 2022. [ DOI ] Molflow+ is a Monte Carlo (MC) simulation software for ultra-high vacuum, mainly used to simulate pressure in particle accelerators. In this article, we present and discuss the design choices arising in a new implementation of its ray-tracing-based simulation unit for Nvidia RTX Graphics Processing Units (GPUs). The GPU simulation kernel was designed with Nvidia's OptiX 7 API to make use of modern hardware-accelerated ray-tracing units, found in recent RTX series GPUs based on the Turing and Ampere architectures. Even with the challenges posed by switching to 32 bit computations, our kernel runs much faster than on comparable CPUs at the expense of a marginal drop in calculation precision.
[2]	J. Bloch, A. Frommer, B. Lang, and T. Wettig. An iterative method to compute the sign function of a non-Hermitian matrix and its application to the overlap Dirac operator at nonzero chemical potential. Comput. Phys. Comm., 177(2):933--943, December 2007. [ DOI ] The overlap Dirac operator in lattice QCD requires the computation of the sign function of a matrix. While this matrix is usually Hermitian, it becomes non-Hermitian in the presence of a quark chemical potential. We show how the action of the sign function of a non-Hermitian matrix on an arbitrary vector can be computed efficiently on large lattices by an iterative method. A Krylov subspace approximation based on the Arnoldi algorithm is described for the evaluation of a generic matrix function. The efficiency of the method is spoiled when the matrix has eigenvalues close to a function discontinuity. This is cured by adding a small number of critical eigenvectors to the Krylov subspace, for which we propose two different deflation schemes. The ensuing modified Arnoldi method is then applied to the sign function, which has a discontinuity along the imaginary axis. The numerical results clearly show the improved efficiency of the method. Our modification is particularly effective when the action of the sign function of the same matrix has to be computed many times on different vectors, e.g., if the overlap Dirac operator is inverted using an iterative method.
[3]	Jacques Bloch, Tilo Wettig, Andreas Frommer, and Bruno Lang. An iterative method to compute the overlap Dirac operator at nonzero chemical potential. In Gunnar Bali et al., editors, Proc. Lattice 2007, XXVth Intl. Symp. Lattice Field Theory, July 30--August 4, 2007, Regensburg, Germany, Proceedings of Science, page 169, Trieste, Italy, 2007. SISSA. [ DOI ] The overlap Dirac operator at nonzero quark chemical potential involves the computation of the sign of a non-Hermitian matrix. In this talk we present an iterative method, first proposed by us in Ref. [1], which allows for an efficient computation of the operator, even on large lattices. The starting point is a Krylov subspace approximation, based on the Arnoldi algorithm, for the evaluation of a generic matrix function. The efficiency of this method is spoiled when the matrix has eigenvalues close to a function discontinuity. To cure this, a small number of critical eigenvectors are added to the Krylov subspace, and two different deflation schemes are proposed in this augmented subspace. The ensuing method is then applied to the sign function of the overlap Dirac operator, for two different lattice sizes. The sign function has a discontinuity along the imaginary axis and the numerical results show how deflation dramatically improves the efficiency of the method.
[4]	Ivo Kabadshow and Bruno Lang. Latency-optimized parallelization of the FMM near-field computations. In Yong Shi, Geert Dick van Albada, Jack Dongarra, and Peter M. A. Sloot, editors, Computational Science, Proc. ICCS 2007, 7th International Conference on Computational Science, May 27--30, 2007, Beijing, China, Part I, volume 4487 of LNCS, pages 716--722, Berlin, 2007. Springer-Verlag. [ DOI ] In this paper we present a new parallelization scheme for the FMM near-field. The parallelization is based on the Global Arrays Toolkit and uses one-sided communication with overlapping. It employs a purely static load-balancing approach to minimize the number of communication steps and benefits from a maximum utilization of data locality. In contrast to other implementations the communication is initiated by the process owning the data via a put call, not the process receiving the data (via a get call).
[5]	Rainer Steiger, Christian H. Bischof, Bruno Lang, and Walter Thiel. Using automatic differentiation to compute derivatives of a quantum-chemical computer program. Future Gen. Comput. Syst., 21(8):1324--1332, October 2005. [ DOI ] The ADIFOR 2.0 tool for Automatic Differentiation of Fortran programs has been used to generate analytic gradient code for all semiempirical SCF methods available in the MNDO97 program. The correctness and accuracy of the new code have been verified. Its performance has been compared with that of hand-coded analytic derivative routines and with numerical differentiation. From a quantum-chemical point of view, the major advance of this work is the development of previously unavailable analytic gradient code for the recently proposed OM1 and OM2 methods.
[6]	Jens Gräbel, Bruno Lang, and Peer Ueberholz. Performance optimization for the parallel Gauss-Seidel smoother. Proc. Appl. Math. Mech., 5(1):831--832, 2005. [ DOI ] Simulation, e.g., in the field of computational fluid dynamics, accounts for a major part of the computing time on high-performance systems. Many simulation packages still rely on Gauss--Seidel iteration, either as the main linear solver or as a smoother for multigrid schemes. Straight-forward implementations of this solver have efficiency problems on today's most common high-performance computers, i.e., multiprocessor clusters with pronounced memory hierarchies. In this work we present two simple techniques for improving the performance of the parallel Gauss--Seidel method for the 3D Poisson equation by optimizing cache usage as well as reducing the number of communication steps.
[7]	H. Martin Bücker, Bruno Lang, Hans-Joachim Pflug, and Andre Vehreschild. Threads in an undergraduate course: A Java example illuminating different multithreading approaches. In A. Laganà, M. L. Gavrilova, V. Kumar, Y. Mun, C. J. K. Tan, and O. Gervasi, editors, Computational Science and Its Applications -- ICCSA 2004, Proc. Intl. Conf. Computational Science and its Applications, Assisi, Italy, May 14--17, 2004. Part II, number 3044 in LNCS, pages 882--891. Springer, Berlin, 2004. [ DOI ] Multithreading is a fundamental approach to expressing parallelism in programs. Since Java is emerging as the de facto standard language for platform independent software development in higher education, there is need for teaching multithreading in the context of Java. We use a simple problem from scientific computing to explain two different multithreading approaches to second-year students. More precisely, a simple boundary value problem is considered, which makes visible the differences between native Java threads and the OpenMP interface. So, the students are able to appreciate the respective merits and drawbacks of a thread package that is integrated into the Java programming language and an approach combining compiler support with library calls.
[8]	H. M. Bücker, B. Lang, and C. H. Bischof. Parallel programming in computational science: An introductory practical training course for computer science at Aachen University. Future Gen. Comput. Syst., 19(8):1309--1319, November 2003. [ DOI ] Parallel programming of high-performance computers has emerged as a key technology for the numerical solution of large-scale problems arising in computational science and engineering (CSE). The authors believe that principles and techniques of parallel programming are among the essential ingredients of any CSE as well as computer science curriculum. Today, opinions on the role and importance of parallel programming are diverse. Rather than seeing it as a marginal beneficial skill optionally taught at the graduate level, we understand parallel programming as crucial basic skill that should be taught as an integral part of the undergraduate computer science curriculum. A practical training course developed for computer science undergraduates at Aachen University is described. Its goal is to introduce young computer science students to different parallel programming paradigms for shared and distributed memory computers as well as to give a first exposition to the field of computational science by simple, yet carefully chosen sample problems.
[9]	H. Martin Bücker, Bruno Lang, Hans-Joachim Pflug, and Andreas Wolf. A hybrid MPI-OpenMP implementation of the conjugate gradient method in Java. In Dieter an Mey, editor, Proc. EWOMP '03, Fifth European Workshop on OpenMP, September 22-26, 2003, Aachen, Germany, pages 205--213, Aachen, Germany, 2003. Center for Computing and Communication, Aachen University. The message passing paradigm is the dominating programming model for parallel processing on distributed-memory architectures. OpenMP is the most widely used parallel programming model for shared-memory systems. Contemporary parallel computers consist of multiple shared-memory computers connected via a network so that a hybrid parallelization approach combining message passing and OpenMP is often desired. We present a hybrid parallelization of an iterative method for the solution of linear systems with a large sparse symmetric coefficient matrix which is implemented in Java. To this end, we bring together mpiJava and JOMP and compare the performance on a Sun Fire 6800 system with a hybrid parallel version of the same problem implemented in Fortran. The scalability of both implementations is similar while, surprisingly, the performance of the Fortran version with data prefetching is only a factor of about 2 larger than the Java version when up to 22 processors are used.
[10]	Christian H. Bischof, H. Martin Bücker, and Bruno Lang. Automatic differentiation for computational finance. In Erricos John Kontoghiorghes, Berç Rustem, and Stavros Siokos, editors, Computational Methods in Decision-Making, Economics and Finance Proc., Neuchâtel, August 16, 2000, chapter 15, pages 297--310. Kluwer Academic Publishers, Dordrecht, The Netherlands, 2002. [ DOI ] Automatic differentiation (AD) is a powerful technique allowing to compute derivatives of a function given by a (potentially very large) piece of code. The basic principles of AD and some available tools implementing this technology are reviewed. AD is superior to divided differences because AD-generated derivative values are free of approximation errors, and superior to symbolic differentiation because code of very high complexity can be handled, in contrast to computer algebra systems whose applicability is limited to rather simple functions. In addition, the cost for computing gradients of scalar-valued functions with either divided differences or symbolic differentiation grows linearly with the number of variables, whereas the so-called reverse mode of AD can compute such gradients at constant cost.
[11]	Christian Bischof, Martin Bücker, Bruno Lang, and Arno Rasch. Automatic differentiation of the FLUENT CFD program: Procedure and first results. Report RWTH-CS-SC-01-22, RWTH Aachen, Institute for Scientific Computing, 2001.
[12]	Christian H. Bischof, H. Martin Bücker, Jörg Henrichs, and Bruno Lang. Hands-on training for undergraduates in high-performance computing using Java. In Tor Sørevik, Fredrik Manne, Randi Moe, and Assefaw Hadish Gebremedhin, editors, Applied Parallel Computing---New Paradigms for HPC in Industry and Academia, Proceedings 5th International Workshop, PARA 2000, Bergen, Norway, June 2000, volume 1947 of LNCS, pages 306--315, Berlin, 2001. Springer-Verlag. [ DOI ] In recent years, the object-oriented approach has emerged as a key technology for building highly complex scientific codes, as has the use of parallel computers for the solution of large-scale problems. We believe that the paradigm shift towards parallelism will continue and, therefore, principles and techniques of writing parallel programs should be taught to the students at an early stage of their education rather than as an advanced topic near the end of a curriculum. A certain understanding of the practical aspects of numerical modeling is also a useful facet in computer science education. The reason is that, in addition to their traditional prime rôle in computational science and engineering, numerical techniques are also increasingly employed in seemingly non-numerical settings as large-scale data mining and web searching. This paper describes a practical training course for undergraduates, where carefully selected problems of high-performance computing are solved using the programming language Java.
[13]	H. Martin Bücker, Bruno Lang, and Christian H. Bischof. Teaching different parallel programming paradigms using Java. In Proceedings of the Third Annual Workshop on Java for High-Performance Computing, Sorrento, Italy, June 17, 2001, pages 73--81, 2001. Parallel computing has emerged as a key technology for the solution of large-scale problems arising in computational science and engineering. Therefore, principles and techniques of writing parallel programs should be taught to the students at an early stage of their education rather than as an advanced topic near the end of a curriculum. This note describes a practical training course for undergraduates, where the programming language Java is used to pave the student's way from concurrent programming using native Java threads via OpenMP-like shared memory programming to distributed memory programming based on the message passing interface MPI.
[14]	Bruno Lang. Direct solvers for symmetric eigenvalue problems. In Johannes Grotendorst, editor, Modern Methods and Algorithms of Quantum Chemistry, Jülich, February 21--25, 2000, Proceedings, 2nd Edition, pages 231--259, Jülich, 2000. John von Neumann Institute for Computing. This article reviews classical and recent direct methods for computing eigenvalues and eigenvectors of symmetric full or banded matrices. The ideas underlying the methods are presented, and the properties of the algorithms with respect to accurary and performance are discussed. Finally, pointers to relevant software are given.
[15]	Christian H. Bischof, H. Martin Bücker, Jörg Henrichs, and Bruno Lang. Software-Praktikum Paralleles Programmieren mit Java. Report RWTH-CS-SC-00-13, RWTH Aachen, Institute for Scientific Computing, 2000. ---
[16]	Rudolf Berrendorf, Christian Bischof, Holger Brunst, Martin Bücker, Ulrich Detert, Rüdiger Esser, Michael Gerndt, Johannes Grotendorst, Inge Gutheil, Hans-Christian Hoppe, Friedel Hoßfeld, Bernd Körfgen, Bruno Lang, Dieter an Mey, Bernd Mohr, Wolfgang E. Nagel, Karl Solchenbach, Godehard Sutmann, Valentina Tikko, and Lothar Wollschläger. Gekoppelte SMP-Systeme im wissenschaftlich-technischen Hochleistungsrechnen --- Status und Entwicklungsbedarf (GoSMP). Analyse im Auftrag des BMBF, Zentralinstitut für Angewandte Mathematik, Forschungszentrum Jülich GmbH, Germany, December 1999. In der Analyse werden der derzeitige Status und die voraussehbare Entwicklung gekoppelter SMP-Systeme bezüglich der Hardware und Software dargestellt sowie Forschungs- und Entwicklungsaufgaben im Softwarebereich identifiziert, deren Lösung für den effizienten Einsatz gekoppelter SMP-Systeme für das wissenschaftlich-technische Hochleistungsrechnen erforderlich ist. Es werden konkrete Themengebiete für eine Projektförderung durch das BMBF empfohlen.
[17]	Bruno Lang. Matrix vector multiplication for symmetric matrices on parallel computers and applications. Z. angew. Math. Mech., 71(6):T807--T808, 1991. [ DOI ] Matrix vector products with symmetric matrices arise in several numerical algorithms, e.g. in the Householder method for eigenproblems and in some iterative methods for the solution of linear systems. Here we propose a procedure for computing such a product on a local memory parallel computer, if only one triangle of the matrix is actually stored. Our method neither increases the number of arithmetic operations nor introduces prohibitive communication overhead.
[18]	Stephan Abramowski, Bruno Lang, and Heinrich Müller. Moving regular k-gons in contact. In J. van Leeuwen, editor, Graph-Theoretic Concepts in Computer Science, International Workshop WG '88 Proceedings, Amsterdam, June 15--17, 1988, volume 344 of LNCS, pages 229--242, Berlin, 1989. Springer-Verlag. [ DOI ] Given m circles in the plane, contacts between them can be specified by a system of quadratic distance equalities. An approximate solution for the trajectories of the circles for a system of one degree of freedom is given, by replacing the circles by translationally moving regular k-gons. The approximation yields trajectories that are piecewise linear. The next linear generation of the m trajectories are found by an incremental algorithm in O( m² ) time. Further, an algorithm is presented which finds the next collision between m k-gons moving on lines at constant speed in time O( k m^2-x ) for a constant x > 0 using linear space. Finally, more practical collision detection algorithms are sketched based on neighborhood information which, however, do not guarantee a nontrivial worst-case bound.