|
[1]
|
Pascal R. Bähr, Bruno Lang, Peer Ueberholz, Marton Ady, and Roberto
Kersevan.
Development of a hardware-accelerated simulation kernel for
ultra-high vacuum with Nvidia RTX GPUs.
Int. J. High Perform. Comput. Appl., 36(2):141--152, 2022.
[ DOI ]
Molflow+ is a Monte Carlo (MC) simulation software for
ultra-high vacuum, mainly used to simulate pressure
in particle accelerators. In this article, we present
and discuss the design choices arising in a new
implementation of its ray-tracing-based simulation
unit for Nvidia RTX Graphics Processing Units (GPUs).
The GPU simulation kernel was designed with Nvidia's
OptiX 7 API to make use of modern hardware-accelerated
ray-tracing units, found in recent RTX series GPUs
based on the Turing and Ampere architectures. Even
with the challenges posed by switching to 32 bit
computations, our kernel runs much faster than on
comparable CPUs at the expense of a marginal drop in
calculation precision.
|
|
[2]
|
J. Bloch, A. Frommer, B. Lang, and T. Wettig.
An iterative method to compute the sign function of a non-Hermitian
matrix and its application to the overlap Dirac operator at nonzero
chemical potential.
Comput. Phys. Comm., 177(2):933--943, December 2007.
[ DOI ]
The overlap Dirac operator in lattice QCD requires the
computation of the sign function of a matrix. While
this matrix is usually Hermitian, it becomes
non-Hermitian in the presence of a quark chemical
potential. We show how the action of the sign function
of a non-Hermitian matrix on an arbitrary vector can
be computed efficiently on large lattices by an
iterative method. A Krylov subspace approximation
based on the Arnoldi algorithm is described for the
evaluation of a generic matrix function. The
efficiency of the method is spoiled when the matrix
has eigenvalues close to a function discontinuity.
This is cured by adding a small number of critical
eigenvectors to the Krylov subspace, for which we
propose two different deflation schemes. The ensuing
modified Arnoldi method is then applied to the sign
function, which has a discontinuity along the
imaginary axis. The numerical results clearly show the
improved efficiency of the method. Our modification is
particularly effective when the action of the sign
function of the same matrix has to be computed many
times on different vectors, e.g., if the overlap Dirac
operator is inverted using an iterative method.
|
|
[3]
|
Jacques Bloch, Tilo Wettig, Andreas Frommer, and Bruno Lang.
An iterative method to compute the overlap Dirac operator at
nonzero chemical potential.
In Gunnar Bali et al., editors, Proc. Lattice 2007, XXVth
Intl. Symp. Lattice Field Theory, July 30--August 4, 2007, Regensburg,
Germany, Proceedings of Science, page 169, Trieste, Italy, 2007. SISSA.
[ DOI ]
The overlap Dirac operator at nonzero quark chemical
potential involves the computation of the sign of a
non-Hermitian matrix. In this talk we present an
iterative method, first proposed by us in Ref. [1],
which allows for an efficient computation of the
operator, even on large lattices. The starting point
is a Krylov subspace approximation, based on the
Arnoldi algorithm, for the evaluation of a generic
matrix function. The efficiency of this method is
spoiled when the matrix has eigenvalues close to a
function discontinuity. To cure this, a small number
of critical eigenvectors are added to the Krylov
subspace, and two different deflation schemes are
proposed in this augmented subspace. The ensuing
method is then applied to the sign function of the
overlap Dirac operator, for two different lattice
sizes. The sign function has a discontinuity along the
imaginary axis and the numerical results show how
deflation dramatically improves the efficiency of the
method.
|
|
[4]
|
Ivo Kabadshow and Bruno Lang.
Latency-optimized parallelization of the FMM near-field
computations.
In Yong Shi, Geert Dick van Albada, Jack Dongarra, and Peter M. A.
Sloot, editors, Computational Science, Proc. ICCS 2007, 7th
International Conference on Computational Science, May 27--30, 2007, Beijing,
China, Part I, volume 4487 of LNCS, pages 716--722, Berlin, 2007.
Springer-Verlag.
[ DOI ]
In this paper we present a new parallelization scheme
for the FMM near-field. The parallelization is based
on the Global Arrays Toolkit and uses one-sided
communication with overlapping. It employs a purely
static load-balancing approach to minimize the number
of communication steps and benefits from a maximum
utilization of data locality. In contrast to other
implementations the communication is initiated by the
process owning the data via a put call, not the
process receiving the data (via a get call).
|
|
[5]
|
Rainer Steiger, Christian H. Bischof, Bruno Lang, and Walter Thiel.
Using automatic differentiation to compute derivatives of a
quantum-chemical computer program.
Future Gen. Comput. Syst., 21(8):1324--1332, October 2005.
[ DOI ]
The ADIFOR 2.0 tool for Automatic Differentiation of
Fortran programs has been used to generate analytic
gradient code for all semiempirical SCF methods
available in the MNDO97 program. The correctness and
accuracy of the new code have been verified. Its
performance has been compared with that of hand-coded
analytic derivative routines and with numerical
differentiation. From a quantum-chemical point of
view, the major advance of this work is the
development of previously unavailable analytic
gradient code for the recently proposed OM1 and OM2
methods.
|
|
[6]
|
Jens Gräbel, Bruno Lang, and Peer Ueberholz.
Performance optimization for the parallel Gauss-Seidel smoother.
Proc. Appl. Math. Mech., 5(1):831--832, 2005.
[ DOI ]
Simulation, e.g., in the field of computational fluid
dynamics, accounts for a major part of the computing
time on high-performance systems. Many simulation
packages still rely on Gauss--Seidel iteration, either
as the main linear solver or as a smoother for
multigrid schemes. Straight-forward implementations of
this solver have efficiency problems on today's most
common high-performance computers, i.e.,
multiprocessor clusters with pronounced memory
hierarchies. In this work we present two simple
techniques for improving the performance of the
parallel Gauss--Seidel method for the 3D Poisson
equation by optimizing cache usage as well as reducing
the number of communication steps.
|
|
[7]
|
H. Martin Bücker, Bruno Lang, Hans-Joachim Pflug, and Andre Vehreschild.
Threads in an undergraduate course: A Java example illuminating
different multithreading approaches.
In A. Laganà, M. L. Gavrilova, V. Kumar, Y. Mun, C. J. K. Tan,
and O. Gervasi, editors, Computational Science and Its Applications --
ICCSA 2004, Proc. Intl. Conf. Computational Science and its Applications,
Assisi, Italy, May 14--17, 2004. Part II, number 3044 in LNCS, pages
882--891. Springer, Berlin, 2004.
[ DOI ]
Multithreading is a fundamental approach to expressing
parallelism in programs. Since Java is emerging as the
de facto standard language for platform independent
software development in higher education, there is
need for teaching multithreading in the context of
Java. We use a simple problem from scientific
computing to explain two different multithreading
approaches to second-year students. More precisely, a
simple boundary value problem is considered, which
makes visible the differences between native Java
threads and the OpenMP interface. So, the students are
able to appreciate the respective merits and drawbacks
of a thread package that is integrated into the Java
programming language and an approach combining
compiler support with library calls.
|
|
[8]
|
H. M. Bücker, B. Lang, and C. H. Bischof.
Parallel programming in computational science: An introductory
practical training course for computer science at Aachen University.
Future Gen. Comput. Syst., 19(8):1309--1319, November 2003.
[ DOI ]
Parallel programming of high-performance computers has
emerged as a key technology for the numerical solution
of large-scale problems arising in computational
science and engineering (CSE). The authors believe
that principles and techniques of parallel programming
are among the essential ingredients of any CSE as well
as computer science curriculum. Today, opinions on the
role and importance of parallel programming are
diverse. Rather than seeing it as a marginal
beneficial skill optionally taught at the graduate
level, we understand parallel programming as crucial
basic skill that should be taught as an integral part
of the undergraduate computer science curriculum. A
practical training course developed for computer
science undergraduates at Aachen University is
described. Its goal is to introduce young computer
science students to different parallel programming
paradigms for shared and distributed memory computers
as well as to give a first exposition to the field of
computational science by simple, yet carefully chosen
sample problems.
|
|
[9]
|
H. Martin Bücker, Bruno Lang, Hans-Joachim Pflug, and Andreas Wolf.
A hybrid MPI-OpenMP implementation of the conjugate gradient
method in Java.
In Dieter an Mey, editor, Proc. EWOMP '03, Fifth European
Workshop on OpenMP, September 22-26, 2003, Aachen, Germany, pages 205--213,
Aachen, Germany, 2003. Center for Computing and Communication, Aachen
University.
The message passing paradigm is the dominating
programming model for parallel processing on
distributed-memory architectures. OpenMP is the most
widely used parallel programming model for
shared-memory systems. Contemporary parallel computers
consist of multiple shared-memory computers connected
via a network so that a hybrid parallelization
approach combining message passing and OpenMP is often
desired. We present a hybrid parallelization of an
iterative method for the solution of linear systems
with a large sparse symmetric coefficient matrix which
is implemented in Java. To this end, we bring together
mpiJava and JOMP and compare the performance on a
Sun Fire 6800 system with a hybrid parallel version of
the same problem implemented in Fortran. The
scalability of both implementations is similar while,
surprisingly, the performance of the Fortran version
with data prefetching is only a factor of about 2
larger than the Java version when up to 22 processors
are used.
|
|
[10]
|
Christian H. Bischof, H. Martin Bücker, and Bruno Lang.
Automatic differentiation for computational finance.
In Erricos John Kontoghiorghes, Berç Rustem, and Stavros
Siokos, editors, Computational Methods in Decision-Making, Economics and
Finance Proc., Neuchâtel, August 16, 2000, chapter 15, pages 297--310.
Kluwer Academic Publishers, Dordrecht, The Netherlands, 2002.
[ DOI ]
Automatic differentiation (AD) is a powerful technique
allowing to compute derivatives of a function given by
a (potentially very large) piece of code. The basic
principles of AD and some available tools implementing
this technology are reviewed. AD is superior to
divided differences because AD-generated derivative
values are free of approximation errors, and superior
to symbolic differentiation because code of very high
complexity can be handled, in contrast to computer
algebra systems whose applicability is limited to
rather simple functions. In addition, the cost for
computing gradients of scalar-valued functions with
either divided differences or symbolic differentiation
grows linearly with the number of variables, whereas
the so-called reverse mode of AD can compute such
gradients at constant cost.
|
|
[11]
|
Christian Bischof, Martin Bücker, Bruno Lang, and Arno Rasch.
Automatic differentiation of the FLUENT CFD program: Procedure
and first results.
Report RWTH-CS-SC-01-22, RWTH Aachen, Institute for Scientific
Computing, 2001.
|
|
[12]
|
Christian H. Bischof, H. Martin Bücker, Jörg Henrichs, and Bruno
Lang.
Hands-on training for undergraduates in high-performance computing
using Java.
In Tor Sørevik, Fredrik Manne, Randi Moe, and Assefaw Hadish
Gebremedhin, editors, Applied Parallel Computing---New Paradigms for HPC
in Industry and Academia, Proceedings 5th International Workshop, PARA 2000,
Bergen, Norway, June 2000, volume 1947 of LNCS, pages 306--315,
Berlin, 2001. Springer-Verlag.
[ DOI ]
In recent years, the object-oriented approach has
emerged as a key technology for building highly
complex scientific codes, as has the use of parallel
computers for the solution of large-scale problems. We
believe that the paradigm shift towards parallelism
will continue and, therefore, principles and
techniques of writing parallel programs should be
taught to the students at an early stage of their
education rather than as an advanced topic near the
end of a curriculum. A certain understanding of the
practical aspects of numerical modeling is also a
useful facet in computer science education. The reason
is that, in addition to their traditional prime
rôle in computational science and engineering,
numerical techniques are also increasingly employed in
seemingly non-numerical settings as large-scale data
mining and web searching. This paper describes a
practical training course for undergraduates, where
carefully selected problems of high-performance
computing are solved using the programming language
Java.
|
|
[13]
|
H. Martin Bücker, Bruno Lang, and Christian H. Bischof.
Teaching different parallel programming paradigms using Java.
In Proceedings of the Third Annual Workshop on Java for
High-Performance Computing, Sorrento, Italy, June 17, 2001, pages 73--81,
2001.
Parallel computing has emerged as a key technology for
the solution of large-scale problems arising in
computational science and engineering. Therefore,
principles and techniques of writing parallel programs
should be taught to the students at an early stage of
their education rather than as an advanced topic near
the end of a curriculum. This note describes a
practical training course for undergraduates, where
the programming language Java is used to pave the
student's way from concurrent programming using native
Java threads via OpenMP-like shared memory programming
to distributed memory programming based on the message
passing interface MPI.
|
|
[14]
|
Bruno Lang.
Direct solvers for symmetric eigenvalue problems.
In Johannes Grotendorst, editor, Modern Methods and Algorithms
of Quantum Chemistry, Jülich, February 21--25, 2000, Proceedings, 2nd
Edition, pages 231--259, Jülich, 2000. John von Neumann Institute for
Computing.
This article reviews classical and recent direct
methods for computing eigenvalues and eigenvectors of
symmetric full or banded matrices. The ideas
underlying the methods are presented, and the
properties of the algorithms with respect to accurary
and performance are discussed. Finally, pointers to
relevant software are given.
|
|
[15]
|
Christian H. Bischof, H. Martin Bücker, Jörg Henrichs, and Bruno
Lang.
Software-Praktikum Paralleles Programmieren mit Java.
Report RWTH-CS-SC-00-13, RWTH Aachen, Institute for Scientific
Computing, 2000.
---
|
|
[16]
|
Rudolf Berrendorf, Christian Bischof, Holger Brunst, Martin Bücker,
Ulrich Detert, Rüdiger Esser, Michael Gerndt, Johannes Grotendorst,
Inge Gutheil, Hans-Christian Hoppe, Friedel Hoßfeld, Bernd Körfgen,
Bruno Lang, Dieter an Mey, Bernd Mohr, Wolfgang E. Nagel, Karl Solchenbach,
Godehard Sutmann, Valentina Tikko, and Lothar Wollschläger.
Gekoppelte SMP-Systeme im wissenschaftlich-technischen
Hochleistungsrechnen --- Status und Entwicklungsbedarf (GoSMP).
Analyse im Auftrag des BMBF, Zentralinstitut für Angewandte
Mathematik, Forschungszentrum Jülich GmbH, Germany, December 1999.
In der Analyse werden der derzeitige Status und die
voraussehbare Entwicklung gekoppelter SMP-Systeme
bezüglich der Hardware und Software dargestellt
sowie Forschungs- und Entwicklungsaufgaben im
Softwarebereich identifiziert, deren Lösung
für den effizienten Einsatz gekoppelter
SMP-Systeme für das wissenschaftlich-technische
Hochleistungsrechnen erforderlich ist. Es werden
konkrete Themengebiete für eine
Projektförderung durch das BMBF empfohlen.
|
|
[17]
|
Bruno Lang.
Matrix vector multiplication for symmetric matrices on parallel
computers and applications.
Z. angew. Math. Mech., 71(6):T807--T808, 1991.
[ DOI ]
Matrix vector products with symmetric matrices arise
in several numerical algorithms, e.g. in the
Householder method for eigenproblems and in some
iterative methods for the solution of linear systems.
Here we propose a procedure for computing such a
product on a local memory parallel computer, if only
one triangle of the matrix is actually stored. Our
method neither increases the number of arithmetic
operations nor introduces prohibitive communication
overhead.
|
|
[18]
|
Stephan Abramowski, Bruno Lang, and Heinrich Müller.
Moving regular k-gons in contact.
In J. van Leeuwen, editor, Graph-Theoretic Concepts in Computer
Science, International Workshop WG '88 Proceedings, Amsterdam, June 15--17,
1988, volume 344 of LNCS, pages 229--242, Berlin, 1989.
Springer-Verlag.
[ DOI ]
Given m circles in the plane, contacts between them
can be specified by a system of quadratic distance
equalities. An approximate solution for the
trajectories of the circles for a system of one degree
of freedom is given, by replacing the circles by
translationally moving regular k-gons. The
approximation yields trajectories that are piecewise
linear. The next linear generation of the m
trajectories are found by an incremental algorithm in
O( m2 ) time. Further, an algorithm is presented
which finds the next collision between m k-gons
moving on lines at constant speed in time
O( k m2-x ) for a constant x > 0 using linear
space. Finally, more practical collision detection
algorithms are sketched based on neighborhood
information which, however, do not guarantee a
nontrivial worst-case bound.
|