Evaluation of High Performance Fortran through
Application Kernels
Presented at: HPCN 1997 at Vienna, Austria. April 1997.
Hon W Yau
EPCC & NPAC
Edinburgh & Syracuse University
UK & USA
Email: hwyau@epcc.ed.ac.uk
Outline of this Talk
- Motivation for HPF
- Motivation for the NPAC HPF Applications Suite
- The benchmark configurations
- Tour of the codes:
- Embarrassingly Parallel
- Synchronous-stencil operations
- Synchronous-matrix/vector
operations
- Selected results
- Discussion
Why HPF
- Largely a benign extension of Fortran 90
- Use of directives to provide information to the compiler
- Portability with Fortran
90 serial code
- Crucial for continued code
development
- Only agreed parallel language
standard
- Portability amongst machine
and compiler vendors
- Single thread of execution
- Clear distinction between
distributed and replicated data
- Consistency of data amongst
processors is implied
Why the HPF Applications Suite
- Aims:
- Demonstrate applicability of HPF to common codes
- Measure the maturity of
available compilers (features)
- Benchmark available compilers
- Give feedback to the vendors
- Give feedback to the HPF-2.0
Forum meetings
- What it does not aim to
do:
- Measure I/O performances
- Consciously test every
HPF feature (cf, new ParkBench effort)
Short History of the Project
- Started at NPAC in early 1995:
- Collection of ~ 40 short codes, from various sources;
mostly from previous NPAC projects
- TMC, MasPar dialects of
data parallel Fortran
- Some message passing variants
- No Fortran 90 versions
- Only DEC HPF compiler was
stable
- Unstable releases from
PGI and APR
- Fortran 90 compilers were
also quite rare...
- Not obvious which HPF features
were implemented well by which compiler
- Lots of major rewrite of
the codes
Short History of the Project (cont.)
- Today (April 1997):
- Pruned down to 16 functioning & correct codes
- Many codes were research
codes, and required corrections
- About another 4 requiring
further work
- Some codes still do not
compile, or cause run-time errors
- Concentrated on F90 and
HPF versions of codes
- Benchmarked and profiled,
with measurements of computation and communication code sections
- No specially written message
passing versions
- Fortran 90 compilers are
now commonly available
Machine and Compiler Configurations
- Cray T3D at EPCC:
- PGI PGHPF v2.1 (June 1996)
- Cray F90 compiler for serial
benchmark
- DEC Alphafarm with Gigaswitch
at NPAC:
- DEC F90 & PSE v4.0
(February 1996)
- DEC F90 compiler for serial
benchmark
- IBM SP2 at Cornell Theory
Center:
- PGI PGHPF v2.1-1 (July
1996)
- PGI PGHPF in `F90' run-time
mode for serial benchmark
- IBM
SP2 at Cornell Theory Center:
- IBM XLHPF v1.1 (March 1996)
- IBM XLF90 F90 compiler
for serial benchmark
Methodology
- Minimum elapsed times, over set of >= 8 runs
- Timing of the codes:
- Fortran 90 intrinsic `SYSTEM_CLOCK()'
for wall-clock times
- Use of the PGI graphical
profiler:
- Sections of code with communications
- Sections of code which
are pure computations
- Measured speed-up with
respect to single processor HPF total execution time:
- Serial Fortran 90 runs
- HPF with p=2, 4, 8 (and
16) processor runs
- To determine latency &
scalability of HPF code
The HPF Application Codes
- Embarrassingly Parallel
- Alternating Direct Implicit
- 2D FFT
- NPAC's NAS EP Benchmark
- 2D Convolution
- Box Muller
- Wavelet Image Transformation
- Hough Image Transformation
HPF Application Kernels (cont.)
- Synchronous Algorithms
- Stencil Operation Codes:
- 2D percolation
- Potts model
- Binary phase quenching
of the Cahn-Hilliard equation
- Segmented bitonic sort
- Direct N-body simulation
- Binomial stochastic volatility
options pricing
- Matrix-vector Operation
Codes:
- Gaussian factorisation
- Cholesky factorisation
- Hopfield Neural Network
Benchmark Results
Discussion
- HPF generally good for problems which are:
- Embarrassingly parallel
- Synchronous with stencil
operations
- Performance:
- Expect speed-ups between
8 to 4 at 8 processors
- F90 to HPF latency `speed-down'
below 2
- Of the compilers tested:
- PGHPF the most mature
- IBM XLHPF also good, considering
its youth
- DEC HPF compiler disappointing
compared to its F90
- IBM SP2 deals better with
small problem sizes than the Cray T3D.
Discussion (cont.)
- Feedback on codes which do not compile or run
- Use of distributed data in common blocks?
- Tuning of the code can
be tedious
- Feedback from the compiler
can be better:
- Memory copy operations
- Data remappings
- Data locality (better control
in HPF-2.0)
- Efficiency of a given line
of code (sequentialisation?)
- Efficiencies of parallelised
intrinsic functions
- Debugging is currently
painful
- Time is right to do large
complex HPF codes!
Further Information:
- On the NPAC HPFA Project:
- On the NPAC HPFA codes
themselves:
- Draft release `HPF for
SPMD programming: An Applications Overview':
- To be mirrored on the EPCC
web-site:
First Edit: HWY, 15th of May, 1997.