Wednesday, 22 November 2017

Benchmarking HPC systems

At SC17, we celebrated the 50th edition of the Top500 list. With nearly 25,000 list positions published over 25 years, the Top500 is an incredibly rich database of consistently measured performance data with associated system configurations, sites, vendors, etc. Each SC and ISC, the Top500 feeds community gossip, serious debate, the HPC media, and ambitious imaginations of HPC marketing departments. Central to the Top500 list is the infamous HPL benchmark.

Benchmarks are used to answer questions such as (naively posed): “How fast is this supercomputer?”, “How fast is my code?”, “How does my code scale?”, “Which system/processor is faster?”.

In the context of HPC, benchmarking means the collection of quantifiable data on the speed, time, scalability, efficiency, or similar characteristics of a specific combination of hardware, software, configuration, and dataset. In practice, this means running well-understood test case(s) on various HPC platforms/configurations under specified conditions or rules (for consistency) and recording appropriate data (e.g., time to completion).

These test cases may be full application codes, or subsets of those codes with representative performance behaviour, or standard benchmarks. HPL falls into the latter category, although for some applications it could fall into the second category too. In fact, this is the heart of the debate over the continued relevance of the HPL benchmark for building the Top500 list: how many real-world applications does it provide a meaningful performance guide for? But, even moving away from HPL to “user codes”, selecting a set of benchmark codes is as much a political choice (e.g., reflecting stakeholders) as it is a technical choice.

Once the performance data has been collected, an analysis phase usually follows. This seeks to explore and explain the observed performance behaviour with respect to architecture or other features of either the hardware or the software.

Thus, while benchmarks are only measured for specific scenarios, they are most often used to extrapolate or infer more general behaviour. This might include predicting the performance of a potential hardware upgrade, or of a new algorithm, or identifying a performance bottleneck. This is also an easy area for "cheating" or optimistic assumptions to creep in, and care is needed when making decisions based on extrapolated benchmark data. Comparing results across different systems or scales requires a treatment of architectural and extrapolation issues, which will be complex and depend on in-depth knowledge of the hardware, software or both.

When undertaking benchmarking, it is important to collect a full set of metadata. The more metadata recorded the better, and a typical list might be: which machine, system details, time, other users/codes on the system, runtime flags, code build details, library versions, input deck, core count, node population, topology used, etc.

Finally, in addition to measuring the performance achieved, it might be appropriate to measure the effort needed to achieve that performance (e.g., code porting and tuning).

Good benchmarking requires specific skills and experience, plus a persistent, methodical, enquiring attitude. It can be complicated, definitely frustrating, hopefully insightful, and even fun!

A version of this article was original published in print edition of Top500 News at SC17.

No comments: