Thursday 11 August 2011

Big Data and Supercomputing for Science

It is interesting to note the increasing attention “big data” seems to be getting from the supercomputing community.

Data explosion

We talk about the challenges of the exponential increase in data, or even an “explosion of data”. This is caused by our ever-growing ability to generate data. More powerful computational resources deliver finer resolutions, wider parameter studies, etc. The emergence of individual scale HPC (GPU etc.) that is both cost-viable and effort-viable gives increased data creation capability to the many scientists not using high end supercomputers. And instrumental sources continue to improve in resolution and speed.

So, we are collecting more data than we have before. We are also increasing our use of multiple data sources – fusion from various sensors and computer models to form predictions or study scientific phenomena.

It is also common to questions such as: are we drowning in volume of data? Is this growth in data overwhelming our ability to extract useful information or insight? Is the potential value of the increased data lost by our inability to manage and comprehend it? Does having more data mean more information – or less due to analysis overload? Do the diversity of formats, quality, and sources further hinder data use?

It is great, even essential, that data is getting more attention in the supercomputing community. But big data, complex data, whatever – this is not a new story. When I started using computers for research lots of my time was spent doing stuff with data: writing routines to convert from one format to another, manually massaging source data into a form usable by the simulation codes (e.g. cleaning up object models, or calibrating instrumental data to correct limitations of collection), storing model output with meta-data for re-use, attending meetings to tackle the forthcoming curation and provenance challenges, … (over a decade ago).

I’m not sure much has really changed.


We must also remember that challenge can always mean opportunity - new insights are often possible as a direct result of volume of data – statistics, trends, exceptions, data mining, parameter coverage, etc.

Massive amounts of data enhances the alternative research process of data exploration rather than the more traditional hypothesis validation. It creates the “discovery by anomaly” scenarios.

From the business and career side, there is also a broader market for data-led HPC: databases, data analytics, business intelligence, etc. This could – should – be a growth opportunity for the traditional HPC community (I’m sure this very ambition is is behind some of the increasing profile of big data!).

Let’s explore two quick examples – each key drivers of HPC usage:

Climate science

The methodology underpinning climate science might be described as “predict and store variations of tens of variables over centuries of model time in pseudo-3D space; include various measurement data; and compare multiple models against each other”. Climate scientists are always among the first to demand greater storage, archival facilities or similar capabilities from HPC centres.

Engineering and design

In engineering, we usually use a computer model (CAD) of the current design (encompassing geometry, materials parameters, etc.), and perform CFD, CEM, structures, or whatever simulations on the model to predict some behaviour. The preparation activity (CAD model creation, meshing, etc.) is often a much greater task (human effort, elapsed time) than the simulation itself. The input data has significant value (e.g. IP) – and often drives memory limits of the computational resource. The output data (e.g. field data) will be large in volume, and often needs an audit trail to be meaningful or usable. There is also a substantial role for post-processing - especially visualization.

And there are many others – intelligence, bioinformatics, astronomy, etc.

These example key drivers of HPC are each not only technical, but also “political” drivers of HPC. Each requires major increases in compute (e.g. they can describe a clear benefit for using exaflops). And the data aspects are at least as important as compute to success in each case.

And yet, for my observations over the last decade of supercomputer centres, “massive data” has been the next big thing ... and usually the last thing on anyone's mind in procurements. HPC procurement conversations usually follow the same pattern. How many CPUs? How fast? What’s the price? (oh, and it has some disk stuff too? – that’s good).

So what can we do differently? What should we do differently?

No comments: