Data at the core of the business

13 05 2009

Experimental engineering software projects start with some urge to have a prototype  where you can “see” or “do” something: see a finite element mesh, do a specific calculation. Eventually the prototype gets to be a full-featured application and do a whole lot of things. In the beginning, the prototype would read a simple text file with a parser or an XML file. You thought this would be enough for the years to come, but the next time you check the requirements have grown so much that the application is overwhelmed with data and data needs. On top of that it’s as slow as if you’re back to a 386DX with math co-processor. Now you have to switch to binary. And then it does not work on Windows<->Linux. Developing a full fledged platform independent binary format is not a bad option but then it does not version on its own, then comes the large file support, different clients need to access the data, you’ve hard-coded business to work with a particular file and now you want another one, 64-bit support and other exotic platforms, multi-threading, and the list goes on and on…

Eventually every such project finds itself revolving around data. And it’s normal, data is a significant and invaluable part of any engineering or scientific software project. It’s not easy to write a good persistence library. I’ve done it in the past and there’s a whole lot of tricks and catches. Usually home made solutions suffer of poor scalability and robustness, as well as performance. It’s natural to ask one’s self “what do other people do”? Like finite element people, scientific projects with tons of data, research centres and so on. There’s a limited list of projects to select from Open Source. I have tested a few in the past and ended up enjoying working two: the Metakit and HDF5. I plan to test Hadoop at some point as well. For the moment I will stick to HDF5.

Why HDF and why 5?

HDF5 stands for Hierarchical Data Format 5, though I only know versions 4 and 5, no 1,2,3… In times where Open Source projects have all kinds of fancy and exotic names, one can tell the age of the HDF5 project… You can smell the 80’s academia from miles, you can even see some girls wearing shoulder pads there in the back as well… I myself used it for the first time in the late 90’s while emitting my own fragrance of academia. However, the old-timer is quite up and running. Now it is managed by a private company, spun-off like everybody else, and to me it seems that now it’s the time for HDF5 to show it’s strength. Obviously the people behind HDF5 have put quite some effort in the last decades to come up with the right solution at the right time.

So what’s HDF5 after all…

HDF5 is a library that provides a convenient API for persisting scientific/engineering data. Scientific data is regarded here as large arrays (enumerating millions of elements) of primitive or composite types. These arrays can be multidimensional. The HDF5 API provides a way to describe this data, arrange it into a hierarchical structure and access it. In this respect, it is self describing, in the sense that the way the data is accessed in the file does not depend on some application defined object tree. Rather, a format is described and this description is stored along with the data within the file. To make this more clear, one may write an application (though there are there already a couple, like HDFView and HDFExplorer) that can open, traverse and access any HDF5 compliant file, created by any other application.

The data within the file is arranged in a tree-like structure consisting of nodes. The tree might form a closed graph. Each node or leaf of this tree must be a valid group. Each group can contain datasets. Datasets are the detailed description of the data along with the data itself. Simple examples of datasets could be an array of n doubles or an array of n structures consisting of several primitive (or integral or atomic) types. The HDF5 API provides the capability to traverse this tree and access each of the datasets. However, one does not need to access the whole tree to look up a dataset: this is done using persistent addressing. Persistent addressing takes the form of a path. Therefore the user simply needs to access a dataset by path to extract the information needed.

If this sounds complicated, consider these alternatives: An HDF5 file looks like a Unix file system: there’s a root, directories are called groups and files are called datasets. You can even have soft and hard links. The path composition is identical to Unix paths: the path /Foo/Bar/… would give access to a particular group at a particular level. The path /Foo/Bar/MyPreciousData could provide access to a particular dataset. Another way to see the HDF5 file is like a binary XML file. It’s just like XML in the way data is hierarchically arranged, only much richer in the complexity and functionality. But in principle you can map any XML file to HDF5, while it’s not always possible the other way around.

Since datasets are primarily multidimensional arrays of numerous elements of identical types, it is usually faster to process them in batch. HDF5 is optimised in performing IO of chunks of data. However, it also provides the capability of accessing subsets of the datasets, using the hyperslab concept. Hyperslabs are masks that can be defined using binary operators, therefore providing the capability of extracting particular information from the dataset. Patterns defined this way may extend to all dimensions of the array.

How is HDF5 related to a Data Base Management System?

HDF5 is similar to a Data Base Management System (DBMS) in the sense that it provides the means to define the organisation of the data within persistent storage as well as data structures to deal with very large amounts of data. It also satisfies to some extent the Atomicity, Consistency, Isolation, Durability (ACID) properties. It also  defines an extensible data model that does not depend on any client application and therefore is not affected by application requirements and their changes.

But any similarities stop here. HDF5 is different to a DBMS in the sense that it does not provide a database query language and report writer, as well as a security and a transaction mechanism that can handle multiple simultaneous accesses to the database. HDF5 operates on single or sets of files through the use of the library’s API. It does not provide an application that  accesses these files. This is the task of the client to implement. Therefore security and simultaneous access have to be handled at that level. But what’s important to understand here is the different scope of the two systems: a database handles efficiently large numbers of transactions consisting of small pieces of data. HDF5 handles one or only a few transactions consissting of large amounts of data. It is important to identify the right tool for the right task.

How does HDF5 compare to serialization?

I would classify HDF5 on the persistence side. But first of all, what is serialization and what is persistence? I would say that:

Serialization is collapsing a multidimensional object tree to a one-dimensional array of bytes. Even more important, a serialization framework can get that array of bytes and reconstruct the object tree and restore the exact state at the moment of serialization. Random access to the data is not possible. It’s all or nothing. The main consumer of serialization is the application that serialized the data in the first place or some descendants of that, in the latter case making versioning an important issue.

Persistence is mapping a multidimensional data structure to another multidimensional data structure in an application independent manner in order to persist that part of the state that is needed by other applications in space and time.

HDF5 compares little to serialization frameworks.  It is rather used in an orthogonal manner: its main task is to persist data. It does not generate a one-dimensional mapping, but rather a multidimensional mapping. Random access to the data is possible. The data is supposed to be consumed by applications other (but not excluding) the one that created the data without the need to share business objects. Versioning is not an issue, as the data is self-describing. Further, serialized data can be persisted using HDF5.
Some major advantages

  1. HDF5 operates on single local files that can be transferred using OS commands.
  2. HDF5 provides the means to arrange information in a hierarchical tree and access it randomly independently of the application that generated the information.
  3. HDF5 provides a dynamic mapping paradigm.
  4. Performance-wise, it is often faster than native calls.

So are there any disadvantages?
The main disadvantages are:

  1. It is an open format and self describing, therefore making the data transparent and completely open. This could be a drawback for proprietary applications. Actually it is one of the major ones.
  2. There is no simultaneous access support at the dataset level. This means that it is not possible to write to the same dataset from multiple clients and read from multiple clients at the same time.
  3. Locking/unlocking, journaling, composing transactions are things that are either not implemented yet or lie completely on the application side.

All in all HDF5 is an extremely rich library that offers the versatility  engineers and scientists need when it comes to writing their data. The data itself are invaluable and have to be safely persisted. Sometimes data are more important than their source. Consider for example scientific experiments that cannot be repeated very easily. HDF5 makes sure we find the data were we put it in the first place.You can access that using nice bindings for C, C++, Java, Python, there’s something for ruby I think and emm.., what was it now… oh, Fortran! COMMON, EQUIVALENCE, DIMENSION and the like.

Next time I have a little time I’ll write some examples on how to use HDF5. It’s pretty intimidating if you haven’t seen OpenCASCADE, but one can put it fast in action.