How is HDF5 different from a folder with files?

python persistence metadata hdf5

As someone who developed a scientific project that went from using folders of files to HDF5, I think I can shed some light on the advantages of HDF5.

When I began my project, I was operating on small test datasets, and producing small amounts of output, in the range of kilobytes. I began with the easiest data format, tables encoded as ASCII. For each object I processed, I produced on ASCII table.

I began applying my code to groups of objects, which meant writing multiple ASCII tables at the end of each run, along with an additional ASCII table containing output related to the entire group. For each group, I now had a folder that looked like:

+ group|    |-- object 1|    |-- object 2|    |-- ...|    |-- object N|    |-- summary

At this point, I began running into my first difficulties. ASCII files are very slow to read and write, and they don't pack numeric information very efficiently, because each digit takes a full Byte to encode, rather than ~3.3 bits. So I switched over to writing each object as a custom binary file, which sped up I/O and decreased file size.

As I scaled up to processing large numbers (tens of thousands to millions) of groups, I suddenly found myself dealing with an extremely large number of files and folders. Having too many small files can be a problem for many filesystems (many filesystems are limited in the number of files they can store, regardless of how much disk space there is). I also began to find that when I would try to do post-processing on my entire dataset, the disk I/O to read many small files was starting to take up an appreciable amount of time. I tried to solve these problems by consolidating my files, so that I only produced two files for each group:

+ group 1|    |-- objects|    |-- summary+ group 2|    |-- objects|    |-- summary...

I also wanted to compress my data, so I began creating .tar.gz files for collections of groups.

At this point, my whole data scheme was getting very cumbersome, and there was a risk that if I wanted to hand my data to someone else, it would take a lot of effort to explain to them how to use it. The binary files that contained the objects, for example, had their own internal structure that existed only in a README file in a repository and on a pad of paper in my office. Whoever wanted to read one of my combined object binary files would have to know the byte offset, type and endianness of each metadata entry in the header, and the byte offset of every object in the file. If they didn't, the file would be gibberish to them.

The way I was grouping and compressing data also posed problems. Let's say I wanted to find one object. I would have to locate the .tar.gz file it was in, unzip the entire contents of the archive to a temporary folder, navigate to the group I was interested in, and retrieve the object with my own custom API to read my binary files. After I was done, I would delete the temporarily unzipped files. It was not an elegant solution.

At this point, I decided to switch to a standard format. HDF5 was attractive for a number of reasons. Firstly, I could keep the overall organization of my data into groups, object datasets and summary datasets. Secondly, I could ditch my custom binary file I/O API, and just use a multidimensional array dataset to store all the objects in a group. I could even create arrays of more complicated datatypes, like arrays of C structs, without having to meticulously document the byte offsets of every entry. Next, HDF5 has chunked compression which can be completely transparent to the end user of the data. Because the compression is chunked, if I think users are going to want to look at individual objects, I can have each object compressed in a separate chunk, so that only the part of the dataset the user is interested in needs to be decompressed. Chunked compression is an extremely powerful feature.

Finally, I can just give a single file to someone now, without having to explain much about how it's internally organized. The end user can read the file in Python, C, Fortran, or h5ls on the commandline or the GUI HDFView, and see what's inside. That wasn't possible with my custom binary format, not to mention my .tar.gz collections.

Sure, it's possible to replicate everything you can do with HDF5 with folders, ASCII and custom binary files. That's what I originally did, but it became a major headache, and in the end, HDF5 did everything I was kluging together in an efficient and portable way.

python persistence metadata hdf5

Thanks for asking this interesting question. Is a folder with files portable because I can copy a directory onto a stick on a Mac and then see the same directory and files on a PC? I agree that the file directory structure is portable, thanks to the people that write operating systems, but this is unrelated to the data in the files being portable. Now, if the files in this directory are pdf's, they are portable because there are tools that read and make sense of pdfs in multiple operating systems (thanks to Adobe). But, if those files are raw scientific data (in ASCII or binary doesn't matter) they are not at all portable. The ASCII file would look like a bunch of characters and the binary file would look like gibberish. If the were XML or json files, they would be readable, because json is ASCII, but the information they contain would likely not be portable because the meaning of the XML/json tags may not be clear to someone that did not write the file. This is an important point, the characters in an ASCII file are portable, but the information they represent is not.

HDF5 data are portable, just like the pdf, because there are tools in many operating systems that can read the data in HDF5 files (just like pdf readers, see http://www.hdfgroup.org/products/hdf5_tools/index.html). There are also libraries in many languages that can be used to read the data and present it in a way that makes sense to users – which is what Adobe reader does. There are hundreds of groups in the HDF5 community that do the same thing for their users (see http://www.hdfgroup.org/HDF5/users5.html).

There has been some discussion here of compression as well. The important thing about compressing in HDF5 files is that objects are compressed independently and only the objects that you need get decompressed on output. This is clearly more efficient than compressing the entire file and having to decompress the entire file to read it.

The other critical piece is that HDF5 files are self-describing – so, people that write the files can add information that helps users and tools know what is in the file. What are the variables, what are their types, what software wrote them, what instruments collected them, etc. It sounds like the tool you are working on can read metadata for files. Attributes in an HDF5 file can be attached to any object in the file – they are not just file level information. This is huge. And, of course, those, attributes can be read using tools written in many languages and many operating systems.

python persistence metadata hdf5

I'm currently evaluating HDF5 so had the same question.

This article – Moving Away from HDF5 – asks pretty much the same question. The article raises some good points about the fact that there is only a single implementation of the HDF5 library which is developed in relatively opaque circumstances by modern open-source standards.

As you can tell from the title, the authors decided to move away from HDF5, to a filesystem hierarchy of binary files containing arrays with metadata in JSON files. This was in spite having made a significant investment in HDF5, having had their fingers burnt by data corruption and performance issues.

CodeHunter

How is HDF5 different from a folder with files?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last