Update: Everything below is inessential since I found the StackOverflow answer about HDF5.

The only thing I don’t agree with is Blaze. I’ve tried it, and it’s clearly still in a raw state and needs a lot of time to become not just stable but truly useful.


My current workflow is entirely based on IPython, and I work extensively with Pandas (which, in my opinion, is an example of poor library design).

Recently, I transitioned to using HDF. However, installing PyTables (which is required to use HDF with Pandas) wasn’t as straightforward as I expected.

Now, I convert all my data to HDF.

  • First, this often reduces the space needed to store data by about 2-3 times (though this depends on the dataset; for some, there’s no difference between CSV and HDF).
  • Second, the data is stored in a binary format, so all types are strictly defined – no parsing or type guessing is necessary.
  • This makes read/write operations significantly faster (by orders of magnitude).
  • Additionally, no approximations are made for float numbers.

I hope these are enough reasons to consider switching to HDF.