Why using HDF5?
Update: Everything below is inessential since I found the StackOverflow answer about HDF5.
The only thing I don’t agree with is Blaze. I’ve tried it, and it’s clearly still in a raw state and needs a lot of time to become not just stable but truly useful.
My current workflow is entirely based on IPython, and I work extensively with Pandas (which, in my opinion, is an example of poor library design).
Recently, I transitioned to using HDF. However, installing PyTables (which is required to use HDF with Pandas) wasn’t as straightforward as I expected.
Now, I convert all my data to HDF.
- First, this often reduces the space needed to store data by about 2-3 times (though this depends on the dataset; for some, there’s no difference between CSV and HDF).
- Second, the data is stored in a binary format, so all types are strictly defined – no parsing or type guessing is necessary.
- This makes read/write operations significantly faster (by orders of magnitude).
- Additionally, no approximations are made for float numbers.
I hope these are enough reasons to consider switching to HDF.