Fixed Shape Optimized Local HDF5¶

Local HDF5 Backend Implementation, Identifier: HDF5_01

Backend Identifiers¶

Backend: 0
Version: 1
Format Code: 01
Canonical Name: HDF5_01

Storage Method¶

This module is meant to handle larger datasets which are of fixed size. IO and significant compression optimization is achieved by storing arrays at their appropriate top level index in the same shape they naturally assume and chunking over the entire subarray domain making up a sample (rather than having to subdivide chunks when the sample could be variably shaped.)
Data is written to specific subarray indexes inside an HDF5 “dataset” in a single HDF5 File.
In each HDF5 File there are COLLECTION_COUNT “datasets” (named ["0" : "{COLLECTION_COUNT}"]). These are referred to as "dataset number"
Each dataset is a zero-initialized array of:
- dtype: {schema_dtype}; ie np.float32 or np.uint8
- shape: (COLLECTION_SIZE, *{schema_shape}); ie (500, 10, 10) or (500, 512, 512, 320). The first index in the dataset is referred to as a collection index.
Compression Filters, Chunking Configuration/Options are applied globally for all datasets in a file at dataset creation time.
On read and write of all samples the xxhash64_hexdigest is calculated for the raw array bytes. This is to ensure that all data in == data out of the hdf5 files. That way even if a file is manually edited (bypassing fletcher32 filter check) we have a quick way to tell that things are not as they should be.

Compression Options¶

Accepts dictionary containing keys

backend == "01"
complib
complevel
shuffle

Blosc-HDF5

complib valid values:
- 'blosc:blosclz',
- 'blosc:lz4',
- 'blosc:lz4hc',
- 'blosc:zlib',
- 'blosc:zstd'
complevel valid values: [0, 9] where 0 is “no compression” and 9 is “most compression”
shuffle valid values:
- None
- 'none'
- 'byte'
- 'bit'

LZF Filter

'complib' == 'lzf'
'shuffle' one of [False, None, 'none', True, 'byte']
'complevel' one of [False, None, 'none']

GZip Filter

'complib' == 'gzip'
'shuffle' one of [False, None, 'none', True, 'byte']
complevel valid values: [0, 9] where 0 is “no compression” and 9 is “most compression”

Record Format¶

Fields Recorded for Each Array¶

Format Code
File UID
xxhash64_hexdigest (ie. checksum)
Dataset Number (0:COLLECTION_COUNT dataset selection)
Dataset Index (0:COLLECTION_SIZE dataset subarray selection)
Subarray Shape

Examples

Adding the first piece of data to a file:
- Array shape (Subarray Shape): (10, 10)
- File UID: “rlUK3C”
- xxhash64_hexdigest: 8067007c0f05c359
- Dataset Number: 16
- Collection Index: 105
Record Data => "01:rlUK3C:8067007c0f05c359:16:105:10 10"

Adding to a piece of data to a the middle of a file:
- Array shape (Subarray Shape): (20, 2, 3)
- File UID: “rlUK3C”
- xxhash64_hexdigest: b89f873d3d153a9c
- Dataset Number: “3”
- Collection Index: 199
Record Data => "01:rlUK3C:b89f873d3d153a9c:8:199:20 2 3"

Technical Notes¶

The majority of methods not directly related to “chunking” and the “raw data chunk cache” are either identical to HDF5_00, or only slightly modified.
Files are read only after initial creation/writes. Only a write-enabled checkout can open a HDF5 file in "w" or "a" mode, and writer checkouts create new files on every checkout, and make no attempt to fill in unset locations in previous files. This is not an issue as no disk space is used until data is written to the initially created “zero-initialized” collection datasets
On write: Single Writer Multiple Reader (SWMR) mode is set to ensure that improper closing (not calling .close()) method does not corrupt any data which had been previously flushed to the file.
On read: SWMR is set to allow multiple readers (in different threads / processes) to read from the same file. File handle serialization is handled via custom python pickle serialization/reduction logic which is implemented by the high level pickle reduction __set_state__(), __get_state__() class methods.
An optimization is performed in order to increase the read / write performance of fixed size datasets. Due to the way that we initialize an entire HDF5 file with all datasets pre-created (to the size of the fixed subarray shape), and the fact we absolutely know the size / shape / access-pattern of the arrays, inefficient IO due to wasted chunk processing is not a concern. It is far more efficient for us to completely blow off the metadata chunk cache, and chunk each subarray as a single large item item.

This method of processing tends to have a number of significant effects as compared to chunked storage methods:
1. Compression rations improve (by a non-trivial factor). This is simply due to the fact that a larger amount of raw data is being passed into the compressor at a time. While the exact improvement seen is highly dependent on both the data size and compressor used, there should be no case where compressing the full tensor uses more disk space then chunking the tensor, compressing each chunk individually, and then saving each chunk to disk.
2. Read performance improves (so long as a suitable compressor / option set was chosen). Instead of issuing (potentially) many read requests - one for each chunk - to the storage hardware, signifiantly few IOPS are used to retrieve the entire set of compressed raw data from disk. Fewer IOPS means much less time waiting on the hard disk. Moreso, only a single decompression step is needed to reconstruct the numeric array, completly decoupling performance from HDF5’s ability to parallelize internal filter pipeline operations.
  
  Additionally, since the entire requested chunk is retrieved in a single decompression pipeline run, there is no need for the HDF5 core to initialize an intermediate buffer which holds data chunks as each decompression operation completes. Futher, by preinitializing an empty numpy.ndarray container and using the low level HDF5 read_direct method, the decompressed data buffer is passes directly into the returned ndarray.__array_interface__.data field with no intermediate copy or processing steps.
3. Shuffle filters are favored.. With much more data to work with in a single compression operation, the use of “byte shuffle” filters in the compressor spec has been seen to both markedly decrease read time and increase compression ratios. Shuffling can significantly reduce disk space required to store some piece of data on disk, further reducing the time spent waiting on hard disk IO while incuring a negligible cost to decompression speed.
Taking all of these effects into account, there can be up to an order of magnitude increase in read performance as compared to the subarray chunking strategy employed by the HDF5_00 backend.
Like all other backends at the time of writing, only ‘C’ ordered arrays are accepted by this method.