Local HDF5 Backend¶
Local HDF5 Backend Implementation, Identifier: HDF5_00
Backend Identifiers¶
- Backend:
0
- Version:
0
- Format Code:
00
- Canonical Name:
HDF5_00
Storage Method¶
- Data is written to specific subarray indexes inside an HDF5 “dataset” in a single HDF5 File.
- In each HDF5 File there are
COLLECTION_COUNT
“datasets” (named["0" : "{COLLECTION_COUNT}"]
). These are referred to as"dataset number"
- Each dataset is a zero-initialized array of:
dtype: {schema_dtype}
; ienp.float32
ornp.uint8
shape: (COLLECTION_SIZE, *{schema_shape.size})
; ie(500, 10)
or(500, 300)
. The first index in the dataset is referred to as acollection index
. See technical note below for detailed explanation on why the flatten operaiton is performed.
- Compression Filters, Chunking Configuration/Options are applied globally for
all
datasets
in a file at dataset creation time.
Record Format¶
Fields Recorded for Each Array¶
- Format Code
- File UID
- Dataset Number (
0:COLLECTION_COUNT
dataset selection) - Collection Index (
0:COLLECTION_SIZE
dataset subarray selection) - Subarray Shape
Separators used¶
SEP_KEY: ":"
SEP_HSH: "$"
SEP_LST: " "
SEP_SLC: "*"
Examples
- Adding the first piece of data to a file:
- Array shape (Subarray Shape): (10)
- File UID: “2HvGf9”
- Dataset Number: “0”
- Collection Index: 0
Record Data => "00:2HvGf9$0 0*10"
Adding to a piece of data to a the middle of a file:
- Array shape (Subarray Shape): (20, 2, 3)
- File UID: “WzUtdu”
- Dataset Number: “3”
- Collection Index: 199
Record Data => "00:WzUtdu$3 199*20 2 3"
Technical Notes¶
Files are read only after initial creation/writes. Only a write-enabled checkout can open a HDF5 file in
"w"
or"a"
mode, and writer checkouts create new files on every checkout, and make no attempt to fill in unset locations in previous files. This is not an issue as no disk space is used until data is written to the initially created “zero-initialized” collection datasetsOn write: Single Writer Multiple Reader (
SWMR
) mode is set to ensure that improper closing (not calling.close()
) method does not corrupt any data which had been previously flushed to the file.On read: SWMR is set to allow multiple readers (in different threads / processes) to read from the same file. File handle serialization is handled via custom python
pickle
serialization/reduction logic which is implemented by the high levelpickle
reduction__set_state__()
,__get_state__()
class methods.An optimization is performed in order to increase the read / write performance of variable shaped datasets. Due to the way that we initialize an entire HDF5 file with all datasets pre-created (to the size of the max subarray shape), we need to ensure that storing smaller sized arrays (in a variable sized Hangar Arrayset) would be effective. Because we use chunked storage, certain dimensions which are incomplete could have potentially required writes to chunks which do are primarily empty (worst case “C” index ordering), increasing read / write speeds significantly.
To overcome this, we create HDF5 datasets which have
COLLECTION_SIZE
first dimension size, and only ONE second dimension of sizeschema_shape.size()
(ie. product of all dimensions). For example an array schema with shape (10, 10, 3) would be stored in a HDF5 dataset of shape (COLLECTION_SIZE, 300). Chunk sizes are chosen to align on the first dimension with a second dimension of size which fits the total data into L2 CPU Cache (< 256 KB). On write, we use thenp.ravel
function to construct a “view” (not copy) of the array as a 1D array, and then on read we reshape the array to the recorded size (a copyless “view-only” operation). This is part of the reason that we only accept C ordered arrays as input to Hangar.