Local NP Memmap Backend

Local Numpy memmap Backend Implementation, Identifier: NUMPY_10

Backend Identifiers

  • Backend: 1

  • Version: 0

  • Format Code: 10

  • Canonical Name: NUMPY_10

Storage Method

  • Data is written to specific subarray indexes inside a numpy memmapped array on disk.

  • Each file is a zero-initialized array of

    • dtype: {schema_dtype}; ie np.float32 or np.uint8

    • shape: (COLLECTION_SIZE, *{schema_shape}); ie (500, 10) or (500, 4, 3). The first index in the array is referred to as a “collection index”.

Compression Options

Does not accept any compression options. No compression is applied.

Record Format

Fields Recorded for Each Array

  • Format Code

  • File UID

  • xxhash64_hexdigest

  • Collection Index (0:COLLECTION_SIZE subarray selection)

  • Subarray Shape

Examples

  1. Adding the first piece of data to a file:

    • Array shape (Subarray Shape): (10, 10)

    • File UID: “K3ktxv”

    • xxhash64_hexdigest: 94701dd9f32626e2

    • Collection Index: 488

    Record Data =>  "10:K3ktxv:94701dd9f32626e2:488:10 10"

  2. Adding to a piece of data to a the middle of a file:

    • Array shape (Subarray Shape): (20, 2, 3)

    • File UID: “Mk23nl”

    • xxhash64_hexdigest: 1363344b6c051b29

    • Collection Index: 199

    Record Data => "10:Mk23nl:1363344b6c051b29:199:20 2 3"

Technical Notes

  • A typical numpy memmap file persisted to disk does not retain information about its datatype or shape, and as such must be provided when re-opened after close. In order to persist a memmap in .npy format, we use the a special function open_memmap imported from np.lib.format which can open a memmap file and persist necessary header info to disk in .npy format.

  • On each write, an xxhash64_hexdigest checksum is calculated. This is not for use as the primary hash algorithm, but rather stored in the local record format itself to serve as a quick way to verify no disk corruption occurred. This is required since numpy has no built in data integrity validation methods when reading from disk.