Local NP Memmap Backend

Local Numpy memmap Backend Implementation, Identifier: NUMPY_10

Backend Identifiers

  • Backend: 1
  • Version: 0
  • Format Code: 10
  • Canonical Name: NUMPY_10

Storage Method

  • Data is written to specific subarray indexes inside a numpy memmapped array on disk.
  • Each file is a zero-initialized array of
    • dtype: {schema_dtype}; ie np.float32 or np.uint8
    • shape: (COLLECTION_SIZE, *{schema_shape}); ie (500, 10) or (500, 4, 3). The first index in the array is referred to as a “collection index”.

Record Format

Fields Recorded for Each Array

  • Format Code
  • File UID
  • Alder32 Checksum
  • Collection Index (0:COLLECTION_SIZE subarray selection)
  • Subarray Shape

Separators used

  • SEP_KEY: ":"
  • SEP_HSH: "$"
  • SEP_LST: " "
  • SEP_SLC: "*"

Examples

  1. Adding the first piece of data to a file:

    • Array shape (Subarray Shape): (10)
    • File UID: “NJUUUK”
    • Alder32 Checksum: 900338819
    • Collection Index: 2

    Record Data => '10:NJUUUK$900338819$2*10'

  1. Adding to a piece of data to a the middle of a file:

    • Array shape (Subarray Shape): (20, 2, 3)
    • File UID: “Mk23nl”
    • Alder32 Checksum: 2546668575
    • Collection Index: 199

    Record Data => "10:Mk23nl$2546668575$199*20 2 3"

Technical Notes

  • A typical numpy memmap file persisted to disk does not retain information about its datatype or shape, and as such must be provided when re-opened after close. In order to persist a memmap in .npy format, we use the a special function open_memmap imported from np.lib.format which can open a memmap file and persist necessary header info to disk in .npy format.
  • On each write, an alder32 checksum is calculated. This is not for use as the primary hash algorithm, but rather stored in the local record format itself to serve as a quick way to verify no disk corruption occurred. This is required since numpy has no built in data integrity validation methods when reading from disk.