Part 1: Creating A Repository And Working With Data

This tutorial will review the first steps of working with a hangar repository.

To fit with the beginner’s theme, we will use the MNIST dataset. Later examples will show off how to work with much more complex data.

from hangar import Repository

import numpy as np
import pickle
import gzip
import matplotlib.pyplot as plt

from tqdm import tqdm

Creating & Interacting with a Hangar Repository

Hangar is designed to “just make sense” in every operation you have to perform. As such, there is a single interface which all interaction begins with: the designed to “just make sense” in every operation you have to perform. As such, there is a single interface which all interaction begins with: the Repository object.

Whether a hangar repository exists at the path you specify or not, just tell hangar where it should live!

Intitializing a repository

The first time you want to work with a new repository, the repository init() method must be called. This is where you provide Hangar with your name and email address (to be used in the commit log), as well as implicitly confirming that you do want to create the underlying data files hangar uses on disk.

repo = Repository(path='/Users/rick/projects/tensorwerk/hangar/dev/mnist/')

# First time a repository is accessed only!
# Note: if you feed a path to the `Repository` which does not contain a pre-initialized hangar repo,
# when the Repository object is initialized it will let you know that you need to run `init()`

repo.init(user_name='Rick Izzo', user_email='', remove_old=True)
Hangar Repo initialized at: /Users/rick/projects/tensorwerk/hangar/dev/mnist/.hangar

Checking out the repo for writing

A repository can be checked out in two modes:

  1. write-enabled: applies all operations to the staging area’s current state. Only one write-enabled checkout can be active at a different time, must be closed upon last use, or manual intervention will be needed to remove the writer lock.
  2. read-only: checkout a commit or branch to view repository state as it existed at that point in time.

Lots of useful information is in the iPython __repr__

If you’re ever in doubt about what the state of the object your working on is, just call its reps, and the most relevant information will be sent to your screen!

co = repo.checkout(write=True)
Hangar WriterCheckout
    Writer       : True
    Base Branch  : master
    Num Arraysets : 0
    Num Metadata : 0

A checkout allows access to arraysets and metadata

The arraysets and metadata attributes of a checkout provide the interface to working with all of the data on disk!

Hangar Arraysets
    Writeable: True
    Arrayset Names / Partial Remote References:
Hangar Metadata
    Writeable: True
    Number of Keys: 0

Before data can be added to a repository, a arrayset must be initialized.

We’re going to first load up a the MNIST pickled dataset so it can be added to the repo!

# Load the dataset
with'/Users/rick/projects/tensorwerk/hangar/dev/data/mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f, encoding='bytes')

def rescale(array):
    array = array * 256
    rounded = np.round(array)
    return rounded.astype(np.uint8())

sample_trimg = rescale(train_set[0][0])
sample_trlabel = np.array([train_set[1][0]])
trimgs = rescale(train_set[0])
trlabels = train_set[1]

Before data can be added to a repository, a arrayset must be initialized.

An “Arrayset” is a named grouping of data samples where each sample shares a number of similar attributes and array properties.

See the docstrings below or in init_arrayset()

Arraysets.init_arrayset(name: str, shape: Union[int, Tuple[int]] = None, dtype: numpy.dtype = None, prototype: numpy.ndarray = None, named_samples: bool = True, variable_shape: bool = False, *, backend_opts: Union[str, dict, None] = None) → hangar.arrayset.ArraysetDataWriter

Initializes a arrayset in the repository.

Arraysets are groups of related data pieces (samples). All samples within a arrayset have the same data type, and number of dimensions. The size of each dimension can be either fixed (the default behavior) or variable per sample.

For fixed dimension sizes, all samples written to the arrayset must have the same size that was initially specified upon arrayset initialization. Variable size arraysets on the other hand, can write samples with dimensions of any size less than a maximum which is required to be set upon arrayset creation.

  • name (str) – The name assigned to this arrayset.
  • shape (Union[int, Tuple[int]]) – The shape of the data samples which will be written in this arrayset. This argument and the dtype argument are required if a prototype is not provided, defaults to None.
  • dtype (numpy.dtype) – The datatype of this arrayset. This argument and the shape argument are required if a prototype is not provided., defaults to None.
  • prototype (numpy.ndarray) – A sample array of correct datatype and shape which will be used to initialize the arrayset storage mechanisms. If this is provided, the shape and dtype arguments must not be set, defaults to None.
  • named_samples (bool, optional) – If the samples in the arrayset have names associated with them. If set, all samples must be provided names, if not, no name will be assigned. defaults to True, which means all samples should have names.
  • variable_shape (bool, optional) – If this is a variable sized arrayset. If true, a the maximum shape is set from the provided shape or prototype argument. Any sample added to the arrayset can then have dimension sizes <= to this initial specification (so long as they have the same rank as what was specified) defaults to False.
  • backend_opts (Optional[Union[str, dict]], optional) – ADVANCED USERS ONLY, backend format code and filter opts to apply to arrayset data. If None, automatically infered and set based on data shape and type. by default None

instance object of the initialized arrayset.

Return type:


  • PermissionError – If any enclosed arrayset is opened in a connection manager.
  • ValueError – If provided name contains any non ascii letter characters characters, or if the string is longer than 64 characters long.
  • ValueError – If required shape and dtype arguments are not provided in absence of prototype argument.
  • ValueError – If prototype argument is not a C contiguous ndarray.
  • LookupError – If a arrayset already exists with the provided name.
  • ValueError – If rank of maximum tensor shape > 31.
  • ValueError – If zero sized dimension in shape argument
  • ValueError – If the specified backend is not valid.
co.arraysets.init_arrayset(name='mnist_training_images', prototype=trimgs[0])
Hangar ArraysetDataWriter
    Arrayset Name             : mnist_training_images
    Schema Hash              : 976ba57033bb
    Variable Shape           : False
    (max) Shape              : (784,)
    Datatype                 : <class 'numpy.uint8'>
    Named Samples            : True
    Access Mode              : a
    Number of Samples        : 0
    Partial Remote Data Refs : False


Through arraysets attribute

When a arrayset is initialized, a arrayset accessor object will be returned, however, depending on your use case, this may or may not be the most convenient way to access a arrayset.

In general, we have implemented a full dict mapping interface on top of all objects. To access the 'mnist_training_images' arrayset you can just use a dict style access like the following (note: if operating in iPython/Jupyter, the arrayset keys will autocomplete for you).

The arrayset objects returned here contain many useful instrospecion methods which we will review over the rest of the tutorial.

Hangar ArraysetDataWriter
    Arrayset Name             : mnist_training_images
    Schema Hash              : 976ba57033bb
    Variable Shape           : False
    (max) Shape              : (784,)
    Datatype                 : <class 'numpy.uint8'>
    Named Samples            : True
    Access Mode              : a
    Number of Samples        : 0
    Partial Remote Data Refs : False

train_aset = co.arraysets['mnist_training_images']

# OR an equivalent way using the `.get()` method

train_aset = co.arraysets.get('mnist_training_images')
Hangar ArraysetDataWriter
    Arrayset Name             : mnist_training_images
    Schema Hash              : 976ba57033bb
    Variable Shape           : False
    (max) Shape              : (784,)
    Datatype                 : <class 'numpy.uint8'>
    Named Samples            : True
    Access Mode              : a
    Number of Samples        : 0
    Partial Remote Data Refs : False

Through the checkout object (arrayset and sample access)

In addition to the standard co.arraysets access methods, we have implemented a convenience mapping to arraysets and samples (ie. data) for both reading and writing from the checkout object itself.

To get the same arrayset object from the checkout, simply use:

train_asets = co['mnist_training_images']
Hangar ArraysetDataWriter
    Arrayset Name             : mnist_training_images
    Schema Hash              : 976ba57033bb
    Variable Shape           : False
    (max) Shape              : (784,)
    Datatype                 : <class 'numpy.uint8'>
    Named Samples            : True
    Access Mode              : a
    Number of Samples        : 0
    Partial Remote Data Refs : False

Though that works as expected, most use cases will take advantage of adding and reading data from multiple arraysets / samples at a time. This is shown in the next section.

Adding Data

To add data to a named arrayset, we can use dict-style setting (refer to the __setitem__, __getitem__, and __delitem__ methods), or the add() method. Sample keys can be either str or int type.

train_aset['0'] = trimgs[0]
train_aset.add(data=trimgs[1], name='1')
train_aset[51] = trimgs[51]

Using the checkout method

co['mnist_training_images', 60] = trimgs[60]

How many samples are in the arrayset?


Containment Testing

'hi' in train_aset
'0' in train_aset
60 in train_aset

Dictionary Style Retrieval for known keys

out1 = train_aset['0']
# OR
out2 = co['mnist_training_images', '0']

print(np.allclose(out1, out2))

plt.imshow(out1.reshape(28, 28))
<matplotlib.image.AxesImage at 0x38476ea90>

Dict style iteration supported out of the box

# iterate normally over keys

for k in train_aset:
    # equivalent method: for k in train_aset.keys():

# iterate over items (plot results)

fig, axs = plt.subplots(nrows=1, ncols=4, figsize=(10, 10))

for idx, v in enumerate(train_aset.values()):
    axs[idx].imshow(v.reshape(28, 28))

# iterate over items, store k, v in dict

myDict = {}
for k, v in train_aset.items():
    myDict[k] = v


Once you’ve completed an interactive exploration, be sure to use the context manager form of the add() and get() methods!

In order to make sure that all your data is always safe in Hangar, the backend diligently ensures that all contexts (operations which can somehow interact with the record structures) are opened and closed appropriately. When you use the context manager form of a arrayset object, we can offload a significant amount of work to the python runtime, and dramatically increase read and write speeds.

Most arraysets we’ve tested see an increased throughput differential of 250% - 500% for writes and 300% - 600% for reads when comparing using the context manager form vs the naked form!

import time

# ----------------- Non Context Manager Form ----------------------

co = repo.checkout(write=True)
aset_trimgs = co.arraysets.init_arrayset(name='train_images', prototype=sample_trimg)
aset_trlabels = co.arraysets.init_arrayset(name='train_labels', prototype=sample_trlabel)

print(f'beginning non-context manager form')
start_time = time.time()

for idx, img in enumerate(trimgs):
    aset_trimgs.add(data=img, name=idx)
    aset_trlabels.add(data=np.array([trlabels[idx]]), name=str(idx))

print(f'Finished non-context manager form in: {time.time() - start_time} seconds')


# ----------------- Context Manager Form --------------------------

co = repo.checkout(write=True)
aset_trimgs = co.arraysets.init_arrayset(name='train_images', prototype=sample_trimg)
aset_trlabels = co.arraysets.init_arrayset(name='train_labels', prototype=sample_trlabel)

print(f'\n beginning context manager form')
start_time = time.time()

with aset_trimgs, aset_trlabels:
    for idx, img in enumerate(trimgs):
        aset_trimgs.add(data=img, name=str(idx))
        aset_trlabels.add(data=np.array([trlabels[idx]]), name=str(idx))

print(f'Finished context manager form in: {time.time() - start_time} seconds')


# -------------- Context Manager With Checkout Access -------------

co = repo.checkout(write=True)
co.arraysets.init_arrayset(name='train_images', prototype=sample_trimg)
co.arraysets.init_arrayset(name='train_labels', prototype=sample_trlabel)

print(f'\n beginning context manager form with checkout access')
start_time = time.time()

with co:
    for idx, img in enumerate(trimgs):
        co[['train_images', 'train_labels'], idx] = [img, np.array([trlabels[idx]])]

print(f'Finished context manager with checkout form in: {time.time() - start_time} seconds')

beginning non-context manager form
Finished non-context manager form in: 107.4064199924469 seconds
Hard reset requested with writer_lock: 5b66da6a-51c5-4beb-9b34-964c600957c2

 beginning context manager form
Finished context manager form in: 20.784971952438354 seconds
Hard reset requested with writer_lock: 6bd0f286-e78f-4777-939b-ab9a60c6518e

 beginning context manager form with checkout access
Finished context manager with checkout form in: 20.909255981445312 seconds

Clearly, the context manager form is far and away superior, however we fell that for the purposes of interactive use that the “Naked” form is valubal to the average user!

Commiting Changes

Once you have made a set of changes you want to commit, just simply call the commit() method (and pass in a message)!

co.commit('hello world, this is my first hangar commit')

The returned value ('e11d061dc457b361842801e24cbd119a745089d6') is the commit hash of this commit. It may be useful to assign this to a variable and follow this up by creating a branch from this commit!

Don’t Forget to Close the Write-Enabled Checkout to Release the Lock!

We mentioned in Checking out the repo for writing that when a write-enabled checkout is created, it places a lock on writers until it is closed. If for whatever reason the program terminates via a non python SIGKILL or fatal interpreter error without closing the write-enabled checkout, this lock will persist (forever technically, but realistically until it is manually freed).

Luckily, preventing this issue from occurring is as simple as calling close()!

If you forget, normal interperter shutdown should trigger an atexit hook automatically, however this behavior should not be relied upon. Is better to just call close().


But if you did forget, and you recieve a PermissionError next time you open a checkout

PermissionError: Cannot acquire the writer lock. Only one instance of
a writer checkout can be active at a time. If the last checkout of this
repository did not properly close, or a crash occured, the lock must be
manually freed before another writer can be instantiated.

You can manually free the lock with the following method. However!

This is a dangerous operation, and it’s one of the only ways where a user can put data in their repository at risk! If another python process is still holding the lock, do NOT force the release. Kill the process (that’s totally fine to do at any time, then force the lock release).


Inspecting state from the top!

After your first commit, the summary and log methods will begin to work, and you can either print the stream to the console (as shown below), or you can dig deep into the internal of how hangar thinks about your data! (To be covered in an advanced tutorial later on).

The point is, regardless of your level of interaction with a live hangar repository, all level of state is accessable from the top, and in general has been built to be the only way to directly access it!

Summary of Contents Contained in Data Repository

| Repository Info
|  Base Directory: /Users/rick/projects/tensorwerk/hangar/dev/mnist
|  Disk Usage: 67.29 MB

| Commit Details
|  Commit: e11d061dc457b361842801e24cbd119a745089d6
|  Created: Thu Sep  5 23:32:46 2019
|  By: Rick Izzo
|  Email:
|  Message: hello world, this is my first hangar commit

| DataSets
|  Number of Named Arraysets: 3
|  * Arrayset Name: mnist_training_images
|    Num Arrays: 4
|    Details:
|    - schema_hash: 976ba57033bb
|    - schema_dtype: 2
|    - schema_is_var: False
|    - schema_max_shape: (784,)
|    - schema_is_named: True
|    - schema_default_backend: 00
|  * Arrayset Name: train_images
|    Num Arrays: 50000
|    Details:
|    - schema_hash: 976ba57033bb
|    - schema_dtype: 2
|    - schema_is_var: False
|    - schema_max_shape: (784,)
|    - schema_is_named: True
|    - schema_default_backend: 00
|  * Arrayset Name: train_labels
|    Num Arrays: 50000
|    Details:
|    - schema_hash: 631f0f57c469
|    - schema_dtype: 7
|    - schema_is_var: False
|    - schema_max_shape: (1,)
|    - schema_is_named: True
|    - schema_default_backend: 10

| Metadata:
|  Number of Keys: 0

* e11d061dc457b361842801e24cbd119a745089d6 (master) : hello world, this is my first hangar commit
* 7293dded698c41f32434e670841d15d96c1c6f8b : ya