Quick Start Tutorial

A simple step-by-step guide that will quickly get you started with Hangar basics, including initializing a repository, adding and committing data to a repository.

Installation

You can install Hangar via pip:

$ pip install hangar

or via conda:

$ conda install -c conda-forge hangar

Please refer to the Installation page for more information.

Quick Start for the Impatient

The only import statement you’ll ever need:

[1]:
from hangar import Repository

Create and initialize a new Hangar Repository at the given path:

[2]:
!mkdir /Volumes/Archivio/tensorwerk/hangar/quick-start

repo = Repository(path="/Volumes/Archivio/tensorwerk/hangar/quick-start")

repo.init(
    user_name="Alessia Marcolini", user_email="alessia@tensorwerk.com", remove_old=True
)
Hangar Repo initialized at: /Volumes/Archivio/tensorwerk/hangar/quick-start/.hangar
//anaconda/envs/hangar-tutorial/lib/python3.8/site-packages/hangar/context.py:92: UserWarning: No repository exists at /Volumes/Archivio/tensorwerk/hangar/quick-start/.hangar, please use `repo.init()` method
  warnings.warn(msg, UserWarning)
[2]:
'/Volumes/Archivio/tensorwerk/hangar/quick-start/.hangar'

Checkout the Repository in write mode:

[3]:
master_checkout = repo.checkout(write=True)
master_checkout
[3]:
Hangar WriterCheckout
    Writer       : True
    Base Branch  : master
    Num Columns  : 0

Inspect the columns we have (we just started, none so far):

[4]:
master_checkout.columns
[4]:
Hangar Columns
    Writeable         : True
    Number of Columns : 0
    Column Names / Partial Remote References:
      -

Prepare some random data to play with:

[5]:
import numpy as np

dummy = np.random.rand(3,2)
dummy
[5]:
array([[0.17961852, 0.31945355],
       [0.10929027, 0.2681622 ],
       [0.29397449, 0.02659856]])

Create a new column named dummy_column:

[6]:
dummy_col = master_checkout.add_ndarray_column(name="dummy_column", prototype=dummy)
dummy_col
[6]:
Hangar FlatSampleWriter
    Column Name              : dummy_column
    Writeable                : True
    Column Type              : ndarray
    Column Layout            : flat
    Schema Type              : fixed_shape
    DType                    : float64
    Shape                    : (3, 2)
    Number of Samples        : 0
    Partial Remote Data Refs : False

Add data to dummy_column, treating it as a normal Python dictionary:

[7]:
dummy_col[0] = dummy
[8]:
dummy_col[1] = np.random.rand(3,2)

Commit your changes providing a message:

[9]:
master_checkout.commit("Add dummy_column with 2 samples")
[9]:
'a=c104ef7e2cfe87318e78addd6033028488050cea'

Add more data and commit again:

[10]:
dummy_col[2] = np.random.rand(3,2)
dummy_col
[10]:
Hangar FlatSampleWriter
    Column Name              : dummy_column
    Writeable                : True
    Column Type              : ndarray
    Column Layout            : flat
    Schema Type              : fixed_shape
    DType                    : float64
    Shape                    : (3, 2)
    Number of Samples        : 3
    Partial Remote Data Refs : False

[11]:
master_checkout.commit("Add one more sample to dummy_column")
[11]:
'a=099557d48edebb7607fa3ec648eafa2a1af5e652'

See the master branch history:

[12]:
master_checkout.log()
* a=099557d48edebb7607fa3ec648eafa2a1af5e652 (master) : Add one more sample to dummy_column
* a=c104ef7e2cfe87318e78addd6033028488050cea : Add dummy_column with 2 samples

Close the write-enabled checkout:

[13]:
master_checkout.close()

Inspect the status of the Repository:

[14]:
repo.summary()
Summary of Contents Contained in Data Repository

==================
| Repository Info
|-----------------
|  Base Directory: /Volumes/Archivio/tensorwerk/hangar/quick-start
|  Disk Usage: 237.53 kB

===================
| Commit Details
-------------------
|  Commit: a=099557d48edebb7607fa3ec648eafa2a1af5e652
|  Created: Mon May  4 13:00:43 2020
|  By: Alessia Marcolini
|  Email: alessia@tensorwerk.com
|  Message: Add one more sample to dummy_column

==================
| DataSets
|-----------------
|  Number of Named Columns: 1
|
|  * Column Name: ColumnSchemaKey(column="dummy_column", layout="flat")
|    Num Data Pieces: 3
|    Details:
|    - column_layout: flat
|    - column_type: ndarray
|    - schema_hasher_tcode: 1
|    - data_hasher_tcode: 0
|    - schema_type: fixed_shape
|    - shape: (3, 2)
|    - dtype: float64
|    - backend: 01
|    - backend_options: {'complib': 'blosc:lz4hc', 'complevel': 5, 'shuffle': 'byte'}

Quick Start - with explanations

1. Create and initialize a “Repository”

Central to Hangar is the concept of Repository.

A Repository consists of an historically ordered mapping of Commits over time by various Committers across any number of Branches. Though there are many conceptual similarities in what a Git repo and a Hangar repository achieve, Hangar is designed with the express purpose of dealing with numeric data.

To start using Hangar programmatically, simply begin with this import statement:

[1]:
from hangar import Repository

Create the folder where you want to store the Repository:

[2]:
!mkdir /Volumes/Archivio/tensorwerk/hangar/quick-start

Initialize the Repository object by saying where your repository should live.

Note

Note that if you feed a path to the Repository which does not contain a pre-initialized Hangar repo, Python shows you a warning saying that you will need to initialize the repo before starting working on it.

[3]:
repo = Repository(path="/Volumes/Archivio/tensorwerk/hangar/quick-start")
//anaconda/envs/hangar-tutorial/lib/python3.8/site-packages/hangar/context.py:92: UserWarning: No repository exists at /Volumes/Archivio/tensorwerk/hangar/quick-start/.hangar, please use `repo.init()` method
  warnings.warn(msg, UserWarning)

Initialize the Repository providing your name and your email.

Warning

Please be aware that the remove_old parameter set to True removes and reinitializes a Hangar repository at the given path.

[4]:
repo.init(
    user_name="Alessia Marcolini", user_email="alessia@tensorwerk.com", remove_old=True
)
Hangar Repo initialized at: /Volumes/Archivio/tensorwerk/hangar/quick-start/.hangar
[4]:
'/Volumes/Archivio/tensorwerk/hangar/quick-start/.hangar'

2. Open the Staging Area for Writing

To start interacting with Hangar, first you need to check out the Repository you want to work on.

A repo can be checked out in two modes:

We need to check out the repo in write mode in order to initialize the columns and write into them.

[5]:
master_checkout = repo.checkout(write=True)
master_checkout
[5]:
Hangar WriterCheckout
    Writer       : True
    Base Branch  : master
    Num Columns  : 0

A checkout allows access to columns. The columns attribute of a checkout provide the interface to working with all of the data on disk!

[6]:
master_checkout.columns
[6]:
Hangar Columns
    Writeable         : True
    Number of Columns : 0
    Column Names / Partial Remote References:
      -

3. Create some random data to play with

Let’s create a random array to be used as a dummy example:

[7]:
import numpy as np

dummy = np.random.rand(3,2)
dummy
[7]:
array([[0.54631485, 0.26578857],
       [0.74990074, 0.41764666],
       [0.75884524, 0.05547267]])

4. Initialize a column

With checkout write-enabled, we can now initialize a new column of the repository using the method add_ndarray_column().

All samples within a column have the same data type, and number of dimensions. The size of each dimension can be either fixed (the default behavior) or variable per sample.

You will need to provide a column name and a prototype, so Hangar can infer the shape of the elements contained in the array. dummy_col will become a column accessor object.

[8]:
dummy_col = master_checkout.add_ndarray_column(name="dummy_column", prototype=dummy)
dummy_col
[8]:
Hangar FlatSampleWriter
    Column Name              : dummy_column
    Writeable                : True
    Column Type              : ndarray
    Column Layout            : flat
    Schema Type              : fixed_shape
    DType                    : float64
    Shape                    : (3, 2)
    Number of Samples        : 0
    Partial Remote Data Refs : False

Verify we successfully added the new column:

[9]:
master_checkout.columns
[9]:
Hangar Columns
    Writeable         : True
    Number of Columns : 1
    Column Names / Partial Remote References:
      - dummy_column / False

5. Add data

To add data to a named column, we can use dict-style mode as follows. Sample keys can be either str or int type.

[10]:
dummy_col[0] = dummy

As we can see, Number of Samples is equal to 1 now!

[11]:
dummy_col
[11]:
Hangar FlatSampleWriter
    Column Name              : dummy_column
    Writeable                : True
    Column Type              : ndarray
    Column Layout            : flat
    Schema Type              : fixed_shape
    DType                    : float64
    Shape                    : (3, 2)
    Number of Samples        : 1
    Partial Remote Data Refs : False

[12]:
dummy_col[1] = np.random.rand(3,2)
[13]:
dummy_col
[13]:
Hangar FlatSampleWriter
    Column Name              : dummy_column
    Writeable                : True
    Column Type              : ndarray
    Column Layout            : flat
    Schema Type              : fixed_shape
    DType                    : float64
    Shape                    : (3, 2)
    Number of Samples        : 2
    Partial Remote Data Refs : False

[14]:
dummy_col[1]
[14]:
array([[0.17590758, 0.26950355],
       [0.88036219, 0.7839301 ],
       [0.87321484, 0.04316646]])

You can also iterate over your column, as you would do with a regular Python dictionary:

[15]:
for key, value in dummy_col.items():
    print('Key:', key)
    print('Value:', value)
    print()
Key: 0
Value: [[0.54631485 0.26578857]
 [0.74990074 0.41764666]
 [0.75884524 0.05547267]]

Key: 1
Value: [[0.17590758 0.26950355]
 [0.88036219 0.7839301 ]
 [0.87321484 0.04316646]]

How many samples are in the column?

[16]:
len(dummy_col)
[16]:
2

Does the column contain that key?

[17]:
0 in dummy_col
[17]:
True
[18]:
5 in dummy_col
[18]:
False

6. Commit changes

Once you have made a set of changes you want to commit, just simply call the commit() method (and pass in a message)!

[19]:
master_checkout.commit("Add dummy_column with 2 samples")
[19]:
'a=4f42fce2b66476271f149e3cd2eb4c6ba66daeee'

Let’s add another sample in the column:

[20]:
dummy_col[2] = np.random.rand(3,2)
dummy_col
[20]:
Hangar FlatSampleWriter
    Column Name              : dummy_column
    Writeable                : True
    Column Type              : ndarray
    Column Layout            : flat
    Schema Type              : fixed_shape
    DType                    : float64
    Shape                    : (3, 2)
    Number of Samples        : 3
    Partial Remote Data Refs : False

Number of Samples is equal to 3 now and we want to keep track of the change with another commit:

[21]:
master_checkout.commit("Add one more sample to dummy_column")
[21]:
'a=753e28e27d4b23a0dca0633f90b4513538a98c40'

To view the history of your commits:

[22]:
master_checkout.log()
* a=753e28e27d4b23a0dca0633f90b4513538a98c40 (master) : Add one more sample to dummy_column
* a=4f42fce2b66476271f149e3cd2eb4c6ba66daeee : Add dummy_column with 2 samples

Do not forget to close the write-enabled checkout!

[23]:
master_checkout.close()

Check the state of the repository and get useful information about disk usage, the columns you have and the last commit:

[24]:
repo.summary()
Summary of Contents Contained in Data Repository

==================
| Repository Info
|-----------------
|  Base Directory: /Volumes/Archivio/tensorwerk/hangar/quick-start
|  Disk Usage: 237.53 kB

===================
| Commit Details
-------------------
|  Commit: a=753e28e27d4b23a0dca0633f90b4513538a98c40
|  Created: Tue Apr 21 21:50:15 2020
|  By: Alessia Marcolini
|  Email: alessia@tensorwerk.com
|  Message: Add one more sample to dummy_column

==================
| DataSets
|-----------------
|  Number of Named Columns: 1
|
|  * Column Name: ColumnSchemaKey(column="dummy_column", layout="flat")
|    Num Data Pieces: 3
|    Details:
|    - column_layout: flat
|    - column_type: ndarray
|    - schema_hasher_tcode: 1
|    - data_hasher_tcode: 0
|    - schema_type: fixed_shape
|    - shape: (3, 2)
|    - dtype: float64
|    - backend: 01
|    - backend_options: {'complib': 'blosc:lz4hc', 'complevel': 5, 'shuffle': 'byte'}