Quick Start Tutorial¶
A simple step-by-step guide that will quickly get you started with Hangar basics, including initializing a repository, adding and committing data to a repository.
Installation¶
You can install Hangar via pip
:
$ pip install hangar
or via conda
:
$ conda install -c conda-forge hangar
Please refer to the Installation page for more information.
Quick Start for the Impatient¶
The only import statement you’ll ever need:
[1]:
from hangar import Repository
Create and initialize a new Hangar Repository
at the given path:
[2]:
!mkdir /Volumes/Archivio/tensorwerk/hangar/quick-start
repo = Repository(path="/Volumes/Archivio/tensorwerk/hangar/quick-start")
repo.init(
user_name="Alessia Marcolini", user_email="alessia@tensorwerk.com", remove_old=True
)
Hangar Repo initialized at: /Volumes/Archivio/tensorwerk/hangar/quick-start/.hangar
//anaconda/envs/hangar-tutorial/lib/python3.8/site-packages/hangar/context.py:92: UserWarning: No repository exists at /Volumes/Archivio/tensorwerk/hangar/quick-start/.hangar, please use `repo.init()` method
warnings.warn(msg, UserWarning)
[2]:
'/Volumes/Archivio/tensorwerk/hangar/quick-start/.hangar'
Checkout the Repository
in write mode:
[3]:
master_checkout = repo.checkout(write=True)
master_checkout
[3]:
Hangar WriterCheckout
Writer : True
Base Branch : master
Num Columns : 0
Inspect the columns
we have (we just started, none so far):
[4]:
master_checkout.columns
[4]:
Hangar Columns
Writeable : True
Number of Columns : 0
Column Names / Partial Remote References:
-
Prepare some random data to play with:
[5]:
import numpy as np
dummy = np.random.rand(3,2)
dummy
[5]:
array([[0.17961852, 0.31945355],
[0.10929027, 0.2681622 ],
[0.29397449, 0.02659856]])
Create a new column named dummy_column
:
[6]:
dummy_col = master_checkout.add_ndarray_column(name="dummy_column", prototype=dummy)
dummy_col
[6]:
Hangar FlatSampleWriter
Column Name : dummy_column
Writeable : True
Column Type : ndarray
Column Layout : flat
Schema Type : fixed_shape
DType : float64
Shape : (3, 2)
Number of Samples : 0
Partial Remote Data Refs : False
Add data to dummy_column
, treating it as a normal Python dictionary:
[7]:
dummy_col[0] = dummy
[8]:
dummy_col[1] = np.random.rand(3,2)
Commit your changes providing a message:
[9]:
master_checkout.commit("Add dummy_column with 2 samples")
[9]:
'a=c104ef7e2cfe87318e78addd6033028488050cea'
Add more data and commit again:
[10]:
dummy_col[2] = np.random.rand(3,2)
dummy_col
[10]:
Hangar FlatSampleWriter
Column Name : dummy_column
Writeable : True
Column Type : ndarray
Column Layout : flat
Schema Type : fixed_shape
DType : float64
Shape : (3, 2)
Number of Samples : 3
Partial Remote Data Refs : False
[11]:
master_checkout.commit("Add one more sample to dummy_column")
[11]:
'a=099557d48edebb7607fa3ec648eafa2a1af5e652'
See the master branch history:
[12]:
master_checkout.log()
* a=099557d48edebb7607fa3ec648eafa2a1af5e652 (master) : Add one more sample to dummy_column
* a=c104ef7e2cfe87318e78addd6033028488050cea : Add dummy_column with 2 samples
Close the write-enabled checkout:
[13]:
master_checkout.close()
Inspect the status of the Repository
:
[14]:
repo.summary()
Summary of Contents Contained in Data Repository
==================
| Repository Info
|-----------------
| Base Directory: /Volumes/Archivio/tensorwerk/hangar/quick-start
| Disk Usage: 237.53 kB
===================
| Commit Details
-------------------
| Commit: a=099557d48edebb7607fa3ec648eafa2a1af5e652
| Created: Mon May 4 13:00:43 2020
| By: Alessia Marcolini
| Email: alessia@tensorwerk.com
| Message: Add one more sample to dummy_column
==================
| DataSets
|-----------------
| Number of Named Columns: 1
|
| * Column Name: ColumnSchemaKey(column="dummy_column", layout="flat")
| Num Data Pieces: 3
| Details:
| - column_layout: flat
| - column_type: ndarray
| - schema_hasher_tcode: 1
| - data_hasher_tcode: 0
| - schema_type: fixed_shape
| - shape: (3, 2)
| - dtype: float64
| - backend: 01
| - backend_options: {'complib': 'blosc:lz4hc', 'complevel': 5, 'shuffle': 'byte'}
Quick Start - with explanations¶
1. Create and initialize a “Repository”¶
Central to Hangar is the concept of Repository.
A Repository
consists of an historically ordered mapping of Commits over time by various Committers across any number of Branches. Though there are many conceptual similarities in what a Git repo and a Hangar repository achieve, Hangar is designed with the express purpose of dealing with numeric data.
To start using Hangar programmatically, simply begin with this import statement:
[1]:
from hangar import Repository
Create the folder where you want to store the Repository
:
[2]:
!mkdir /Volumes/Archivio/tensorwerk/hangar/quick-start
Initialize the Repository
object by saying where your repository should live.
Note
Note that if you feed a path to the Repository
which does not contain a pre-initialized Hangar repo, Python shows you a warning saying that you will need to initialize the repo before starting working on it.
[3]:
repo = Repository(path="/Volumes/Archivio/tensorwerk/hangar/quick-start")
//anaconda/envs/hangar-tutorial/lib/python3.8/site-packages/hangar/context.py:92: UserWarning: No repository exists at /Volumes/Archivio/tensorwerk/hangar/quick-start/.hangar, please use `repo.init()` method
warnings.warn(msg, UserWarning)
Initialize the Repository
providing your name and your email.
Warning
Please be aware that the remove_old
parameter set to True
removes and reinitializes a Hangar repository at the given path.
[4]:
repo.init(
user_name="Alessia Marcolini", user_email="alessia@tensorwerk.com", remove_old=True
)
Hangar Repo initialized at: /Volumes/Archivio/tensorwerk/hangar/quick-start/.hangar
[4]:
'/Volumes/Archivio/tensorwerk/hangar/quick-start/.hangar'
2. Open the Staging Area for Writing¶
To start interacting with Hangar, first you need to check out the Repository
you want to work on.
A repo can be checked out in two modes:
We need to check out the repo in write mode in order to initialize the columns and write into them.
[5]:
master_checkout = repo.checkout(write=True)
master_checkout
[5]:
Hangar WriterCheckout
Writer : True
Base Branch : master
Num Columns : 0
A checkout allows access to columns
. The columns
attribute of a checkout provide the interface to working with all of the data on disk!
[6]:
master_checkout.columns
[6]:
Hangar Columns
Writeable : True
Number of Columns : 0
Column Names / Partial Remote References:
-
3. Create some random data to play with¶
Let’s create a random array to be used as a dummy example:
[7]:
import numpy as np
dummy = np.random.rand(3,2)
dummy
[7]:
array([[0.54631485, 0.26578857],
[0.74990074, 0.41764666],
[0.75884524, 0.05547267]])
4. Initialize a column¶
With checkout write-enabled, we can now initialize a new column of the repository using the method add_ndarray_column().
All samples within a column have the same data type, and number of dimensions. The size of each dimension can be either fixed (the default behavior) or variable per sample.
You will need to provide a column name and a prototype, so Hangar can infer the shape of the elements contained in the array. dummy_col
will become a column accessor object.
[8]:
dummy_col = master_checkout.add_ndarray_column(name="dummy_column", prototype=dummy)
dummy_col
[8]:
Hangar FlatSampleWriter
Column Name : dummy_column
Writeable : True
Column Type : ndarray
Column Layout : flat
Schema Type : fixed_shape
DType : float64
Shape : (3, 2)
Number of Samples : 0
Partial Remote Data Refs : False
Verify we successfully added the new column:
[9]:
master_checkout.columns
[9]:
Hangar Columns
Writeable : True
Number of Columns : 1
Column Names / Partial Remote References:
- dummy_column / False
5. Add data¶
To add data to a named column, we can use dict-style mode as follows. Sample keys can be either str or int type.
[10]:
dummy_col[0] = dummy
As we can see, Number of Samples
is equal to 1 now!
[11]:
dummy_col
[11]:
Hangar FlatSampleWriter
Column Name : dummy_column
Writeable : True
Column Type : ndarray
Column Layout : flat
Schema Type : fixed_shape
DType : float64
Shape : (3, 2)
Number of Samples : 1
Partial Remote Data Refs : False
[12]:
dummy_col[1] = np.random.rand(3,2)
[13]:
dummy_col
[13]:
Hangar FlatSampleWriter
Column Name : dummy_column
Writeable : True
Column Type : ndarray
Column Layout : flat
Schema Type : fixed_shape
DType : float64
Shape : (3, 2)
Number of Samples : 2
Partial Remote Data Refs : False
[14]:
dummy_col[1]
[14]:
array([[0.17590758, 0.26950355],
[0.88036219, 0.7839301 ],
[0.87321484, 0.04316646]])
You can also iterate over your column, as you would do with a regular Python dictionary:
[15]:
for key, value in dummy_col.items():
print('Key:', key)
print('Value:', value)
print()
Key: 0
Value: [[0.54631485 0.26578857]
[0.74990074 0.41764666]
[0.75884524 0.05547267]]
Key: 1
Value: [[0.17590758 0.26950355]
[0.88036219 0.7839301 ]
[0.87321484 0.04316646]]
How many samples are in the column?
[16]:
len(dummy_col)
[16]:
2
Does the column contain that key?
[17]:
0 in dummy_col
[17]:
True
[18]:
5 in dummy_col
[18]:
False
6. Commit changes¶
Once you have made a set of changes you want to commit, just simply call the commit() method (and pass in a message)!
[19]:
master_checkout.commit("Add dummy_column with 2 samples")
[19]:
'a=4f42fce2b66476271f149e3cd2eb4c6ba66daeee'
Let’s add another sample in the column:
[20]:
dummy_col[2] = np.random.rand(3,2)
dummy_col
[20]:
Hangar FlatSampleWriter
Column Name : dummy_column
Writeable : True
Column Type : ndarray
Column Layout : flat
Schema Type : fixed_shape
DType : float64
Shape : (3, 2)
Number of Samples : 3
Partial Remote Data Refs : False
Number of Samples
is equal to 3 now and we want to keep track of the change with another commit:
[21]:
master_checkout.commit("Add one more sample to dummy_column")
[21]:
'a=753e28e27d4b23a0dca0633f90b4513538a98c40'
To view the history of your commits:
[22]:
master_checkout.log()
* a=753e28e27d4b23a0dca0633f90b4513538a98c40 (master) : Add one more sample to dummy_column
* a=4f42fce2b66476271f149e3cd2eb4c6ba66daeee : Add dummy_column with 2 samples
Do not forget to close the write-enabled checkout!
[23]:
master_checkout.close()
Check the state of the repository and get useful information about disk usage, the columns you have and the last commit:
[24]:
repo.summary()
Summary of Contents Contained in Data Repository
==================
| Repository Info
|-----------------
| Base Directory: /Volumes/Archivio/tensorwerk/hangar/quick-start
| Disk Usage: 237.53 kB
===================
| Commit Details
-------------------
| Commit: a=753e28e27d4b23a0dca0633f90b4513538a98c40
| Created: Tue Apr 21 21:50:15 2020
| By: Alessia Marcolini
| Email: alessia@tensorwerk.com
| Message: Add one more sample to dummy_column
==================
| DataSets
|-----------------
| Number of Named Columns: 1
|
| * Column Name: ColumnSchemaKey(column="dummy_column", layout="flat")
| Num Data Pieces: 3
| Details:
| - column_layout: flat
| - column_type: ndarray
| - schema_hasher_tcode: 1
| - data_hasher_tcode: 0
| - schema_type: fixed_shape
| - shape: (3, 2)
| - dtype: float64
| - backend: 01
| - backend_options: {'complib': 'blosc:lz4hc', 'complevel': 5, 'shuffle': 'byte'}