“Real World” Quick Start Tutorial

This tutorial will guide you on working with the basics of Hangar, while playing with some “real world” data:

  • adding data to a repository

  • commiting changes

  • reading data from a commit

  • inspecting contents of a commit

Setup

You can install Hangar via pip:

$ pip install hangar

or via conda:

$ conda install -c conda-forge hangar

Other requirements for this tutorial are:

  • pillow - the python imaging library

  • tqdm - a simple tool to display progress bars (this is installed automatically as it is a requirement for Hangar)

$ pip install pillow

1. Create and Initialize a “Repository”

When working with Hangar programatically (the CLI is covered in later tutorials), we always start with the following import:

[1]:
from hangar import Repository

Create the folder where you want to store the Hangar Repository:

[2]:
!mkdir /Volumes/Archivio/tensorwerk/hangar/imagenette

and create the Repository object. Note that when you specify a new folder for a Hangar repository, Python shows you a warning saying that you will need to initialize the repo before starting working on it.

[3]:
repo = Repository(path="/Volumes/Archivio/tensorwerk/hangar/imagenette")
//anaconda/envs/hangar-nested/lib/python3.7/site-packages/hangar-0.5.0.dev1-py3.7-macosx-10.9-x86_64.egg/hangar/context.py:94: UserWarning: No repository exists at /Volumes/Archivio/tensorwerk/hangar/imagenette/.hangar, please use `repo.init()` method
  warnings.warn(msg, UserWarning)

Initialize the Repository providing your name and your email.

Warning

Please be aware that the remove_old parameter set to True removes and reinitializes a Hangar repository at the given path.

[4]:
repo.init(
    user_name="Alessia Marcolini", user_email="alessia@tensorwerk.com", remove_old=True
)
Hangar Repo initialized at: /Volumes/Archivio/tensorwerk/hangar/imagenette/.hangar
[4]:
'/Volumes/Archivio/tensorwerk/hangar/imagenette/.hangar'

2. Open the Staging Area for Writing

A Repository can be checked out in two modes: write-enabled and read-only. We need to checkout the repo in write mode in order to initialize the columns and write into them.

[5]:
master_checkout = repo.checkout(write=True)

A checkout allows access to columns. The columns attribute of a checkout provides the interface to working with all of the data on disk!

[6]:
master_checkout.columns
[6]:
Hangar Columns
    Writeable         : True
    Number of Columns : 0
    Column Names / Partial Remote References:
      -

3. Download and Prepare Some Conventionally Stored Data

To start playing with Hangar, let’s get some data to work on. We’ll be using the Imagenette dataset.

The following commands will download ~96 MB of data to the local directory and decompress the tarball containing ~ 9,200 .jpeg images in the folder data in the current working directory.

[7]:
!wget https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz -P data
--2020-04-04 13:25:37--  https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz
Resolving s3.amazonaws.com... 52.216.238.197
Connecting to s3.amazonaws.com|52.216.238.197|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 98948031 (94M) [application/x-tar]
Saving to: ‘data/imagenette2-160.tgz’

imagenette2-160.tgz 100%[===================>]  94.36M  4.52MB/s    in 22s

2020-04-04 13:26:00 (4.31 MB/s) - ‘data/imagenette2-160.tgz’ saved [98948031/98948031]

[8]:
!tar -xzf data/imagenette2-160.tgz -C data
[9]:
!wget http://image-net.org/archive/words.txt -P data/imagenette2-160
--2020-04-04 13:26:24--  http://image-net.org/archive/words.txt
Resolving image-net.org... 171.64.68.16
Connecting to image-net.org|171.64.68.16|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2655750 (2.5M) [text/plain]
Saving to: ‘data/imagenette2-160/words.txt’

words.txt           100%[===================>]   2.53M   884KB/s    in 2.9s

2020-04-04 13:26:27 (884 KB/s) - ‘data/imagenette2-160/words.txt’ saved [2655750/2655750]

The dataset directory structure on disk is as follows:

Each subdirectory in the train / val folders (named starting with "n0") contains a few hundred images which feature objects/elements of a common classification (tench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, parachute, etc.). The image file names follow a convention specific to the ImageNet project, but can be thought of as essentially random (so long as they are unique).

imagenette2-160
├── train
│   ├── n01440764
│   ├── n02102040
│   ├── n02979186
│   ├── n03000684
│   ├── n03028079
│   ├── n03394916
│   ├── n03417042
│   ├── n03425413
│   ├── n03445777
│   └── n03888257
└── val
    ├── n01440764
    ├── n02102040
    ├── n02979186
    ├── n03000684
    ├── n03028079
    ├── n03394916
    ├── n03417042
    ├── n03425413
    ├── n03445777
    └── n03888257

Classification/Label Data

The labels associated with each image are contained in a seperate .txt file, we download the words.txt to the directory the images are extracted into.

Reviewing the contents of this file, we will find a mapping of classification codes (subdirectory names starting with "n0") to human readable descriptions of the contents. A small selection of the file is provided below as an illustration.

n01635343   Rhyacotriton, genus Rhyacotriton
n01635480   olympic salamander, Rhyacotriton olympicus
n01635659   Plethodontidae, family Plethodontidae
n01635964   Plethodon, genus Plethodon
n01636127   lungless salamander, plethodont
n01636352   eastern red-backed salamander, Plethodon cinereus
n01636510   western red-backed salamander, Plethodon vehiculum
n01636675   Desmograthus, genus Desmograthus
n01636829   dusky salamander
n01636984   Aneides, genus Aneides
n01637112   climbing salamander
n01637338   arboreal salamander, Aneides lugubris
n01637478   Batrachoseps, genus Batrachoseps
n01637615   slender salamander, worm salamander
n01637796   Hydromantes, genus Hydromantes

Mapping Classification Codes to Meaningful Descriptors

We begin by reading each line of this file and creating a dictionary to store the corrispondence between ImageNet synset name and a human readable label.

[10]:
from pathlib import Path

dataset_dir = Path("./data/imagenette2-160")

synset_label = {}
with open(dataset_dir / "words.txt", "r") as f:
    for line in f.readlines():
        synset, label = line.split("\t")
        synset_label[synset] = label.rstrip()

Read training data (images and labels) from disk and store them in NumPy arrays.

[11]:
import os
from tqdm import tqdm

import numpy as np
from PIL import Image
[12]:
train_images = []
train_labels = []

for synset in tqdm(os.listdir(dataset_dir / "train")):
    label = synset_label[synset]

    for image_filename in os.listdir(dataset_dir / "train" / synset):
        image = Image.open(dataset_dir / "train" / synset / image_filename)
        image = image.resize((163, 160))
        data = np.asarray(image)

        if len(data.shape) == 2:  # discard B&W images
            continue

        train_images.append(data)
        train_labels.append(label)

train_images = np.array(train_images)
100%|██████████| 10/10 [00:31<00:00,  3.12s/it]
[13]:
train_images.shape
[13]:
(9296, 160, 163, 3)

Note

Here we are reading the images from disk and storing them in a big Python list, and then converting it to a NumPy array. Note that it could be impractical for larger datasets. You might want to consider the idea of reading files in batch.

Read validation data (images and labels) from disk and store them in NumPy arrays, same as before.

[14]:
val_images = []
val_labels = []

for synset in tqdm(os.listdir(dataset_dir / "val")):
    label = synset_label[synset]

    for image_filename in os.listdir(dataset_dir / "val" / synset):
        image = Image.open(dataset_dir / "val" / synset / image_filename)
        image = image.resize((163, 160))
        data = np.asarray(image)

        if len(data.shape) == 2:  # discard B&W images
            continue

        val_images.append(data)
        val_labels.append(label)

val_images = np.array(val_images)
100%|██████████| 10/10 [00:12<00:00,  1.22s/it]
[15]:
val_images.shape
[15]:
(3856, 160, 163, 3)

4. Column initialization

With checkout write-enabled, we can now initialize a new column of the repository using the method add_ndarray_column().

All samples within a column have the same data type, and number of dimensions. The size of each dimension can be either fixed (the default behavior) or variable per sample.

You will need to provide a column name and a prototype, so Hangar can infer the shape of the elements contained in the array. train_im_col will become a column accessor object.

[16]:
train_im_col = master_checkout.add_ndarray_column(
    name="training_images", prototype=train_images[0]
)

Verify we successfully added the new column:

[17]:
master_checkout.columns
[17]:
Hangar Columns
    Writeable         : True
    Number of Columns : 1
    Column Names / Partial Remote References:
      - training_images / False

Get useful information about the new column simply by inspecting train_im_col

[18]:
train_im_col
[18]:
Hangar FlatSampleWriter
    Column Name              : training_images
    Writeable                : True
    Column Type              : ndarray
    Column Layout            : flat
    Schema Type              : fixed_shape
    DType                    : uint8
    Shape                    : (160, 163, 3)
    Number of Samples        : 0
    Partial Remote Data Refs : False

… or by leveraging the dict-style columns access through the checkout object. They provide the same information.

[19]:
master_checkout.columns["training_images"]
[19]:
Hangar FlatSampleWriter
    Column Name              : training_images
    Writeable                : True
    Column Type              : ndarray
    Column Layout            : flat
    Schema Type              : fixed_shape
    DType                    : uint8
    Shape                    : (160, 163, 3)
    Number of Samples        : 0
    Partial Remote Data Refs : False

Since Hangar 0.5, it’s possible to have a column with string datatype, and we will be using it to store the labels of our dataset.

[20]:
train_lab_col = master_checkout.add_str_column(name="training_labels")
[21]:
train_lab_col
[21]:
Hangar FlatSampleWriter
    Column Name              : training_labels
    Writeable                : True
    Column Type              : str
    Column Layout            : flat
    Schema Type              : variable_shape
    DType                    : <class 'str'>
    Shape                    : None
    Number of Samples        : 0
    Partial Remote Data Refs : False

5. Adding data

To add data to a named column, we can use dict-style mode (refer to the __setitem__, __getitem__, and __delitem__ methods) or the update() method. Sample keys can be either str or int type.

[22]:
train_im_col[0] = train_images[0]
train_lab_col[0] = train_labels[0]

As we can see, Number of Samples is equal to 1 now.

[23]:
master_checkout.columns["training_labels"]
[23]:
Hangar FlatSampleWriter
    Column Name              : training_labels
    Writeable                : True
    Column Type              : str
    Column Layout            : flat
    Schema Type              : variable_shape
    DType                    : <class 'str'>
    Shape                    : None
    Number of Samples        : 1
    Partial Remote Data Refs : False

[24]:
data = {1: train_images[1], 2: train_images[2]}
[25]:
train_im_col.update(data)
[26]:
train_im_col
[26]:
Hangar FlatSampleWriter
    Column Name              : training_images
    Writeable                : True
    Column Type              : ndarray
    Column Layout            : flat
    Schema Type              : fixed_shape
    DType                    : uint8
    Shape                    : (160, 163, 3)
    Number of Samples        : 3
    Partial Remote Data Refs : False

Let’s add the remaining training images:

[27]:
with train_im_col:
    for i, img in tqdm(enumerate(train_images), total=train_images.shape[0]):
        if i not in [0, 1, 2]:
            train_im_col[i] = img
100%|██████████| 9296/9296 [00:36<00:00, 257.92it/s]
[28]:
with train_lab_col:
    for i, label in tqdm(enumerate(train_labels), total=len(train_labels)):
        if i != 0:
            train_lab_col[i] = label
100%|██████████| 9296/9296 [00:01<00:00, 5513.23it/s]
[29]:
train_lab_col
[29]:
Hangar FlatSampleWriter
    Column Name              : training_labels
    Writeable                : True
    Column Type              : str
    Column Layout            : flat
    Schema Type              : variable_shape
    DType                    : <class 'str'>
    Shape                    : None
    Number of Samples        : 9296
    Partial Remote Data Refs : False

Both the training_images and the training_labels have 9296 samples. Great!

Note

To get an overview of the different ways you could add data to a Hangar repository (also from a performance point of view), please refer to the Performance section of the Hangar Tutorial Part 1.

6. Committing changes

Once you have made a set of changes you want to commit, simply call the commit() method and specify a message.

The returned value (a=ecc943c89b9b09e41574c9849f11937828fece28) is the commit hash of this commit.

[30]:
master_checkout.commit("Add Imagenette training images and labels")
[30]:
'a=ecc943c89b9b09e41574c9849f11937828fece28'

Let’s add the validation data to the repository …

[31]:
val_im_col = master_checkout.add_ndarray_column(
    name="validation_images", prototype=val_images[0]
)
val_lab_col = master_checkout.add_str_column(name="validation_labels")
[32]:
with val_im_col, val_lab_col:
    for img, label in tqdm(zip(val_images, val_labels), total=len(val_labels)):
        val_im_col[i] = img
        val_lab_col[i] = label
100%|██████████| 3856/3856 [00:08<00:00, 474.25it/s]

… and commit!

[33]:
master_checkout.commit("Add Imagenette validation images and labels")
[33]:
'a=e31ef9a06c8d1a4cefeb52c336b2c33d1dca3fba'

To view the history of your commits:

[34]:
master_checkout.log()
* a=e31ef9a06c8d1a4cefeb52c336b2c33d1dca3fba (master) : Add Imagenette validation images and labels
* a=ecc943c89b9b09e41574c9849f11937828fece28 : Add Imagenette training images and labels

Do not forget to close the write-enabled checkout!

[35]:
master_checkout.close()

Let’s inspect the repository state! This will show disk usage information, the details of the last commit and all the information about the dataset columns.

[36]:
repo.summary()
Summary of Contents Contained in Data Repository

==================
| Repository Info
|-----------------
|  Base Directory: /Volumes/Archivio/tensorwerk/hangar/imagenette
|  Disk Usage: 862.09 MB

===================
| Commit Details
-------------------
|  Commit: a=e31ef9a06c8d1a4cefeb52c336b2c33d1dca3fba
|  Created: Sat Apr  4 11:29:12 2020
|  By: Alessia Marcolini
|  Email: alessia@tensorwerk.com
|  Message: Add Imagenette validation images and labels

==================
| DataSets
|-----------------
|  Number of Named Columns: 4
|
|  * Column Name: ColumnSchemaKey(column="training_images", layout="flat")
|    Num Data Pieces: 9296
|    Details:
|    - column_layout: flat
|    - column_type: ndarray
|    - schema_type: fixed_shape
|    - shape: (160, 163, 3)
|    - dtype: uint8
|    - backend: 01
|    - backend_options: {'complib': 'blosc:lz4hc', 'complevel': 5, 'shuffle': 'byte'}
|
|  * Column Name: ColumnSchemaKey(column="training_labels", layout="flat")
|    Num Data Pieces: 9296
|    Details:
|    - column_layout: flat
|    - column_type: str
|    - schema_type: variable_shape
|    - dtype: <class'str'>
|    - backend: 30
|    - backend_options: {}
|
|  * Column Name: ColumnSchemaKey(column="validation_images", layout="flat")
|    Num Data Pieces: 1
|    Details:
|    - column_layout: flat
|    - column_type: ndarray
|    - schema_type: fixed_shape
|    - shape: (160, 163, 3)
|    - dtype: uint8
|    - backend: 01
|    - backend_options: {'complib': 'blosc:lz4hc', 'complevel': 5, 'shuffle': 'byte'}
|
|  * Column Name: ColumnSchemaKey(column="validation_labels", layout="flat")
|    Num Data Pieces: 1
|    Details:
|    - column_layout: flat
|    - column_type: str
|    - schema_type: variable_shape
|    - dtype: <class'str'>
|    - backend: 30
|    - backend_options: {}

==================
| Metadata:
|-----------------
|  Number of Keys: 0

Great! You’ve made it until the end of the “Real World” Quick Start Tutorial!! 👏🏼

Please check out the other tutorials for more advanced stuff such as branching & merging, conflicts resolution and data loaders for TensorFlow and PyTorch!