“Real World” Quick Start Tutorial¶
This tutorial will guide you on working with the basics of Hangar, while playing with some “real world” data:
adding data to a repository
commiting changes
reading data from a commit
inspecting contents of a commit
Setup¶
You can install Hangar via pip
:
$ pip install hangar
or via conda
:
$ conda install -c conda-forge hangar
Other requirements for this tutorial are:
pillow - the python imaging library
tqdm - a simple tool to display progress bars (this is installed automatically as it is a requirement for
Hangar
)
$ pip install pillow
1. Create and Initialize a “Repository”¶
When working with Hangar programatically (the CLI is covered in later tutorials), we always start with the following import:
[1]:
from hangar import Repository
Create the folder where you want to store the Hangar Repository
:
[2]:
!mkdir /Volumes/Archivio/tensorwerk/hangar/imagenette
and create the Repository
object. Note that when you specify a new folder for a Hangar repository, Python shows you a warning saying that you will need to initialize the repo before starting working on it.
[3]:
repo = Repository(path="/Volumes/Archivio/tensorwerk/hangar/imagenette")
//anaconda/envs/hangar-nested/lib/python3.7/site-packages/hangar-0.5.0.dev1-py3.7-macosx-10.9-x86_64.egg/hangar/context.py:94: UserWarning: No repository exists at /Volumes/Archivio/tensorwerk/hangar/imagenette/.hangar, please use `repo.init()` method
warnings.warn(msg, UserWarning)
Initialize the Repository
providing your name and your email.
Warning
Please be aware that the remove_old
parameter set to True
removes and reinitializes a Hangar repository at the given path.
[4]:
repo.init(
user_name="Alessia Marcolini", user_email="alessia@tensorwerk.com", remove_old=True
)
Hangar Repo initialized at: /Volumes/Archivio/tensorwerk/hangar/imagenette/.hangar
[4]:
'/Volumes/Archivio/tensorwerk/hangar/imagenette/.hangar'
2. Open the Staging Area for Writing¶
A Repository
can be checked out in two modes: write-enabled and read-only. We need to checkout the repo in write mode in order to initialize the columns and write into them.
[5]:
master_checkout = repo.checkout(write=True)
A checkout allows access to columns
. The columns
attribute of a checkout provides the interface to working with all of the data on disk!
[6]:
master_checkout.columns
[6]:
Hangar Columns
Writeable : True
Number of Columns : 0
Column Names / Partial Remote References:
-
3. Download and Prepare Some Conventionally Stored Data¶
To start playing with Hangar, let’s get some data to work on. We’ll be using the Imagenette dataset.
The following commands will download ~96 MB of data to the local directory and decompress the tarball containing ~ 9,200 .jpeg
images in the folder data
in the current working directory.
[7]:
!wget https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz -P data
--2020-04-04 13:25:37-- https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz
Resolving s3.amazonaws.com... 52.216.238.197
Connecting to s3.amazonaws.com|52.216.238.197|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 98948031 (94M) [application/x-tar]
Saving to: ‘data/imagenette2-160.tgz’
imagenette2-160.tgz 100%[===================>] 94.36M 4.52MB/s in 22s
2020-04-04 13:26:00 (4.31 MB/s) - ‘data/imagenette2-160.tgz’ saved [98948031/98948031]
[8]:
!tar -xzf data/imagenette2-160.tgz -C data
[9]:
!wget http://image-net.org/archive/words.txt -P data/imagenette2-160
--2020-04-04 13:26:24-- http://image-net.org/archive/words.txt
Resolving image-net.org... 171.64.68.16
Connecting to image-net.org|171.64.68.16|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2655750 (2.5M) [text/plain]
Saving to: ‘data/imagenette2-160/words.txt’
words.txt 100%[===================>] 2.53M 884KB/s in 2.9s
2020-04-04 13:26:27 (884 KB/s) - ‘data/imagenette2-160/words.txt’ saved [2655750/2655750]
The dataset directory structure on disk is as follows:¶
Each subdirectory in the train
/ val
folders (named starting with "n0"
) contains a few hundred images which feature objects/elements of a common classification (tench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, parachute, etc.). The image file names follow a convention specific to the ImageNet project, but can be thought of as essentially random (so long as they are unique).
imagenette2-160
├── train
│ ├── n01440764
│ ├── n02102040
│ ├── n02979186
│ ├── n03000684
│ ├── n03028079
│ ├── n03394916
│ ├── n03417042
│ ├── n03425413
│ ├── n03445777
│ └── n03888257
└── val
├── n01440764
├── n02102040
├── n02979186
├── n03000684
├── n03028079
├── n03394916
├── n03417042
├── n03425413
├── n03445777
└── n03888257
Classification/Label Data¶
The labels associated with each image are contained in a seperate .txt
file, we download the words.txt
to the directory the images are extracted into.
Reviewing the contents of this file, we will find a mapping of classification codes (subdirectory names starting with "n0"
) to human readable descriptions of the contents. A small selection of the file is provided below as an illustration.
n01635343 Rhyacotriton, genus Rhyacotriton
n01635480 olympic salamander, Rhyacotriton olympicus
n01635659 Plethodontidae, family Plethodontidae
n01635964 Plethodon, genus Plethodon
n01636127 lungless salamander, plethodont
n01636352 eastern red-backed salamander, Plethodon cinereus
n01636510 western red-backed salamander, Plethodon vehiculum
n01636675 Desmograthus, genus Desmograthus
n01636829 dusky salamander
n01636984 Aneides, genus Aneides
n01637112 climbing salamander
n01637338 arboreal salamander, Aneides lugubris
n01637478 Batrachoseps, genus Batrachoseps
n01637615 slender salamander, worm salamander
n01637796 Hydromantes, genus Hydromantes
Mapping Classification Codes to Meaningful Descriptors¶
We begin by reading each line of this file and creating a dictionary to store the corrispondence between ImageNet synset name and a human readable label.
[10]:
from pathlib import Path
dataset_dir = Path("./data/imagenette2-160")
synset_label = {}
with open(dataset_dir / "words.txt", "r") as f:
for line in f.readlines():
synset, label = line.split("\t")
synset_label[synset] = label.rstrip()
Read training data (images and labels) from disk and store them in NumPy arrays.
[11]:
import os
from tqdm import tqdm
import numpy as np
from PIL import Image
[12]:
train_images = []
train_labels = []
for synset in tqdm(os.listdir(dataset_dir / "train")):
label = synset_label[synset]
for image_filename in os.listdir(dataset_dir / "train" / synset):
image = Image.open(dataset_dir / "train" / synset / image_filename)
image = image.resize((163, 160))
data = np.asarray(image)
if len(data.shape) == 2: # discard B&W images
continue
train_images.append(data)
train_labels.append(label)
train_images = np.array(train_images)
100%|██████████| 10/10 [00:31<00:00, 3.12s/it]
[13]:
train_images.shape
[13]:
(9296, 160, 163, 3)
Note
Here we are reading the images from disk and storing them in a big Python list, and then converting it to a NumPy array. Note that it could be impractical for larger datasets. You might want to consider the idea of reading files in batch.
Read validation data (images and labels) from disk and store them in NumPy arrays, same as before.
[14]:
val_images = []
val_labels = []
for synset in tqdm(os.listdir(dataset_dir / "val")):
label = synset_label[synset]
for image_filename in os.listdir(dataset_dir / "val" / synset):
image = Image.open(dataset_dir / "val" / synset / image_filename)
image = image.resize((163, 160))
data = np.asarray(image)
if len(data.shape) == 2: # discard B&W images
continue
val_images.append(data)
val_labels.append(label)
val_images = np.array(val_images)
100%|██████████| 10/10 [00:12<00:00, 1.22s/it]
[15]:
val_images.shape
[15]:
(3856, 160, 163, 3)
4. Column initialization¶
With checkout write-enabled, we can now initialize a new column of the repository using the method add_ndarray_column()
.
All samples within a column have the same data type, and number of dimensions. The size of each dimension can be either fixed (the default behavior) or variable per sample.
You will need to provide a column name
and a prototype
, so Hangar can infer the shape of the elements contained in the array. train_im_col
will become a column accessor object.
[16]:
train_im_col = master_checkout.add_ndarray_column(
name="training_images", prototype=train_images[0]
)
Verify we successfully added the new column:
[17]:
master_checkout.columns
[17]:
Hangar Columns
Writeable : True
Number of Columns : 1
Column Names / Partial Remote References:
- training_images / False
Get useful information about the new column simply by inspecting train_im_col
…
[18]:
train_im_col
[18]:
Hangar FlatSampleWriter
Column Name : training_images
Writeable : True
Column Type : ndarray
Column Layout : flat
Schema Type : fixed_shape
DType : uint8
Shape : (160, 163, 3)
Number of Samples : 0
Partial Remote Data Refs : False
… or by leveraging the dict-style columns access through the checkout
object. They provide the same information.
[19]:
master_checkout.columns["training_images"]
[19]:
Hangar FlatSampleWriter
Column Name : training_images
Writeable : True
Column Type : ndarray
Column Layout : flat
Schema Type : fixed_shape
DType : uint8
Shape : (160, 163, 3)
Number of Samples : 0
Partial Remote Data Refs : False
Since Hangar 0.5, it’s possible to have a column with string datatype, and we will be using it to store the labels of our dataset.
[20]:
train_lab_col = master_checkout.add_str_column(name="training_labels")
[21]:
train_lab_col
[21]:
Hangar FlatSampleWriter
Column Name : training_labels
Writeable : True
Column Type : str
Column Layout : flat
Schema Type : variable_shape
DType : <class 'str'>
Shape : None
Number of Samples : 0
Partial Remote Data Refs : False
5. Adding data¶
To add data to a named column, we can use dict-style mode (refer to the __setitem__
, __getitem__
, and __delitem__
methods) or the update()
method. Sample keys can be either str
or int
type.
[22]:
train_im_col[0] = train_images[0]
train_lab_col[0] = train_labels[0]
As we can see, Number of Samples
is equal to 1 now.
[23]:
master_checkout.columns["training_labels"]
[23]:
Hangar FlatSampleWriter
Column Name : training_labels
Writeable : True
Column Type : str
Column Layout : flat
Schema Type : variable_shape
DType : <class 'str'>
Shape : None
Number of Samples : 1
Partial Remote Data Refs : False
[24]:
data = {1: train_images[1], 2: train_images[2]}
[25]:
train_im_col.update(data)
[26]:
train_im_col
[26]:
Hangar FlatSampleWriter
Column Name : training_images
Writeable : True
Column Type : ndarray
Column Layout : flat
Schema Type : fixed_shape
DType : uint8
Shape : (160, 163, 3)
Number of Samples : 3
Partial Remote Data Refs : False
Let’s add the remaining training images:
[27]:
with train_im_col:
for i, img in tqdm(enumerate(train_images), total=train_images.shape[0]):
if i not in [0, 1, 2]:
train_im_col[i] = img
100%|██████████| 9296/9296 [00:36<00:00, 257.92it/s]
[28]:
with train_lab_col:
for i, label in tqdm(enumerate(train_labels), total=len(train_labels)):
if i != 0:
train_lab_col[i] = label
100%|██████████| 9296/9296 [00:01<00:00, 5513.23it/s]
[29]:
train_lab_col
[29]:
Hangar FlatSampleWriter
Column Name : training_labels
Writeable : True
Column Type : str
Column Layout : flat
Schema Type : variable_shape
DType : <class 'str'>
Shape : None
Number of Samples : 9296
Partial Remote Data Refs : False
Both the training_images
and the training_labels
have 9296 samples. Great!
Note
To get an overview of the different ways you could add data to a Hangar repository (also from a performance point of view), please refer to the Performance section of the Hangar Tutorial Part 1.
6. Committing changes¶
Once you have made a set of changes you want to commit, simply call the commit()
method and specify a message.
The returned value (a=ecc943c89b9b09e41574c9849f11937828fece28
) is the commit hash of this commit.
[30]:
master_checkout.commit("Add Imagenette training images and labels")
[30]:
'a=ecc943c89b9b09e41574c9849f11937828fece28'
Let’s add the validation data to the repository …
[31]:
val_im_col = master_checkout.add_ndarray_column(
name="validation_images", prototype=val_images[0]
)
val_lab_col = master_checkout.add_str_column(name="validation_labels")
[32]:
with val_im_col, val_lab_col:
for img, label in tqdm(zip(val_images, val_labels), total=len(val_labels)):
val_im_col[i] = img
val_lab_col[i] = label
100%|██████████| 3856/3856 [00:08<00:00, 474.25it/s]
… and commit!
[33]:
master_checkout.commit("Add Imagenette validation images and labels")
[33]:
'a=e31ef9a06c8d1a4cefeb52c336b2c33d1dca3fba'
To view the history of your commits:
[34]:
master_checkout.log()
* a=e31ef9a06c8d1a4cefeb52c336b2c33d1dca3fba (master) : Add Imagenette validation images and labels
* a=ecc943c89b9b09e41574c9849f11937828fece28 : Add Imagenette training images and labels
Do not forget to close the write-enabled checkout!¶
[35]:
master_checkout.close()
Let’s inspect the repository state! This will show disk usage information, the details of the last commit and all the information about the dataset columns.
[36]:
repo.summary()
Summary of Contents Contained in Data Repository
==================
| Repository Info
|-----------------
| Base Directory: /Volumes/Archivio/tensorwerk/hangar/imagenette
| Disk Usage: 862.09 MB
===================
| Commit Details
-------------------
| Commit: a=e31ef9a06c8d1a4cefeb52c336b2c33d1dca3fba
| Created: Sat Apr 4 11:29:12 2020
| By: Alessia Marcolini
| Email: alessia@tensorwerk.com
| Message: Add Imagenette validation images and labels
==================
| DataSets
|-----------------
| Number of Named Columns: 4
|
| * Column Name: ColumnSchemaKey(column="training_images", layout="flat")
| Num Data Pieces: 9296
| Details:
| - column_layout: flat
| - column_type: ndarray
| - schema_type: fixed_shape
| - shape: (160, 163, 3)
| - dtype: uint8
| - backend: 01
| - backend_options: {'complib': 'blosc:lz4hc', 'complevel': 5, 'shuffle': 'byte'}
|
| * Column Name: ColumnSchemaKey(column="training_labels", layout="flat")
| Num Data Pieces: 9296
| Details:
| - column_layout: flat
| - column_type: str
| - schema_type: variable_shape
| - dtype: <class'str'>
| - backend: 30
| - backend_options: {}
|
| * Column Name: ColumnSchemaKey(column="validation_images", layout="flat")
| Num Data Pieces: 1
| Details:
| - column_layout: flat
| - column_type: ndarray
| - schema_type: fixed_shape
| - shape: (160, 163, 3)
| - dtype: uint8
| - backend: 01
| - backend_options: {'complib': 'blosc:lz4hc', 'complevel': 5, 'shuffle': 'byte'}
|
| * Column Name: ColumnSchemaKey(column="validation_labels", layout="flat")
| Num Data Pieces: 1
| Details:
| - column_layout: flat
| - column_type: str
| - schema_type: variable_shape
| - dtype: <class'str'>
| - backend: 30
| - backend_options: {}
==================
| Metadata:
|-----------------
| Number of Keys: 0
Great! You’ve made it until the end of the “Real World” Quick Start Tutorial!! 👏🏼
Please check out the other tutorials for more advanced stuff such as branching & merging, conflicts resolution and data loaders for TensorFlow and PyTorch!