βReal Worldβ Quick Start TutorialΒΆ
This tutorial will guide you on working with the basics of Hangar, while playing with some βreal worldβ data:
- adding data to a repository
- commiting changes
- reading data from a commit
- inspecting contents of a commit
SetupΒΆ
You can install Hangar via pip
:
$ pip install hangar
or via conda
:
$ conda install -c conda-forge hangar
Other requirements for this tutorial are:
- pillow - the python imaging library
- tqdm - a simple tool to display progress bars (this is installed automatically as it is a requirement for
Hangar
)
$ pip install pillow
1. Create and Initialize a Repository
ΒΆ
When working with Hangar programatically (the CLI is covered in later tutorials), we always start with the following import:
[1]:
from hangar import Repository
Create the folder where you want to store the Hangar Repository
:
[2]:
!mkdir /Volumes/Archivio/tensorwerk/hangar/imagenette
and create the Repository
object. Note that when you specify a new folder for a Hangar repository, Python shows you a warning saying that you will need to initialize the repo before starting working on it.
[3]:
repo = Repository(path="/Volumes/Archivio/tensorwerk/hangar/imagenette")
//anaconda/envs/hangar-nested/lib/python3.7/site-packages/hangar-0.5.0.dev1-py3.7-macosx-10.9-x86_64.egg/hangar/context.py:94: UserWarning: No repository exists at /Volumes/Archivio/tensorwerk/hangar/imagenette/.hangar, please use `repo.init()` method
warnings.warn(msg, UserWarning)
Initialize the Repository
providing your name and your email.
Warning
Please be aware that the remove_old
parameter set to True
removes and reinitializes a Hangar repository at the given path.
[4]:
repo.init(
user_name="Alessia Marcolini", user_email="alessia@tensorwerk.com", remove_old=True
)
Hangar Repo initialized at: /Volumes/Archivio/tensorwerk/hangar/imagenette/.hangar
[4]:
'/Volumes/Archivio/tensorwerk/hangar/imagenette/.hangar'
2. Open the Staging Area for WritingΒΆ
A Repository
can be checked out in two modes: write-enabled and read-only. We need to checkout the repo in write mode in order to initialize the columns and write into them.
[5]:
master_checkout = repo.checkout(write=True)
A checkout allows access to columns
. The columns
attribute of a checkout provides the interface to working with all of the data on disk!
[6]:
master_checkout.columns
[6]:
Hangar Columns
Writeable : True
Number of Columns : 0
Column Names / Partial Remote References:
-
3. Download and Prepare Some Conventionally Stored DataΒΆ
To start playing with Hangar, letβs get some data to work on. Weβll be using the Imagenette dataset.
The following commands will download ~96 MB of data to the local directory and decompress the tarball containing ~ 9,200 .jpeg
images in the folder data
in the current working directory.
[7]:
!wget https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz -P data
--2020-04-04 13:25:37-- https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz
Resolving s3.amazonaws.com... 52.216.238.197
Connecting to s3.amazonaws.com|52.216.238.197|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 98948031 (94M) [application/x-tar]
Saving to: βdata/imagenette2-160.tgzβ
imagenette2-160.tgz 100%[===================>] 94.36M 4.52MB/s in 22s
2020-04-04 13:26:00 (4.31 MB/s) - βdata/imagenette2-160.tgzβ saved [98948031/98948031]
[8]:
!tar -xzf data/imagenette2-160.tgz -C data
[9]:
!wget http://image-net.org/archive/words.txt -P data/imagenette2-160
--2020-04-04 13:26:24-- http://image-net.org/archive/words.txt
Resolving image-net.org... 171.64.68.16
Connecting to image-net.org|171.64.68.16|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2655750 (2.5M) [text/plain]
Saving to: βdata/imagenette2-160/words.txtβ
words.txt 100%[===================>] 2.53M 884KB/s in 2.9s
2020-04-04 13:26:27 (884 KB/s) - βdata/imagenette2-160/words.txtβ saved [2655750/2655750]
The dataset directory structure on disk is as follows:ΒΆ
Each subdirectory in the train
/ val
folders (named starting with "n0"
) contains a few hundred images which feature objects/elements of a common classification (tench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, parachute, etc.). The image file names follow a convention specific to the ImageNet project, but can be thought of as essentially random (so long as they are unique).
imagenette2-160
βββ train
βΒ Β βββ n01440764
βΒ Β βββ n02102040
βΒ Β βββ n02979186
βΒ Β βββ n03000684
βΒ Β βββ n03028079
βΒ Β βββ n03394916
βΒ Β βββ n03417042
βΒ Β βββ n03425413
βΒ Β βββ n03445777
βΒ Β βββ n03888257
βββ val
βββ n01440764
βββ n02102040
βββ n02979186
βββ n03000684
βββ n03028079
βββ n03394916
βββ n03417042
βββ n03425413
βββ n03445777
βββ n03888257
Classification/Label DataΒΆ
The labels associated with each image are contained in a seperate .txt
file, we download the words.txt
to the directory the images are extracted into.
Reviewing the contents of this file, we will find a mapping of classification codes (subdirectory names starting with "n0"
) to human readable descriptions of the contents. A small selection of the file is provided below as an illustration.
n01635343 Rhyacotriton, genus Rhyacotriton
n01635480 olympic salamander, Rhyacotriton olympicus
n01635659 Plethodontidae, family Plethodontidae
n01635964 Plethodon, genus Plethodon
n01636127 lungless salamander, plethodont
n01636352 eastern red-backed salamander, Plethodon cinereus
n01636510 western red-backed salamander, Plethodon vehiculum
n01636675 Desmograthus, genus Desmograthus
n01636829 dusky salamander
n01636984 Aneides, genus Aneides
n01637112 climbing salamander
n01637338 arboreal salamander, Aneides lugubris
n01637478 Batrachoseps, genus Batrachoseps
n01637615 slender salamander, worm salamander
n01637796 Hydromantes, genus Hydromantes
Mapping Classification Codes to Meaningful DescriptorsΒΆ
We begin by reading each line of this file and creating a dictionary to store the corrispondence between ImageNet synset name and a human readable label.
[10]:
from pathlib import Path
dataset_dir = Path("./data/imagenette2-160")
synset_label = {}
with open(dataset_dir / "words.txt", "r") as f:
for line in f.readlines():
synset, label = line.split("\t")
synset_label[synset] = label.rstrip()
Read training data (images and labels) from disk and store them in NumPy arrays.
[11]:
import os
from tqdm import tqdm
import numpy as np
from PIL import Image
[12]:
train_images = []
train_labels = []
for synset in tqdm(os.listdir(dataset_dir / "train")):
label = synset_label[synset]
for image_filename in os.listdir(dataset_dir / "train" / synset):
image = Image.open(dataset_dir / "train" / synset / image_filename)
image = image.resize((163, 160))
data = np.asarray(image)
if len(data.shape) == 2: # discard B&W images
continue
train_images.append(data)
train_labels.append(label)
train_images = np.array(train_images)
100%|ββββββββββ| 10/10 [00:31<00:00, 3.12s/it]
[13]:
train_images.shape
[13]:
(9296, 160, 163, 3)
Note
Here we are reading the images from disk and storing them in a big Python list, and then converting it to a NumPy array. Note that it could be impractical for larger datasets. You might want to consider the idea of reading files in batch.
Read validation data (images and labels) from disk and store them in NumPy arrays, same as before.
[14]:
val_images = []
val_labels = []
for synset in tqdm(os.listdir(dataset_dir / "val")):
label = synset_label[synset]
for image_filename in os.listdir(dataset_dir / "val" / synset):
image = Image.open(dataset_dir / "val" / synset / image_filename)
image = image.resize((163, 160))
data = np.asarray(image)
if len(data.shape) == 2: # discard B&W images
continue
val_images.append(data)
val_labels.append(label)
val_images = np.array(val_images)
100%|ββββββββββ| 10/10 [00:12<00:00, 1.22s/it]
[15]:
val_images.shape
[15]:
(3856, 160, 163, 3)
4. Column initializationΒΆ
With checkout write-enabled, we can now initialize a new column of the repository using the method add_ndarray_column()
.
All samples within a column have the same data type, and number of dimensions. The size of each dimension can be either fixed (the default behavior) or variable per sample.
You will need to provide a column name
and a prototype
, so Hangar can infer the shape of the elements contained in the array. train_im_col
will become a column accessor object.
[16]:
train_im_col = master_checkout.add_ndarray_column(
name="training_images", prototype=train_images[0]
)
Verify we successfully added the new column:
[17]:
master_checkout.columns
[17]:
Hangar Columns
Writeable : True
Number of Columns : 1
Column Names / Partial Remote References:
- training_images / False
Get useful information about the new column simply by inspecting train_im_col
β¦
[18]:
train_im_col
[18]:
Hangar FlatSampleWriter
Column Name : training_images
Writeable : True
Column Type : ndarray
Column Layout : flat
Schema Type : fixed_shape
DType : uint8
Shape : (160, 163, 3)
Number of Samples : 0
Partial Remote Data Refs : False
β¦ or by leveraging the dict-style columns access through the checkout
object. They provide the same information.
[19]:
master_checkout.columns["training_images"]
[19]:
Hangar FlatSampleWriter
Column Name : training_images
Writeable : True
Column Type : ndarray
Column Layout : flat
Schema Type : fixed_shape
DType : uint8
Shape : (160, 163, 3)
Number of Samples : 0
Partial Remote Data Refs : False
Since Hangar 0.5, itβs possible to have a column with string datatype, and we will be using it to store the labels of our dataset.
[20]:
train_lab_col = master_checkout.add_str_column(name="training_labels")
[21]:
train_lab_col
[21]:
Hangar FlatSampleWriter
Column Name : training_labels
Writeable : True
Column Type : str
Column Layout : flat
Schema Type : variable_shape
DType : <class 'str'>
Shape : None
Number of Samples : 0
Partial Remote Data Refs : False
5. Adding dataΒΆ
To add data to a named column, we can use dict-style mode (refer to the __setitem__
, __getitem__
, and __delitem__
methods) or the update()
method. Sample keys can be either str
or int
type.
[22]:
train_im_col[0] = train_images[0]
train_lab_col[0] = train_labels[0]
As we can see, Number of Samples
is equal to 1 now.
[23]:
master_checkout.columns["training_labels"]
[23]:
Hangar FlatSampleWriter
Column Name : training_labels
Writeable : True
Column Type : str
Column Layout : flat
Schema Type : variable_shape
DType : <class 'str'>
Shape : None
Number of Samples : 1
Partial Remote Data Refs : False
[24]:
data = {1: train_images[1], 2: train_images[2]}
[25]:
train_im_col.update(data)
[26]:
train_im_col
[26]:
Hangar FlatSampleWriter
Column Name : training_images
Writeable : True
Column Type : ndarray
Column Layout : flat
Schema Type : fixed_shape
DType : uint8
Shape : (160, 163, 3)
Number of Samples : 3
Partial Remote Data Refs : False
Letβs add the remaining training images:
[27]:
with train_im_col:
for i, img in tqdm(enumerate(train_images), total=train_images.shape[0]):
if i not in [0, 1, 2]:
train_im_col[i] = img
100%|ββββββββββ| 9296/9296 [00:36<00:00, 257.92it/s]
[28]:
with train_lab_col:
for i, label in tqdm(enumerate(train_labels), total=len(train_labels)):
if i != 0:
train_lab_col[i] = label
100%|ββββββββββ| 9296/9296 [00:01<00:00, 5513.23it/s]
[29]:
train_lab_col
[29]:
Hangar FlatSampleWriter
Column Name : training_labels
Writeable : True
Column Type : str
Column Layout : flat
Schema Type : variable_shape
DType : <class 'str'>
Shape : None
Number of Samples : 9296
Partial Remote Data Refs : False
Both the training_images
and the training_labels
have 9296 samples. Great!
Note
To get an overview of the different ways you could add data to a Hangar repository (also from a performance point of view), please refer to the Performance section of the Hangar Tutorial Part 1.
6. Committing changesΒΆ
Once you have made a set of changes you want to commit, simply call the commit()
method and specify a message.
The returned value (a=ecc943c89b9b09e41574c9849f11937828fece28
) is the commit hash of this commit.
[30]:
master_checkout.commit("Add Imagenette training images and labels")
[30]:
'a=ecc943c89b9b09e41574c9849f11937828fece28'
Letβs add the validation data to the repository β¦
[31]:
val_im_col = master_checkout.add_ndarray_column(
name="validation_images", prototype=val_images[0]
)
val_lab_col = master_checkout.add_str_column(name="validation_labels")
[32]:
with val_im_col, val_lab_col:
for img, label in tqdm(zip(val_images, val_labels), total=len(val_labels)):
val_im_col[i] = img
val_lab_col[i] = label
100%|ββββββββββ| 3856/3856 [00:08<00:00, 474.25it/s]
β¦ and commit!
[33]:
master_checkout.commit("Add Imagenette validation images and labels")
[33]:
'a=e31ef9a06c8d1a4cefeb52c336b2c33d1dca3fba'
To view the history of your commits:
[34]:
master_checkout.log()
* a=e31ef9a06c8d1a4cefeb52c336b2c33d1dca3fba (master) : Add Imagenette validation images and labels
* a=ecc943c89b9b09e41574c9849f11937828fece28 : Add Imagenette training images and labels
[35]:
master_checkout.close()
Letβs inspect the repository state! This will show disk usage information, the details of the last commit and all the information about the dataset columns.
[36]:
repo.summary()
Summary of Contents Contained in Data Repository
==================
| Repository Info
|-----------------
| Base Directory: /Volumes/Archivio/tensorwerk/hangar/imagenette
| Disk Usage: 862.09 MB
===================
| Commit Details
-------------------
| Commit: a=e31ef9a06c8d1a4cefeb52c336b2c33d1dca3fba
| Created: Sat Apr 4 11:29:12 2020
| By: Alessia Marcolini
| Email: alessia@tensorwerk.com
| Message: Add Imagenette validation images and labels
==================
| DataSets
|-----------------
| Number of Named Columns: 4
|
| * Column Name: ColumnSchemaKey(column="training_images", layout="flat")
| Num Data Pieces: 9296
| Details:
| - column_layout: flat
| - column_type: ndarray
| - schema_type: fixed_shape
| - shape: (160, 163, 3)
| - dtype: uint8
| - backend: 01
| - backend_options: {'complib': 'blosc:lz4hc', 'complevel': 5, 'shuffle': 'byte'}
|
| * Column Name: ColumnSchemaKey(column="training_labels", layout="flat")
| Num Data Pieces: 9296
| Details:
| - column_layout: flat
| - column_type: str
| - schema_type: variable_shape
| - dtype: <class'str'>
| - backend: 30
| - backend_options: {}
|
| * Column Name: ColumnSchemaKey(column="validation_images", layout="flat")
| Num Data Pieces: 1
| Details:
| - column_layout: flat
| - column_type: ndarray
| - schema_type: fixed_shape
| - shape: (160, 163, 3)
| - dtype: uint8
| - backend: 01
| - backend_options: {'complib': 'blosc:lz4hc', 'complevel': 5, 'shuffle': 'byte'}
|
| * Column Name: ColumnSchemaKey(column="validation_labels", layout="flat")
| Num Data Pieces: 1
| Details:
| - column_layout: flat
| - column_type: str
| - schema_type: variable_shape
| - dtype: <class'str'>
| - backend: 30
| - backend_options: {}
==================
| Metadata:
|-----------------
| Number of Keys: 0
Great! Youβve made it until the end of the βReal Worldβ Quick Start Tutorial!! ππΌ
Please check out the other tutorials for more advanced stuff such as branching & merging, conflicts resolution and data loaders for TensorFlow and PyTorch!