{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## \"Real World\" Quick Start Tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This tutorial will guide you on working with the basics of Hangar, while playing with some \"real world\" data:\n",
    "\n",
    "* adding data to a repository\n",
    "* commiting changes\n",
    "* reading data from a commit\n",
    "* inspecting contents of a commit"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can install Hangar via `pip`:\n",
    "\n",
    "```\n",
    "$ pip install hangar\n",
    "```\n",
    "\n",
    "or via `conda`:\n",
    "\n",
    "```\n",
    "$ conda install -c conda-forge hangar\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Other requirements for this tutorial are:\n",
    "\n",
    "* pillow - the python imaging library\n",
    "* tqdm - a simple tool to display progress bars (this is installed automatically as it is a requirement for `Hangar`)\n",
    "\n",
    "```\n",
    "$ pip install pillow\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Create and Initialize a \"Repository\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When working with Hangar programatically (the CLI is covered in later tutorials), we always start with the following import:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hangar import Repository"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create the folder where you want to store the Hangar `Repository`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir /Volumes/Archivio/tensorwerk/hangar/imagenette"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and create the `Repository` object. Note that when you specify a new folder for a Hangar repository, Python shows you a warning saying that you will need to initialize the repo before starting working on it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "//anaconda/envs/hangar-nested/lib/python3.7/site-packages/hangar-0.5.0.dev1-py3.7-macosx-10.9-x86_64.egg/hangar/context.py:94: UserWarning: No repository exists at /Volumes/Archivio/tensorwerk/hangar/imagenette/.hangar, please use `repo.init()` method\n",
      "  warnings.warn(msg, UserWarning)\n"
     ]
    }
   ],
   "source": [
    "repo = Repository(path=\"/Volumes/Archivio/tensorwerk/hangar/imagenette\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initialize the `Repository` providing your name and your email.\n",
    "\n",
    ".. warning:: Please be aware that the `remove_old` parameter set to `True` **removes and reinitializes** a Hangar repository at the given path."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hangar Repo initialized at: /Volumes/Archivio/tensorwerk/hangar/imagenette/.hangar\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'/Volumes/Archivio/tensorwerk/hangar/imagenette/.hangar'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "repo.init(\n",
    "    user_name=\"Alessia Marcolini\", user_email=\"alessia@tensorwerk.com\", remove_old=True\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Open the Staging Area for Writing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A `Repository` can be checked out in two modes: write-enabled and read-only. We need to checkout the repo in write mode in order to initialize the columns and write into them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "master_checkout = repo.checkout(write=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A checkout allows access to `columns`. The `columns` attribute of a checkout provides the interface to working with all of the data on disk!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Hangar Columns                \n",
       "    Writeable         : True                \n",
       "    Number of Columns : 0                \n",
       "    Column Names / Partial Remote References:                \n",
       "      - "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "master_checkout.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Download and Prepare Some Conventionally Stored Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To start playing with Hangar, let's get some data to work on. We'll be using the [Imagenette dataset](https://github.com/fastai/imagenette).\n",
    "\n",
    "The following commands will download ~96 MB of data to the local directory and decompress the tarball containing ~ 9,200 `.jpeg` images in the folder `data` in the current working directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2020-04-04 13:25:37--  https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz\n",
      "Resolving s3.amazonaws.com... 52.216.238.197\n",
      "Connecting to s3.amazonaws.com|52.216.238.197|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 98948031 (94M) [application/x-tar]\n",
      "Saving to: ‘data/imagenette2-160.tgz’\n",
      "\n",
      "imagenette2-160.tgz 100%[===================>]  94.36M  4.52MB/s    in 22s     \n",
      "\n",
      "2020-04-04 13:26:00 (4.31 MB/s) - ‘data/imagenette2-160.tgz’ saved [98948031/98948031]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz -P data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "!tar -xzf data/imagenette2-160.tgz -C data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2020-04-04 13:26:24--  http://image-net.org/archive/words.txt\n",
      "Resolving image-net.org... 171.64.68.16\n",
      "Connecting to image-net.org|171.64.68.16|:80... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 2655750 (2.5M) [text/plain]\n",
      "Saving to: ‘data/imagenette2-160/words.txt’\n",
      "\n",
      "words.txt           100%[===================>]   2.53M   884KB/s    in 2.9s    \n",
      "\n",
      "2020-04-04 13:26:27 (884 KB/s) - ‘data/imagenette2-160/words.txt’ saved [2655750/2655750]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget http://image-net.org/archive/words.txt -P data/imagenette2-160"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The dataset directory structure on disk is as follows:\n",
    "\n",
    "Each subdirectory in the `train` / `val` folders (named starting with `\"n0\"`) contains a few hundred images which feature objects/elements of a common classification  (tench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, parachute, etc.). The image file names follow a convention specific to the ImageNet project, but can be thought of as essentially random (so long as they are unique).\n",
    "\n",
    "```\n",
    "imagenette2-160\n",
    "├── train\n",
    "│   ├── n01440764\n",
    "│   ├── n02102040\n",
    "│   ├── n02979186\n",
    "│   ├── n03000684\n",
    "│   ├── n03028079\n",
    "│   ├── n03394916\n",
    "│   ├── n03417042\n",
    "│   ├── n03425413\n",
    "│   ├── n03445777\n",
    "│   └── n03888257\n",
    "└── val\n",
    "    ├── n01440764\n",
    "    ├── n02102040\n",
    "    ├── n02979186\n",
    "    ├── n03000684\n",
    "    ├── n03028079\n",
    "    ├── n03394916\n",
    "    ├── n03417042\n",
    "    ├── n03425413\n",
    "    ├── n03445777\n",
    "    └── n03888257\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Classification/Label Data\n",
    "\n",
    "The labels associated with each image are contained in a seperate `.txt` file, we download the `words.txt` to the directory the images are extracted into.\n",
    "\n",
    "Reviewing the contents of this file, we will find a mapping of classification codes (subdirectory names starting with `\"n0\"`) to human readable descriptions of the contents. A small selection of the file is provided below as an illustration.\n",
    "\n",
    "```\n",
    "n01635343\tRhyacotriton, genus Rhyacotriton\n",
    "n01635480\tolympic salamander, Rhyacotriton olympicus\n",
    "n01635659\tPlethodontidae, family Plethodontidae\n",
    "n01635964\tPlethodon, genus Plethodon\n",
    "n01636127\tlungless salamander, plethodont\n",
    "n01636352\teastern red-backed salamander, Plethodon cinereus\n",
    "n01636510\twestern red-backed salamander, Plethodon vehiculum\n",
    "n01636675\tDesmograthus, genus Desmograthus\n",
    "n01636829\tdusky salamander\n",
    "n01636984\tAneides, genus Aneides\n",
    "n01637112\tclimbing salamander\n",
    "n01637338\tarboreal salamander, Aneides lugubris\n",
    "n01637478\tBatrachoseps, genus Batrachoseps\n",
    "n01637615\tslender salamander, worm salamander\n",
    "n01637796\tHydromantes, genus Hydromantes\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Mapping Classification Codes to Meaningful Descriptors\n",
    "\n",
    "We begin by reading each line of this file and creating a dictionary to store the corrispondence between ImageNet synset name and a human readable label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "dataset_dir = Path(\"./data/imagenette2-160\")\n",
    "\n",
    "synset_label = {}\n",
    "with open(dataset_dir / \"words.txt\", \"r\") as f:\n",
    "    for line in f.readlines():\n",
    "        synset, label = line.split(\"\\t\")\n",
    "        synset_label[synset] = label.rstrip()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read training data (images and labels) from disk and store them in NumPy arrays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from tqdm import tqdm\n",
    "\n",
    "import numpy as np\n",
    "from PIL import Image"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 10/10 [00:31<00:00,  3.12s/it]\n"
     ]
    }
   ],
   "source": [
    "train_images = []\n",
    "train_labels = []\n",
    "\n",
    "for synset in tqdm(os.listdir(dataset_dir / \"train\")):\n",
    "    label = synset_label[synset]\n",
    "\n",
    "    for image_filename in os.listdir(dataset_dir / \"train\" / synset):\n",
    "        image = Image.open(dataset_dir / \"train\" / synset / image_filename)\n",
    "        image = image.resize((163, 160))\n",
    "        data = np.asarray(image)\n",
    "\n",
    "        if len(data.shape) == 2:  # discard B&W images\n",
    "            continue\n",
    "\n",
    "        train_images.append(data)\n",
    "        train_labels.append(label)\n",
    "\n",
    "train_images = np.array(train_images)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(9296, 160, 163, 3)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_images.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ".. note:: Here we are reading the images from disk and storing them in a big Python list, and then converting it to a NumPy array. Note that it could be impractical for larger datasets. You might want to consider the idea of reading files in batch."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read validation data (images and labels) from disk and store them in NumPy arrays, same as before."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 10/10 [00:12<00:00,  1.22s/it]\n"
     ]
    }
   ],
   "source": [
    "val_images = []\n",
    "val_labels = []\n",
    "\n",
    "for synset in tqdm(os.listdir(dataset_dir / \"val\")):\n",
    "    label = synset_label[synset]\n",
    "\n",
    "    for image_filename in os.listdir(dataset_dir / \"val\" / synset):\n",
    "        image = Image.open(dataset_dir / \"val\" / synset / image_filename)\n",
    "        image = image.resize((163, 160))\n",
    "        data = np.asarray(image)\n",
    "\n",
    "        if len(data.shape) == 2:  # discard B&W images\n",
    "            continue\n",
    "\n",
    "        val_images.append(data)\n",
    "        val_labels.append(label)\n",
    "\n",
    "val_images = np.array(val_images)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3856, 160, 163, 3)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val_images.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Column initialization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With checkout write-enabled, we can now initialize a new column of the repository using the method `add_ndarray_column()`.\n",
    "\n",
    "All samples within a column have the same data type, and number of dimensions. The size of each dimension can be either fixed (the default behavior) or variable per sample.\n",
    "\n",
    "You will need to provide a column `name` and a `prototype`, so Hangar can infer the shape of the elements contained in the array.\n",
    "`train_im_col` will become a column accessor object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_im_col = master_checkout.add_ndarray_column(\n",
    "    name=\"training_images\", prototype=train_images[0]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Verify we successfully added the new column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Hangar Columns                \n",
       "    Writeable         : True                \n",
       "    Number of Columns : 1                \n",
       "    Column Names / Partial Remote References:                \n",
       "      - training_images / False"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "master_checkout.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get useful information about the new column simply by inspecting `train_im_col` ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Hangar FlatSampleWriter                 \n",
       "    Column Name              : training_images                \n",
       "    Writeable                : True                \n",
       "    Column Type              : ndarray                \n",
       "    Column Layout            : flat                \n",
       "    Schema Type              : fixed_shape                \n",
       "    DType                    : uint8                \n",
       "    Shape                    : (160, 163, 3)                \n",
       "    Number of Samples        : 0                \n",
       "    Partial Remote Data Refs : False\n"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_im_col"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "... or by leveraging the dict-style columns access through the `checkout` object. They provide the same information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Hangar FlatSampleWriter                 \n",
       "    Column Name              : training_images                \n",
       "    Writeable                : True                \n",
       "    Column Type              : ndarray                \n",
       "    Column Layout            : flat                \n",
       "    Schema Type              : fixed_shape                \n",
       "    DType                    : uint8                \n",
       "    Shape                    : (160, 163, 3)                \n",
       "    Number of Samples        : 0                \n",
       "    Partial Remote Data Refs : False\n"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "master_checkout.columns[\"training_images\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since Hangar 0.5, it's possible to have a column with string datatype, and we will be using it to store the labels of our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_lab_col = master_checkout.add_str_column(name=\"training_labels\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Hangar FlatSampleWriter                 \n",
       "    Column Name              : training_labels                \n",
       "    Writeable                : True                \n",
       "    Column Type              : str                \n",
       "    Column Layout            : flat                \n",
       "    Schema Type              : variable_shape                \n",
       "    DType                    : <class 'str'>                \n",
       "    Shape                    : None                \n",
       "    Number of Samples        : 0                \n",
       "    Partial Remote Data Refs : False\n"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_lab_col"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. Adding data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To add data to a named column, we can use dict-style mode (refer to the `__setitem__`, `__getitem__`, and `__delitem__` methods) or the `update()` method. Sample keys can be either `str` or `int` type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_im_col[0] = train_images[0]\n",
    "train_lab_col[0] = train_labels[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, `Number of Samples` is equal to 1 now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Hangar FlatSampleWriter                 \n",
       "    Column Name              : training_labels                \n",
       "    Writeable                : True                \n",
       "    Column Type              : str                \n",
       "    Column Layout            : flat                \n",
       "    Schema Type              : variable_shape                \n",
       "    DType                    : <class 'str'>                \n",
       "    Shape                    : None                \n",
       "    Number of Samples        : 1                \n",
       "    Partial Remote Data Refs : False\n"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "master_checkout.columns[\"training_labels\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = {1: train_images[1], 2: train_images[2]}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_im_col.update(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Hangar FlatSampleWriter                 \n",
       "    Column Name              : training_images                \n",
       "    Writeable                : True                \n",
       "    Column Type              : ndarray                \n",
       "    Column Layout            : flat                \n",
       "    Schema Type              : fixed_shape                \n",
       "    DType                    : uint8                \n",
       "    Shape                    : (160, 163, 3)                \n",
       "    Number of Samples        : 3                \n",
       "    Partial Remote Data Refs : False\n"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_im_col"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's add the remaining training images:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 9296/9296 [00:36<00:00, 257.92it/s]\n"
     ]
    }
   ],
   "source": [
    "with train_im_col:\n",
    "    for i, img in tqdm(enumerate(train_images), total=train_images.shape[0]):\n",
    "        if i not in [0, 1, 2]:\n",
    "            train_im_col[i] = img"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "code_folding": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 9296/9296 [00:01<00:00, 5513.23it/s] \n"
     ]
    }
   ],
   "source": [
    "with train_lab_col:\n",
    "    for i, label in tqdm(enumerate(train_labels), total=len(train_labels)):\n",
    "        if i != 0:\n",
    "            train_lab_col[i] = label"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Hangar FlatSampleWriter                 \n",
       "    Column Name              : training_labels                \n",
       "    Writeable                : True                \n",
       "    Column Type              : str                \n",
       "    Column Layout            : flat                \n",
       "    Schema Type              : variable_shape                \n",
       "    DType                    : <class 'str'>                \n",
       "    Shape                    : None                \n",
       "    Number of Samples        : 9296                \n",
       "    Partial Remote Data Refs : False\n"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_lab_col"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Both the `training_images` and the `training_labels` have 9296 samples. Great!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ".. note:: To get an overview of the different ways you could add data to a Hangar repository (also from a performance point of view), please refer to the Performance section of the Hangar Tutorial Part 1."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6. Committing changes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you have made a set of changes you want to commit, simply call the `commit()` method and specify a message.\n",
    "\n",
    "The returned value (`a=ecc943c89b9b09e41574c9849f11937828fece28`) is the commit hash of this commit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'a=ecc943c89b9b09e41574c9849f11937828fece28'"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "master_checkout.commit(\"Add Imagenette training images and labels\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's add the validation data to the repository ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "val_im_col = master_checkout.add_ndarray_column(\n",
    "    name=\"validation_images\", prototype=val_images[0]\n",
    ")\n",
    "val_lab_col = master_checkout.add_str_column(name=\"validation_labels\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 3856/3856 [00:08<00:00, 474.25it/s]\n"
     ]
    }
   ],
   "source": [
    "with val_im_col, val_lab_col:\n",
    "    for img, label in tqdm(zip(val_images, val_labels), total=len(val_labels)):\n",
    "        val_im_col[i] = img\n",
    "        val_lab_col[i] = label"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "... and commit!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'a=e31ef9a06c8d1a4cefeb52c336b2c33d1dca3fba'"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "master_checkout.commit(\"Add Imagenette validation images and labels\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To view the **history** of your commits:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "* a=e31ef9a06c8d1a4cefeb52c336b2c33d1dca3fba (\u001B[1;31mmaster\u001B[m) : Add Imagenette validation images and labels\n",
      "* a=ecc943c89b9b09e41574c9849f11937828fece28 : Add Imagenette training images and labels\n"
     ]
    }
   ],
   "source": [
    "master_checkout.log()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Do not forget to close the write-enabled checkout!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "master_checkout.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's inspect the repository state! This will show disk usage information, the details of the last commit and all the information about the dataset columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Summary of Contents Contained in Data Repository \n",
      " \n",
      "================== \n",
      "| Repository Info \n",
      "|----------------- \n",
      "|  Base Directory: /Volumes/Archivio/tensorwerk/hangar/imagenette \n",
      "|  Disk Usage: 862.09 MB \n",
      " \n",
      "=================== \n",
      "| Commit Details \n",
      "------------------- \n",
      "|  Commit: a=e31ef9a06c8d1a4cefeb52c336b2c33d1dca3fba \n",
      "|  Created: Sat Apr  4 11:29:12 2020 \n",
      "|  By: Alessia Marcolini \n",
      "|  Email: alessia@tensorwerk.com \n",
      "|  Message: Add Imagenette validation images and labels \n",
      " \n",
      "================== \n",
      "| DataSets \n",
      "|----------------- \n",
      "|  Number of Named Columns: 4 \n",
      "|\n",
      "|  * Column Name: ColumnSchemaKey(column=\"training_images\", layout=\"flat\") \n",
      "|    Num Data Pieces: 9296 \n",
      "|    Details: \n",
      "|    - column_layout: flat \n",
      "|    - column_type: ndarray \n",
      "|    - schema_type: fixed_shape \n",
      "|    - shape: (160, 163, 3) \n",
      "|    - dtype: uint8 \n",
      "|    - backend: 01 \n",
      "|    - backend_options: {'complib': 'blosc:lz4hc', 'complevel': 5, 'shuffle': 'byte'} \n",
      "|\n",
      "|  * Column Name: ColumnSchemaKey(column=\"training_labels\", layout=\"flat\") \n",
      "|    Num Data Pieces: 9296 \n",
      "|    Details: \n",
      "|    - column_layout: flat \n",
      "|    - column_type: str \n",
      "|    - schema_type: variable_shape \n",
      "|    - dtype: <class'str'> \n",
      "|    - backend: 30 \n",
      "|    - backend_options: {} \n",
      "|\n",
      "|  * Column Name: ColumnSchemaKey(column=\"validation_images\", layout=\"flat\") \n",
      "|    Num Data Pieces: 1 \n",
      "|    Details: \n",
      "|    - column_layout: flat \n",
      "|    - column_type: ndarray \n",
      "|    - schema_type: fixed_shape \n",
      "|    - shape: (160, 163, 3) \n",
      "|    - dtype: uint8 \n",
      "|    - backend: 01 \n",
      "|    - backend_options: {'complib': 'blosc:lz4hc', 'complevel': 5, 'shuffle': 'byte'} \n",
      "|\n",
      "|  * Column Name: ColumnSchemaKey(column=\"validation_labels\", layout=\"flat\") \n",
      "|    Num Data Pieces: 1 \n",
      "|    Details: \n",
      "|    - column_layout: flat \n",
      "|    - column_type: str \n",
      "|    - schema_type: variable_shape \n",
      "|    - dtype: <class'str'> \n",
      "|    - backend: 30 \n",
      "|    - backend_options: {} \n",
      " \n",
      "================== \n",
      "| Metadata: \n",
      "|----------------- \n",
      "|  Number of Keys: 0 \n",
      "\n"
     ]
    }
   ],
   "source": [
    "repo.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great! You've made it until the end of the \"Real World\" Quick Start Tutorial!! 👏🏼\n",
    "\n",
    "Please check out the other tutorials for more advanced stuff such as branching & merging, conflicts resolution and data loaders for TensorFlow and PyTorch!"
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}