{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 3: Working With Remote Servers\n", "\n", "This tutorial will introduce how to start a remote Hangar server, and how to work with [remotes](api.rst#hangar.repository.Remotes) from the client side.\n", "\n", "Particular attention is paid to the concept of a ***partially fetch* / *partial clone*** operations. This is a key component of the Hangar design which provides the ability to quickly and efficiently work with data contained in remote repositories whose full size would be significatly prohibitive to local use under most circumstances.\n", "\n", "*Note:*\n", "\n", "> At the time of writing, the API, user-facing functionality, client-server negotiation protocols, and test coverage of the remotes implementation is generally adqequate for this to serve as an \"alpha\" quality preview. However, please be warned that significantly less time has been spent in this module to optimize speed, refactor for simplicity, and assure stability under heavy loads than the rest of the Hangar core. While we can guarantee that your data is secure on disk, you may experience crashes from time to time when working with remotes. In addition, sending data over the wire should NOT be considered secure in ANY way. No in-transit encryption, user authentication, or secure access limitations are implemented at this moment. We realize the importance of these types of protections, and they are on our radar for the next release cycle. If you are interested in making a contribution to Hangar, this module contains a lot of low hanging fruit which would would provide drastic improvements and act as a good intro the the internal Hangar data model. Please get in touch with us to discuss!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Starting a Hangar Server\n", "\n", "To start a Hangar server, navigate to the command line and simply execute:\n", "\n", "```\n", "$ hangar server\n", "```\n", "\n", "This will get a local server instance running at `localhost:50051`. The IP and port can be configured by setting the `--ip` and `--port` flags to the desired values in the command line.\n", "\n", "A blocking process will begin in that terminal session. Leave it running while you experiment with connecting from a client repo." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using Remotes with a Local Repository\n", "\n", "The [CLI](cli.rst#hangar-cli-documentation) is the easiest way to interact with the remote server from a local repository (though all functioanlity is mirrorred via the [repository API](api.rst#hangar.repository.Remotes) (more on that later).\n", "\n", "Before we begin we will set up a repository with some data, a few commits, two branches, and a merge." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Setup a Test Repo\n", "\n", "As normal, we shall begin with creating a repository and adding some data. This should be familiar to you from previous tutorials." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from hangar import Repository\n", "import numpy as np\n", "from tqdm import tqdm\n", "\n", "testData = np.loadtxt('/Users/rick/projects/tensorwerk/hangar/dev/data/dota2Dataset/dota2Test.csv', delimiter=',', dtype=np.uint8)\n", "trainData = np.loadtxt('/Users/rick/projects/tensorwerk/hangar/dev/data/dota2Dataset/dota2Train.csv', delimiter=',', dtype=np.uint16)\n", "\n", "testName = 'test'\n", "testPrototype = testData[0]\n", "trainName = 'train'\n", "trainPrototype = trainData[0]" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hangar Repo initialized at: /Users/rick/projects/tensorwerk/hangar/dev/intro/.hangar\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/rick/projects/tensorwerk/hangar/hangar-py/src/hangar/context.py:94: UserWarning: No repository exists at /Users/rick/projects/tensorwerk/hangar/dev/intro/.hangar, please use `repo.init()` method\n", " warnings.warn(msg, UserWarning)\n" ] } ], "source": [ "repo = Repository('/Users/rick/projects/tensorwerk/hangar/dev/intro/')\n", "repo.init(user_name='Rick Izzo', user_email='rick@tensorwerk.com', remove_old=True)\n", "co = repo.checkout(write=True)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "10500it [00:02, 4286.17it/s] \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "* a=b98f6b65c0036489e53ddaf2b30bf797ddc40da0 (\u001B[1;31madd-train\u001B[m) (\u001B[1;31mmaster\u001B[m) : initial commit on master with test data\n" ] } ], "source": [ "co.add_ndarray_column(testName, prototype=testPrototype)\n", "testcol = co.columns[testName]\n", "\n", "pbar = tqdm(total=testData.shape[0])\n", "with testcol as tcol:\n", " for gameIdx, gameData in enumerate(testData):\n", " if (gameIdx % 500 == 0):\n", " pbar.update(500)\n", " tcol.append(gameData)\n", "pbar.close()\n", "\n", "co.commit('initial commit on master with test data')\n", "\n", "repo.create_branch('add-train')\n", "co.close()\n", "repo.log()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "93000it [00:22, 4078.73it/s] \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "* a=957d20e4b921f41975591cc8ee51a4a6912cb919 (\u001B[1;31madd-train\u001B[m) : added training data on another branch\n", "* a=b98f6b65c0036489e53ddaf2b30bf797ddc40da0 (\u001B[1;31mmaster\u001B[m) : initial commit on master with test data\n" ] } ], "source": [ "co = repo.checkout(write=True, branch='add-train')\n", "\n", "co.add_ndarray_column(trainName, prototype=trainPrototype)\n", "traincol = co.columns[trainName]\n", "\n", "pbar = tqdm(total=trainData.shape[0])\n", "with traincol as trcol:\n", " for gameIdx, gameData in enumerate(trainData):\n", " if (gameIdx % 500 == 0):\n", " pbar.update(500)\n", " trcol.append(gameData)\n", "pbar.close()\n", "\n", "co.commit('added training data on another branch')\n", "co.close()\n", "repo.log()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* a=bb1b108ef17b7d7667a2ff396f257d82bad11e1d (\u001B[1;31mmaster\u001B[m) : more changes here\n", "* a=b98f6b65c0036489e53ddaf2b30bf797ddc40da0 : initial commit on master with test data\n" ] } ], "source": [ "co = repo.checkout(write=True, branch='master')\n", "co.metadata['earaea'] = 'eara'\n", "co.commit('more changes here')\n", "co.close()\n", "repo.log()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pushing to a Remote" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use the [API remote add()](api.rst#hangar.repository.Remotes.add) method to add a remote, however, this can also be done with the [CLI command](cli.rst#hangar-remote-add):\n", "\n", " $ hangar remote add origin localhost:50051" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RemoteInfo(name='origin', address='localhost:50051')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "repo.remote.add('origin', 'localhost:50051')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pushing is as simple as running the [push()](api.rst#hangar.repository.Remotes.push) method\n", "from the [API](api.rst#hangar.repository.Remotes.push) or [CLI](cli.rst#hangar-push):\n", "\n", " $ hangar push origin master" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Push the `master` branch:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "counting objects: 100%|██████████| 2/2 [00:00<00:00, 5.47it/s]\n", "pushing schemas: 100%|██████████| 1/1 [00:00<00:00, 133.74it/s]\n", "pushing data: 97%|█████████▋| 10001/10294 [00:01<00:00, 7676.23it/s]\n", "pushing metadata: 100%|██████████| 1/1 [00:00<00:00, 328.50it/s]\n", "pushing commit refs: 100%|██████████| 2/2 [00:00<00:00, 140.73it/s]\n" ] }, { "data": { "text/plain": [ "'master'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "repo.remote.push('origin', 'master')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Push the `add-train` branch:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "counting objects: 100%|██████████| 1/1 [00:01<00:00, 1.44s/it]\n", "pushing schemas: 100%|██████████| 1/1 [00:00<00:00, 126.05it/s]\n", "pushing data: 99%|█████████▉| 92001/92650 [00:12<00:00, 7107.60it/s] \n", "pushing metadata: 0it [00:00, ?it/s]\n", "pushing commit refs: 100%|██████████| 1/1 [00:00<00:00, 17.05it/s]\n" ] }, { "data": { "text/plain": [ "'add-train'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "repo.remote.push('origin', 'add-train')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Details of the Negotiation Processs\n", "\n", "> The following details are not necessary to use the system, but may be of interest to some readers\n", "\n", "When we push data, **we perform a negotation with the server** which basically occurs like this:\n", "\n", "\n", "- Hi, I would like to push this branch, do you have it?\n", "\n", " - If yes, what is the latest commit you record on it?\n", "\n", " - Is that the same commit I'm trying to push? If yes, abort.\n", "\n", " - Is that a commit I don't have? If yes, someone else has updated that branch, abort.\n", "\n", "- Here's the commit digests which are parents of my branches head, which commits are you missing?\n", "\n", "- Ok great, I'm going to scan through each of those commits to find the data hashes they contain. Tell me which ones you are missing.\n", "\n", "- Thanks, now I'll send you all of the data corresponding to those hashes. It might be a lot of data, so we'll handle this in batches so that if my connection cuts out, we can resume this later\n", "\n", "- Now that you have the data, I'm going to send the actual commit references for you to store, this isn't that much information, but you'll be sure to verify that I'm not trying to pull any funny buisness and send you incorrect data.\n", "\n", "- Now that you've received everything, and have verified it matches what I told you it is, go ahead and make those commits I've pushed `available` as the `HEAD` of the branch I just sent. It's some good work that others will want!\n", "\n", "\n", "When we want to fetch updates to a branch, essentially the exact same thing happens in reverse. Instead of asking the server what it doesn't have, we ask it what it does have, and then request the stuff that we are missing!\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Partial Fetching and Clones\n", "\n", "**Now we will introduce one of the most important and unique features of Hangar remotes: Partial fetch/clone of data!**\n", "\n", "*There is a very real problem with keeping the full history of data - **it's huge**!* The size of data can very easily exceeds what can fit on (most) contributors laptops or personal workstations. This section explains how Hangar can handle working with columns which are prohibitively large to download or store on a single machine.\n", "\n", "As mentioned in High Performance From Simplicity, under the hood Hangar deals with “Data” and “Bookkeeping” completely separately. We’ve previously covered what exactly we mean by Data in How Hangar Thinks About Data, so we’ll briefly cover the second major component of Hangar here.\n", "In short “Bookkeeping” describes everything about the repository. By everything, we do mean that the Bookkeeping records describe everything: all commits, parents, branches, columns, samples, data descriptors, schemas, commit message, etc. Though complete, these records are fairly small (tens of MB in size for decently sized repositories with decent history), and are highly compressed for fast transfer between a Hangar client/server.\n", "\n", "A brief technical interlude\n", "\n", "> There is one very important (and rather complex) property which gives Hangar Bookeeping massive power: existence of some data piece is always known to Hangar and stored immutably once committed. However, the access pattern, backend, and locating information for this data piece may (and over time, will) be unique in every hangar repository instance.\n", ">\n", "> Though the details of how this works is well beyond the scope of this document, the following example may provide some insight into the implications of this property:\n", ">\n", "> If you clone some Hangar repository, Bookeeping says that “some number of data pieces exist” and they should retrieved from the server. However, the bookeeping records transfered in a fetch / push / clone operation do not include information about where that piece of data existed on the client (or server) computer. Two synced repositories can use completly different backends to store the data, in completly different locations, and it does not matter - Hangar only guarantees that when collaborators ask for a data sample in some checkout, that they will be provided with identical arrays, not that they will come from the same place or be stored in the same way. Only when data is actually retrieved the “locating information” is set for that repository instance.\n", "Because Hangar makes no assumptions about how/where it should retrieve some piece of data, or even an assumption that it exists on the local machine, and because records are small and completely describe history, once a machine has the Bookkeeping, it can decide what data it actually wants to materialize on its local disk! These partial fetch / partial clone operations can materialize any desired data, whether it be for a few records at the head branch, for all data in a commit, or for the entire historical data. A future release will even include the ability to stream data directly to a Hangar checkout and materialize the data in memory without having to save it to disk at all!\n", "\n", "More importantly: since Bookkeeping describes all history, merging can be performed between branches which may contain partial (or even no) actual data. Aka **you don’t need data on disk to merge changes into it.** It’s an odd concept which will be shown in this tutorial" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cloning a Remote Repo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " $ hangar clone localhost:50051" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/rick/projects/tensorwerk/hangar/hangar-py/src/hangar/context.py:94: UserWarning: No repository exists at /Users/rick/projects/tensorwerk/hangar/dev/dota-clone/.hangar, please use `repo.init()` method\n", " warnings.warn(msg, UserWarning)\n" ] } ], "source": [ "cloneRepo = Repository('/Users/rick/projects/tensorwerk/hangar/dev/dota-clone/')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we perform the initial clone, we will only receive the `master` branch by default." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "fetching commit data refs: 0%| | 0/2 [00:00 does not exist on this machine. Perform a `data-fetch` operation to retrieve it from the remote server.", "output_type": "error", "traceback": [ "\u001B[0;31m---------------------------------------------------------------------------\u001B[0m", "\u001B[0;31mFileNotFoundError\u001B[0m Traceback (most recent call last)", "\u001B[0;32m\u001B[0m in \u001B[0;36m\u001B[0;34m\u001B[0m\n\u001B[0;32m----> 1\u001B[0;31m \u001B[0mco\u001B[0m\u001B[0;34m.\u001B[0m\u001B[0mcolumns\u001B[0m\u001B[0;34m[\u001B[0m\u001B[0;34m'test'\u001B[0m\u001B[0;34m]\u001B[0m\u001B[0;34m[\u001B[0m\u001B[0mtestKey\u001B[0m\u001B[0;34m]\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n\u001B[0m", "\u001B[0;32m~/projects/tensorwerk/hangar/hangar-py/src/hangar/columns/layout_flat.py\u001B[0m in \u001B[0;36m__getitem__\u001B[0;34m(self, key)\u001B[0m\n\u001B[1;32m 222\u001B[0m \"\"\"\n\u001B[1;32m 223\u001B[0m \u001B[0mspec\u001B[0m \u001B[0;34m=\u001B[0m \u001B[0mself\u001B[0m\u001B[0;34m.\u001B[0m\u001B[0m_samples\u001B[0m\u001B[0;34m[\u001B[0m\u001B[0mkey\u001B[0m\u001B[0;34m]\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n\u001B[0;32m--> 224\u001B[0;31m \u001B[0;32mreturn\u001B[0m \u001B[0mself\u001B[0m\u001B[0;34m.\u001B[0m\u001B[0m_be_fs\u001B[0m\u001B[0;34m[\u001B[0m\u001B[0mspec\u001B[0m\u001B[0;34m.\u001B[0m\u001B[0mbackend\u001B[0m\u001B[0;34m]\u001B[0m\u001B[0;34m.\u001B[0m\u001B[0mread_data\u001B[0m\u001B[0;34m(\u001B[0m\u001B[0mspec\u001B[0m\u001B[0;34m)\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n\u001B[0m\u001B[1;32m 225\u001B[0m \u001B[0;34m\u001B[0m\u001B[0m\n\u001B[1;32m 226\u001B[0m \u001B[0;32mdef\u001B[0m \u001B[0mget\u001B[0m\u001B[0;34m(\u001B[0m\u001B[0mself\u001B[0m\u001B[0;34m,\u001B[0m \u001B[0mkey\u001B[0m\u001B[0;34m:\u001B[0m \u001B[0mKeyType\u001B[0m\u001B[0;34m,\u001B[0m \u001B[0mdefault\u001B[0m\u001B[0;34m=\u001B[0m\u001B[0;32mNone\u001B[0m\u001B[0;34m)\u001B[0m\u001B[0;34m:\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n", "\u001B[0;32m~/projects/tensorwerk/hangar/hangar-py/src/hangar/backends/remote_50.py\u001B[0m in \u001B[0;36mread_data\u001B[0;34m(self, hashVal)\u001B[0m\n\u001B[1;32m 172\u001B[0m \u001B[0;32mdef\u001B[0m \u001B[0mread_data\u001B[0m\u001B[0;34m(\u001B[0m\u001B[0mself\u001B[0m\u001B[0;34m,\u001B[0m \u001B[0mhashVal\u001B[0m\u001B[0;34m:\u001B[0m \u001B[0mREMOTE_50_DataHashSpec\u001B[0m\u001B[0;34m)\u001B[0m \u001B[0;34m->\u001B[0m \u001B[0;32mNone\u001B[0m\u001B[0;34m:\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n\u001B[1;32m 173\u001B[0m raise FileNotFoundError(\n\u001B[0;32m--> 174\u001B[0;31m \u001B[0;34mf'data hash spec: {REMOTE_50_DataHashSpec} does not exist on this machine. '\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n\u001B[0m\u001B[1;32m 175\u001B[0m f'Perform a `data-fetch` operation to retrieve it from the remote server.')\n\u001B[1;32m 176\u001B[0m \u001B[0;34m\u001B[0m\u001B[0m\n", "\u001B[0;31mFileNotFoundError\u001B[0m: data hash spec: does not exist on this machine. Perform a `data-fetch` operation to retrieve it from the remote server." ] } ], "source": [ "co.columns['test'][testKey]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Fetching Data from a Remote\n", "\n", "To retrieve the data, we use the [fetch_data()](api.rst#hangar.repository.Remotes.fetch_data)\n", "method (accessible via the [API](api.rst#hangar.repository.Remotes.fetch_data) or\n", "[fetch-data](cli.rst#hangar-fetch-data) via the CLI).\n", "\n", "The amount / type of data to retrieve is extremly configurable via the following options:\n", "\n", ".. include:: ./noindexapi/apiremotefetchdata.rst\n", "\n", "This will retrieve all the data on the `master` branch, but not on the `add-train` branch." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "counting objects: 100%|██████████| 1/1 [00:00<00:00, 27.45it/s]\n", "fetching data: 100%|██████████| 10294/10294 [00:01<00:00, 6664.60it/s]\n" ] }, { "data": { "text/plain": [ "['a=bb1b108ef17b7d7667a2ff396f257d82bad11e1d']" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cloneRepo.remote.fetch_data('origin', branch='master')" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " * Checking out BRANCH: master with current HEAD: a=bb1b108ef17b7d7667a2ff396f257d82bad11e1d\n" ] } ], "source": [ "co = cloneRepo.checkout(branch='master')" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Hangar ReaderCheckout \n", " Writer : False \n", " Commit Hash : a=bb1b108ef17b7d7667a2ff396f257d82bad11e1d \n", " Num Columns : 1 \n", " Num Metadata : 1\n" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unlike before, we see that there is no partial references from the `repr`" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Hangar Columns \n", " Writeable : False \n", " Number of Columns : 1 \n", " Column Names / Partial Remote References: \n", " - test / False" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co.columns" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Hangar FlatSampleReader \n", " Column Name : test \n", " Writeable : False \n", " Column Type : ndarray \n", " Column Layout : flat \n", " Schema Type : fixed_shape \n", " DType : uint8 \n", " Shape : (117,) \n", " Number of Samples : 10294 \n", " Partial Remote Data Refs : False\n" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co.columns['test']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***When we access the data this time, it is available and retrieved as requested!***" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([255, 223, 8, 2, 0, 255, 0, 0, 0, 0, 0, 0, 1,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 1, 0, 0, 0, 255, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 255, 0, 0,\n", " 0, 0, 0, 0, 0, 255, 0, 0, 0, 0, 1, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 255, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n", " dtype=uint8)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co['test', testKey]" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "co.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Working with mixed local / remote checkout Data\n", "\n", "If we were to checkout the `add-train` branch now, we would see that there is no `arrayset \"train\"` data, but there will be data common to the ancestor that `master` and `add-train` share." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* a=957d20e4b921f41975591cc8ee51a4a6912cb919 (\u001B[1;31madd-train\u001B[m) (\u001B[1;31morigin/add-train\u001B[m) : added training data on another branch\n", "* a=b98f6b65c0036489e53ddaf2b30bf797ddc40da0 : initial commit on master with test data\n" ] } ], "source": [ "cloneRepo.log('add-train')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, the common ancestor is commit: `9b93b393e8852a1fa57f0170f54b30c2c0c7d90f`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To show that there is no data on the `add-train` branch:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " * Checking out BRANCH: add-train with current HEAD: a=957d20e4b921f41975591cc8ee51a4a6912cb919\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/rick/projects/tensorwerk/hangar/hangar-py/src/hangar/columns/constructors.py:45: UserWarning: Column: train contains `reference-only` samples, with actual data residing on a remote server. A `fetch-data` operation is required to access these samples.\n", " f'operation is required to access these samples.', UserWarning)\n" ] } ], "source": [ "co = cloneRepo.checkout(branch='add-train')" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Hangar ReaderCheckout \n", " Writer : False \n", " Commit Hash : a=957d20e4b921f41975591cc8ee51a4a6912cb919 \n", " Num Columns : 2 \n", " Num Metadata : 0\n" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Hangar Columns \n", " Writeable : False \n", " Number of Columns : 2 \n", " Column Names / Partial Remote References: \n", " - test / False\n", " - train / True" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co.columns" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([255, 223, 8, 2, 0, 255, 0, 0, 0, 0, 0, 0, 1,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 1, 0, 0, 0, 255, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 255, 0, 0,\n", " 0, 0, 0, 0, 0, 255, 0, 0, 0, 0, 1, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 255, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n", " dtype=uint8)" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co['test', testKey]" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "trainKey = next(co.columns['train'].keys())" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "ename": "FileNotFoundError", "evalue": "data hash spec: does not exist on this machine. Perform a `data-fetch` operation to retrieve it from the remote server.", "output_type": "error", "traceback": [ "\u001B[0;31m---------------------------------------------------------------------------\u001B[0m", "\u001B[0;31mFileNotFoundError\u001B[0m Traceback (most recent call last)", "\u001B[0;32m\u001B[0m in \u001B[0;36m\u001B[0;34m\u001B[0m\n\u001B[0;32m----> 1\u001B[0;31m \u001B[0mco\u001B[0m\u001B[0;34m.\u001B[0m\u001B[0mcolumns\u001B[0m\u001B[0;34m[\u001B[0m\u001B[0;34m'train'\u001B[0m\u001B[0;34m]\u001B[0m\u001B[0;34m[\u001B[0m\u001B[0mtrainKey\u001B[0m\u001B[0;34m]\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n\u001B[0m", "\u001B[0;32m~/projects/tensorwerk/hangar/hangar-py/src/hangar/columns/layout_flat.py\u001B[0m in \u001B[0;36m__getitem__\u001B[0;34m(self, key)\u001B[0m\n\u001B[1;32m 222\u001B[0m \"\"\"\n\u001B[1;32m 223\u001B[0m \u001B[0mspec\u001B[0m \u001B[0;34m=\u001B[0m \u001B[0mself\u001B[0m\u001B[0;34m.\u001B[0m\u001B[0m_samples\u001B[0m\u001B[0;34m[\u001B[0m\u001B[0mkey\u001B[0m\u001B[0;34m]\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n\u001B[0;32m--> 224\u001B[0;31m \u001B[0;32mreturn\u001B[0m \u001B[0mself\u001B[0m\u001B[0;34m.\u001B[0m\u001B[0m_be_fs\u001B[0m\u001B[0;34m[\u001B[0m\u001B[0mspec\u001B[0m\u001B[0;34m.\u001B[0m\u001B[0mbackend\u001B[0m\u001B[0;34m]\u001B[0m\u001B[0;34m.\u001B[0m\u001B[0mread_data\u001B[0m\u001B[0;34m(\u001B[0m\u001B[0mspec\u001B[0m\u001B[0;34m)\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n\u001B[0m\u001B[1;32m 225\u001B[0m \u001B[0;34m\u001B[0m\u001B[0m\n\u001B[1;32m 226\u001B[0m \u001B[0;32mdef\u001B[0m \u001B[0mget\u001B[0m\u001B[0;34m(\u001B[0m\u001B[0mself\u001B[0m\u001B[0;34m,\u001B[0m \u001B[0mkey\u001B[0m\u001B[0;34m:\u001B[0m \u001B[0mKeyType\u001B[0m\u001B[0;34m,\u001B[0m \u001B[0mdefault\u001B[0m\u001B[0;34m=\u001B[0m\u001B[0;32mNone\u001B[0m\u001B[0;34m)\u001B[0m\u001B[0;34m:\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n", "\u001B[0;32m~/projects/tensorwerk/hangar/hangar-py/src/hangar/backends/remote_50.py\u001B[0m in \u001B[0;36mread_data\u001B[0;34m(self, hashVal)\u001B[0m\n\u001B[1;32m 172\u001B[0m \u001B[0;32mdef\u001B[0m \u001B[0mread_data\u001B[0m\u001B[0;34m(\u001B[0m\u001B[0mself\u001B[0m\u001B[0;34m,\u001B[0m \u001B[0mhashVal\u001B[0m\u001B[0;34m:\u001B[0m \u001B[0mREMOTE_50_DataHashSpec\u001B[0m\u001B[0;34m)\u001B[0m \u001B[0;34m->\u001B[0m \u001B[0;32mNone\u001B[0m\u001B[0;34m:\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n\u001B[1;32m 173\u001B[0m raise FileNotFoundError(\n\u001B[0;32m--> 174\u001B[0;31m \u001B[0;34mf'data hash spec: {REMOTE_50_DataHashSpec} does not exist on this machine. '\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n\u001B[0m\u001B[1;32m 175\u001B[0m f'Perform a `data-fetch` operation to retrieve it from the remote server.')\n\u001B[1;32m 176\u001B[0m \u001B[0;34m\u001B[0m\u001B[0m\n", "\u001B[0;31mFileNotFoundError\u001B[0m: data hash spec: does not exist on this machine. Perform a `data-fetch` operation to retrieve it from the remote server." ] } ], "source": [ "co.columns['train'][trainKey]" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "co.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Merging Branches with Partial Data\n", "\n", "Even though we don't have the actual data references in the `add-train` branch, it is still possible to merge the two branches!\n", "\n", "This is possible because Hangar doesn't use the data contents in its internal model of checkouts / commits, but instead thinks of a checkouts as a sequence of columns / metadata / keys & their associated data hashes (which are very small text records; ie. \"bookkeeping\"). To show this in action, lets merge the two branches `master` (containing all data locally) and `add-train` (containing partial remote references for the `train` arrayset) together and push it to the Remote!" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* a=bb1b108ef17b7d7667a2ff396f257d82bad11e1d (\u001B[1;31mmaster\u001B[m) (\u001B[1;31morigin/master\u001B[m) : more changes here\n", "* a=b98f6b65c0036489e53ddaf2b30bf797ddc40da0 : initial commit on master with test data\n" ] } ], "source": [ "cloneRepo.log('master')" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* a=957d20e4b921f41975591cc8ee51a4a6912cb919 (\u001B[1;31madd-train\u001B[m) (\u001B[1;31morigin/add-train\u001B[m) : added training data on another branch\n", "* a=b98f6b65c0036489e53ddaf2b30bf797ddc40da0 : initial commit on master with test data\n" ] } ], "source": [ "cloneRepo.log('add-train')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Perform the Merge**" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Selected 3-Way Merge Strategy\n" ] }, { "data": { "text/plain": [ "'a=ace3dacbd94f475664ee136dcf05430a2895aca3'" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cloneRepo.merge('merge commit here', 'master', 'add-train')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**IT WORKED!**" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* a=ace3dacbd94f475664ee136dcf05430a2895aca3 (\u001B[1;31mmaster\u001B[m) : merge commit here\n", "\u001B[1;31m|\u001B[m\u001B[1;32m\\\u001B[m \n", "* \u001B[1;32m|\u001B[m a=bb1b108ef17b7d7667a2ff396f257d82bad11e1d (\u001B[1;31morigin/master\u001B[m) : more changes here\n", "\u001B[1;32m|\u001B[m * a=957d20e4b921f41975591cc8ee51a4a6912cb919 (\u001B[1;31madd-train\u001B[m) (\u001B[1;31morigin/add-train\u001B[m) : added training data on another branch\n", "\u001B[1;32m|\u001B[m\u001B[1;32m/\u001B[m \n", "* a=b98f6b65c0036489e53ddaf2b30bf797ddc40da0 : initial commit on master with test data\n" ] } ], "source": [ "cloneRepo.log()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check the summary of the master commit to check that the contents are what we expect (containing both `test` and `train` columns)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Summary of Contents Contained in Data Repository \n", " \n", "================== \n", "| Repository Info \n", "|----------------- \n", "| Base Directory: /Users/rick/projects/tensorwerk/hangar/dev/dota-clone \n", "| Disk Usage: 42.03 MB \n", " \n", "=================== \n", "| Commit Details \n", "------------------- \n", "| Commit: a=ace3dacbd94f475664ee136dcf05430a2895aca3 \n", "| Created: Tue Feb 25 19:18:30 2020 \n", "| By: rick izzo \n", "| Email: rick@tensorwerk.com \n", "| Message: merge commit here \n", " \n", "================== \n", "| DataSets \n", "|----------------- \n", "| Number of Named Columns: 2 \n", "|\n", "| * Column Name: ColumnSchemaKey(column=\"test\", layout=\"flat\") \n", "| Num Data Pieces: 10294 \n", "| Details: \n", "| - column_layout: flat \n", "| - column_type: ndarray \n", "| - schema_type: fixed_shape \n", "| - shape: (117,) \n", "| - dtype: uint8 \n", "| - backend: 10 \n", "| - backend_options: {} \n", "|\n", "| * Column Name: ColumnSchemaKey(column=\"train\", layout=\"flat\") \n", "| Num Data Pieces: 92650 \n", "| Details: \n", "| - column_layout: flat \n", "| - column_type: ndarray \n", "| - schema_type: fixed_shape \n", "| - shape: (117,) \n", "| - dtype: uint16 \n", "| - backend: 10 \n", "| - backend_options: {} \n", " \n", "================== \n", "| Metadata: \n", "|----------------- \n", "| Number of Keys: 1 \n", "\n" ] } ], "source": [ "cloneRepo.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Pushing the Merge back to the Remote\n", "\n", "To push this merge back to our original copy of the Repository (`repo`), we just push the `master` branch back to the remote via the API or CLI." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "counting objects: 100%|██████████| 1/1 [00:00<00:00, 1.02it/s]\n", "pushing schemas: 0it [00:00, ?it/s]\n", "pushing data: 0it [00:00, ?it/s]\n", "pushing metadata: 0it [00:00, ?it/s]\n", "pushing commit refs: 100%|██████████| 1/1 [00:00<00:00, 34.26it/s]\n" ] }, { "data": { "text/plain": [ "'master'" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cloneRepo.remote.push('origin', 'master')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at our current state of our other instance of the repo `repo` we see that the merge changes aren't yet propogated to it (since it hasn't fetched from the remote yet)." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* a=bb1b108ef17b7d7667a2ff396f257d82bad11e1d (\u001B[1;31mmaster\u001B[m) (\u001B[1;31morigin/master\u001B[m) : more changes here\n", "* a=b98f6b65c0036489e53ddaf2b30bf797ddc40da0 : initial commit on master with test data\n" ] } ], "source": [ "repo.log()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To fetch the merged changes, just [fetch()](api.rst#hangar.repository.Remotes.fetch) the branch as normal. Like all fetches, this will be a fast operation, as it will be a `partial fetch` operation, not actually transfering the data." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "fetching commit data refs: 100%|██████████| 1/1 [00:01<00:00, 1.33s/it]\n", "fetching commit spec: 100%|██████████| 1/1 [00:00<00:00, 37.61it/s]\n" ] }, { "data": { "text/plain": [ "'origin/master'" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "repo.remote.fetch('origin', 'master')" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* a=ace3dacbd94f475664ee136dcf05430a2895aca3 (\u001B[1;31morigin/master\u001B[m) : merge commit here\n", "\u001B[1;31m|\u001B[m\u001B[1;32m\\\u001B[m \n", "* \u001B[1;32m|\u001B[m a=bb1b108ef17b7d7667a2ff396f257d82bad11e1d (\u001B[1;31mmaster\u001B[m) : more changes here\n", "\u001B[1;32m|\u001B[m * a=957d20e4b921f41975591cc8ee51a4a6912cb919 (\u001B[1;31madd-train\u001B[m) (\u001B[1;31morigin/add-train\u001B[m) : added training data on another branch\n", "\u001B[1;32m|\u001B[m\u001B[1;32m/\u001B[m \n", "* a=b98f6b65c0036489e53ddaf2b30bf797ddc40da0 : initial commit on master with test data\n" ] } ], "source": [ "repo.log('origin/master')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To bring our `master` branch up to date is a simple fast-forward merge." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Selected Fast-Forward Merge Strategy\n" ] }, { "data": { "text/plain": [ "'a=ace3dacbd94f475664ee136dcf05430a2895aca3'" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "repo.merge('ff-merge', 'master', 'origin/master')" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* a=ace3dacbd94f475664ee136dcf05430a2895aca3 (\u001B[1;31mmaster\u001B[m) (\u001B[1;31morigin/master\u001B[m) : merge commit here\n", "\u001B[1;31m|\u001B[m\u001B[1;32m\\\u001B[m \n", "* \u001B[1;32m|\u001B[m a=bb1b108ef17b7d7667a2ff396f257d82bad11e1d : more changes here\n", "\u001B[1;32m|\u001B[m * a=957d20e4b921f41975591cc8ee51a4a6912cb919 (\u001B[1;31madd-train\u001B[m) (\u001B[1;31morigin/add-train\u001B[m) : added training data on another branch\n", "\u001B[1;32m|\u001B[m\u001B[1;32m/\u001B[m \n", "* a=b98f6b65c0036489e53ddaf2b30bf797ddc40da0 : initial commit on master with test data\n" ] } ], "source": [ "repo.log()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Everything is as it should be!** Now, try it out for yourself!" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Summary of Contents Contained in Data Repository \n", " \n", "================== \n", "| Repository Info \n", "|----------------- \n", "| Base Directory: /Users/rick/projects/tensorwerk/hangar/dev/intro \n", "| Disk Usage: 77.43 MB \n", " \n", "=================== \n", "| Commit Details \n", "------------------- \n", "| Commit: a=ace3dacbd94f475664ee136dcf05430a2895aca3 \n", "| Created: Tue Feb 25 19:18:30 2020 \n", "| By: rick izzo \n", "| Email: rick@tensorwerk.com \n", "| Message: merge commit here \n", " \n", "================== \n", "| DataSets \n", "|----------------- \n", "| Number of Named Columns: 2 \n", "|\n", "| * Column Name: ColumnSchemaKey(column=\"test\", layout=\"flat\") \n", "| Num Data Pieces: 10294 \n", "| Details: \n", "| - column_layout: flat \n", "| - column_type: ndarray \n", "| - schema_type: fixed_shape \n", "| - shape: (117,) \n", "| - dtype: uint8 \n", "| - backend: 10 \n", "| - backend_options: {} \n", "|\n", "| * Column Name: ColumnSchemaKey(column=\"train\", layout=\"flat\") \n", "| Num Data Pieces: 92650 \n", "| Details: \n", "| - column_layout: flat \n", "| - column_type: ndarray \n", "| - schema_type: fixed_shape \n", "| - shape: (117,) \n", "| - dtype: uint16 \n", "| - backend: 10 \n", "| - backend_options: {} \n", " \n", "================== \n", "| Metadata: \n", "|----------------- \n", "| Number of Keys: 1 \n", "\n" ] } ], "source": [ "repo.summary()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }