{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: Checkouts, Branching, & Merging\n", "\n", "This section deals with navigating repository history, creating & merging\n", "branches, and understanding conflicts." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Hangar Workflow\n", "\n", "The hangar workflow is intended to mimic common ``git`` workflows in which small\n", "incremental changes are made and committed on dedicated ``topic`` branches.\n", "After the ``topic`` has been adequatly set, ``topic`` branch is merged into\n", "a separate branch (commonly referred to as ``master``, though it need not to be the\n", "actual branch named ``\"master\"``), where well vetted and more permanent changes\n", "are kept.\n", "\n", " Create Branch -> Checkout Branch -> Make Changes -> Commit\n", "\n", "#### Making the Initial Commit\n", "\n", "Let's initialize a new repository and see how branching works in Hangar:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from hangar import Repository\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "repo = Repository(path='/Users/rick/projects/tensorwerk/hangar/dev/mnist/')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hangar Repo initialized at: /Users/rick/projects/tensorwerk/hangar/dev/mnist/.hangar\n" ] } ], "source": [ "repo_pth = repo.init(user_name='Test User', user_email='test@foo.com', remove_old=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When a repository is first initialized, it has no history, no commits." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "repo.log() # -> returns None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Though the repository is essentially empty at this point in time, there is one\n", "thing which is present: a branch with the name: ``\"master\"``." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['master']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "repo.list_branches()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This ``\"master\"`` is the branch we make our first commit on; until we do, the\n", "repository is in a semi-unstable state; with no history or contents, most of the\n", "functionality of a repository (to store, retrieve, and work with versions of\n", "data across time) just isn't possible. A significant portion of otherwise\n", "standard operations will generally flat out refuse to execute (ie. read-only\n", "checkouts, log, push, etc.) until the first commit is made.\n", "\n", "One of the only options available at this point is to create a\n", "write-enabled checkout on the ``\"master\"`` branch and to begin to add data so we\n", "can make a commit. Let’s do that now:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "co = repo.checkout(write=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected, there are no columns nor metadata samples recorded in the checkout." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "number of metadata keys: 0\n", "number of columns: 0\n" ] } ], "source": [ "print(f'number of metadata keys: {len(co.metadata)}')\n", "print(f'number of columns: {len(co.columns)}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s add a dummy array just to put something in the repository history to\n", "commit. We'll then close the checkout so we can explore some useful tools which\n", "depend on having at least one historical record (commit) in the repo." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "dummy = np.arange(10, dtype=np.uint16)\n", "col = co.add_ndarray_column('dummy_column', prototype=dummy)\n", "col['0'] = dummy\n", "initialCommitHash = co.commit('first commit with a single sample added to a dummy column')\n", "co.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we check the history now, we can see our first commit hash, and that it is labeled with the branch name `\"master\"`" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* a=eaee002ed9c6e949c3657bd50e3949d6a459d50e (\u001B[1;31mmaster\u001B[m) : first commit with a single sample added to a dummy column\n" ] } ], "source": [ "repo.log()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So now our repository contains:\n", "- [A commit](api.rst#hangar.checkout.WriterCheckout.commit_hash): a fully\n", " independent description of the entire repository state as\n", " it existed at some point in time. A commit is identified by a `commit_hash`.\n", "- [A branch](api.rst#hangar.checkout.WriterCheckout.branch_name): a label\n", " pointing to a particular `commit` / `commit_hash`.\n", "\n", "Once committed, it is not possible to remove, modify, or otherwise tamper with\n", "the contents of a commit in any way. It is a permanent record, which Hangar has\n", "no method to change once written to disk.\n", "\n", "In addition, as a `commit_hash` is not only calculated from the `commit` ’s\n", "contents, but from the `commit_hash` of its parents (more on this to follow),\n", "knowing a single top-level `commit_hash` allows us to verify the integrity of\n", "the entire repository history. This fundamental behavior holds even in cases of\n", "disk-corruption or malicious use.\n", "\n", "### Working with Checkouts & Branches\n", "\n", "As mentioned in the first tutorial, we work with the data in a repository through\n", "a [checkout](api.rst#hangar.repository.Repository.checkout). There are two types\n", "of checkouts (each of which have different uses and abilities):\n", "\n", "**[Checking out a branch / commit for reading:](api.rst#read-only-checkout)** is\n", "the process of retrieving records describing repository state at some point in\n", "time, and setting up access to the referenced data.\n", "\n", "- Any number of read checkout processes can operate on a repository (on\n", " any number of commits) at the same time.\n", "\n", "**[Checking out a branch for writing:](api.rst#write-enabled-checkout)** is the\n", "process of setting up a (mutable) ``staging area`` to temporarily gather\n", "record references / data before all changes have been made and staging area\n", "contents are committed in a new permanent record of history (a `commit`).\n", "\n", "- Only one write-enabled checkout can ever be operating in a repository\n", " at a time.\n", "- When initially creating the checkout, the `staging area` is not\n", " actually “empty”. Instead, it has the full contents of the last `commit`\n", " referenced by a branch’s `HEAD`. These records can be removed / mutated / added\n", " to in any way to form the next `commit`. The new `commit` retains a\n", " permanent reference identifying the previous ``HEAD`` ``commit`` was used as\n", " its base `staging area`.\n", "- On commit, the branch which was checked out has its ``HEAD`` pointer\n", " value updated to the new `commit`’s `commit_hash`. A write-enabled\n", " checkout starting from the same branch will now use that `commit`’s\n", " record content as the base for its `staging area`.\n", "\n", "#### Creating a branch\n", "\n", "A branch is an individual series of changes / commits which diverge from the main\n", "history of the repository at some point in time. All changes made along a branch\n", "are completely isolated from those on other branches. After some point in time,\n", "changes made in a disparate branches can be unified through an automatic\n", "`merge` process (described in detail later in this tutorial). In general, the\n", "`Hangar` branching model is semantically identical to the `Git` one; The one exception\n", "is that in Hangar, a branch must always have a `name` and a `base_commit`. (No\n", "\"Detached HEAD state\" is possible for a `write-enabled` checkout). If No `base_commit` is\n", "specified, the current writer branch `HEAD` `commit` is used as the `base_commit`\n", "hash for the branch automatically.\n", "\n", "Hangar branches have the same lightweight and performant properties which\n", "make working with `Git` branches so appealing - they are cheap and easy to use,\n", "create, and discard (if necessary).\n", "\n", "To create a branch, use the [create_branch()](api.rst#hangar.repository.Repository.create_branch)\n", "method." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "branch_1 = repo.create_branch(name='testbranch')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BranchHead(name='testbranch', digest='a=eaee002ed9c6e949c3657bd50e3949d6a459d50e')" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "branch_1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use the [list_branches()](api.rst#hangar.repository.Repository.list_branches) and [log()](api.rst#hangar.repository.Repository.log) methods to see that a new branch named `testbranch` has been created and is indeed pointing to our initial commit." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "branch names: ['master', 'testbranch'] \n", "\n", "* a=eaee002ed9c6e949c3657bd50e3949d6a459d50e (\u001B[1;31mmaster\u001B[m) (\u001B[1;31mtestbranch\u001B[m) : first commit with a single sample added to a dummy column\n" ] } ], "source": [ "print(f'branch names: {repo.list_branches()} \\n')\n", "repo.log()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If instead, we actually specify the base commit (with a different branch\n", "name) we see we do actually get a third branch. pointing to the same commit as\n", "`master` and `testbranch`" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "branch_2 = repo.create_branch(name='new', base_commit=initialCommitHash)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BranchHead(name='new', digest='a=eaee002ed9c6e949c3657bd50e3949d6a459d50e')" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "branch_2" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* a=eaee002ed9c6e949c3657bd50e3949d6a459d50e (\u001B[1;31mmaster\u001B[m) (\u001B[1;31mnew\u001B[m) (\u001B[1;31mtestbranch\u001B[m) : first commit with a single sample added to a dummy column\n" ] } ], "source": [ "repo.log()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Making changes on a branch\n", "\n", "Let’s make some changes on the `new` branch to see how things work.\n", "\n", "We can see that the data we added previously is still here (`dummy` arrayset containing\n", "one sample labeled `0`)." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "co = repo.checkout(write=True, branch='new')" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Hangar Columns \n", " Writeable : True \n", " Number of Columns : 1 \n", " Column Names / Partial Remote References: \n", " - dummy_column / False" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co.columns" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Hangar FlatSampleWriter \n", " Column Name : dummy_column \n", " Writeable : True \n", " Column Type : ndarray \n", " Column Layout : flat \n", " Schema Type : fixed_shape \n", " DType : uint16 \n", " Shape : (10,) \n", " Number of Samples : 1 \n", " Partial Remote Data Refs : False\n" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co.columns['dummy_column']" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint16)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co.columns['dummy_column']['0']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's add another sample to the `dummy_arrayset` called `1`" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "arr = np.arange(10, dtype=np.uint16)\n", "# let's increment values so that `0` and `1` aren't set to the same thing\n", "arr += 1\n", "\n", "co['dummy_column', '1'] = arr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that in this checkout, there are indeed two samples in the `dummy_arrayset`:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(co.columns['dummy_column'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's all, let's commit this and be done with this branch." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "co.commit('commit on `new` branch adding a sample to dummy_arrayset')\n", "co.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### How do changes appear when made on a branch?\n", "\n", "If we look at the log, we see that the branch we were on (`new`) is a commit ahead of `master` and `testbranch`" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* a=c1cf1bd6863ed0b95239d2c9e1a6c6cc65569e94 (\u001B[1;31mnew\u001B[m) : commit on `new` branch adding a sample to dummy_arrayset\n", "* a=eaee002ed9c6e949c3657bd50e3949d6a459d50e (\u001B[1;31mmaster\u001B[m) (\u001B[1;31mtestbranch\u001B[m) : first commit with a single sample added to a dummy column\n" ] } ], "source": [ "repo.log()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The meaning is exactly what one would intuit. We made some changes, they were\n", "reflected on the `new` branch, but the `master` and `testbranch` branches\n", "were not impacted at all, nor were any of the commits!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Merging (Part 1) Fast-Forward Merges\n", "\n", "Say we like the changes we made on the ``new`` branch so much that we want them\n", "to be included into our ``master`` branch! How do we make this happen for this\n", "scenario??\n", "\n", "Well, the history between the ``HEAD`` of the ``new`` and the ``HEAD`` of the\n", "``master`` branch is perfectly linear. In fact, when we began making changes\n", "on ``new``, our staging area was *identical* to what the ``master`` ``HEAD``\n", "commit references are right now!\n", "\n", "If you’ll remember that a branch is just a pointer which assigns some ``name``\n", "to a ``commit_hash``, it becomes apparent that a merge in this case really\n", "doesn’t involve any work at all. With a linear history between ``master`` and\n", "``new``, any ``commits`` exsting along the path between the ``HEAD`` of\n", "``new`` and ``master`` are the only changes which are introduced, and we can\n", "be sure that this is the only view of the data records which can exist!\n", "\n", "What this means in practice is that for this type of merge, we can just update\n", "the ``HEAD`` of ``master`` to point to the ``HEAD`` of ``\"new\"``, and the\n", "merge is complete.\n", "\n", "This situation is referred to as a **Fast Forward (FF) Merge**. A FF merge is\n", "safe to perform any time a linear history lies between the ``HEAD`` of some\n", "``topic`` and ``base`` branch, regardless of how many commits or changes which\n", "were introduced.\n", "\n", "For other situations, a more complicated **Three Way Merge** is required. This\n", "merge method will be explained a bit more later in this tutorial." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "co = repo.checkout(write=True, branch='master')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Performing the Merge\n", "\n", "In practice, you’ll never need to know the details of the merge theory explained\n", "above (or even remember it exists). Hangar automatically figures out which merge\n", "algorithms should be used and then performed whatever calculations are needed to\n", "compute the results.\n", "\n", "As a user, merging in Hangar is a one-liner! just use the [merge()](api.rst#hangar.checkout.WriterCheckout.merge)\n", "method from a `write-enabled` checkout (shown below), or the analogous methods method\n", "from the Repository Object [repo.merge()](api.rst#hangar.repository.Repository.merge)\n", "(if not already working with a `write-enabled` checkout object)." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Selected Fast-Forward Merge Strategy\n" ] }, { "data": { "text/plain": [ "'a=c1cf1bd6863ed0b95239d2c9e1a6c6cc65569e94'" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co.merge(message='message for commit (not used for FF merge)', dev_branch='new')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check the log!" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* a=c1cf1bd6863ed0b95239d2c9e1a6c6cc65569e94 (\u001B[1;31mmaster\u001B[m) (\u001B[1;31mnew\u001B[m) : commit on `new` branch adding a sample to dummy_arrayset\n", "* a=eaee002ed9c6e949c3657bd50e3949d6a459d50e (\u001B[1;31mtestbranch\u001B[m) : first commit with a single sample added to a dummy column\n" ] } ], "source": [ "repo.log()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'master'" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co.branch_name" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'a=c1cf1bd6863ed0b95239d2c9e1a6c6cc65569e94'" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co.commit_hash" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Hangar FlatSampleWriter \n", " Column Name : dummy_column \n", " Writeable : True \n", " Column Type : ndarray \n", " Column Layout : flat \n", " Schema Type : fixed_shape \n", " DType : uint16 \n", " Shape : (10,) \n", " Number of Samples : 2 \n", " Partial Remote Data Refs : False\n" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co.columns['dummy_column']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, everything is as it should be!" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "co.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Making changes to introduce diverged histories\n", "\n", "Let’s now go back to our `testbranch` branch and make some changes there so\n", "we can see what happens when changes don’t follow a linear history." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "co = repo.checkout(write=True, branch='testbranch')" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Hangar Columns \n", " Writeable : True \n", " Number of Columns : 1 \n", " Column Names / Partial Remote References: \n", " - dummy_column / False" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co.columns" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Hangar FlatSampleWriter \n", " Column Name : dummy_column \n", " Writeable : True \n", " Column Type : ndarray \n", " Column Layout : flat \n", " Schema Type : fixed_shape \n", " DType : uint16 \n", " Shape : (10,) \n", " Number of Samples : 1 \n", " Partial Remote Data Refs : False\n" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co.columns['dummy_column']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will start by mutating sample `0` in `dummy_arrayset` to a different value" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([50, 51, 52, 53, 54, 55, 56, 57, 58, 59], dtype=uint16)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "old_arr = co['dummy_column', '0']\n", "new_arr = old_arr + 50\n", "new_arr" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "co['dummy_column', '0'] = new_arr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s make a commit here, then add some metadata and make a new commit (all on\n", "the `testbranch` branch)." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'a=fcd82f86e39b19c3e5351dda063884b5d2fda67b'" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co.commit('mutated sample `0` of `dummy_column` to new value')" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* a=fcd82f86e39b19c3e5351dda063884b5d2fda67b (\u001B[1;31mtestbranch\u001B[m) : mutated sample `0` of `dummy_column` to new value\n", "* a=eaee002ed9c6e949c3657bd50e3949d6a459d50e : first commit with a single sample added to a dummy column\n" ] } ], "source": [ "repo.log()" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "co.metadata['hello'] = 'world'" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'a=69a08ca41ca1f5577fb0ffcf59d4d1585f614c4d'" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co.commit('added hellow world metadata')" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "co.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at our history how, we see that none of the original branches reference\n", "our first commit anymore." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* a=69a08ca41ca1f5577fb0ffcf59d4d1585f614c4d (\u001B[1;31mtestbranch\u001B[m) : added hellow world metadata\n", "* a=fcd82f86e39b19c3e5351dda063884b5d2fda67b : mutated sample `0` of `dummy_column` to new value\n", "* a=eaee002ed9c6e949c3657bd50e3949d6a459d50e : first commit with a single sample added to a dummy column\n" ] } ], "source": [ "repo.log()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check the history of the `master` branch by specifying it as an argument to the `log()` method." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* a=c1cf1bd6863ed0b95239d2c9e1a6c6cc65569e94 (\u001B[1;31mmaster\u001B[m) (\u001B[1;31mnew\u001B[m) : commit on `new` branch adding a sample to dummy_arrayset\n", "* a=eaee002ed9c6e949c3657bd50e3949d6a459d50e : first commit with a single sample added to a dummy column\n" ] } ], "source": [ "repo.log('master')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Merging (Part 2) Three Way Merge\n", "\n", "If we now want to merge the changes on `testbranch` into `master`, we can't just follow a simple linear history; **the branches have diverged**.\n", "\n", "For this case, Hangar implements a **Three Way Merge** algorithm which does the following:\n", "- Find the most recent common ancestor `commit` present in both the `testbranch` and `master` branches\n", "- Compute what changed between the common ancestor and each branch's `HEAD` commit\n", "- Check if any of the changes conflict with each other (more on this in a later tutorial)\n", "- If no conflicts are present, compute the results of the merge between the two sets of changes\n", "- Create a new `commit` containing the merge results reference both branch `HEAD`s as parents of the new `commit`, and update the `base` branch `HEAD` to that new `commit`'s `commit_hash`" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "co = repo.checkout(write=True, branch='master')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once again, as a user, the details are completely irrelevant, and the operation\n", "occurs from the same one-liner call we used before for the FF Merge." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Selected 3-Way Merge Strategy\n" ] }, { "data": { "text/plain": [ "'a=002041fe8d8846b06f33842964904b627de55214'" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co.merge(message='merge of testbranch into master', dev_branch='testbranch')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we now look at the log, we see that this has a much different look than\n", "before. The three way merge results in a history which references changes made\n", "in both diverged branches, and unifies them in a single ``commit``" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* a=002041fe8d8846b06f33842964904b627de55214 (\u001B[1;31mmaster\u001B[m) : merge of testbranch into master\n", "\u001B[1;31m|\u001B[m\u001B[1;32m\\\u001B[m \n", "\u001B[1;31m|\u001B[m * a=69a08ca41ca1f5577fb0ffcf59d4d1585f614c4d (\u001B[1;31mtestbranch\u001B[m) : added hellow world metadata\n", "\u001B[1;31m|\u001B[m * a=fcd82f86e39b19c3e5351dda063884b5d2fda67b : mutated sample `0` of `dummy_column` to new value\n", "* \u001B[1;32m|\u001B[m a=c1cf1bd6863ed0b95239d2c9e1a6c6cc65569e94 (\u001B[1;31mnew\u001B[m) : commit on `new` branch adding a sample to dummy_arrayset\n", "\u001B[1;32m|\u001B[m\u001B[1;32m/\u001B[m \n", "* a=eaee002ed9c6e949c3657bd50e3949d6a459d50e : first commit with a single sample added to a dummy column\n" ] } ], "source": [ "repo.log()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Manually inspecting the merge result to verify it matches our expectations\n", "\n", "`dummy_arrayset` should contain two arrays, key `1` was set in the previous\n", "commit originally made in `new` and merged into `master`. Key `0` was\n", "mutated in `testbranch` and unchanged in `master`, so the update from\n", "`testbranch` is kept.\n", "\n", "There should be one metadata sample with they key `hello` and the value\n", "``\"world\"``." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Hangar Columns \n", " Writeable : True \n", " Number of Columns : 1 \n", " Column Names / Partial Remote References: \n", " - dummy_column / False" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co.columns" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Hangar FlatSampleWriter \n", " Column Name : dummy_column \n", " Writeable : True \n", " Column Type : ndarray \n", " Column Layout : flat \n", " Schema Type : fixed_shape \n", " DType : uint16 \n", " Shape : (10,) \n", " Number of Samples : 2 \n", " Partial Remote Data Refs : False\n" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co.columns['dummy_column']" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[array([50, 51, 52, 53, 54, 55, 56, 57, 58, 59], dtype=uint16),\n", " array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=uint16)]" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co['dummy_column', ['0', '1']]" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Hangar Metadata \n", " Writeable: True \n", " Number of Keys: 1\n" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co.metadata" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'world'" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co.metadata['hello']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**The Merge was a success!**" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "co.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conflicts\n", "\n", "Now that we've seen merging in action, the next step is to talk about conflicts.\n", "\n", "#### How Are Conflicts Detected?\n", "\n", "Any merge conflicts can be identified and addressed ahead of running a `merge`\n", "command by using the built in [diff](api.rst#hangar.diff.WriterUserDiff) tools.\n", "When diffing commits, Hangar will provide a list of conflicts which it identifies.\n", "In general these fall into 4 categories:\n", "\n", "1. **Additions** in both branches which created new keys (samples /\n", " columns / metadata) with non-compatible values. For samples &\n", " metadata, the hash of the data is compared, for columns, the schema\n", " specification is checked for compatibility in a method custom to the\n", " internal workings of Hangar.\n", "2. **Removal** in `Master Commit/Branch` **& Mutation** in `Dev Commit / Branch`. Applies for samples, columns, and metadata identically.\n", "3. **Mutation** in `Dev Commit/Branch` **& Removal** in `Master Commit / Branch`. Applies for samples, columns, and metadata identically.\n", "4. **Mutations** on keys of both branches to non-compatible values. For\n", " samples & metadata, the hash of the data is compared; for columns, the\n", " schema specification is checked for compatibility in a method custom to the\n", " internal workings of Hangar.\n", "\n", "#### Let's make a merge conflict\n", "\n", "To force a conflict, we are going to checkout the `new` branch and set the\n", "metadata key `hello` to the value `foo conflict... BOO!`. Then if we try\n", "to merge this into the `testbranch` branch (which set `hello` to a value\n", "of `world`) we see how hangar will identify the conflict and halt without\n", "making any changes.\n", "\n", "Automated conflict resolution will be introduced in a future version of Hangar,\n", "for now it is up to the user to manually resolve conflicts by making any\n", "necessary changes in each branch before reattempting a merge operation." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "co = repo.checkout(write=True, branch='new')" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "co.metadata['hello'] = 'foo conflict... BOO!'" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'a=95896880b33fc06a3c2359a03408f07c87bcc8c0'" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co.commit ('commit on new branch to hello metadata key so we can demonstrate a conflict')" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* a=95896880b33fc06a3c2359a03408f07c87bcc8c0 (\u001B[1;31mnew\u001B[m) : commit on new branch to hello metadata key so we can demonstrate a conflict\n", "* a=c1cf1bd6863ed0b95239d2c9e1a6c6cc65569e94 : commit on `new` branch adding a sample to dummy_arrayset\n", "* a=eaee002ed9c6e949c3657bd50e3949d6a459d50e : first commit with a single sample added to a dummy column\n" ] } ], "source": [ "repo.log()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**When we attempt the merge, an exception is thrown telling us there is a conflict!**" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Selected 3-Way Merge Strategy\n" ] }, { "ename": "ValueError", "evalue": "HANGAR VALUE ERROR:: Merge ABORTED with conflict: Conflicts(t1=[(b'l:hello', b'2=d8fa6800caf496e637d965faac1a033e4636c2e6')], t21=[], t22=[], t3=[], conflict=True)", "output_type": "error", "traceback": [ "\u001B[0;31m---------------------------------------------------------------------------\u001B[0m", "\u001B[0;31mValueError\u001B[0m Traceback (most recent call last)", "\u001B[0;32m\u001B[0m in \u001B[0;36m\u001B[0;34m\u001B[0m\n\u001B[0;32m----> 1\u001B[0;31m \u001B[0mco\u001B[0m\u001B[0;34m.\u001B[0m\u001B[0mmerge\u001B[0m\u001B[0;34m(\u001B[0m\u001B[0mmessage\u001B[0m\u001B[0;34m=\u001B[0m\u001B[0;34m'this merge should not happen'\u001B[0m\u001B[0;34m,\u001B[0m \u001B[0mdev_branch\u001B[0m\u001B[0;34m=\u001B[0m\u001B[0;34m'testbranch'\u001B[0m\u001B[0;34m)\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n\u001B[0m", "\u001B[0;32m~/projects/tensorwerk/hangar/hangar-py/src/hangar/checkout.py\u001B[0m in \u001B[0;36mmerge\u001B[0;34m(self, message, dev_branch)\u001B[0m\n\u001B[1;32m 1027\u001B[0m \u001B[0mdev_branch\u001B[0m\u001B[0;34m=\u001B[0m\u001B[0mdev_branch\u001B[0m\u001B[0;34m,\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n\u001B[1;32m 1028\u001B[0m \u001B[0mrepo_path\u001B[0m\u001B[0;34m=\u001B[0m\u001B[0mself\u001B[0m\u001B[0;34m.\u001B[0m\u001B[0m_repo_path\u001B[0m\u001B[0;34m,\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n\u001B[0;32m-> 1029\u001B[0;31m writer_uuid=self._writer_lock)\n\u001B[0m\u001B[1;32m 1030\u001B[0m \u001B[0;34m\u001B[0m\u001B[0m\n\u001B[1;32m 1031\u001B[0m \u001B[0;32mfor\u001B[0m \u001B[0masetHandle\u001B[0m \u001B[0;32min\u001B[0m \u001B[0mself\u001B[0m\u001B[0;34m.\u001B[0m\u001B[0m_columns\u001B[0m\u001B[0;34m.\u001B[0m\u001B[0mvalues\u001B[0m\u001B[0;34m(\u001B[0m\u001B[0;34m)\u001B[0m\u001B[0;34m:\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n", "\u001B[0;32m~/projects/tensorwerk/hangar/hangar-py/src/hangar/merger.py\u001B[0m in \u001B[0;36mselect_merge_algorithm\u001B[0;34m(message, branchenv, stageenv, refenv, stagehashenv, master_branch, dev_branch, repo_path, writer_uuid)\u001B[0m\n\u001B[1;32m 136\u001B[0m \u001B[0;34m\u001B[0m\u001B[0m\n\u001B[1;32m 137\u001B[0m \u001B[0;32mexcept\u001B[0m \u001B[0mValueError\u001B[0m \u001B[0;32mas\u001B[0m \u001B[0me\u001B[0m\u001B[0;34m:\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n\u001B[0;32m--> 138\u001B[0;31m \u001B[0;32mraise\u001B[0m \u001B[0me\u001B[0m \u001B[0;32mfrom\u001B[0m \u001B[0;32mNone\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n\u001B[0m\u001B[1;32m 139\u001B[0m \u001B[0;34m\u001B[0m\u001B[0m\n\u001B[1;32m 140\u001B[0m \u001B[0;32mfinally\u001B[0m\u001B[0;34m:\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n", "\u001B[0;32m~/projects/tensorwerk/hangar/hangar-py/src/hangar/merger.py\u001B[0m in \u001B[0;36mselect_merge_algorithm\u001B[0;34m(message, branchenv, stageenv, refenv, stagehashenv, master_branch, dev_branch, repo_path, writer_uuid)\u001B[0m\n\u001B[1;32m 133\u001B[0m \u001B[0mrefenv\u001B[0m\u001B[0;34m=\u001B[0m\u001B[0mrefenv\u001B[0m\u001B[0;34m,\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n\u001B[1;32m 134\u001B[0m \u001B[0mstagehashenv\u001B[0m\u001B[0;34m=\u001B[0m\u001B[0mstagehashenv\u001B[0m\u001B[0;34m,\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n\u001B[0;32m--> 135\u001B[0;31m repo_path=repo_path)\n\u001B[0m\u001B[1;32m 136\u001B[0m \u001B[0;34m\u001B[0m\u001B[0m\n\u001B[1;32m 137\u001B[0m \u001B[0;32mexcept\u001B[0m \u001B[0mValueError\u001B[0m \u001B[0;32mas\u001B[0m \u001B[0me\u001B[0m\u001B[0;34m:\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n", "\u001B[0;32m~/projects/tensorwerk/hangar/hangar-py/src/hangar/merger.py\u001B[0m in \u001B[0;36m_three_way_merge\u001B[0;34m(message, master_branch, masterHEAD, dev_branch, devHEAD, ancestorHEAD, branchenv, stageenv, refenv, stagehashenv, repo_path)\u001B[0m\n\u001B[1;32m 260\u001B[0m \u001B[0;32mif\u001B[0m \u001B[0mconflict\u001B[0m\u001B[0;34m.\u001B[0m\u001B[0mconflict\u001B[0m \u001B[0;32mis\u001B[0m \u001B[0;32mTrue\u001B[0m\u001B[0;34m:\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n\u001B[1;32m 261\u001B[0m \u001B[0mmsg\u001B[0m \u001B[0;34m=\u001B[0m \u001B[0;34mf'HANGAR VALUE ERROR:: Merge ABORTED with conflict: {conflict}'\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n\u001B[0;32m--> 262\u001B[0;31m \u001B[0;32mraise\u001B[0m \u001B[0mValueError\u001B[0m\u001B[0;34m(\u001B[0m\u001B[0mmsg\u001B[0m\u001B[0;34m)\u001B[0m \u001B[0;32mfrom\u001B[0m \u001B[0;32mNone\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n\u001B[0m\u001B[1;32m 263\u001B[0m \u001B[0;34m\u001B[0m\u001B[0m\n\u001B[1;32m 264\u001B[0m \u001B[0;32mwith\u001B[0m \u001B[0mmEnv\u001B[0m\u001B[0;34m.\u001B[0m\u001B[0mbegin\u001B[0m\u001B[0;34m(\u001B[0m\u001B[0mwrite\u001B[0m\u001B[0;34m=\u001B[0m\u001B[0;32mTrue\u001B[0m\u001B[0;34m)\u001B[0m \u001B[0;32mas\u001B[0m \u001B[0mtxn\u001B[0m\u001B[0;34m:\u001B[0m\u001B[0;34m\u001B[0m\u001B[0;34m\u001B[0m\u001B[0m\n", "\u001B[0;31mValueError\u001B[0m: HANGAR VALUE ERROR:: Merge ABORTED with conflict: Conflicts(t1=[(b'l:hello', b'2=d8fa6800caf496e637d965faac1a033e4636c2e6')], t21=[], t22=[], t3=[], conflict=True)" ] } ], "source": [ "co.merge(message='this merge should not happen', dev_branch='testbranch')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Checking for Conflicts\n", "\n", "Alternatively, use the diff methods on a checkout to test for conflicts before attempting a merge.\n", "\n", "It is possible to diff between a checkout object and:\n", "\n", "1. Another branch ([diff.branch()](api.rst#hangar.diff.WriterUserDiff.branch))\n", "2. A specified commit ([diff.commit()](api.rst#hangar.diff.WriterUserDiff.commit))\n", "3. Changes made in the staging area before a commit is made\n", " ([diff.staged()](api.rst#hangar.diff.WriterUserDiff.staged))\n", " (for `write-enabled` checkouts only.)\n", "\n", "Or via the [CLI status tool](cli.rst#hangar-status) between the staging area and any branch/commit\n", "(only a human readable summary is produced)." ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "merge_results, conflicts_found = co.diff.branch('testbranch')" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Conflicts(t1=Changes(schema={}, samples=(), metadata=(MetadataRecordKey(key='hello'),)), t21=Changes(schema={}, samples=(), metadata=()), t22=Changes(schema={}, samples=(), metadata=()), t3=Changes(schema={}, samples=(), metadata=()), conflict=True)" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conflicts_found" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(MetadataRecordKey(key='hello'),)" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conflicts_found.t1.metadata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The type codes for a `Conflicts` `namedtuple` such as the one we saw:\n", "\n", " Conflicts(t1=('hello',), t21=(), t22=(), t3=(), conflict=True)\n", "\n", "are as follow:\n", "\n", "- ``t1``: Addition of key in master AND dev with different values.\n", "- ``t21``: Removed key in master, mutated value in dev.\n", "- ``t22``: Removed key in dev, mutated value in master.\n", "- ``t3``: Mutated key in both master AND dev to different values.\n", "- ``conflict``: Bool indicating if any type of conflict is present." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### To resolve, remove the conflict" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'a=e69ba8aeffc130c57d2ae0a8131c8ea59083cb62'" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "del co.metadata['hello']\n", "# resolved conflict by removing hello key\n", "co.commit('commit which removes conflicting metadata key')" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Selected 3-Way Merge Strategy\n" ] }, { "data": { "text/plain": [ "'a=ef7ddf4a4a216315d929bd905e78866e3ad6e4fd'" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co.merge(message='this merge succeeds as it no longer has a conflict', dev_branch='testbranch')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can verify that history looks as we would expect via the log!" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* a=ef7ddf4a4a216315d929bd905e78866e3ad6e4fd (\u001B[1;31mnew\u001B[m) : this merge succeeds as it no longer has a conflict\n", "\u001B[1;31m|\u001B[m\u001B[1;32m\\\u001B[m \n", "* \u001B[1;32m|\u001B[m a=e69ba8aeffc130c57d2ae0a8131c8ea59083cb62 : commit which removes conflicting metadata key\n", "* \u001B[1;32m|\u001B[m a=95896880b33fc06a3c2359a03408f07c87bcc8c0 : commit on new branch to hello metadata key so we can demonstrate a conflict\n", "\u001B[1;32m|\u001B[m * a=69a08ca41ca1f5577fb0ffcf59d4d1585f614c4d (\u001B[1;31mtestbranch\u001B[m) : added hellow world metadata\n", "\u001B[1;32m|\u001B[m * a=fcd82f86e39b19c3e5351dda063884b5d2fda67b : mutated sample `0` of `dummy_column` to new value\n", "* \u001B[1;32m|\u001B[m a=c1cf1bd6863ed0b95239d2c9e1a6c6cc65569e94 : commit on `new` branch adding a sample to dummy_arrayset\n", "\u001B[1;32m|\u001B[m\u001B[1;32m/\u001B[m \n", "* a=eaee002ed9c6e949c3657bd50e3949d6a459d50e : first commit with a single sample added to a dummy column\n" ] } ], "source": [ "repo.log()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }