Part 2: Checkouts, Branching, & Merging

This section deals with navigating repository history, creating & merging branches, and understanding conflicts

The Hangar Workflow

The hangar workflow is intended to mimic common git workflows in which small incremental changes are made and committed on dedicated topic branches. After the topic has been adequatly set, topic branch is merged into a separate branch (commonly referred to as master, though it need not be the actual branch named "master"), where well vetted and more permanent changes are kept.

Create Branch -> Checkout Branch -> Make Changes -> Commit

Making the Initial Commit

Let’s initialize a new repository and see how branching works in Hangar

[1]:
from hangar import Repository
import numpy as np
[2]:
repo = Repository(path='foo/pth')
[3]:
repo_pth = repo.init(user_name='Test User', user_email='test@foo.com')

When a repository is first initialized, it has no history, no commits.

[4]:
repo.log() # -> returns None

Though the repository is essentially empty at this point in time, there is one thing which is present: A branch with the name: "master".

[5]:
repo.list_branches()
[5]:
['master']

This "master" is the branch we make our first commit on; until we do, the repository is in a semi-unstable state; with no history or contents, most of the functionality of a repository (to store, retrieve, and work with versions of data across time) just isn’t possible. A significant potion of otherwise standard operations will generally flat out refuse to to execute (ie. read-only checkouts, log, push, etc.) until the first commit is made.

One of the only options available at this point in time is to create a write-enabled checkout on the "master" branch and begin to add data so we can make a commit. let’s do that now:

[6]:
co = repo.checkout(write=True)

As expected, there are no arraysets or metadata samples recorded in the checkout.

[7]:
print(f'number of metadata keys: {len(co.metadata)}')
print(f'number of arraysets: {len(co.arraysets)}')
number of metadata keys: 0
number of arraysets: 0

Let’s add a dummy array just to put something in the repository history to commit. We’ll then close the checkout so we can explore some useful tools which depend on having at least on historical record (commit) in the repo.

[8]:
dummy = np.arange(10, dtype=np.uint16)
aset = co.arraysets.init_arrayset(name='dummy_arrayset', prototype=dummy)
aset['0'] = dummy
initialCommitHash = co.commit('first commit with a single sample added to a dummy arrayset')
co.close()

If we check the history now, we can see our first commit hash, and that it is labeled with the branch name "master"

[9]:
repo.log()
* 0fd892b7ce9d9d0150c68bb5483876d58c28cbf1 (master) : first commit with a single sample added to a dummy arrayset

So now our repository contains: - A commit: a fully independent description of the entire repository state as it existed at some point in time. A commit is identified by a commit_hash - A branch: a label pointing to a particular commit / commit_hash

Once committed, it is not possible to remove, modify, or otherwise tamper with the contents of a commit in any way. It is a permanent record, which Hangar has no method to change once written to disk.

In addition, as a commit_hash is not only calculated from the commit ’s contents, but from the commit_hash of its parents (more on this to follow), knowing a single top-level commit_hash allows us to verify the integrity of the entire repository history. This fundamental behavior holds even in cases of disk-corruption or malicious use.

Working with Checkouts & Branches

As mentioned in the first tutorial, we work with the data in a repository though a checkout. There are two types of checkouts (each of which have different uses and abilities):

Checking out a branch/commit for reading: is the process of retrieving records describing repository state at some point in time, and setting up access to the referenced data.

  • Any number of read checkout processes can operate on a repository (on any number of commits) at the same time.

Checking out a branch for writing: is the process of setting up a (mutable) staging area to temporarily gather record references / data before all changes have been made and staging area contents are committed in a new permanent record of history (a commit)

  • Only one write-enabled checkout can ever be operating in a repository at a time
  • When initially creating the checkout, the staging area is not actually “empty”. Instead, it has the full contents of the last commit referenced by a branch’s HEAD. These records can be removed/mutated/added to in any way to form the next commit. The new commit retains a permanent reference identifying the previous HEAD commit was used as it’s base staging area
  • On commit, the branch which was checked out has it’s HEAD pointer value updated to the new commit ’s commit_hash. A write-enabled checkout starting from the same branch will now use that commit ’s record content as the base for it’s staging area.

Creating a branch

A branch is an individual series of changes/commits which diverge from the main history of the repository at some point in time. All changes made along a branch are completely isolated from those on other branches. After some point in time, changes made in a disparate branches can be unified through an automatic merge process (described in detail later in this tutorial). In general, the Hangar branching model is semantically identical Git; Hangar branches also have the same lightweight and performant properties which make working with Git branches so appealing.

In hangar, branch must always have a name and a base_commit. However, If no base_commit is specified, the current writer branch HEAD commit is used as the base_commit hash for the branch automatically.

[10]:
branch_1 = repo.create_branch(name='testbranch')
[11]:
branch_1
[11]:
'testbranch'

viewing the log, we see that a new branch named: testbranch is pointing to our initial commit

[12]:
print(f'branch names: {repo.list_branches()} \n')
repo.log()
branch names: ['master', 'testbranch']

* 0fd892b7ce9d9d0150c68bb5483876d58c28cbf1 (master) (testbranch) : first commit with a single sample added to a dummy arrayset

If instead, we do actually specify the base commit (with a different branch name) we see we do actually get a third branch. pointing to the same commit as "master" and "testbranch"

[13]:
branch_2 = repo.create_branch(name='new', base_commit=initialCommitHash)
[14]:
branch_2
[14]:
'new'
[15]:
repo.log()
* 0fd892b7ce9d9d0150c68bb5483876d58c28cbf1 (master) (new) (testbranch) : first commit with a single sample added to a dummy arrayset

Making changes on a branch

Let’s make some changes on the "new" branch to see how things work. We can see that the data we added previously is still here (dummy arrayset containing one sample labeled 0)

[16]:
co = repo.checkout(write=True, branch='new')
[17]:
co.arraysets
[17]:
Hangar Arraysets
    Writeable: True
    Arrayset Names / Partial Remote References:
      - dummy_arrayset / False
[18]:
co.arraysets['dummy_arrayset']
[18]:
Hangar ArraysetDataWriter
    Arrayset Name             : dummy_arrayset
    Schema Hash              : 43edf7aa314c
    Variable Shape           : False
    (max) Shape              : (10,)
    Datatype                 : <class 'numpy.uint16'>
    Named Samples            : True
    Access Mode              : a
    Number of Samples        : 1
    Partial Remote Data Refs : False

[19]:
co.arraysets['dummy_arrayset']['0']
[19]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint16)

Let’s add another sample to the dummy_arrayset called 1

[20]:
arr = np.arange(10, dtype=np.uint16)
# let's increment values so that `0` and `1` aren't set to the same thing
arr += 1

co.arraysets['dummy_arrayset']['1'] = arr

We can see that in this checkout, there are indeed, two samples in the dummy_arrayset

[21]:
len(co.arraysets['dummy_arrayset'])
[21]:
2

That’s all, let’s commit this and be done with this branch

[22]:
co.commit('commit on `new` branch adding a sample to dummy_arrayset')
co.close()

How do changes appear when made on a branch?

If we look at the log, we see that the branch we were on (new) is a commit ahead of master and testbranch

[23]:
repo.log()
* 186f1ccae28ad8f58bcae95dd8c1115a3b0de9dd (new) : commit on `new` branch adding a sample to dummy_arrayset
* 0fd892b7ce9d9d0150c68bb5483876d58c28cbf1 (master) (testbranch) : first commit with a single sample added to a dummy arrayset

The meaning is exactly what one would intuit. we made some changes, they were reflected on the new branch, but the master and testbranch branches were not impacted at all, nor were any of the commits!

Merging (Part 1) Fast-Forward Merges

Say we like the changes we made on the new branch so much that we want them to be included into our master branch! How do we make this happen for this scenario??

Well, the history between the HEAD of the "new" and the HEAD of the "master" branch is perfectly linear. In fact, when we began making changes on "new", our staging area was identical to what the "master" HEAD commit references are right now!

If you’ll remember that a branch is just a pointer which assigns some name to a commit_hash, it becomes apparent that a merge in this case really doesn’t involve any work at all. With a linear history between "master" and "new", any commits exsting along the path between the HEAD of "new" and "master" are the only changes which are introduced, and we can be sure that this is the only view of the data records which can exist!

What this means in practice is that for this type of merge, we can just update the HEAD of "master" to point to the "HEAD" of "new", and the merge is complete.

This situation is referred to as a Fast Forward (FF) Merge. A FF merge is safe to perform any time a linear history lies between the "HEAD" of some topic and base branch, regardless of how many commits or changes which were introduced.

For other situations, a more complicated Three Way Merge is required. This merge method will be explained a bit more later in this tutorial

[24]:
co = repo.checkout(write=True, branch='master')

Performing the Merge

In practice, you’ll never need to know the details of the merge theory explained above (or even remember it exists). Hangar automatically figures out which merge algorithms should be used and then performed whatever calculations are needed to compute the results.

As a user, merging in Hangar is a one-liner!

[25]:
co.merge(message='message for commit (not used for FF merge)', dev_branch='new')
[25]:
'186f1ccae28ad8f58bcae95dd8c1115a3b0de9dd'

Let’s check the log!

[26]:
repo.log()
* 186f1ccae28ad8f58bcae95dd8c1115a3b0de9dd (master) (new) : commit on `new` branch adding a sample to dummy_arrayset
* 0fd892b7ce9d9d0150c68bb5483876d58c28cbf1 (testbranch) : first commit with a single sample added to a dummy arrayset
[27]:
co.branch_name
[27]:
'master'
[28]:
co.commit_hash
[28]:
'186f1ccae28ad8f58bcae95dd8c1115a3b0de9dd'
[29]:
co.arraysets['dummy_arrayset']
[29]:
Hangar ArraysetDataWriter
    Arrayset Name             : dummy_arrayset
    Schema Hash              : 43edf7aa314c
    Variable Shape           : False
    (max) Shape              : (10,)
    Datatype                 : <class 'numpy.uint16'>
    Named Samples            : True
    Access Mode              : a
    Number of Samples        : 2
    Partial Remote Data Refs : False

As you can see, everything is as it should be!

[30]:
co.close()

Making a changes to introduce diverged histories

Let’s now go back to our "testbranch" branch and make some changes there so we can see what happens when changes don’t follow a linear history.

[31]:
co = repo.checkout(write=True, branch='testbranch')
[32]:
co.arraysets
[32]:
Hangar Arraysets
    Writeable: True
    Arrayset Names / Partial Remote References:
      - dummy_arrayset / False
[33]:
co.arraysets['dummy_arrayset']
[33]:
Hangar ArraysetDataWriter
    Arrayset Name             : dummy_arrayset
    Schema Hash              : 43edf7aa314c
    Variable Shape           : False
    (max) Shape              : (10,)
    Datatype                 : <class 'numpy.uint16'>
    Named Samples            : True
    Access Mode              : a
    Number of Samples        : 1
    Partial Remote Data Refs : False

We will start by mutating sample 0 in dummy_arrayset to a different value

[34]:
dummy_aset = co.arraysets['dummy_arrayset']
[35]:
old_arr = dummy_aset['0']
new_arr = old_arr + 50
new_arr
[35]:
array([50, 51, 52, 53, 54, 55, 56, 57, 58, 59], dtype=uint16)
[36]:
dummy_aset['0'] = new_arr

let’s make a commit here, then add some metadata and make a new commit (all on the testbranch branch)

[37]:
co.commit('mutated sample `0` of `dummy_arrayset` to new value')
[37]:
'2fe5c53a899ba6accbe8c19debd9a489e3baeaed'
[38]:
repo.log()
* 2fe5c53a899ba6accbe8c19debd9a489e3baeaed (testbranch) : mutated sample `0` of `dummy_arrayset` to new value
* 0fd892b7ce9d9d0150c68bb5483876d58c28cbf1 : first commit with a single sample added to a dummy arrayset
[39]:
co.metadata['hello'] = 'world'
[40]:
co.commit('added hellow world metadata')
[40]:
'836ba8ff1fe552fb65944e2340b2a2ef2b2b62d4'
[41]:
co.close()

Looking at our history how, we see that none of the original branches reference our first commit anymore

[42]:
repo.log()
* 836ba8ff1fe552fb65944e2340b2a2ef2b2b62d4 (testbranch) : added hellow world metadata
* 2fe5c53a899ba6accbe8c19debd9a489e3baeaed : mutated sample `0` of `dummy_arrayset` to new value
* 0fd892b7ce9d9d0150c68bb5483876d58c28cbf1 : first commit with a single sample added to a dummy arrayset

We can check the history of the "master" branch by specifying it as an argument to the log() method

[43]:
repo.log('master')
* 186f1ccae28ad8f58bcae95dd8c1115a3b0de9dd (master) (new) : commit on `new` branch adding a sample to dummy_arrayset
* 0fd892b7ce9d9d0150c68bb5483876d58c28cbf1 : first commit with a single sample added to a dummy arrayset

Merging (Part 2) Three Way Merge

If we now want to merge the changes on "testbranch" into "master", we can’t just follow a simple linear history; the branches have diverged.

For this case, Hangar implements a Three Way Merge algorithm which does the following: - Find the most recent common ancestor commit present in both the "testbranch" and "master" branches - Compute what changed between the common ancestor and each branch’s HEAD commit - Check if any of the changes conflict with eachother (more on this in a later tutorial) - If no conflicts are present, compute the results of the merge between the two sets of changes - Create a new commit containing the merge results reference both branch HEADs as parents of the new commit, and update the base branch HEAD to that new commit’s commit_hash

[44]:
co = repo.checkout(write=True, branch='master')

Once again, as a user, the details are completely irrelevant, and the operation occurs from the same one-liner call we used before for the FF Merge.

[45]:
co.merge(message='merge of testbranch into master', dev_branch='testbranch')
[45]:
'fd4a07ada0f138870924fc4ffee47839b77f1fbe'

If we now look at the log, we see that this has a much different look then before. The three way merge results in a history which references changes made in both diverged branches, and unifies them in a single commit

[46]:
repo.log()
*   fd4a07ada0f138870924fc4ffee47839b77f1fbe (master) : merge of testbranch into master
|\
| * 836ba8ff1fe552fb65944e2340b2a2ef2b2b62d4 (testbranch) : added hellow world metadata
| * 2fe5c53a899ba6accbe8c19debd9a489e3baeaed : mutated sample `0` of `dummy_arrayset` to new value
* | 186f1ccae28ad8f58bcae95dd8c1115a3b0de9dd (new) : commit on `new` branch adding a sample to dummy_arrayset
|/
* 0fd892b7ce9d9d0150c68bb5483876d58c28cbf1 : first commit with a single sample added to a dummy arrayset

Manually inspecting the merge result to verify it matches our expectations

dummy_arrayset should contain two arrays, key 1 was set in the previous commit originally made in "new" and merged into "master". Key 0 was mutated in "testbranch" and unchanged in "master", so the update from "testbranch" is kept.

There should be one metadata sample with they key "hello" and the value "world"

[47]:
co.arraysets
[47]:
Hangar Arraysets
    Writeable: True
    Arrayset Names / Partial Remote References:
      - dummy_arrayset / False
[48]:
co.arraysets['dummy_arrayset']
[48]:
Hangar ArraysetDataWriter
    Arrayset Name             : dummy_arrayset
    Schema Hash              : 43edf7aa314c
    Variable Shape           : False
    (max) Shape              : (10,)
    Datatype                 : <class 'numpy.uint16'>
    Named Samples            : True
    Access Mode              : a
    Number of Samples        : 2
    Partial Remote Data Refs : False

[49]:
co.arraysets['dummy_arrayset']['0']
[49]:
array([50, 51, 52, 53, 54, 55, 56, 57, 58, 59], dtype=uint16)
[50]:
co.arraysets['dummy_arrayset']['1']
[50]:
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=uint16)
[51]:
co.metadata
[51]:
Hangar Metadata
    Writeable: True
    Number of Keys: 1

[52]:
co.metadata['hello']
[52]:
'world'

The Merge was a success!

[53]:
co.close()

Conflicts

Now that we’ve seen merging in action, the next step is to talk about conflicts.

How Are Conflicts Detected?

Any merge conflicts can be identified and addressed ahead of running a merge command by using the built in diff tools. When diffing commits, Hangar will provide a list of conflicts which it identifies. In general these fall into 4 categories:

  1. Additions in both branches which created new keys (samples / arraysets / metadata) with non-compatible values. For samples & metadata, the hash of the data is compared, for arraysets, the schema specification is checked for compatibility in a method custom to the internal workings of Hangar.
  2. Removal in Master Commit/Branch & Mutation in Dev Commit /    Branch. Applies for samples, arraysets, and metadata identically.
  3. Mutation in Dev Commit/Branch & Removal in Master Commit /    Branch. Applies for samples, arraysets, and metadata identically.
  4. Mutations on keys both branches to non-compatible values. For samples & metadata, the hash of the data is compared, for arraysets, the schema specification is checked for compatibility in a method custom to the internal workings of Hangar.

Let’s make a merge conflict

To force a conflict, we are going to checkout the "new" branch and set the metadata key "hello" to the value "foo conflict... BOO!". If we then try to merge this into the "testbranch" branch (which set "hello" to a value of "world") we see how hangar will identify the conflict and halt without making any changes.

Automated conflict resolution will be introduced in a future version of Hangar, for now it is up to the user to manually resolve conflicts by making any necessary changes in each branch before reattempting a merge operation.

[54]:
co = repo.checkout(write=True, branch='new')
[55]:
co.metadata['hello'] = 'foo conflict... BOO!'
[56]:
co.commit ('commit on new branch to hello metadata key so we can demonstrate a conflict')
[56]:
'0a0c4dbcfe63ce10fd2a87a98b785ce03099b09e'
[57]:
repo.log()
* 0a0c4dbcfe63ce10fd2a87a98b785ce03099b09e (new) : commit on new branch to hello metadata key so we can demonstrate a conflict
* 186f1ccae28ad8f58bcae95dd8c1115a3b0de9dd : commit on `new` branch adding a sample to dummy_arrayset
* 0fd892b7ce9d9d0150c68bb5483876d58c28cbf1 : first commit with a single sample added to a dummy arrayset

When we attempt the merge, an exception is thrown telling us there is a conflict!

[58]:
co.merge(message='this merge should not happen', dev_branch='testbranch')
HANGAR VALUE ERROR:: Merge ABORTED with conflict: {'aset': ConflictRecords(t1=(), t21=(), t22=(), t3=(), conflict=False), 'meta': ConflictRecords(t1=(MetadataRecordKey(meta_name='hello'),), t21=(), t22=(), t3=(), conflict=True), 'sample': {'dummy_arrayset': ConflictRecords(t1=(), t21=(), t22=(), t3=(), conflict=False)}, 'conflict_found': True}
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-58-1a98dce1852b> in <module>
----> 1 co.merge(message='this merge should not happen', dev_branch='testbranch')

~/projects/tensorwerk/hangar/hangar-py/src/hangar/checkout.py in merge(self, message, dev_branch)
    468             dev_branch_name=dev_branch,
    469             repo_path=self._repo_path,
--> 470             writer_uuid=self._writer_lock)
    471
    472         for asetHandle in self._arraysets.values():

~/projects/tensorwerk/hangar/hangar-py/src/hangar/merger.py in select_merge_algorithm(message, branchenv, stageenv, refenv, stagehashenv, master_branch_name, dev_branch_name, repo_path, writer_uuid)
    131
    132     except ValueError as e:
--> 133         raise e from None
    134
    135     finally:

~/projects/tensorwerk/hangar/hangar-py/src/hangar/merger.py in select_merge_algorithm(message, branchenv, stageenv, refenv, stagehashenv, master_branch_name, dev_branch_name, repo_path, writer_uuid)
    128                 refenv=refenv,
    129                 stagehashenv=stagehashenv,
--> 130                 repo_path=repo_path)
    131
    132     except ValueError as e:

~/projects/tensorwerk/hangar/hangar-py/src/hangar/merger.py in _three_way_merge(message, master_branch_name, masterHEAD, dev_branch_name, devHEAD, ancestorHEAD, branchenv, stageenv, refenv, stagehashenv, repo_path)
    256     except ValueError as e:
    257         logger.error(e, exc_info=False)
--> 258         raise e from None
    259
    260     fmtCont = _merge_dict_to_lmdb_tuples(patchedRecs=mergeContents)

~/projects/tensorwerk/hangar/hangar-py/src/hangar/merger.py in _three_way_merge(message, master_branch_name, masterHEAD, dev_branch_name, devHEAD, ancestorHEAD, branchenv, stageenv, refenv, stagehashenv, repo_path)
    253
    254     try:
--> 255         mergeContents = _compute_merge_results(a_cont=aCont, m_cont=mCont, d_cont=dCont)
    256     except ValueError as e:
    257         logger.error(e, exc_info=False)

~/projects/tensorwerk/hangar/hangar-py/src/hangar/merger.py in _compute_merge_results(a_cont, m_cont, d_cont)
    350     if confs['conflict_found'] is True:
    351         msg = f'HANGAR VALUE ERROR:: Merge ABORTED with conflict: {confs}'
--> 352         raise ValueError(msg) from None
    353
    354     # merging: arrayset schemas

ValueError: HANGAR VALUE ERROR:: Merge ABORTED with conflict: {'aset': ConflictRecords(t1=(), t21=(), t22=(), t3=(), conflict=False), 'meta': ConflictRecords(t1=(MetadataRecordKey(meta_name='hello'),), t21=(), t22=(), t3=(), conflict=True), 'sample': {'dummy_arrayset': ConflictRecords(t1=(), t21=(), t22=(), t3=(), conflict=False)}, 'conflict_found': True}

Checking for Conflicts

Alternatively, use the diff methods on a checkout to test for conflicts before attempting a merge.

[59]:
merge_results, conflicts_found = co.diff.branch('testbranch')
[60]:
conflicts_found
[60]:
{'aset': ConflictRecords(t1=(), t21=(), t22=(), t3=(), conflict=False),
 'meta': ConflictRecords(t1=(MetadataRecordKey(meta_name='hello'),), t21=(), t22=(), t3=(), conflict=True),
 'sample': {'dummy_arrayset': ConflictRecords(t1=(), t21=(), t22=(), t3=(), conflict=False)},
 'conflict_found': True}
[61]:
conflicts_found['meta']
[61]:
ConflictRecords(t1=(MetadataRecordKey(meta_name='hello'),), t21=(), t22=(), t3=(), conflict=True)

The type codes for a ConflictRecords namedtuple such as the one we saw:

ConflictRecords(t1=('hello',), t21=(), t22=(), t3=(), conflict=True)

are as follow:

  • t1: Addition of key in master AND dev with different values.
  • t21: Removed key in master, mutated value in dev.
  • t22: Removed key in dev, mutated value in master.
  • t3: Mutated key in both master AND dev to different values.
  • conflict: Bool indicating if any type of conflict is present.

To resolve, remove the conflict

[62]:
del co.metadata['hello']
co.metadata['resolved'] = 'conflict by removing hello key'
co.commit('commit which removes conflicting metadata key')
[62]:
'9af80ed5df5d893b5e918f1a060cce4c46d9ddec'
[63]:
co.merge(message='this merge succeeds as it no longer has a conflict', dev_branch='testbranch')
[63]:
'b3b097d069f351e5b4688f1ebf30ae1a5aa94f4a'

We can verify that history looks as we would expect via the log!

[64]:
repo.log()
*   b3b097d069f351e5b4688f1ebf30ae1a5aa94f4a (new) : this merge succeeds as it no longer has a conflict
|\
* | 9af80ed5df5d893b5e918f1a060cce4c46d9ddec : commit which removes conflicting metadata key
* | 0a0c4dbcfe63ce10fd2a87a98b785ce03099b09e : commit on new branch to hello metadata key so we can demonstrate a conflict
| * 836ba8ff1fe552fb65944e2340b2a2ef2b2b62d4 (testbranch) : added hellow world metadata
| * 2fe5c53a899ba6accbe8c19debd9a489e3baeaed : mutated sample `0` of `dummy_arrayset` to new value
* | 186f1ccae28ad8f58bcae95dd8c1115a3b0de9dd : commit on `new` branch adding a sample to dummy_arrayset
|/
* 0fd892b7ce9d9d0150c68bb5483876d58c28cbf1 : first commit with a single sample added to a dummy arrayset