Part 3: Working With Remote Servers¶

This tutorial will introduce how to start a remote Hangar server, and how to work with remotes from the client side.

Particular attention is paid to the concept of a *partially fetch* / *partial clone* operations. This is a key component of the Hangar design which provides the ability to quickly and efficiently work with data contained in remote repositories whose full size would be significatly prohibitive to local use under most circumstances.

Note:

At the time of writing, the API, user-facing functionality, client-server negotiation protocols, and test coverage of the remotes implementation is generally adqequate for this to serve as an “alpha” quality preview. However, please be warned that significantly less time has been spent in this module to optimize speed, refactor for simplicity, and assure stability under heavy loads than the rest of the Hangar core. While we can guarrentee that your data is secure on disk, you may experience crashes from time to time when working with remotes. In addition, sending data over the wire should NOT be considered secure in ANY way. No in-transit encryption, user authentication, or secure access limitations are implemented at this moment. We realize the importance of these types of protections, and they are on our radar for the next release cycle. If you are interested in making a contribution to Hangar, this module contains a lot of low hanging fruit which would would provide drastic improvements and act as a good intro the the internal Hangar data model. Please get in touch with us to discuss!

Starting a Hangar Server¶

To start a hangar server, navigate to the command line and simply execute:

$ hangar server

This will get a local server instanse running at localhost:50051. The IP and port can be configured by setting the --ip and --port flags to the desired values at the command line

A blocking process will begin in that terminal session. Leave it running while you experiment with connecting from a client repo

Using Remotes with a Local Repository¶

The CLI is the easiest way to interact with the remote server from a local repository (Though all functioanlity is mirrorred via the repository API (more on that later).

Before we begin we will set up a repository with some data, a few commits, two branches, and a merge

Setup a Test Repo¶

As normal, we shall begin with creating a repository and adding some data. This should be familiar to you from previous tutorials

[1]:

from hangar import Repository
import numpy as np
from tqdm import tqdm

testData = np.loadtxt('/Users/rick/projects/tensorwerk/hangar/dev/data/dota2Dataset/dota2Test.csv', delimiter=',', dtype=np.uint8)
trainData = np.loadtxt('/Users/rick/projects/tensorwerk/hangar/dev/data/dota2Dataset/dota2Train.csv', delimiter=',', dtype=np.uint16)

testName = 'test'
testPrototype = testData[0]
trainName = 'train'
trainPrototype = trainData[0]

[2]:

repo = Repository('/Users/rick/projects/tensorwerk/hangar/dev/intro/')
repo.init(user_name='Rick Izzo', user_email='rick@tensorwerk.com', remove_old=True)
co = repo.checkout(write=True)

Hangar Repo initialized at: /Users/rick/projects/tensorwerk/hangar/dev/intro/.hangar

[3]:

co.arraysets.init_arrayset(name=testName, prototype=testPrototype, named_samples=False)
testaset = co.arraysets[testName]

pbar = tqdm(total=testData.shape[0])
with testaset as ds:
    for gameIdx, gameData in enumerate(testData):
        if (gameIdx % 500 == 0):
            pbar.update(500)
        ds.add(gameData)
pbar.close()

co.commit('initial commit on master with test data')

repo.create_branch('add-train')
co.close()
repo.log()

10500it [00:00, 22604.66it/s]

* 9b93b393e8852a1fa57f0170f54b30c2c0c7d90f (add-train) (master) : initial commit on master with test data

[4]:

co = repo.checkout(write=True, branch='add-train')

co.arraysets.init_arrayset(name=trainName, prototype=trainPrototype, named_samples=False)
trainaset = co.arraysets[trainName]

pbar = tqdm(total=trainData.shape[0])
with trainaset as dt:
    for gameIdx, gameData in enumerate(trainData):
        if (gameIdx % 500 == 0):
            pbar.update(500)
        dt.add(gameData)
pbar.close()

co.commit('added training data on another branch')
co.close()
repo.log()

93000it [00:03, 23300.08it/s]

* 903fa337a6d1925f82a1700ad76f6c074eec8d7b (add-train) : added training data on another branch
* 9b93b393e8852a1fa57f0170f54b30c2c0c7d90f (master) : initial commit on master with test data

[5]:

co = repo.checkout(write=True, branch='master')
co.metadata['earaea'] = 'eara'
co.commit('more changes here')
co.close()
repo.log()

* b119a4db817d9a4120593938ee4115402aa1405f (master) : more changes here
* 9b93b393e8852a1fa57f0170f54b30c2c0c7d90f : initial commit on master with test data

Pushing to a Remote¶

We will use the API to add a remote, however, this can also be done wtih the CLI command:

$ hangar remote add origin localhost:50051

[6]:

repo.remote.add('origin', 'localhost:50051')

[6]:

RemoteInfo(name='origin', address='localhost:50051')

Pushing is as simple as running a simple command from the CLI or API:

$ hangar push origin master

Push the “master” branch

[7]:

repo.remote.push('origin', 'master')

counting objects: 100%|██████████| 2/2 [00:00<00:00, 13.31it/s]
pushing schemas: 100%|██████████| 1/1 [00:00<00:00, 273.60it/s]
pushing data: 10295it [00:00, 27622.34it/s]
pushing metadata: 100%|██████████| 1/1 [00:00<00:00, 510.94it/s]
pushing commit refs: 100%|██████████| 2/2 [00:00<00:00, 36.52it/s]

[7]:

'master'

Push the “add-train” branch

[8]:

repo.remote.push('origin', 'add-train')

counting objects: 100%|██████████| 1/1 [00:00<00:00,  1.14it/s]
pushing schemas: 100%|██████████| 1/1 [00:00<00:00, 464.02it/s]
pushing data: 92651it [00:04, 6427.60it/s]
pushing metadata: 0it [00:00, ?it/s]
pushing commit refs: 100%|██████████| 1/1 [00:00<00:00,  3.68it/s]

[8]:

'add-train'

Details of the Negotiation Processs¶

(The following details are not necessary to use the system, but may be of interest to some readers)

When we push data, we perform a negotation with the server which basically occurs like this:

Hi, I would like to push this branch, do you have it?
If yes, what is the latest commit you record on it?
- Is that the same commit I’m trying to push? if yes, abort
- Is that a commit I dont have? If yes, someone else has updated that branch, abort
Here’s the commit digests which are parents of my branches head, which commits are you missing?
Ok great, I’m going to scan through each of those commits to find the data hashes they contain. Tell me which ones you are missing.
Thanks, now I’ll send you all of the data corresponding to those hashes. It might be a lot of data, so we’ll handle this in batches so that if my connection cuts out, we can resume this later
Now that you have the data, I’m going to send the actual commit references for you to store, this isn’t that much information, but you’ll be sure to verify that I’m not trying to pull any funny buisness and send you incorrect data.
Now that you’ve recieved everything, and have verified it matches what I told you it is, go ahead and make those commits I’ve pushed available as the HEAD of the branch I just sent. It’s some good work that others will want!

When we want to fetch updates to a branch, essentially the exact same thing happens in reverse. Instead of asking the server what it doesn’t have, we ask it what it does have, and then request the stuff that we are missing!

Partial Fetching and Clones¶

Now we will introduce one of the most important and unique features of Hangar remotes: Partial fetch/clone of data!

There is a very real problem with keeping the full history of data - **it’s huge*!* The size of data can very easily exceeds what can fit on (most) contributors laptops or personal workstations. This section explains how Hangar can handle working with arraysets which are prohibitively large to download or store on a single machine.

As mentioned in High Performance From Simplicity, under the hood Hangar deals with “Data” and “Bookkeeping” completely separately. We’ve previously covered what exactly we mean by Data in How Hangar Thinks About Data, so we’ll briefly cover the second major component of Hangar here. In short “Bookkeeping” describes everything about the repository. By everything, we do mean that the Bookkeeping records describe everything: all commits, parents, branches, arraysets, samples, data descriptors, schemas, commit message, etc. Though complete, these records are fairly small (tens of MB in size for decently sized repositories with decent history), and are highly compressed for fast transfer between a Hangar client/server.

A brief technical interlude

There is one very important (and rather complex) property which gives Hangar Bookeeping massive power: Existence of some data piece is always known to Hangar and stored immutably once committed. However, the access pattern, backend, and locating information for this data piece may (and over time, will) be unique in every hangar repository instance.

Though the details of how this works is well beyond the scope of this document, the following example may provide some insight into the implications of this property:

If you clone some hangar repository, Bookeeping says that “some number of data pieces exist” and they should retrieved from the server. However, the bookeeping records transfered in a fetch / push / clone operation do not include information about where that piece of data existed on the client (or server) computer. Two synced repositories can use completly different backends to store the data, in completly different locations, and it does not matter - Hangar only guarrentees that when collaborators ask for a data sample in some checkout, that they will be provided with identical arrays, not that they will come from the same place or be stored in the same way. Only when data is actually retrieved is the “locating information” set for that repository instance. Because Hangar makes no assumptions about how/where it should retrieve some piece of data, or even an assumption that it exists on the local machine, and because records are small and completely describe history, once a machine has the Bookkeeping, it can decide what data it actually wants to materialize on it’s local disk! These partial fetch / partial clone operations can materialize any desired data, whether it be for a few records at the head branch, for all data in a commit, or for the entire historical data. A future release will even include the ability to stream data directly to a hangar checkout and materialize the data in memory without having to save it to disk at all!

More importantly: Since Bookkeeping describes all history, merging can be performed between branches which may contain partial (or even no) actual data. Aka. You don’t need data on disk to merge changes into it. It’s an odd concept which will be shown in this tutorial

Cloning a Remote Repo¶

$ hangar clone localhost:50051

[9]:

cloneRepo = Repository('/Users/rick/projects/tensorwerk/hangar/dev/dota-clone/')

When we perform the initial clone, we will only recieve the “master” branch by default.

[10]:

cloneRepo.clone('rick izzo', 'rick@tensorwerk.com', 'localhost:50051', remove_old=True)

fetching commit data refs:  50%|█████     | 1/2 [00:00<00:00,  7.50it/s]

Hangar Repo initialized at: /Users/rick/projects/tensorwerk/hangar/dev/dota-clone/.hangar

fetching commit data refs: 100%|██████████| 2/2 [00:00<00:00,  8.84it/s]
fetching commit spec: 100%|██████████| 2/2 [00:00<00:00, 26.22it/s]

Hard reset requested with writer_lock: 893d3a43-7f95-44e4-9fed-72feb3cf49df

[10]:

'master'

[11]:

cloneRepo.log()

* b119a4db817d9a4120593938ee4115402aa1405f (master) (origin/master) : more changes here
* 9b93b393e8852a1fa57f0170f54b30c2c0c7d90f : initial commit on master with test data

[12]:

cloneRepo.list_branches()

[12]:

['master', 'origin/master']

To get the “add-train” branch, we fetch it from the remote

[13]:

cloneRepo.remote.fetch('origin', 'add-train')

fetching commit data refs: 100%|██████████| 1/1 [00:01<00:00,  1.02s/it]
fetching commit spec: 100%|██████████| 1/1 [00:00<00:00,  3.69it/s]

[13]:

'origin/add-train'

[14]:

cloneRepo.list_branches()

[14]:

['master', 'origin/add-train', 'origin/master']

[15]:

cloneRepo.log(branch='origin/add-train')

* 903fa337a6d1925f82a1700ad76f6c074eec8d7b (origin/add-train) : added training data on another branch
* 9b93b393e8852a1fa57f0170f54b30c2c0c7d90f : initial commit on master with test data

We will create a local branch from the origin/add-train branch, just like in Git

[16]:

cloneRepo.create_branch('add-train', '903fa337a6d1925f82a1700ad76f6c074eec8d7b')

[16]:

'add-train'

[17]:

cloneRepo.list_branches()

[17]:

['add-train', 'master', 'origin/add-train', 'origin/master']

[18]:

cloneRepo.log(branch='add-train')

* 903fa337a6d1925f82a1700ad76f6c074eec8d7b (add-train) (origin/add-train) : added training data on another branch
* 9b93b393e8852a1fa57f0170f54b30c2c0c7d90f : initial commit on master with test data

Checking out a Parial Clone/Fetch¶

When we fetch/clone, the transfers are very quick, because only the commit records/history were retrieved. The data was not sent, because it may be very large to get the entire data across all of history.

When you check out a commit with partial data, you will be shown a warning indicating that some data is not available locally. An error is raised if you try to access that particular sample data. Otherwise, everything will appear as normal.

[19]:

co = cloneRepo.checkout(branch='master')

 * Checking out BRANCH: master with current HEAD: b119a4db817d9a4120593938ee4115402aa1405f

/Users/rick/projects/tensorwerk/hangar/hangar-py/src/hangar/arrayset.py:115: UserWarning: Arrayset: test contains `reference-only` samples, with actual data residing on a remote server. A `fetch-data` operation is required to access these samples.
  f'operation is required to access these samples.', UserWarning)

[20]:

co

[20]:

Hangar ReaderCheckout
    Writer       : False
    Commit Hash  : b119a4db817d9a4120593938ee4115402aa1405f
    Num Arraysets : 1
    Num Metadata : 1

we can see from the repr that the arraysets contain partial remote references

[21]:

co.arraysets

[21]:

Hangar Arraysets
    Writeable: False
    Arrayset Names / Partial Remote References:
      - test / True

[22]:

co.arraysets['test']

[22]:

Hangar ArraysetDataReader
    Arrayset Name             : test
    Schema Hash              : 2bd5a5720bc3
    Variable Shape           : False
    (max) Shape              : (117,)
    Datatype                 : <class 'numpy.uint8'>
    Named Samples            : False
    Access Mode              : r
    Number of Samples        : 10294
    Partial Remote Data Refs : True

[23]:

testKey = next(co.arraysets['test'].keys())

[24]:

co.arraysets['test'][testKey]

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-24-7900b0d54ebc> in <module>
----> 1 co.arraysets['test'][testKey]

~/projects/tensorwerk/hangar/hangar-py/src/hangar/arrayset.py in __getitem__(self, key)
    141             sample array data corresponding to the provided key
    142         """
--> 143         return self.get(key)
    144
    145     def __iter__(self) -> Iterator[Union[str, int]]:

~/projects/tensorwerk/hangar/hangar-py/src/hangar/arrayset.py in get(self, name)
    350         try:
    351             spec = self._sspecs[name]
--> 352             data = self._fs[spec.backend].read_data(spec)
    353             return data
    354         except KeyError:

~/projects/tensorwerk/hangar/hangar-py/src/hangar/backends/remote_50.py in read_data(self, hashVal)
    134     def read_data(self, hashVal: REMOTE_50_DataHashSpec) -> None:
    135         raise FileNotFoundError(
--> 136             f'data hash spec: {REMOTE_50_DataHashSpec} does not exist on this machine. '
    137             f'Perform a `data-fetch` operation to retrieve it from the remote server.')
    138

FileNotFoundError: data hash spec: <class 'hangar.backends.remote_50.REMOTE_50_DataHashSpec'> does not exist on this machine. Perform a `data-fetch` operation to retrieve it from the remote server.

[25]:

co.close()

Fetching Data from a Remote¶

To retrieve the data, we use the fetch_data operation (accessible via the API or fetch-data via the CLI).

The amount / type of data to retrieve is extremly configurable via the following options

Retrieve the data for some commit which exists in a `partial` state.

    Parameters
    ----------
    remote : str
        name of the remote to pull the data from
    branch : str, optional
        The name of a branch whose HEAD will be used as the data fetch
        point. If None, ``commit`` argument expected, by default None
    commit : str, optional
        Commit hash to retrieve data for, If None, ``branch`` argument
        expected, by default None
    arrayset_names : Optional[Sequence[str]]
        Names of the arraysets which should be retrieved for the particular
        commits, any arraysets not named will not have their data fetched
        from the server. Default behavior is to retrieve all arraysets
    max_num_bytes : Optional[int]
        If you wish to limit the amount of data sent to the local machine,
        set a `max_num_bytes` parameter. This will retrieve only this
        amount of data from the server to be placed on the local disk.
        Default is to retrieve all data regardless of how large.
    retrieve_all_history : Optional[bool]
        if data should be retrieved for all history accessible by the parents
        of this commit HEAD. by default False

    Returns
    -------
    List[str]
        commit hashs of the data which was returned.

This will retrieve all the data on the “master” branch, but not on the “add-train” branch

[26]:

cloneRepo.remote.fetch_data('origin', branch='master')

counting objects: 100%|██████████| 1/1 [00:00<00:00, 39.35it/s]
fetching data: 100%|██████████| 10294/10294 [00:00<00:00, 17452.01it/s]

[26]:

['b119a4db817d9a4120593938ee4115402aa1405f']

[27]:

co = cloneRepo.checkout(branch='master')

 * Checking out BRANCH: master with current HEAD: b119a4db817d9a4120593938ee4115402aa1405f

[28]:

co

[28]:

Hangar ReaderCheckout
    Writer       : False
    Commit Hash  : b119a4db817d9a4120593938ee4115402aa1405f
    Num Arraysets : 1
    Num Metadata : 1

Unlike before, we see that there is no partial references from the repr

[29]:

co.arraysets

[29]:

Hangar Arraysets
    Writeable: False
    Arrayset Names / Partial Remote References:
      - test / False

[30]:

co.arraysets['test']

[30]:

Hangar ArraysetDataReader
    Arrayset Name             : test
    Schema Hash              : 2bd5a5720bc3
    Variable Shape           : False
    (max) Shape              : (117,)
    Datatype                 : <class 'numpy.uint8'>
    Named Samples            : False
    Access Mode              : r
    Number of Samples        : 10294
    Partial Remote Data Refs : False

*When we access the data this time, it is available and retrieved as requested!*

[31]:

co['test', testKey]

[31]:

array([255, 223,   8,   2,   0, 255,   0,   0,   0,   0,   0,   0,   1,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   1,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   1,   0,   0,   0, 255,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   1,   0,   0,   0, 255,   0,   0,
         0,   0,   0,   0,   0, 255,   0,   0,   0,   0,   1,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0, 255,   0,   0,   0,   0,   0,   0,   0,   0,   0],
      dtype=uint8)

[32]:

co.close()

Working with mixed local/remote checkout Data¶

If we were to checkout the “add-train” branch now, we would see that there is no arrayset "train" data, but there will be data common to the ancestor that “master” and “add-train” share.

[33]:

cloneRepo.log('add-train')

* 903fa337a6d1925f82a1700ad76f6c074eec8d7b (add-train) (origin/add-train) : added training data on another branch
* 9b93b393e8852a1fa57f0170f54b30c2c0c7d90f : initial commit on master with test data

In this case, the common ancestor is commit: 9b93b393e8852a1fa57f0170f54b30c2c0c7d90f

To show that there is no data on the “add-train” branch

[34]:

co = cloneRepo.checkout(branch='add-train')

 * Checking out BRANCH: add-train with current HEAD: 903fa337a6d1925f82a1700ad76f6c074eec8d7b

/Users/rick/projects/tensorwerk/hangar/hangar-py/src/hangar/arrayset.py:115: UserWarning: Arrayset: train contains `reference-only` samples, with actual data residing on a remote server. A `fetch-data` operation is required to access these samples.
  f'operation is required to access these samples.', UserWarning)

[35]:

co

[35]:

Hangar ReaderCheckout
    Writer       : False
    Commit Hash  : 903fa337a6d1925f82a1700ad76f6c074eec8d7b
    Num Arraysets : 2
    Num Metadata : 0

[36]:

co.arraysets

[36]:

Hangar Arraysets
    Writeable: False
    Arrayset Names / Partial Remote References:
      - test / False
      - train / True

[37]:

co['test', testKey]

[37]:

array([255, 223,   8,   2,   0, 255,   0,   0,   0,   0,   0,   0,   1,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   1,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   1,   0,   0,   0, 255,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   1,   0,   0,   0, 255,   0,   0,
         0,   0,   0,   0,   0, 255,   0,   0,   0,   0,   1,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0, 255,   0,   0,   0,   0,   0,   0,   0,   0,   0],
      dtype=uint8)

[38]:

trainKey = next(co.arraysets['train'].keys())

[39]:

co.arraysets['train'][trainKey]

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-39-c20187a5b311> in <module>
----> 1 co.arraysets['train'][trainKey]

~/projects/tensorwerk/hangar/hangar-py/src/hangar/arrayset.py in __getitem__(self, key)
    141             sample array data corresponding to the provided key
    142         """
--> 143         return self.get(key)
    144
    145     def __iter__(self) -> Iterator[Union[str, int]]:

~/projects/tensorwerk/hangar/hangar-py/src/hangar/arrayset.py in get(self, name)
    350         try:
    351             spec = self._sspecs[name]
--> 352             data = self._fs[spec.backend].read_data(spec)
    353             return data
    354         except KeyError:

~/projects/tensorwerk/hangar/hangar-py/src/hangar/backends/remote_50.py in read_data(self, hashVal)
    134     def read_data(self, hashVal: REMOTE_50_DataHashSpec) -> None:
    135         raise FileNotFoundError(
--> 136             f'data hash spec: {REMOTE_50_DataHashSpec} does not exist on this machine. '
    137             f'Perform a `data-fetch` operation to retrieve it from the remote server.')
    138

FileNotFoundError: data hash spec: <class 'hangar.backends.remote_50.REMOTE_50_DataHashSpec'> does not exist on this machine. Perform a `data-fetch` operation to retrieve it from the remote server.

[40]:

co.close()

Merging Branches with Parial Data¶

Even though we don’t have the actual data references in the "add-train" branch, it is still possible to merge the two branches!

This is possible because Hangar doesn’t use the data contents in it’s internal model of checkouts/commits, but instead thinks of a checkouts as a sequence of arraysets/metadata/keys & their associated data hashes (which are very small text records; ie. “bookkeeping”). To show this in action, lets merge the two branches "master" (containing all data locally) and "add-train" (containing parial remote references for the "train" arrayset) together and push it to the Remote!

[41]:

cloneRepo.log('master')

* b119a4db817d9a4120593938ee4115402aa1405f (master) (origin/master) : more changes here
* 9b93b393e8852a1fa57f0170f54b30c2c0c7d90f : initial commit on master with test data

[42]:

cloneRepo.log('add-train')

* 903fa337a6d1925f82a1700ad76f6c074eec8d7b (add-train) (origin/add-train) : added training data on another branch
* 9b93b393e8852a1fa57f0170f54b30c2c0c7d90f : initial commit on master with test data

Perform the Merge

[43]:

cloneRepo.merge('merge commit here', 'master', 'add-train')

Selected 3-Way Merge Strategy

[43]:

'71f3bd864919c6e0c5ef95e2e8fb67102a0f94a2'

IT WORKED!

[44]:

cloneRepo.log()

*   71f3bd864919c6e0c5ef95e2e8fb67102a0f94a2 (master) : merge commit here
|\
* | b119a4db817d9a4120593938ee4115402aa1405f (origin/master) : more changes here
| * 903fa337a6d1925f82a1700ad76f6c074eec8d7b (add-train) (origin/add-train) : added training data on another branch
|/
* 9b93b393e8852a1fa57f0170f54b30c2c0c7d90f : initial commit on master with test data

We can check the summary of the master commit to check that the contents are what we expect (containing both test and train arraysets)

[45]:

cloneRepo.summary()

Summary of Contents Contained in Data Repository

==================
| Repository Info
|-----------------
|  Base Directory: /Users/rick/projects/tensorwerk/hangar/dev/dota-clone
|  Disk Usage: 45.61 MB

===================
| Commit Details
-------------------
|  Commit: 71f3bd864919c6e0c5ef95e2e8fb67102a0f94a2
|  Created: Mon Aug 19 17:41:02 2019
|  By: rick izzo
|  Email: rick@tensorwerk.com
|  Message: merge commit here

==================
| DataSets
|-----------------
|  Number of Named Arraysets: 2
|
|  * Arrayset Name: test
|    Num Arrays: 10294
|    Details:
|    - schema_hash: 2bd5a5720bc3
|    - schema_dtype: 2
|    - schema_is_var: False
|    - schema_max_shape: (117,)
|    - schema_is_named: False
|    - schema_default_backend: 10
|
|  * Arrayset Name: train
|    Num Arrays: 92650
|    Details:
|    - schema_hash: ded1ae23f9af
|    - schema_dtype: 4
|    - schema_is_var: False
|    - schema_max_shape: (117,)
|    - schema_is_named: False
|    - schema_default_backend: 10

==================
| Metadata:
|-----------------
|  Number of Keys: 1

Pushing the Merge back to the Remote¶

To push this merge back to our original copy of the Repository (repo), we just push the "master" branch back to the remote via the API or CLI

[46]:

cloneRepo.remote.push('origin', 'master')

counting objects: 100%|██████████| 1/1 [00:00<00:00,  1.46it/s]
pushing schemas: 0it [00:00, ?it/s]
pushing data: 0it [00:00, ?it/s]
pushing metadata: 0it [00:00, ?it/s]
pushing commit refs: 100%|██████████| 1/1 [00:00<00:00,  3.85it/s]

[46]:

'master'

Looking at our current state of our other instance of the repo "repo" we see that the merge changes aren’t yet propogated to it (since it hasn’t fetched from the remote yet

[47]:

repo.log()

* b119a4db817d9a4120593938ee4115402aa1405f (master) (origin/master) : more changes here
* 9b93b393e8852a1fa57f0170f54b30c2c0c7d90f : initial commit on master with test data

To fetch the merged changes, just fetch the branch as normal. Like all fetches, this will be a fast operation, as it will be a partial fetch operation, not actually transfering the data

[48]:

repo.remote.fetch('origin', 'master')

fetching commit data refs: 100%|██████████| 1/1 [00:00<00:00,  1.24it/s]
fetching commit spec: 100%|██████████| 1/1 [00:00<00:00,  3.62it/s]

[48]:

'origin/master'

[49]:

repo.log('origin/master')

*   71f3bd864919c6e0c5ef95e2e8fb67102a0f94a2 (origin/master) : merge commit here
|\
* | b119a4db817d9a4120593938ee4115402aa1405f (master) : more changes here
| * 903fa337a6d1925f82a1700ad76f6c074eec8d7b (add-train) (origin/add-train) : added training data on another branch
|/
* 9b93b393e8852a1fa57f0170f54b30c2c0c7d90f : initial commit on master with test data

To bring our "master" branch up to date is a simple fast-forward merge

[50]:

repo.merge('ff-merge', 'master', 'origin/master')

Selected Fast-Forward Merge Strategy

[50]:

'71f3bd864919c6e0c5ef95e2e8fb67102a0f94a2'

[51]:

repo.log()

*   71f3bd864919c6e0c5ef95e2e8fb67102a0f94a2 (master) (origin/master) : merge commit here
|\
* | b119a4db817d9a4120593938ee4115402aa1405f : more changes here
| * 903fa337a6d1925f82a1700ad76f6c074eec8d7b (add-train) (origin/add-train) : added training data on another branch
|/
* 9b93b393e8852a1fa57f0170f54b30c2c0c7d90f : initial commit on master with test data

Everything is as it should be! Now, try it out for yourself!

[52]:

repo.summary()

Summary of Contents Contained in Data Repository

==================
| Repository Info
|-----------------
|  Base Directory: /Users/rick/projects/tensorwerk/hangar/dev/intro
|  Disk Usage: 79.98 MB

===================
| Commit Details
-------------------
|  Commit: 71f3bd864919c6e0c5ef95e2e8fb67102a0f94a2
|  Created: Mon Aug 19 17:41:02 2019
|  By: rick izzo
|  Email: rick@tensorwerk.com
|  Message: merge commit here

==================
| DataSets
|-----------------
|  Number of Named Arraysets: 2
|
|  * Arrayset Name: test
|    Num Arrays: 10294
|    Details:
|    - schema_hash: 2bd5a5720bc3
|    - schema_dtype: 2
|    - schema_is_var: False
|    - schema_max_shape: (117,)
|    - schema_is_named: False
|    - schema_default_backend: 10
|
|  * Arrayset Name: train
|    Num Arrays: 92650
|    Details:
|    - schema_hash: ded1ae23f9af
|    - schema_dtype: 4
|    - schema_is_var: False
|    - schema_max_shape: (117,)
|    - schema_is_named: False
|    - schema_default_backend: 10

==================
| Metadata:
|-----------------
|  Number of Keys: 1