Python API¶

This is the python API for the Hangar project.

Repository¶

class Repository(path: Union[str, pathlib.Path], exists: bool = True)¶

Launching point for all user operations in a Hangar repository.

All interaction, including the ability to initialize a repo, checkout a commit (for either reading or writing), create a branch, merge branches, or generally view the contents or state of the local repository starts here. Just provide this class instance with a path to an existing Hangar repository, or to a directory one should be initialized, and all required data for starting your work on the repo will automatically be populated.

>>> from hangar import Repository
>>> repo = Repository('foo/path/to/dir')

Parameters

path (Union[str, os.PathLike]) – local directory path where the Hangar repository exists (or initialized)
exists (bool, optional) –
True if a Hangar repository should exist at the given directory path. Should no Hangar repository exists at that location, a UserWarning will be raised indicating that the init() method needs to be called.

False if the provided path does not need to (but optionally can) contain a Hangar repository. if a Hangar repository does not exist at that path, the usual UserWarning will be suppressed.

In both cases, the path must exist and the user must have sufficient OS permissions to write to that location. Default = True

checkout(write: bool = False, *, branch: str = '', commit: str = '') → Union[hangar.checkout.ReaderCheckout, hangar.checkout.WriterCheckout]¶

Checkout the repo at some point in time in either read or write mode.

Only one writer instance can exist at a time. Write enabled checkout must must create a staging area from the HEAD commit of a branch. On the contrary, any number of reader checkouts can exist at the same time and can specify either a branch name or a commit hash.

Parameters

write (bool, optional) – Specify if the checkout is write capable, defaults to False
branch (str, optional) – name of the branch to checkout. This utilizes the state of the repo as it existed at the branch HEAD commit when this checkout object was instantiated, defaults to ‘’
commit (str, optional) – specific hash of a commit to use for the checkout (instead of a branch HEAD commit). This argument takes precedent over a branch name parameter if it is set. Note: this only will be used in non-writeable checkouts, defaults to ‘’

Raises

ValueError – If the value of write argument is not boolean
ValueError – If commit argument is set to any value when write=True. Only branch argument is allowed.

Returns

Checkout object which can be used to interact with the repository data

Return type

Union[ReaderCheckout, WriterCheckout]

clone(user_name: str, user_email: str, remote_address: str, *, remove_old: bool = False) → str ¶

Download a remote repository to the local disk.

The clone method implemented here is very similar to a git clone operation. This method will pull all commit records, history, and data which are parents of the remote’s master branch head commit. If a Repository exists at the specified directory, the operation will fail.

Parameters

user_name (str) – Name of the person who will make commits to the repository. This information is recorded permanently in the commit records.
user_email (str) – Email address of the repository user. This information is recorded permanently in any commits created.
remote_address (str) – location where the hangar.remote.server.HangarServer process is running and accessible by the clone user.
remove_old (bool, optional, kwarg only) – DANGER! DEVELOPMENT USE ONLY! If enabled, a hangar.repository.Repository existing on disk at the same path as the requested clone location will be completely removed and replaced with the newly cloned repo. (the default is False, which will not modify any contents on disk and which will refuse to create a repository at a given location if one already exists there.)

Returns

Name of the master branch for the newly cloned repository.

Return type

str

create_branch(name: str, base_commit: str = None) → hangar.records.heads.BranchHead¶

create a branch with the provided name from a certain commit.

If no base commit hash is specified, the current writer branch HEAD commit is used as the base_commit hash for the branch. Note that creating a branch does not actually create a checkout object for interaction with the data. to interact you must use the repository checkout method to properly initialize a read (or write) enabled checkout object.

>>> from hangar import Repository
>>> repo = Repository('foo/path/to/dir')

>>> repo.create_branch('testbranch')
    BranchHead(name='testbranch', digest='b66b...a8cc')
>>> repo.list_branches()
    ['master', 'testbranch']
>>> co = repo.checkout(write=True, branch='testbranch')
>>> # add data ...
>>> newDigest = co.commit('added some stuff')

>>> repo.create_branch('new-changes', base_commit=newDigest)
    BranchHead(name='new-changes', digest='35kd...3254')
>>> repo.list_branches()
    ['master', 'new-changes', 'testbranch']

Parameters

name (str) – name to assign to the new branch
base_commit (str, optional) – commit hash to start the branch root at. if not specified, the writer branch HEAD commit at the time of execution will be used, defaults to None

Returns

NamedTuple[str, str] with fields for name and digest of the branch created (if the operation was successful)

Return type

BranchHead

Raises

ValueError – If the branch name provided contains characters outside of alpha-numeric ascii characters and “.”, “_”, “-” (no whitespace), or is > 64 characters.
ValueError – If the branch already exists.
RuntimeError – If the repository does not have at-least one commit on the “default” (ie. master) branch.

diff(master: str, dev: str) → hangar.diff.DiffAndConflicts¶

Calculate diff between master and dev branch/commits.

Diff is calculated as if we are to merge “dev” into “master”

Parameters

master (str) – branch name or commit hash digest to use as the “master” which changes made in “dev” are compared to.
dev (str) – branch name or commit hash digest to use as the “dev” (ie. “feature”) branch which changes have been made to which are to be compared to the contents of “master”.

Returns

Standard output diff structure.

Return type

DiffAndConflicts

force_release_writer_lock() → bool ¶

Force release the lock left behind by an unclosed writer-checkout

Warning

NEVER USE THIS METHOD IF WRITER PROCESS IS CURRENTLY ACTIVE. At the time of writing, the implications of improper/malicious use of this are not understood, and there is a a risk of of undefined behavior or (potentially) data corruption.

At the moment, the responsibility to close a write-enabled checkout is placed entirely on the user. If the close() method is not called before the program terminates, a new checkout with write=True will fail. The lock can only be released via a call to this method.

Note

This entire mechanism is subject to review/replacement in the future.

Returns: if the operation was successful.
Return type: bool

init(user_name: str, user_email: str, *, remove_old: bool = False) → str ¶

Initialize a Hangar repository at the specified directory path.

This function must be called before a checkout can be performed.

Parameters

user_name (str) – Name of the repository user account.
user_email (str) – Email address of the repository user account.
remove_old (bool, kwarg-only) – DEVELOPER USE ONLY – remove and reinitialize a Hangar repository at the given path, Default = False

Returns

the full directory path where the Hangar repository was initialized on disk.

Return type

str

property initialized¶

Check if the repository has been initialized or not

Returns: True if repository has been initialized.
Return type: bool

list_branches() → List[str]¶

list all branch names created in the repository.

Returns: the branch names recorded in the repository
Return type: List[str]

log(branch: str = None, commit: str = None, *, return_contents: bool = False, show_time: bool = False, show_user: bool = False) → Optional[dict]¶

Displays a pretty printed commit log graph to the terminal.

Note

For programatic access, the return_contents value can be set to true which will retrieve relevant commit specifications as dictionary elements.

Parameters

branch (str, optional) – The name of the branch to start the log process from. (Default value = None)
commit (str, optional) – The commit hash to start the log process from. (Default value = None)
return_contents (bool, optional, kwarg only) – If true, return the commit graph specifications in a dictionary suitable for programatic access/evaluation.
show_time (bool, optional, kwarg only) – If true and return_contents is False, show the time of each commit on the printed log graph
show_user (bool, optional, kwarg only) – If true and return_contents is False, show the committer of each commit on the printed log graph

Returns

Dict containing the commit ancestor graph, and all specifications.

Return type

Optional[dict]

merge(message: str, master_branch: str, dev_branch: str) → str ¶

Perform a merge of the changes made on two branches.

Parameters

message (str) – Commit message to use for this merge.
master_branch (str) – name of the master branch to merge into
dev_branch (str) – name of the dev/feature branch to merge

Returns

Hash of the commit which is written if possible.

Return type

str

property path¶

Return the path to the repository on disk, read-only attribute

Returns: path to the specified repository, not including .hangar directory
Return type: str

property remote¶

Accessor to the methods controlling remote interactions.

See also

Remotes for available methods of this property

Returns: Accessor object methods for controlling remote interactions.
Return type: Remotes

remove_branch(name: str, *, force_delete: bool = False) → hangar.records.heads.BranchHead¶

Permanently delete a branch pointer from the repository history.

Since a branch (by definition) is the name associated with the HEAD commit of a historical path, the default behavior of this method is to throw an exception (no-op) should the HEAD not be referenced as an ancestor (or at least as a twin) of a separate branch which is currently ALIVE. If referenced in another branch’s history, we are assured that all changes have been merged and recorded, and that this pointer can be safely deleted without risk of damage to historical provenance or (eventual) loss to garbage collection.

>>> from hangar import Repository
>>> repo = Repository('foo/path/to/dir')

>>> repo.create_branch('first-testbranch')
BranchHead(name='first-testbranch', digest='9785...56da')
>>> repo.create_branch('second-testbranch')
BranchHead(name='second-testbranch', digest='9785...56da')
>>> repo.list_branches()
['master', 'first-testbranch', 'second-testbranch']
>>> # Make a commit to advance a branch
>>> co = repo.checkout(write=True, branch='first-testbranch')
>>> # add data ...
>>> co.commit('added some stuff')
'3l253la5hna3k3a553256nak35hq5q534kq35532'
>>> co.close()

>>> repo.remove_branch('second-testbranch')
BranchHead(name='second-testbranch', digest='9785...56da')

A user may manually specify to delete an un-merged branch, in which case the force_delete keyword-only argument should be set to True.

>>> # check out master and try to remove 'first-testbranch'
>>> co = repo.checkout(write=True, branch='master')
>>> co.close()

>>> repo.remove_branch('first-testbranch')
Traceback (most recent call last):
    ...
RuntimeError: ("The branch first-testbranch is not fully merged. "
"If you are sure you want to delete it, re-run with "
"force-remove parameter set.")
>>> # Now set the `force_delete` parameter
>>> repo.remove_branch('first-testbranch', force_delete=True)
BranchHead(name='first-testbranch', digest='9785...56da')

It is important to note that while this method will handle all safety checks, argument validation, and performs the operation to permanently delete a branch name/digest pointer, **no commit refs along the history will be deleted from the Hangar database*.* Most of the history contains commit refs which must be safe in other branch histories, and recent commits may have been used as the base for some new history. As such, even if some of the latest commits leading up to a deleted branch HEAD are orphaned (unreachable), the records (and all data added in those commits) will remain on the disk.

In the future, we intend to implement a garbage collector which will remove orphan commits which have not been modified for some set amount of time (probably on the order of a few months), but this is not implemented at the moment.

Should an accidental forced branch deletion occur, it is possible to recover and create a new branch head pointing to the same commit. If the commit digest of the removed branch HEAD is known, its as simple as specifying a name and the base_digest in the normal create_branch() method. If the digest is unknown, it will be a bit more work, but some of the developer facing introspection tools / routines could be used to either manually or (with minimal effort) programmatically find the orphan commit candidates. If you find yourself having accidentally deleted a branch, and must get it back, please reach out on the Github Issues page. We’ll gladly explain more in depth and walk you through the process in any way we can help!

Parameters

name (str) – name of the branch which should be deleted. This branch must exist, and cannot refer to a remote tracked branch (ie. origin/devbranch), please see exception descriptions for other parameters determining validity of argument
force_delete (bool, optional) – If True, remove the branch pointer even if the changes are un-merged in other branch histories. May result in orphaned commits which may be time-consuming to recover if needed, by default False

Returns

NamedTuple[str, str] with fields for name and digest of the branch pointer deleted.

Return type

BranchHead

Raises

ValueError – If a branch with the provided name does not exist locally
PermissionError – If removal of the branch would result in a repository with zero local branches.
PermissionError – If a write enabled checkout is holding the writer-lock at time of this call.
PermissionError – If the branch to be removed was the last used in a write-enabled checkout, and whose contents form the base of the staging area.
RuntimeError – If the branch has not been fully merged into other branch histories, and force_delete option is not True.

property size_human¶

Disk space used by the repository returned in human readable string.

>>> repo.size_human
'1.23 GB'
>>> print(type(repo.size_human))
<class 'str'>

Returns: disk space used by the repository formated in human readable text.
Return type: str

property size_nbytes¶

Disk space used by the repository returned in number of bytes.

>>> repo.size_nbytes
1234567890
>>> print(type(repo.size_nbytes))
<class 'int'>

Returns: number of bytes used by the repository on disk.
Return type: int

summary(*, branch: str = '', commit: str = '') → None ¶

Print a summary of the repository contents to the terminal

Parameters

branch (str, optional) – A specific branch name whose head commit will be used as the summary point (Default value = ‘’)
commit (str, optional) – A specific commit hash which should be used as the summary point. (Default value = ‘’)

verify_repo_integrity() → bool ¶

Verify the integrity of the repository data on disk.

Runs a full cryptographic verification of repository contents in order to ensure the integrity of all data and history recorded on disk.

Note

This proof may take a significant amount of time to run for repositories which:

store significant quantities of data on disk.
have a very large number of commits in their history.

As a brief explanation for why these are the driving factors behind processing time:

Every single piece of data in the repositories history must be read from disk, cryptographically hashed, and compared to the expected value. There is no exception to this rule; regardless of when a piece of data was added / removed from an column, or for how many (or how few) commits some sample exists in. The integrity of the commit tree at any point after some piece of data is added to the repo can only be validated if it - and all earlier data pieces - are proven to be intact and unchanged.

Note: This does not mean that the verification is repeatedly performed for every commit some piece of data is stored in. Each data piece is read from disk and verified only once, regardless of how many commits some piece of data is referenced in.
Each commit reference (defining names / contents of a commit) must be decompressed and parsed into a usable data structure. We scan across all data digests referenced in the commit and ensure that the corresponding data piece is known to hangar (and validated as unchanged). The commit refs (along with the corresponding user records, message, and parent map), are then re-serialized and cryptographically hashed for comparison to the expected value. While this process is fairly efficient for a single commit, it must be repeated for each commit in the repository history, and may take a non-trivial amount of time for repositories with thousands of commits.

While the two points above are the most time consuming operations, there are many more checks which are performed alongside them as part of the full verification run.

Returns: True if integrity verification is successful, otherwise False; in this case, a message describing the offending component will be printed to stdout.
Return type: bool

property version¶

Find the version of Hangar software the repository is written with

Returns: semantic version of major, minor, micro version of repo software version.
Return type: str

property writer_lock_held¶

Check if the writer lock is currently marked as held. Read-only attribute.

Returns: True is writer-lock is held, False if writer-lock is free.
Return type: bool

Remotes¶

class Remotes¶

Class which governs access to remote interactor objects.

Note

The remote-server implementation is under heavy development, and is likely to undergo changes in the Future. While we intend to ensure compatability between software versions of Hangar repositories written to disk, the API is likely to change. Please follow our process at: https://www.github.com/tensorwerk/hangar-py

add(name: str, address: str) → hangar.remotes.RemoteInfo¶

Add a remote to the repository accessible by name at address.

Parameters

name (str) – the name which should be used to refer to the remote server (ie: ‘origin’)
address (str) – the IP:PORT where the hangar server is running

Returns

Two-tuple containing (name, address) of the remote added to the client’s server list.

Return type

RemoteInfo

Raises

ValueError – If provided name contains any non ascii letter characters characters, or if the string is longer than 64 characters long.
ValueError – If a remote with the provided name is already listed on this client, No-Op. In order to update a remote server address, it must be removed and then re-added with the desired address.

fetch(remote: str, branch: str) → str ¶

Retrieve new commits made on a remote repository branch.

This is semantically identical to a git fetch command. Any new commits along the branch will be retrieved, but placed on an isolated branch to the local copy (ie. remote_name/branch_name). In order to unify histories, simply merge the remote branch into the local branch.

Parameters

remote (str) – name of the remote repository to fetch from (ie. origin)
branch (str) – name of the branch to fetch the commit references for.

Returns

Name of the branch which stores the retrieved commits.

Return type

str

fetch_data(remote: str, branch: str = None, commit: str = None, *, column_names: Optional[Sequence[str]] = None, max_num_bytes: int = None, retrieve_all_history: bool = False) → List[str]¶

Retrieve the data for some commit which exists in a partial state.

Parameters

remote (str) – name of the remote to pull the data from
branch (str, optional) – The name of a branch whose HEAD will be used as the data fetch point. If None, commit argument expected, by default None
commit (str, optional) – Commit hash to retrieve data for, If None, branch argument expected, by default None
column_names (Optional[Sequence[str]]) – Names of the columns which should be retrieved for the particular commits, any columns not named will not have their data fetched from the server. Default behavior is to retrieve all columns
max_num_bytes (Optional[int]) – If you wish to limit the amount of data sent to the local machine, set a max_num_bytes parameter. This will retrieve only this amount of data from the server to be placed on the local disk. Default is to retrieve all data regardless of how large.
retrieve_all_history (Optional[bool]) – if data should be retrieved for all history accessible by the parents of this commit HEAD. by default False

Returns

commit hashes of the data which was returned.

Return type

List[str]

Raises

ValueError – if branch and commit args are set simultaneously.
ValueError – if specified commit does not exist in the repository.
ValueError – if branch name does not exist in the repository.

fetch_data_sample(remote: str, column: str, samples: Union[str, int, Sequence[Union[str, int]], Sequence[Union[Tuple[Union[str, int], Union[str, int]], Tuple[Union[str, int]], str, int]]], branch: Optional[str] = None, commit: Optional[str] = None) → str ¶

Granular fetch data operation allowing selection of individual samples.

Warning

This is a specialized version of the fetch_data() method for use in specilized situations where some prior knowledge is known about the data. Most users should prefer fetch_data() over this version.

In some cases, it may be desireable to only perform a fetch data operation for some particular samples within a column (without needing to download any other data contained in the column). This method allows for the granular specification of keys to fetch in a certain column at the selected branch / commit time point.

Parameters

remote (str) – name of the remote server to pull data from
column (str) – name of the column which data is being fetched from.
samples (Union[KeyType, Sequence[KeyType],) –
Sequence[Union[Tuple[KeyType, KeyType], Tuple[KeyType], KeyType]]] Key, or sequence of sample keys to select.
- Flat column layouts should provide just a single key, or flat sequence of keys which will be fetched from the server. ie. sample1 OR [sample1, sample2, sample3, etc.]
- Nested column layouts can provide tuples specifying (sample, subsample) records to retrieve, tuples with an Ellipsis character in the subsample index (sample, …) (which will fetch all subsamples for the given sample), or can provide lone sample keys in the sequences sample (which will also fetch all subsamples listed under the sample) OR ANY COMBINATION of the above.
branch (Optional[str]) – branch head to operate on, either branch or commit argument must be passed, but NOT both. Default is None
commit (Optional[str]) – commit to operate on, either branch or commit argument must be passed, but NOT both.

Returns

On success, the commit hash which data was fetched into.

Return type

str

list_all() → List[hangar.remotes.RemoteInfo]¶

List all remote names and addresses recorded in the client’s repository.

Returns: list of namedtuple specifying (name, address) for each remote server recorded in the client repo.
Return type: List[RemoteInfo]

ping(name: str) → float ¶

Ping remote server and check the round trip time.

Parameters

name (str) – name of the remote server to ping

Returns

round trip time it took to ping the server after the connection was established and requested client configuration was retrieved

Return type

float

Raises

KeyError – If no remote with the provided name is recorded.
ConnectionError – If the remote server could not be reached.

push(remote: str, branch: str, *, username: str = '', password: str = '') → str ¶

push changes made on a local repository to a remote repository.

This method is semantically identical to a git push operation. Any local updates will be sent to the remote repository.

Note

The current implementation is not capable of performing a force push operation. As such, remote branches with diverged histories to the local repo must be retrieved, locally merged, then re-pushed. This feature will be added in the near future.

Parameters

remote (str) – name of the remote repository to make the push on.
branch (str) – Name of the branch to push to the remote. If the branch name does not exist on the remote, the it will be created
username (str, optional, kwarg-only) – credentials to use for authentication if repository push restrictions are enabled, by default ‘’.
password (str, optional, kwarg-only) – credentials to use for authentication if repository push restrictions are enabled, by default ‘’.

Returns

Name of the branch which was pushed

Return type

str

remove(name: str) → hangar.remotes.RemoteInfo¶

Remove a remote repository from the branch records

Parameters: name (str) – name of the remote to remove the reference to
Raises: KeyError – If a remote with the provided name does not exist
Returns: The channel address which was removed at the given remote name
Return type: str

Write Enabled Checkout¶

Checkout¶

class WriterCheckout¶

Checkout the repository at the head of a given branch for writing.

This is the entry point for all writing operations to the repository, the writer class records all interactions in a special "staging" area, which is based off the state of the repository as it existed at the HEAD commit of a branch.

>>> co = repo.checkout(write=True)
>>> co.branch_name
'master'
>>> co.commit_hash
'masterheadcommithash'
>>> co.close()

At the moment, only one instance of this class can write data to the staging area at a time. After the desired operations have been completed, it is crucial to call close() to release the writer lock. In addition, after any changes have been made to the staging area, the branch HEAD cannot be changed. In order to checkout another branch HEAD for writing, you must either commit() the changes, or perform a hard-reset of the staging area to the last commit via reset_staging_area().

In order to reduce the chance that the python interpreter is shut down without calling close(), which releases the writer lock - a common mistake during ipython / jupyter sessions - an atexit hook is registered to close(). If properly closed by the user, the hook is unregistered after completion with no ill effects. So long as a the process is NOT terminated via non-python SIGKILL, fatal internal python error, or or special os exit methods, cleanup will occur on interpreter shutdown and the writer lock will be released. If a non-handled termination method does occur, the force_release_writer_lock() method must be called manually when a new python process wishes to open the writer checkout.

__contains__(key)¶: Determine if some column name (key) exists in the checkout.

__getitem__(index)¶

Dictionary style access to columns and samples

Checkout object can be thought of as a “dataset” (“dset”) mapping a view of samples across columns.

>>> dset = repo.checkout(branch='master')
>>>
# Get an column contained in the checkout.
>>> dset['foo']
ColumnDataReader
>>>
# Get a specific sample from ``'foo'`` (returns a single array)
>>> dset['foo', '1']
np.array([1])
>>>
# Get multiple samples from ``'foo'`` (returns a list of arrays, in order
# of input keys)
>>> dset[['foo', '1'], ['foo', '2'],  ['foo', '324']]
[np.array([1]), np.ndarray([2]), np.ndarray([324])]
>>>
# Get sample from multiple columns, column/data returned is ordered
# in same manner as input of func.
>>> dset[['foo', '1'], ['bar', '1'],  ['baz', '1']]
[np.array([1]), np.ndarray([1, 1]), np.ndarray([1, 1, 1])]
>>>
# Get multiple samples from multiple columns            >>> keys = [(col, str(samp)) for samp in range(2) for col in ['foo', 'bar']]
>>> keys
[('foo', '0'), ('bar', '0'), ('foo', '1'), ('bar', '1')]
>>> dset[keys]
[np.array([1]), np.array([1, 1]), np.array([2]), np.array([2, 2])]

Arbitrary column layouts are supported by simply adding additional members to the keys for each piece of data. For example, getting data from a column with a nested layout:

>>> dset['nested_col', 'sample_1', 'subsample_0']
np.array([1, 0])
>>>
# a sample accessor object can be retrieved at will...
>>> dset['nested_col', 'sample_1']
<class 'FlatSubsampleReader'>(column_name='nested_col', sample_name='sample_1')
>>>
# to get all subsamples in a nested sample use the Ellipsis operator
>>> dset['nested_col', 'sample_1', ...]
{'subsample_0': np.array([1, 0]),
 'subsample_1': np.array([1, 1]),
 ...
 'subsample_n': np.array([1, 255])}

Retrieval of data from different column types can be mixed and combined as desired. For example, retrieving data from both flat and nested columns simultaneously:

>>> dset[('nested_col', 'sample_1', '0'), ('foo', '0')]
[np.array([1, 0]), np.array([0])]
>>> dset[('nested_col', 'sample_1', ...), ('foo', '0')]
[{'subsample_0': np.array([1, 0]), 'subsample_1': np.array([1, 1])},
 np.array([0])]
>>> dset[('foo', '0'), ('nested_col', 'sample_1')]
[np.array([0]),
 <class 'FlatSubsampleReader'>(column_name='nested_col', sample_name='sample_1')]

If a column or data key does not exist, then this method will raise a KeyError. As an alternative, missing keys can be gracefully handeled by calling get() instead. This method does not (by default) raise an error if a key is missing. Instead, a (configurable) default value is simply inserted in it’s place.

>>> dset['foo', 'DOES_NOT_EXIST']
-------------------------------------------------------------------
KeyError                           Traceback (most recent call last)
<ipython-input-40-731e6ea62fb8> in <module>
----> 1 res = co['foo', 'DOES_NOT_EXIST']
KeyError: 'DOES_NOT_EXIST'

Parameters

index –

column name, sample key(s) or sequence of list/tuple of column name, sample keys(s) which should be retrieved in the operation.

Please see detailed explanation above for full explanation of accepted argument format / result types.

Returns

Columns – single column parameter, no samples specified
Any – Single column specified, single sample key specified
List[Any] – arbitrary columns, multiple samples array data for each sample is returned in same order sample keys are received.

__iter__()¶: Iterate over column keys

__len__()¶: Returns number of columns in the checkout.

add_bytes_column(name: str, contains_subsamples: bool = False, *, backend: Optional[str] = None, backend_options: Optional[dict] = None)¶

Initializes a bytes container column

Columns are created in order to store some arbitrary collection of data pieces. In this case, we store bbytes data. Items need not be related to each-other in any direct capacity; the only criteria hangar requires is that all pieces of data stored in the column have a compatible schema with each-other (more on this below). Each piece of data is indexed by some key (either user defined or automatically generated depending on the user’s preferences). Both single level stores (sample keys mapping to data on disk) and nested stores (where some sample key maps to an arbitrary number of subsamples, in turn each pointing to some piece of store data on disk) are supported.

All data pieces within a column have the same data type. For bytes columns, there is no distinction between 'variable_shape' and 'fixed_shape' schema types. Values are allowed to take on a value of any size so long as the datatype and contents are valid for the schema definition.

Parameters

name (str) – Name assigned to the column
contains_subsamples (bool, optional) – True if the column column should store data in a nested structure. In this scheme, a sample key is used to index an arbitrary number of subsamples which map some (sub)key to a piece of data. If False, sample keys map directly to a single piece of data; essentially acting as a single level key/value store. By default, False.
backend (Optional[str], optional) – ADVANCED USERS ONLY, backend format code to use for column data. If None, automatically inferred and set based on data shape and type. by default None
backend_options (Optional[dict], optional) – ADVANCED USERS ONLY, filter opts to apply to column data. If None, automatically inferred and set based on data shape and type. by default None

Returns

instance object of the initialized column.

Return type

Columns

add_ndarray_column(name: str, shape: Optional[Union[int, tuple]] = None, dtype: Optional[numpy.dtype] = None, prototype: Optional[numpy.ndarray] = None, variable_shape: bool = False, contains_subsamples: bool = False, *, backend: Optional[str] = None, backend_options: Optional[dict] = None)¶

Initializes a numpy.ndarray container column.

Columns are created in order to store some arbitrary collection of data pieces. In this case, we store numpy.ndarray data. Items need not be related to each-other in any direct capacity; the only criteria hangar requires is that all pieces of data stored in the column have a compatible schema with each-other (more on this below). Each piece of data is indexed by some key (either user defined or automatically generated depending on the user’s preferences). Both single level stores (sample keys mapping to data on disk) and nested stores (where some sample key maps to an arbitrary number of subsamples, in turn each pointing to some piece of store data on disk) are supported.

All data pieces within a column have the same data type and number of dimensions. The size of each dimension can be either fixed (the default behavior) or variable per sample. For fixed dimension sizes, all data pieces written to the column must have the same shape & size which was specified at the time the column column was initialized. Alternatively, variable sized columns can write data pieces with dimensions of any size (up to a specified maximum).

Parameters

name (str) – The name assigned to this column.
shape (Optional[Union[int, Tuple[int]]]) – The shape of the data samples which will be written in this column. This argument and the dtype argument are required if a prototype is not provided, defaults to None.
dtype (Optional[numpy.dtype]) – The datatype of this column. This argument and the shape argument are required if a prototype is not provided., defaults to None.
prototype (Optional[numpy.ndarray]) – A sample array of correct datatype and shape which will be used to initialize the column storage mechanisms. If this is provided, the shape and dtype arguments must not be set, defaults to None.
variable_shape (bool, optional) – If this is a variable sized column. If true, a the maximum shape is set from the provided shape or prototype argument. Any sample added to the column can then have dimension sizes <= to this initial specification (so long as they have the same rank as what was specified) defaults to False.
contains_subsamples (bool, optional) – True if the column column should store data in a nested structure. In this scheme, a sample key is used to index an arbitrary number of subsamples which map some (sub)key to some piece of data. If False, sample keys map directly to a single piece of data; essentially acting as a single level key/value store. By default, False.
backend (Optional[str], optional) – ADVANCED USERS ONLY, backend format code to use for column data. If None, automatically inferred and set based on data shape and type. by default None
backend_options (Optional[dict], optional) – ADVANCED USERS ONLY, filter opts to apply to column data. If None, automatically inferred and set based on data shape and type. by default None

Returns

instance object of the initialized column.

Return type

Columns

add_str_column(name: str, contains_subsamples: bool = False, *, backend: Optional[str] = None, backend_options: Optional[dict] = None)¶

Initializes a str container column

Columns are created in order to store some arbitrary collection of data pieces. In this case, we store str data. Items need not be related to each-other in any direct capacity; the only criteria hangar requires is that all pieces of data stored in the column have a compatible schema with each-other (more on this below). Each piece of data is indexed by some key (either user defined or automatically generated depending on the user’s preferences). Both single level stores (sample keys mapping to data on disk) and nested stores (where some sample key maps to an arbitrary number of subsamples, in turn each pointing to some piece of store data on disk) are supported.

All data pieces within a column have the same data type. For str columns, there is no distinction between 'variable_shape' and 'fixed_shape' schema types. Values are allowed to take on a value of any size so long as the datatype and contents are valid for the schema definition.

Parameters

name (str) – Name assigned to the column
contains_subsamples (bool, optional) – True if the column column should store data in a nested structure. In this scheme, a sample key is used to index an arbitrary number of subsamples which map some (sub)key to a piece of data. If False, sample keys map directly to a single piece of data; essentially acting as a single level key/value store. By default, False.
backend (Optional[str], optional) – ADVANCED USERS ONLY, backend format code to use for column data. If None, automatically inferred and set based on data shape and type. by default None
backend_options (Optional[dict], optional) – ADVANCED USERS ONLY, filter opts to apply to column data. If None, automatically inferred and set based on data shape and type. by default None

Returns

instance object of the initialized column.

Return type

Columns

property branch_name¶

Branch this write enabled checkout’s staging area was based on.

Returns: name of the branch whose commit HEAD changes are staged from.
Return type: str

close() → None ¶

Close all handles to the writer checkout and release the writer lock.

Failure to call this method after the writer checkout has been used will result in a lock being placed on the repository which will not allow any writes until it has been manually cleared.

property columns¶

Provides access to column interaction object.

Can be used to either return the columns accessor for all elements or a single column instance by using dictionary style indexing.

>>> co = repo.checkout(write=True)
>>> cols = co.columns
>>> len(cols)
0
>>> fooCol = co.add_ndarray_column('foo', shape=(10, 10), dtype=np.uint8)
>>> len(co.columns)
1
>>> len(co)
1
>>> list(co.columns.keys())
['foo']
>>> list(co.keys())
['foo']
>>> fooCol = co.columns['foo']
>>> fooCol.dtype
np.fooDtype
>>> fooCol = cols.get('foo')
>>> fooCol.dtype
np.fooDtype
>>> 'foo' in co.columns
True
>>> 'bar' in co.columns
False

See also

The class Columns contains all methods accessible by this property accessor

Returns: the columns object which behaves exactly like a columns accessor class but which can be invalidated when the writer lock is released.
Return type: Columns

commit(commit_message: str) → str ¶

Commit the changes made in the staging area on the checkout branch.

Parameters: commit_message (str, optional) – user proved message for a log of what was changed in this commit. Should a fast forward commit be possible, this will NOT be added to fast-forward HEAD.
Returns: The commit hash of the new commit.
Return type: str
Raises: RuntimeError – If no changes have been made in the staging area, no commit occurs.

property commit_hash¶

Commit hash which the staging area of branch_name is based on.

Returns: commit hash
Return type: str

property diff¶

Access the differ methods which are aware of any staged changes.

See also

The class hangar.diff.WriterUserDiff contains all methods accessible by this property accessor

Returns: weakref proxy to the differ object (and contained methods) which behaves exactly like the differ class but which can be invalidated when the writer lock is released.
Return type: WriterUserDiff

get(keys, default=None, except_missing=False)¶

View of sample data across columns gracefully handling missing sample keys.

Please see __getitem__() for full description. This method is identical with a single exception: if a sample key is not present in an column, this method will plane a null None value in it’s return slot rather than throwing a KeyError like the dict style access does.

Parameters

keys –
sequence of column name (and optionally) sample key(s) or sequence of list/tuple of column name, sample keys(s) which should be retrieved in the operation.

Please see detailed explanation in __getitem__() for full explanation of accepted argument format / result types.
default (Any, optional) – default value to insert in results for the case where some column name / sample key is not found, and the except_missing parameter is set to False.
except_missing (bool, optional) – If False, will not throw exceptions on missing sample key value. Will raise KeyError if True and missing key found.

Returns

Columns – single column parameter, no samples specified
Any – Single column specified, single sample key specified
List[Any] – arbitrary columns, multiple samples array data for each sample is returned in same order sample keys are received.

items()¶: Generator yielding tuple of (name, accessor object) of every column

keys()¶: Generator yielding the name (key) of every column

log(branch: str = None, commit: str = None, *, return_contents: bool = False, show_time: bool = False, show_user: bool = False) → Optional[dict]¶

Displays a pretty printed commit log graph to the terminal.

Note

For programatic access, the return_contents value can be set to true which will retrieve relevant commit specifications as dictionary elements.

if Neither branch nor commit arguments are supplied, the branch which is currently checked out for writing will be used as default.

Parameters

branch (str, optional) – The name of the branch to start the log process from. (Default value = None)
commit (str, optional) – The commit hash to start the log process from. (Default value = None)
return_contents (bool, optional, kwarg only) – If true, return the commit graph specifications in a dictionary suitable for programatic access/evaluation.
show_time (bool, optional, kwarg only) – If true and return_contents is False, show the time of each commit on the printed log graph
show_user (bool, optional, kwarg only) – If true and return_contents is False, show the committer of each commit on the printed log graph

Returns

Dict containing the commit ancestor graph, and all specifications.

Return type

Optional[dict]

merge(message: str, dev_branch: str) → str ¶

Merge the currently checked out commit with the provided branch name.

If a fast-forward merge is possible, it will be performed, and the commit message argument to this function will be ignored.

Parameters

message (str) – commit message to attach to a three-way merge
dev_branch (str) – name of the branch which should be merge into this branch (ie master)

Returns

commit hash of the new commit for the master branch this checkout was started from.

Return type

str

reset_staging_area(*, force=False) → str ¶

Perform a hard reset of the staging area to the last commit head.

After this operation completes, the writer checkout will automatically close in the typical fashion (any held references to :attr:column or :attr:metadata objects will finalize and destruct as normal), In order to perform any further operation, a new checkout needs to be opened.

Warning

This operation is IRREVERSIBLE. all records and data which are note stored in a previous commit will be permanently deleted.

Returns: Commit hash of the head which the staging area is reset to.
Return type: str
Raises: RuntimeError – If no changes have been made to the staging area, No-Op.

values()¶: Generator yielding accessor object of every column

Columns¶

class Columns¶

Common access patterns and initialization/removal of columns in a checkout.

This object is the entry point to all data stored in their individual columns. Each column contains a common schema which dictates the general shape, dtype, and access patters which the backends optimize access for. The methods contained within allow us to create, remove, query, and access these collections of common data pieces.

__contains__(key: str) → bool ¶

Determine if a column with a particular name is stored in the checkout

Parameters: key (str) – name of the column to check for
Returns: True if a column with the provided name exists in the checkout, otherwise False.
Return type: bool

__delitem__(key: str) → str ¶

Remove a column and all data records if write-enabled process.

Parameters: key (str) – Name of the column to remove from the repository. This will remove all records from the staging area (though the actual data and all records are still accessible) if they were previously committed.
Returns: If successful, the name of the removed column.
Return type: str
Raises: PermissionError – If any enclosed column is opened in a connection manager.

__getitem__(key: str) → Union[NestedSampleWriter, FlatSubsampleWriter, FlatSampleWriter]¶

Dict style access to return the column object with specified key/name.

Parameters: key (string) – name of the column object to get.
Returns: The object which is returned depends on the mode of checkout specified. If the column was checked out with write-enabled, return writer object, otherwise return read only object.
Return type: ModifierTypes

__len__() → int ¶: Get the number of column columns contained in the checkout.

property contains_remote_references¶

Dict of bool indicating data reference locality in each column.

Returns: For each column name key, boolean value where False indicates all samples in column exist locally, True if some reference remote sources.
Return type: Mapping[str, bool]

delete(column: str) → str ¶

Remove the column and all data contained within it.

Parameters

column (str) – name of the column to remove

Returns

name of the removed column

Return type

str

Raises

PermissionError – If any enclosed column is opened in a connection manager.
KeyError – If a column does not exist with the provided name

get(name: str) → Union[NestedSampleWriter, FlatSubsampleWriter, FlatSampleWriter]¶

Returns a column access object.

This can be used in lieu of the dictionary style access.

Parameters: name (str) – name of the column to return
Returns: ColumnData accessor (set to read or write mode as appropriate) which governs interaction with the data
Return type: ModifierTypes

property iswriteable¶: Bool indicating if this column object is write-enabled. Read-only attribute.

items() → Iterable[Tuple[str, Union[NestedSampleWriter, FlatSubsampleWriter, FlatSampleWriter]]]¶

Generator providing access to column_name, Columns

Yields: Iterable[Tuple[str, ModifierTypes]] – returns two tuple of all all column names/object pairs in the checkout.

keys() → List[str]¶

list all column keys (names) in the checkout

Returns: list of column names
Return type: List[str]

property remote_sample_keys¶

Determine columns samples names which reference remote sources.

Returns: dict where keys are column names and values are iterables of samples in the column containing remote references
Return type: Mapping[str, Iterable[Union[int, str]]]

values() → Iterable[Union[NestedSampleWriter, FlatSubsampleWriter, FlatSampleWriter]]¶

Yield all column object instances in the checkout.

Yields: Iterable[ModifierTypes] – Generator of ColumnData accessor objects (set to read or write mode as appropriate)

Flat Column Layout Container¶

class FlatSampleWriter¶

__contains__(key: Union[str, int]) → bool ¶: Determine if a key is a valid sample name in the column.

__delitem__(key: Union[str, int]) → None ¶

Remove a sample from the column. Convenience method to delete().

See also

pop() to return a value and then delete it in the same operation

Parameters: key (KeyType) – Name of the sample to remove from the column.

__getitem__(key: Union[str, int])¶

Retrieve data for some sample key via dict style access conventions.

See also

get()

Parameters: key (KeyType) – Sample key to retrieve from the column.
Returns: Data corresponding to the provided sample key.
Return type: value
Raises: KeyError – if no sample with the requested key exists.

__iter__() → Iterable[Union[str, int]]¶

Create iterator yielding an column sample keys.

Yields: Iterable[KeyType] – Sample key contained in the column.

__len__() → int ¶: Check how many samples are present in a given column.

__setitem__(key, value)¶

Store a piece of data in a column.

See also

update() for an implementation analogous to python’s built in dict.update() method which accepts a dict or iterable of key/value pairs to add in the same operation.

Parameters

key – name to assign to the sample (assuming the column accepts named samples), If str, can only contain alpha-numeric ascii characters (in addition to ‘-‘, ‘.’, ‘_’). Integer key must be >= 0. by default, None
value – data to store as a sample in the column.

append(value) → Union[str, int]¶

Store some data in a sample with an automatically generated key.

This method should only be used if the context some piece of data is used in is independent from its value (i.e. when reading data back, there is no useful information which needs to be conveyed between the data source’s name/id and the value of that piece of information.) Think carefully before going this route, as this posit does not apply to many common use cases.

To store the data with a user defined key, use update() or __setitem__()

Parameters: value – Piece of data to store in the column.
Returns: Name of the generated key this data is stored with.
Return type: KeyType

property backend¶: Code indicating which backing store is used when writing data.

property backend_options¶: Filter / Compression options applied to backend when writing data.

change_backend(backend: str, backend_options: Optional[dict] = None)¶

Change the default backend and filters applied to future data writes.

Warning

This method is meant for advanced users only. Please refer to the hangar backend codebase for information on accepted parameters and options.

Parameters

backend (str) – Backend format code to swtich to.
backend_options (Optional[dict]) – Backend option specification to use (if specified). If left to default value of None, then default options for backend are automatically used.

Raises

RuntimeError – If this method was called while this column is invoked in a context manager
ValueError – If the backend format code is not valid.

property column¶: Name of the column.

property column_layout¶: Column layout type (‘nested’, ‘flat’, etc).

property column_type¶: Data container type of the column (‘ndarray’, ‘str’, etc).

property contains_remote_references¶

Bool indicating if all samples in column exist on local disk.

The data associated with samples referencing some remote server will need to be downloaded (fetched in the hangar vocabulary) before they can be read into memory.

Returns: False if at least one sample in the column references data stored on some remote server. True if all sample data is available on the machine’s local disk.
Return type: bool

property contains_subsamples¶: Bool indicating if sub-samples are contained in this column container.

property dtype¶: Dtype of the columns data (np.float, str, etc).

get(key: Union[str, int], default=None)¶

Retrieve the data associated with some sample key

Parameters

key (KeyType) – The name of the subsample(s) to retrieve. Passing a single subsample key will return the stored data value.
default (Any) – if a key parameter is not found, then return this value instead. By default, None.

Returns

data data stored under subsample key if key exists, else default value if not found.

Return type

value

property iswriteable¶: Bool indicating if this column object is write-enabled.

items(local: bool = False) → Iterable[Tuple[Union[str, int], Any]]¶

Generator yielding (name, data) tuple for every subsample.

Parameters: local (bool, optional) – If True, returned keys/values will only correspond to data which is available for reading on the local disk, No attempt will be made to read data existing on a remote server, by default False.
Yields: Iterable[Tuple[KeyType, Any]] – Name and stored value for every subsample inside the sample.

keys(local: bool = False) → Iterable[Union[str, int]]¶

Generator yielding the name (key) of every subsample.

Parameters: local (bool, optional) – If True, returned keys will only correspond to data which is available for reading on the local disk, by default False.
Yields: Iterable[KeyType] – Keys of one subsample at a time inside the sample.

pop(key: Union[str, int])¶

Retrieve some value for some key(s) and delete it in the same operation.

Parameters: key (KeysType) – Sample key to remove
Returns: Upon success, the value of the removed key.
Return type: value
Raises: KeyError – If there is no sample with some key in the column.

property remote_reference_keys¶

Compute sample names whose data is stored in a remote server reference.

Returns: list of sample keys in the column whose data references indicate they are stored on a remote server.
Return type: Tuple[KeyType]

property schema_type¶: Schema type of the contained data (‘variable_shape’, ‘fixed_shape’, etc).

property shape¶: (Max) shape of data that can (is) written in the column.

update(other=None, **kwargs)¶

Store some data with the key/value pairs from other, overwriting existing keys.

update() implements functionality similar to python’s builtin dict.update() method, accepting either a dictionary or other iterable (of length two) listing key / value pairs.

Parameters

other – Accepts either another dictionary object or an iterable of key/value pairs (as tuples or other iterables of length two). mapping sample names to data value instances instances, If sample name is string type, can only contain alpha-numeric ascii characters (in addition to ‘-‘, ‘.’, ‘_’). Int key must be >= 0. By default, None.
**kwargs – keyword arguments provided will be saved with keywords as sample keys (string type only) and values as np.array instances.

values(local: bool = False) → Iterable[Any]¶

Generator yielding the data for every subsample.

Parameters: local (bool, optional) – If True, returned values will only correspond to data which is available for reading on the local disk. No attempt will be made to read data existing on a remote server, by default False.
Yields: Iterable[Any] – Values of one subsample at a time inside the sample.

Nested Column Layout Container¶

class NestedSampleWriter¶

__contains__(key: Union[str, int]) → bool ¶: Determine if some sample key exists in the column.

__delitem__(key: Union[str, int])¶: Remove a sample (including all contained subsamples) from the column.

See also

pop() for alternative implementing a simultaneous get value and delete operation.

__getitem__(key: Union[str, int]) → hangar.columns.layout_nested.FlatSubsampleReader ¶

Get the sample access class for some sample key.

Parameters: key (KeyType) – Name of sample to retrieve
Returns: Sample accessor corresponding to the given key
Return type: FlatSubsampleReader
Raises: KeyError – If no sample with the provided key exists.

__iter__() → Iterable[Union[str, int]]¶

Create iterator yielding an column sample keys.

Yields: Iterable[KeyType] – Sample key contained in the column.

__len__() → int ¶: Find number of samples in the column

__setitem__(key, value) → None ¶: Store some subsample key / subsample data map, overwriting existing keys.

See also

update() for alternative syntax for setting values.

property backend¶: Code indicating which backing store is used when writing data.

property backend_options¶: Filter / Compression options applied to backend when writing data.

change_backend(backend: str, backend_options: Optional[dict] = None)¶

Change the default backend and filters applied to future data writes.

Warning

This method is meant for advanced users only. Please refer to the hangar backend codebase for information on accepted parameters and options.

Parameters

backend (str) – Backend format code to swtich to.
backend_options – Backend option specification to use (if specified). If left to default value of None, then default options for backend are automatically used.

Raises

RuntimeError – If this method was called while this column is invoked in a context manager
ValueError – If the backend format code is not valid.

property column¶: Name of the column.

property column_layout¶: Column layout type (‘nested’, ‘flat’, etc).

property column_type¶: Data container type of the column (‘ndarray’, ‘str’, etc).

property contains_remote_references¶

Bool indicating all subsamples in sample column exist on local disk.

The data associated with subsamples referencing some remote server will need to be downloaded (fetched in the hangar vocabulary) before they can be read into memory.

Returns: False if at least one subsample in the column references data stored on some remote server. True if all sample data is available on the machine’s local disk.
Return type: bool

property contains_subsamples¶: Bool indicating if sub-samples are contained in this column container.

property dtype¶: Dtype of the columns data (np.float, str, etc).

get(key: Union[str, int, ellipsis, slice], default: Any = None) → hangar.columns.layout_nested.FlatSubsampleReader ¶

Retrieve data for some sample key(s) in the column.

Parameters

key (GetKeysType) – The name of the subsample(s) to retrieve
default (Any) – if a key parameter is not found, then return this value instead. By default, None.

Returns

Sample accessor class given by name key which can be used to access subsample data.

Return type

FlatSubsampleReader

property iswriteable¶: Bool indicating if this column object is write-enabled.

items(local: bool = False) → Iterable[Tuple[Union[str, int], Any]]¶

Generator yielding (name, data) tuple for every subsample.

Parameters: local (bool, optional) – If True, returned keys/values will only correspond to data which is available for reading on the local disk, No attempt will be made to read data existing on a remote server, by default False.
Yields: Iterable[Tuple[KeyType, Any]] – Name and stored value for every subsample inside the sample.

keys(local: bool = False) → Iterable[Union[str, int]]¶

Generator yielding the name (key) of every subsample.

Parameters: local (bool, optional) – If True, returned keys will only correspond to data which is available for reading on the local disk, by default False.
Yields: Iterable[KeyType] – Keys of one subsample at a time inside the sample.

property num_subsamples¶: Calculate total number of subsamples existing in all samples in column

pop(key: Union[str, int]) → Dict[Union[str, int], Any]¶

Retrieve some value for some key(s) and delete it in the same operation.

Parameters: key (KeysType) – sample key to remove
Returns: Upon success, a nested dictionary mapping sample names to a dict of subsample names and subsample values for every sample key passed into this method.
Return type: Dict[KeyType, KeyArrMap]

property remote_reference_keys¶

Compute subsample names whose data is stored in a remote server reference.

Returns: list of subsample keys in the column whose data references indicate they are stored on a remote server.
Return type: Tuple[KeyType]

property schema_type¶: Schema type of the contained data (‘variable_shape’, ‘fixed_shape’, etc).

property shape¶: (Max) shape of data that can (is) written in the column.

update(other=None, **kwargs) → None ¶

Store some data with the key/value pairs, overwriting existing keys.

update() implements functionality similar to python’s builtin dict.update() method, accepting either a dictionary or other iterable (of length two) listing key / value pairs.

Parameters

other – Dictionary mapping sample names to subsample data maps. Or Sequence (list or tuple) where element one is the sample name and element two is a subsample data map.
**kwargs – keyword arguments provided will be saved with keywords as sample keys (string type only) and values as a mapping of subarray keys to data values.

values(local: bool = False) → Iterable[Any]¶

Generator yielding the tensor data for every subsample.

Parameters: local (bool, optional) – If True, returned values will only correspond to data which is available for reading on the local disk. No attempt will be made to read data existing on a remote server, by default False.
Yields: Iterable[Any] – Values of one subsample at a time inside the sample.

class FlatSubsampleWriter¶

__delitem__(key: Union[str, int])¶

Remove a subsample from the column.`.

See also

pop() to simultaneously get a keys value and delete it.

Parameters: key (KeyType) – Name of the sample to remove from the column.

__getitem__(key: Union[str, int, ellipsis, slice]) → Union[Any, Dict[Union[str, int], Any]]¶

Retrieve data for some subsample key via dict style access conventions.

See also

get()

Parameters: key (GetKeysType) – Sample key to retrieve from the column. Alternatively, slice syntax can be used to retrieve a selection of subsample keys/values. An empty slice (: == slice(None)) or Ellipsis (...) will return all subsample keys/values. Passing a non-empty slice ([1:5] == slice(1, 5)) will select keys to retrieve by enumerating all subsamples and retrieving the element (key) for each step across the range. Note: order of enumeration is not guaranteed; do not rely on any ordering observed when using this method.
Returns: Sample data corresponding to the provided key. or dictionary of subsample keys/data if Ellipsis or slice passed in as key.
Return type: Union[Any, Dict[KeyType, Any]]
Raises: KeyError – if no sample with the requested key exists.

__setitem__(key, value)¶

Store data as a subsample. Convenience method to add().

See also

update() for an implementation analogous to python’s built in dict.update() method which accepts a dict or iterable of key/value pairs to add in the same operation.

Parameters

key – Key (name) of the subsample to add to the column.
value – Data to add as the sample.

append(value) → Union[str, int]¶

Store some data in a subsample with an automatically generated key.

This method should only be used if the context some piece of data is used in is independent from its value (ie. when reading data back, there is no useful information which needs to be conveyed between the data source’s name/id and the value of that piece of information.) Think carefully before going this route, as this posit does not apply to many common use cases.

See also

In order to store the data with a user defined key, use update() or __setitem__()

Parameters: value – Piece of data to store in the column.
Returns: Name of the generated key this data is stored with.
Return type: KeyType

property column¶: Name of the column.

property contains_remote_references¶

Bool indicating all subsamples in sample column exist on local disk.

The data associated with subsamples referencing some remote server will need to be downloaded (fetched in the hangar vocabulary) before they can be read into memory.

Returns: False if at least one subsample in the column references data stored on some remote server. True if all sample data is available on the machine’s local disk.
Return type: bool

property data¶

Return dict mapping every subsample key / data value stored in the sample.

Returns: Dictionary mapping subsample name(s) (keys) to their stored values as numpy.ndarray instances.
Return type: Dict[KeyType, Any]

get(key: Union[str, int], default=None)¶

Retrieve the data associated with some subsample key

Parameters

key (GetKeysType) – The name of the subsample(s) to retrieve. Passing a single subsample key will return the stored numpy.ndarray
default – if a key parameter is not found, then return this value instead. By default, None.

Returns

data stored under subsample key if key exists, else default value if not found.

Return type

value

property iswriteable¶: Bool indicating if this column object is write-enabled.

items(local: bool = False) → Iterable[Tuple[Union[str, int], Any]]¶

Generator yielding (name, data) tuple for every subsample.

Parameters: local (bool, optional) – If True, returned keys/values will only correspond to data which is available for reading on the local disk, No attempt will be made to read data existing on a remote server, by default False.
Yields: Iterable[Tuple[KeyType, Any]] – Name and stored value for every subsample inside the sample.

keys(local: bool = False) → Iterable[Union[str, int]]¶

Generator yielding the name (key) of every subsample.

Parameters: local (bool, optional) – If True, returned keys will only correspond to data which is available for reading on the local disk, by default False.
Yields: Iterable[KeyType] – Keys of one subsample at a time inside the sample.

pop(key: Union[str, int])¶

Retrieve some value for some key(s) and delete it in the same operation.

Parameters: key (KeysType) – Sample key to remove
Returns: Upon success, the value of the removed key.
Return type: value

property remote_reference_keys¶

Compute subsample names whose data is stored in a remote server reference.

Returns: list of subsample keys in the column whose data references indicate they are stored on a remote server.
Return type: Tuple[KeyType]

property sample¶: Name of the sample this column subsamples are stured under.

update(other=None, **kwargs)¶

Store data with the key/value pairs, overwriting existing keys.

update() implements functionality similar to python’s builtin dict.update() method, accepting either a dictionary or other iterable (of length two) listing key / value pairs.

Parameters

other – Accepts either another dictionary object or an iterable of key/value pairs (as tuples or other iterables of length two). mapping sample names to data values, If sample name is string type, can only contain alpha-numeric ascii characters (in addition to ‘-‘, ‘.’, ‘_’). Int key must be >= 0. By default, None.
**kwargs – keyword arguments provided will be saved with keywords as subsample keys (string type only) and values as np.array instances.

values(local: bool = False) → Iterable[Any]¶

Generator yielding the data for every subsample.

Parameters: local (bool, optional) – If True, returned values will only correspond to data which is available for reading on the local disk. No attempt will be made to read data existing on a remote server, by default False.
Yields: Iterable[Any] – Values of one subsample at a time inside the sample.

Differ¶

class WriterUserDiff¶

Methods diffing contents of a WriterCheckout instance.

These provide diffing implementations to compare the current HEAD of a checkout to a branch, commit, or the staging area "base" contents. The results are generally returned as a nested set of named tuples. In addition, the status() method is implemented which can be used to quickly determine if there are any uncommitted changes written in the checkout.

When diffing of commits or branches is performed, if there is not a linear history of commits between current HEAD and the diff commit (ie. a history which would permit a "fast-forward" merge), the result field named conflict will contain information on any merge conflicts that would exist if staging area HEAD and the (compared) "dev" HEAD were merged “right now”. Though this field is present for all diff comparisons, it can only contain non-empty values in the cases where a three way merge would need to be performed.

Fast Forward is Possible
========================

    (master)          (foo)
a ----- b ----- c ----- d


3-Way Merge Required
====================

                     (master)
a ----- b ----- c ----- d
        \
         \               (foo)
          \----- ee ----- ff

branch(dev_branch: str) → hangar.diff.DiffAndConflicts¶

Compute diff between HEAD and branch, returning user-facing results.

Parameters: dev_branch (str) – name of the branch whose HEAD will be used to calculate the diff of.
Returns: two-tuple of diff, conflict (if any) calculated in the diff algorithm.
Return type: DiffAndConflicts
Raises: ValueError – If the specified dev_branch does not exist.

commit(dev_commit_hash: str) → hangar.diff.DiffAndConflicts¶

Compute diff between HEAD and commit, returning user-facing results.

Parameters: dev_commit_hash (str) – hash of the commit to be used as the comparison.
Returns: two-tuple of diff, conflict (if any) calculated in the diff algorithm.
Return type: DiffAndConflicts
Raises: ValueError – if the specified dev_commit_hash is not a valid commit reference.

staged() → hangar.diff.DiffAndConflicts¶

Return diff of staging area to base, returning user-facing results.

Returns: two-tuple of diff, conflict (if any) calculated in the diff algorithm.
Return type: DiffAndConflicts

status() → str ¶

Determine if changes have been made in the staging area

If the contents of the staging area and it’s parent commit are the same, the status is said to be “CLEAN”. If even one column or metadata record has changed however, the status is “DIRTY”.

Returns: “CLEAN” if no changes have been made, otherwise “DIRTY”
Return type: str

Bulk Importer¶

Bulk importer methods to ingest large quantities of data into Hangar.

The following module is designed to address challenges inherent to writing massive amounts of data to a hangar repository via the standard API. Since write-enabled checkouts are limited to processing in a single thread, the time required to import hundreds of Gigabytes (or Terabytes) of data into Hangar (from external sources) can become prohibitivly long. This module implements a multi-processed importer which reduces import time nearly linearly with the number of CPU cores allocated on a machine.

There are a number of challenges to overcome:

How to validate data against a column schema?
- Does the column exist?
- Are the key(s) valid?
- Is the data a valid type/shape/precision valid for the the selected column schema?
How to handle duplicated data?
- If an identical piece of data is recorded in the repository already, only record the sample reference (do not write the data to disk again).
- If the bulk import method would write identical pieces of data to the repository multiple times, and the data does not already exist, then that piece of content should only be written to disk once. Only sample references should be saved after that.
How to handle transactionality?
- What happens if some column, sample keys, or data piece is invalid and cannot be written as desired?
- How to rollback partial changes if the process is inturupted in the middle of a bulk import operation?
How to limit memory usage if many processes are trying to load and write large tensors?

Rough outline of steps:

Validate UDF & Argument Signature

Read, Validate, and Hash UDF results –> Task Recipe

Prune Recipe

Read, Validate, Write Data to Isolated Backend Storage

Record Sample References in Isolated Environment

If all successful, make isolated data known to repository core, otherwise abort to starting state.

class UDF_Return(column: str, key: Union[str, int, Tuple[Union[str, int], Union[str, int]]], data: Union[numpy.ndarray, str, bytes])¶

User-Defined Function return container for bulk importer read functions

Variables

column (str) – column name to place data into
key (Union[KeyType, Tuple[KeyType, KeyType]]) – key to place flat sample into, or 2-tuple of keys for nested samples
data (Union[np.ndarray, str, bytes]) – piece of data to place in the column with the provided key.

property column¶: Alias for field number 0

property data¶: Alias for field number 2

property key¶: Alias for field number 1

run_bulk_import(repo: Repository, branch_name: str, column_names: List[str], udf: Callable[[…], Iterator[UDF_Return]], udf_kwargs: List[dict], *, ncpus: int = 0, autocommit: bool = True)¶

Perform a bulk import operation from a given user-defined function.

In order to provide for arbitrary input data sources along with ensuring the core promises of hangar hold we require the following from users:

Define some arbitrary function (ie “user-defined function” / “UDF”) which accepts some arguments and yields data. The UDF must be a generator function, yielding only values which are of UDF_Return type. The results yielded by the UDF must be deterministic for a given set of inputs. This includes all values of the UDF_Return (columns and keys, as well as data).

A list of input arguments to the UDF must be provided, this is formatted as a sequence (list / tuple) of keyword-arg dictionaries, each of which must be valid when unpacked and bound to the UDF signature. Additionally, all columns must be specified up front. If any columns are named a UDF_Return which were not pre-specified, the entire operation will fail.

Notes

This is an all-or-nothing operation, either all data is successfully read, validated, and written to the storage backends, or none of it is. A single maleformed key or data type/shape will cause the entire import operation to abort.
The input kwargs should be fairly small (of no consequence to load into memory), data out should be large. The results of the UDF will only be stored in memory for a very short period (just the time it takes to be validated against the column schema and compressed / flushed to disk).
Every step of the process is executed as a generator, lazily loading data the entire way. If possible, we recomend writing the UDF such that data is not allocated in memory before it is ready to be yielded.
If it is possible, the task recipe will be pruned and optimized in such a way that iteration over the UDF will be short circuted during the second pass (writing data to the backend). As this can greatly reduce processing time, we recomend trying to yield data pieces which are likely to be unique first from the UDF.

Warning

Please be aware that these methods should not be executed within a Jupyter Notebook / Jupyter Lab when running the bulk importer at scale. The internal implemenation makes significant use of multiprocess Queues for work distribution and recording. The heavy loads placed on the system have been observed to place strain on Jupyters ZeroMQ implementation, resulting in random failures which may or may not even display a traceback to indicate failure mode.

A small sample set of data can be used within jupyter to test an implementation without problems, but for full scale operations it is best run in a script with the operations protected by a __main__ block.

Examples

>>> import os
>>> import numpy as np
>>> from PIL import Image
>>> from hangar.bulk_importer import UDF_Return

>>> def image_loader(file_path):
...     im = Image.open(file_name)
...     arr = np.array(im.resize(512, 512))
...     im_record = UDF_Return(column='image', key=(category, sample), data=arr)
...     yield im_record
...
...     root, sample_file = os.path.split(file_path)
...     category = os.path.dirname(root)
...     sample_name, _ = os.path.splitext(sample_file)
...     path_record = UDF_Return(column='file_str', key=(category, sample_name), data=file_path)
...     yield path_record
...
>>> udf_kwargs = [
...     {'file_path': '/foo/cat/image_001.jpeg'},
...     {'file_path': '/foo/cat/image_002.jpeg'},
...     {'file_path': '/foo/dog/image_001.jpeg'},
...     {'file_path': '/foo/bird/image_011.jpeg'},
...     {'file_path': '/foo/bird/image_003.jpeg'}
... ]
>>> repo = Repository('foo/path/to/repo')
>>> from hangar.bulk_importer import run_bulk_import
>>> run_bulk_import(
...     repo, branch_name='master', column_names=['file_str', 'image'],
...     udf=image_loader, udf_kwargs=udf_kwargs)

However, the following will not work, since the output is non-deterministic.

>>> def nondeterminstic(x, y):
...     first = str(x * y)
...     yield UDF_Return(column='valstr', key=f'{x}_{y}', data=first)
...
...     second = str(x * y * random())
...     yield UDF_Return(column='valstr', key=f'{x}_{y}', data=second)
...
>>> udf_kwargs = [
...     {'x': 1, 'y': 2},
...     {'x': 1, 'y': 3},
...     {'x': 2, 'y': 4},
... ]
>>> run_bulk_import(
...     repo, branch_name='master', column_names=['valstr'],
...     udf=image_loader, udf_kwargs=udf_kwargs)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: contents returned in subbsequent calls to UDF with identical
  kwargs yielded different results. UDFs MUST generate deterministic
  results for the given inputs. Input kwargs generating this result:
  {'x': 1, 'y': 2}.

Not all columns must be returned from every input to the UDF, the number of data pieces yielded can also vary arbitrarily (so long as the results are deterministic for a particular set of inputs)

>>> def maybe_load(x_arr, y_arr, sample_name, columns=['default']):
...     for column in columns:
...         arr = np.multiply(x_arr, y_arr)
...         yield UDF_Return(column=column, key=sample_name, data=arr)
...     #
...     # do some strange processing which only outputs another column sometimes
...     if len(columns) == 1:
...         other = np.array(x_arr.shape) * np.array(y_arr.shape)
...         yield UDF_Return(column='strange_column', key=sample_name, data=other)
...
>>> udf_kwargs = [
...     {'x_arr': np.arange(10), 'y_arr': np.arange(10) + 1, 'sample_name': 'sample_1'},
...     {'x_arr': np.arange(10), 'y_arr': np.arange(10) + 1, 'sample_name': 'sample_2', 'columns': ['foo', 'bar', 'default']},
...     {'x_arr': np.arange(10) * 2, 'y_arr': np.arange(10), 'sample_name': 'sample_3'},
... ]
>>> run_bulk_import(
...     repo, branch_name='master',
...     column_names=['default', 'foo', 'bar', 'strange_column'],
...     udf=maybe_load, udf_kwargs=udf_kwargs)

Parameters

repo ('Repository') – Initialized repository object to import data into.
branch_name (str) – Name of the branch to checkout and import data into.
column_names (List[str]) – Names of all columns which data should be saved to.
udf (UDF_T) – User-Defined Function (generator style; yielding an arbitrary number of values when iterated on) which is passed an unpacked kwarg dict as input and yields a single UDF_Return instance at a time when iterated over. Cannot contain
udf_kwargs (List[dict]) – A sequence of keyword argument dictionaries which are individually unpacked as inputs into the user-defined function (UDF). the keyword argument dictionaries
ncpus (int, optional, default=0) – Number of Parallel processes to read data files & write to hangar backend stores in. If <= 0, then the default is set to num_cpus / 2. The value of this parameter should never exceed the total CPU count of the system. Import time scales mostly linearly with ncpus. Optimal performance is achieved by balancing memory usage of the UDF function and backend storage writer processes against the total system memory. generally increase linearly up to
autocommit (bool, optional, default=True) – Control whether a commit should be made after successfully importing the specified data to the staging area of the branch.

Read Only Checkout¶

Checkout¶

class ReaderCheckout¶

Checkout the repository as it exists at a particular branch.

This class is instantiated automatically from a repository checkout operation. This object will govern all access to data and interaction methods the user requests.

>>> co = repo.checkout()
>>> isinstance(co, ReaderCheckout)
True

If a commit hash is provided, it will take precedent over the branch name parameter. If neither a branch not commit is specified, the staging environment’s base branch HEAD commit hash will be read.

>>> co = repo.checkout(commit='foocommit')
>>> co.commit_hash
'foocommit'
>>> co.close()
>>> co = repo.checkout(branch='testbranch')
>>> co.commit_hash
'someothercommithashhere'
>>> co.close()

Unlike WriterCheckout, any number of ReaderCheckout objects can exist on the repository independently. Like the write-enabled variant, the close() method should be called after performing the necessary operations on the repo. However, as there is no concept of a lock for read-only checkouts, this is just to free up memory resources, rather than changing recorded access state.

In order to reduce the chance that the python interpreter is shut down without calling close(), - a common mistake during ipython / jupyter sessions - an atexit hook is registered to close(). If properly closed by the user, the hook is unregistered after completion with no ill effects. So long as a the process is NOT terminated via non-python SIGKILL, fatal internal python error, or or special os exit methods, cleanup will occur on interpreter shutdown and resources will be freed. If a non-handled termination method does occur, the implications of holding resources varies on a per-OS basis. While no risk to data integrity is observed, repeated misuse may require a system reboot in order to achieve expected performance characteristics.

__contains__(key)¶: Determine if some column name (key) exists in the checkout.

__getitem__(index)¶

Dictionary style access to columns and samples

Checkout object can be thought of as a “dataset” (“dset”) mapping a view of samples across columns.

>>> dset = repo.checkout(branch='master')
>>>
# Get an column contained in the checkout.
>>> dset['foo']
ColumnDataReader
>>>
# Get a specific sample from ``'foo'`` (returns a single array)
>>> dset['foo', '1']
np.array([1])
>>>
# Get multiple samples from ``'foo'`` (returns a list of arrays, in order
# of input keys)
>>> dset[['foo', '1'], ['foo', '2'],  ['foo', '324']]
[np.array([1]), np.ndarray([2]), np.ndarray([324])]
>>>
# Get sample from multiple columns, column/data returned is ordered
# in same manner as input of func.
>>> dset[['foo', '1'], ['bar', '1'],  ['baz', '1']]
[np.array([1]), np.ndarray([1, 1]), np.ndarray([1, 1, 1])]
>>>
# Get multiple samples from multiple columns            >>> keys = [(col, str(samp)) for samp in range(2) for col in ['foo', 'bar']]
>>> keys
[('foo', '0'), ('bar', '0'), ('foo', '1'), ('bar', '1')]
>>> dset[keys]
[np.array([1]), np.array([1, 1]), np.array([2]), np.array([2, 2])]

Arbitrary column layouts are supported by simply adding additional members to the keys for each piece of data. For example, getting data from a column with a nested layout:

>>> dset['nested_col', 'sample_1', 'subsample_0']
np.array([1, 0])
>>>
# a sample accessor object can be retrieved at will...
>>> dset['nested_col', 'sample_1']
<class 'FlatSubsampleReader'>(column_name='nested_col', sample_name='sample_1')
>>>
# to get all subsamples in a nested sample use the Ellipsis operator
>>> dset['nested_col', 'sample_1', ...]
{'subsample_0': np.array([1, 0]),
 'subsample_1': np.array([1, 1]),
 ...
 'subsample_n': np.array([1, 255])}

Retrieval of data from different column types can be mixed and combined as desired. For example, retrieving data from both flat and nested columns simultaneously:

>>> dset[('nested_col', 'sample_1', '0'), ('foo', '0')]
[np.array([1, 0]), np.array([0])]
>>> dset[('nested_col', 'sample_1', ...), ('foo', '0')]
[{'subsample_0': np.array([1, 0]), 'subsample_1': np.array([1, 1])},
 np.array([0])]
>>> dset[('foo', '0'), ('nested_col', 'sample_1')]
[np.array([0]),
 <class 'FlatSubsampleReader'>(column_name='nested_col', sample_name='sample_1')]

If a column or data key does not exist, then this method will raise a KeyError. As an alternative, missing keys can be gracefully handeled by calling get() instead. This method does not (by default) raise an error if a key is missing. Instead, a (configurable) default value is simply inserted in it’s place.

>>> dset['foo', 'DOES_NOT_EXIST']
-------------------------------------------------------------------
KeyError                           Traceback (most recent call last)
<ipython-input-40-731e6ea62fb8> in <module>
----> 1 res = co['foo', 'DOES_NOT_EXIST']
KeyError: 'DOES_NOT_EXIST'

Parameters

index –

column name, sample key(s) or sequence of list/tuple of column name, sample keys(s) which should be retrieved in the operation.

Please see detailed explanation above for full explanation of accepted argument format / result types.

Returns

Columns – single column parameter, no samples specified
Any – Single column specified, single sample key specified
List[Any] – arbitrary columns, multiple samples array data for each sample is returned in same order sample keys are received.

__iter__()¶: Iterate over column keys

__len__()¶: Returns number of columns in the checkout.

close() → None ¶

Gracefully close the reader checkout object.

Though not strictly required for reader checkouts (as opposed to writers), closing the checkout after reading will free file handles and system resources, which may improve performance for repositories with multiple simultaneous read checkouts.

property columns¶

Provides access to column interaction object.

Can be used to either return the columns accessor for all elements or a single column instance by using dictionary style indexing.

>>> co = repo.checkout(write=False)
>>> len(co.columns)
1
>>> print(co.columns.keys())
['foo']
>>> fooCol = co.columns['foo']
>>> fooCol.dtype
np.fooDtype
>>> cols = co.columns
>>> fooCol = cols['foo']
>>> fooCol.dtype
np.fooDtype
>>> fooCol = cols.get('foo')
>>> fooCol.dtype
np.fooDtype

See also

The class Columns contains all methods accessible by this property accessor

Returns: the columns object which behaves exactly like a columns accessor class but which can be invalidated when the writer lock is released.
Return type: Columns

property commit_hash¶

Commit hash this read-only checkout’s data is read from.

>>> co = repo.checkout()
>>> co.commit_hash
foohashdigesthere

Returns: commit hash of the checkout
Return type: str

property diff¶

Access the differ methods for a read-only checkout.

See also

The class ReaderUserDiff contains all methods accessible by this property accessor

Returns: weakref proxy to the differ object (and contained methods) which behaves exactly like the differ class but which can be invalidated when the writer lock is released.
Return type: ReaderUserDiff

get(keys, default=None, except_missing=False)¶

View of sample data across columns gracefully handling missing sample keys.

Please see __getitem__() for full description. This method is identical with a single exception: if a sample key is not present in an column, this method will plane a null None value in it’s return slot rather than throwing a KeyError like the dict style access does.

Parameters

keys –
sequence of column name (and optionally) sample key(s) or sequence of list/tuple of column name, sample keys(s) which should be retrieved in the operation.

Please see detailed explanation in __getitem__() for full explanation of accepted argument format / result types.
default (Any, optional) – default value to insert in results for the case where some column name / sample key is not found, and the except_missing parameter is set to False.
except_missing (bool, optional) – If False, will not throw exceptions on missing sample key value. Will raise KeyError if True and missing key found.

Returns

Columns – single column parameter, no samples specified
Any – Single column specified, single sample key specified
List[Any] – arbitrary columns, multiple samples array data for each sample is returned in same order sample keys are received.

items()¶: Generator yielding tuple of (name, accessor object) of every column

keys()¶: Generator yielding the name (key) of every column

log(branch: str = None, commit: str = None, *, return_contents: bool = False, show_time: bool = False, show_user: bool = False) → Optional[dict]¶

Displays a pretty printed commit log graph to the terminal.

Note

For programatic access, the return_contents value can be set to true which will retrieve relevant commit specifications as dictionary elements.

if Neither branch nor commit arguments are supplied, the commit digest of the currently reader checkout will be used as default.

Parameters

branch (str, optional) – The name of the branch to start the log process from. (Default value = None)
commit (str, optional) – The commit hash to start the log process from. (Default value = None)
return_contents (bool, optional, kwarg only) – If true, return the commit graph specifications in a dictionary suitable for programatic access/evaluation.
show_time (bool, optional, kwarg only) – If true and return_contents is False, show the time of each commit on the printed log graph
show_user (bool, optional, kwarg only) – If true and return_contents is False, show the committer of each commit on the printed log graph

Returns

Dict containing the commit ancestor graph, and all specifications.

Return type

Optional[dict]

values()¶: Generator yielding accessor object of every column

Flat Column Layout Container¶

class FlatSampleReader¶

Class implementing get access to data in a column.

This class exposes the standard API to access data stored in a single level key / value mapping column. Usage is modeled after the python dict style syntax – with a few additional utility and inspection methods and properties added. Methods named after those of a python dict have syntactically identical arguments and behavior to that of the standard library.

If not opened in a write-enabled checkout, then attempts to add or delete data or container properties will raise an exception (in the form of a PermissionError). No changes will be propogated unless a write-enabled checkout is used.

This object can be serialized – pickled – for parallel processing / reading if opened in a read-only checkout. Parallel operations are both thread and process safe, though performance may significantly differ between multithreaded vs multiprocessed code (depending on the backend data is stored in). Attempts to serialize objects opened in write-enabled checkouts are not supported and will raise a PermissionError if attempted. This behavior is enforced in order to ensure data and record integrity while writing to the repository.

__contains__(key: Union[str, int]) → bool ¶: Determine if a key is a valid sample name in the column.

__getitem__(key: Union[str, int])¶

Retrieve data for some sample key via dict style access conventions.

See also

get()

Parameters: key (KeyType) – Sample key to retrieve from the column.
Returns: Data corresponding to the provided sample key.
Return type: value
Raises: KeyError – if no sample with the requested key exists.

__iter__() → Iterable[Union[str, int]]¶

Create iterator yielding an column sample keys.

Yields: Iterable[KeyType] – Sample key contained in the column.

__len__() → int ¶: Check how many samples are present in a given column.

property backend¶: Code indicating which backing store is used when writing data.

property backend_options¶: Filter / Compression options applied to backend when writing data.

property column¶: Name of the column.

property column_layout¶: Column layout type (‘nested’, ‘flat’, etc).

property column_type¶: Data container type of the column (‘ndarray’, ‘str’, etc).

property contains_remote_references¶

Bool indicating if all samples in column exist on local disk.

The data associated with samples referencing some remote server will need to be downloaded (fetched in the hangar vocabulary) before they can be read into memory.

Returns: False if at least one sample in the column references data stored on some remote server. True if all sample data is available on the machine’s local disk.
Return type: bool

property contains_subsamples¶: Bool indicating if sub-samples are contained in this column container.

property dtype¶: Dtype of the columns data (np.float, str, etc).

get(key: Union[str, int], default=None)¶

Retrieve the data associated with some sample key

Parameters

key (KeyType) – The name of the subsample(s) to retrieve. Passing a single subsample key will return the stored data value.
default (Any) – if a key parameter is not found, then return this value instead. By default, None.

Returns

data data stored under subsample key if key exists, else default value if not found.

Return type

value

property iswriteable¶: Bool indicating if this column object is write-enabled.

items(local: bool = False) → Iterable[Tuple[Union[str, int], Any]]¶

Generator yielding (name, data) tuple for every subsample.

Parameters: local (bool, optional) – If True, returned keys/values will only correspond to data which is available for reading on the local disk, No attempt will be made to read data existing on a remote server, by default False.
Yields: Iterable[Tuple[KeyType, Any]] – Name and stored value for every subsample inside the sample.

keys(local: bool = False) → Iterable[Union[str, int]]¶

Generator yielding the name (key) of every subsample.

Parameters: local (bool, optional) – If True, returned keys will only correspond to data which is available for reading on the local disk, by default False.
Yields: Iterable[KeyType] – Keys of one subsample at a time inside the sample.

property remote_reference_keys¶

Compute sample names whose data is stored in a remote server reference.

Returns: list of sample keys in the column whose data references indicate they are stored on a remote server.
Return type: Tuple[KeyType]

property schema_type¶: Schema type of the contained data (‘variable_shape’, ‘fixed_shape’, etc).

property shape¶: (Max) shape of data that can (is) written in the column.

values(local: bool = False) → Iterable[Any]¶

Generator yielding the data for every subsample.

Parameters: local (bool, optional) – If True, returned values will only correspond to data which is available for reading on the local disk. No attempt will be made to read data existing on a remote server, by default False.
Yields: Iterable[Any] – Values of one subsample at a time inside the sample.

Nested Column Layout Container¶

class NestedSampleReader¶

__contains__(key: Union[str, int]) → bool ¶: Determine if some sample key exists in the column.

__getitem__(key: Union[str, int]) → hangar.columns.layout_nested.FlatSubsampleReader ¶

Get the sample access class for some sample key.

Parameters: key (KeyType) – Name of sample to retrieve
Returns: Sample accessor corresponding to the given key
Return type: FlatSubsampleReader
Raises: KeyError – If no sample with the provided key exists.

__iter__() → Iterable[Union[str, int]]¶

Create iterator yielding an column sample keys.

Yields: Iterable[KeyType] – Sample key contained in the column.

__len__() → int ¶: Find number of samples in the column

property backend¶: Code indicating which backing store is used when writing data.

property backend_options¶: Filter / Compression options applied to backend when writing data.

property column¶: Name of the column.

property column_layout¶: Column layout type (‘nested’, ‘flat’, etc).

property column_type¶: Data container type of the column (‘ndarray’, ‘str’, etc).

property contains_remote_references¶

Bool indicating all subsamples in sample column exist on local disk.

The data associated with subsamples referencing some remote server will need to be downloaded (fetched in the hangar vocabulary) before they can be read into memory.

Returns: False if at least one subsample in the column references data stored on some remote server. True if all sample data is available on the machine’s local disk.
Return type: bool

property contains_subsamples¶: Bool indicating if sub-samples are contained in this column container.

property dtype¶: Dtype of the columns data (np.float, str, etc).

get(key: Union[str, int, ellipsis, slice], default: Any = None) → hangar.columns.layout_nested.FlatSubsampleReader ¶

Retrieve data for some sample key(s) in the column.

Parameters

key (GetKeysType) – The name of the subsample(s) to retrieve
default (Any) – if a key parameter is not found, then return this value instead. By default, None.

Returns

Sample accessor class given by name key which can be used to access subsample data.

Return type

FlatSubsampleReader

property iswriteable¶: Bool indicating if this column object is write-enabled.

items(local: bool = False) → Iterable[Tuple[Union[str, int], Any]]¶

Generator yielding (name, data) tuple for every subsample.

Parameters: local (bool, optional) – If True, returned keys/values will only correspond to data which is available for reading on the local disk, No attempt will be made to read data existing on a remote server, by default False.
Yields: Iterable[Tuple[KeyType, Any]] – Name and stored value for every subsample inside the sample.

keys(local: bool = False) → Iterable[Union[str, int]]¶

Generator yielding the name (key) of every subsample.

Parameters: local (bool, optional) – If True, returned keys will only correspond to data which is available for reading on the local disk, by default False.
Yields: Iterable[KeyType] – Keys of one subsample at a time inside the sample.

property num_subsamples¶: Calculate total number of subsamples existing in all samples in column

property remote_reference_keys¶

Compute subsample names whose data is stored in a remote server reference.

Returns: list of subsample keys in the column whose data references indicate they are stored on a remote server.
Return type: Tuple[KeyType]

property schema_type¶: Schema type of the contained data (‘variable_shape’, ‘fixed_shape’, etc).

property shape¶: (Max) shape of data that can (is) written in the column.

values(local: bool = False) → Iterable[Any]¶

Generator yielding the tensor data for every subsample.

Parameters: local (bool, optional) – If True, returned values will only correspond to data which is available for reading on the local disk. No attempt will be made to read data existing on a remote server, by default False.
Yields: Iterable[Any] – Values of one subsample at a time inside the sample.

class FlatSubsampleReader¶

__getitem__(key: Union[str, int, ellipsis, slice]) → Union[Any, Dict[Union[str, int], Any]]¶

Retrieve data for some subsample key via dict style access conventions.

See also

get()

Parameters: key (GetKeysType) – Sample key to retrieve from the column. Alternatively, slice syntax can be used to retrieve a selection of subsample keys/values. An empty slice (: == slice(None)) or Ellipsis (...) will return all subsample keys/values. Passing a non-empty slice ([1:5] == slice(1, 5)) will select keys to retrieve by enumerating all subsamples and retrieving the element (key) for each step across the range. Note: order of enumeration is not guaranteed; do not rely on any ordering observed when using this method.
Returns: Sample data corresponding to the provided key. or dictionary of subsample keys/data if Ellipsis or slice passed in as key.
Return type: Union[Any, Dict[KeyType, Any]]
Raises: KeyError – if no sample with the requested key exists.

property column¶: Name of the column.

property contains_remote_references¶

Bool indicating all subsamples in sample column exist on local disk.

The data associated with subsamples referencing some remote server will need to be downloaded (fetched in the hangar vocabulary) before they can be read into memory.

Returns: False if at least one subsample in the column references data stored on some remote server. True if all sample data is available on the machine’s local disk.
Return type: bool

property data¶

Return dict mapping every subsample key / data value stored in the sample.

Returns: Dictionary mapping subsample name(s) (keys) to their stored values as numpy.ndarray instances.
Return type: Dict[KeyType, Any]

get(key: Union[str, int], default=None)¶

Retrieve the data associated with some subsample key

Parameters

key (GetKeysType) – The name of the subsample(s) to retrieve. Passing a single subsample key will return the stored numpy.ndarray
default – if a key parameter is not found, then return this value instead. By default, None.

Returns

data stored under subsample key if key exists, else default value if not found.

Return type

value

property iswriteable¶: Bool indicating if this column object is write-enabled.

items(local: bool = False) → Iterable[Tuple[Union[str, int], Any]]¶

Generator yielding (name, data) tuple for every subsample.

Parameters: local (bool, optional) – If True, returned keys/values will only correspond to data which is available for reading on the local disk, No attempt will be made to read data existing on a remote server, by default False.
Yields: Iterable[Tuple[KeyType, Any]] – Name and stored value for every subsample inside the sample.

keys(local: bool = False) → Iterable[Union[str, int]]¶

Generator yielding the name (key) of every subsample.

Parameters: local (bool, optional) – If True, returned keys will only correspond to data which is available for reading on the local disk, by default False.
Yields: Iterable[KeyType] – Keys of one subsample at a time inside the sample.

property remote_reference_keys¶

Compute subsample names whose data is stored in a remote server reference.

Returns: list of subsample keys in the column whose data references indicate they are stored on a remote server.
Return type: Tuple[KeyType]

property sample¶: Name of the sample this column subsamples are stured under.

values(local: bool = False) → Iterable[Any]¶

Generator yielding the data for every subsample.

Parameters: local (bool, optional) – If True, returned values will only correspond to data which is available for reading on the local disk. No attempt will be made to read data existing on a remote server, by default False.
Yields: Iterable[Any] – Values of one subsample at a time inside the sample.

Differ¶

class ReaderUserDiff¶

Methods diffing contents of a ReaderCheckout instance.

These provide diffing implementations to compare the current checkout HEAD of a to a branch or commit. The results are generally returned as a nested set of named tuples.

When diffing of commits or branches is performed, if there is not a linear history of commits between current HEAD and the diff commit (ie. a history which would permit a "fast-forward" merge), the result field named conflict will contain information on any merge conflicts that would exist if staging area HEAD and the (compared) "dev" HEAD were merged “right now”. Though this field is present for all diff comparisons, it can only contain non-empty values in the cases where a three way merge would need to be performed.

Fast Forward is Possible
========================

    (master)          (foo)
a ----- b ----- c ----- d


3-Way Merge Required
====================

                     (master)
a ----- b ----- c ----- d
        \
         \               (foo)
          \----- ee ----- ff

branch(dev_branch: str) → hangar.diff.DiffAndConflicts¶

Compute diff between HEAD and branch name, returning user-facing results.

Parameters: dev_branch (str) – name of the branch whose HEAD will be used to calculate the diff of.
Returns: two-tuple of diff, conflict (if any) calculated in the diff algorithm.
Return type: DiffAndConflicts
Raises: ValueError – If the specified dev_branch does not exist.

commit(dev_commit_hash: str) → hangar.diff.DiffAndConflicts¶

Compute diff between HEAD and commit hash, returning user-facing results.

Parameters: dev_commit_hash (str) – hash of the commit to be used as the comparison.
Returns: two-tuple of diff, conflict (if any) calculated in the diff algorithm.
Return type: DiffAndConflicts
Raises: ValueError – if the specified dev_commit_hash is not a valid commit reference.

ML Framework Dataloaders¶

Tensorflow¶

make_tf_dataset(columns, keys: Sequence[str] = None, index_range: slice = None, shuffle: bool = True)¶

Uses the hangar columns to make a tensorflow dataset. It uses from_generator function from tensorflow.data.Dataset with a generator function that wraps all the hangar columns. In such instances tensorflow Dataset does shuffle by loading the subset of data which can fit into the memory and shuffle that subset. Since it is not really a global shuffle make_tf_dataset accepts a shuffle argument which will be used by the generator to shuffle each time it is being called.

Warning

tf.data.Dataset.from_generator currently uses tf.compat.v1.py_func() internally. Hence the serialization function (yield_data) will not be serialized in a GraphDef. Therefore, you won’t be able to serialize your model and restore it in a different environment if you use make_tf_dataset. The operation must run in the same address space as the Python program that calls tf.compat.v1.py_func(). If you are using distributed TensorFlow, you must run a tf.distribute.Server in the same process as the program that calls tf.compat.v1.py_func() and you must pin the created operation to a device in that server (e.g. using with tf.device():)

Parameters

columns (Columns or Sequence) – A column object, a tuple of column object or a list of column objects`
keys (Sequence[str]) – An iterable of sample names. If given only those samples will fetched from the column
index_range (slice) – A python slice object which will be used to find the subset of column. Argument keys takes priority over index_range i.e. if both are given, keys will be used and index_range will be ignored
shuffle (bool) – generator uses this to decide a global shuffle accross all the samples is required or not. But user doesn’t have any restriction on doing`column.shuffle()` on the returned column

Examples

>>> from hangar import Repository
>>> from hangar import make_tf_dataset
>>> import tensorflow as tf
>>> tf.compat.v1.enable_eager_execution()
>>> repo = Repository('.')
>>> co = repo.checkout()
>>> data = co.columns['mnist_data']
>>> target = co.columns['mnist_target']
>>> tf_dset = make_tf_dataset([data, target])
>>> tf_dset = tf_dset.batch(512)
>>> for bdata, btarget in tf_dset:
...     print(bdata.shape, btarget.shape)

Returns
Return type: tf.data.Dataset

Pytorch¶

make_torch_dataset(columns, keys: Sequence[str] = None, index_range: slice = None, field_names: Sequence[str] = None)¶

Returns a torch.utils.data.Dataset object which can be loaded into a torch.utils.data.DataLoader.

Warning

On Windows systems, setting the parameter num_workers in the resulting torch.utils.data.DataLoader method will result in a RuntimeError or deadlock. This is due to limitations of multiprocess start methods on Windows itself. Using the default argument value (num_workers=0) will let the DataLoader work in single process mode as expected.

Parameters

columns (Columns or Sequence) – A column object, a tuple of column object or a list of column objects.
keys (Sequence[str]) – An iterable collection of sample names. If given only those samples will fetched from the column
index_range (slice) – A python slice object which will be used to find the subset of column. Argument keys takes priority over range i.e. if both are given, keys will be used and range will be ignored
field_names (Sequence[str], optional) – An array of field names used as the field_names for the returned dict keys. If not given, column names will be used as the field_names.

Examples

>>> from hangar import Repository
>>> from torch.utils.data import DataLoader
>>> from hangar import make_torch_dataset
>>> repo = Repository('.')
>>> co = repo.checkout()
>>> aset = co.columns['dummy_aset']
>>> torch_dset = make_torch_dataset(aset, index_range=slice(1, 100))
>>> loader = DataLoader(torch_dset, batch_size=16)
>>> for batch in loader:
...     train_model(batch)

Returns
Return type: torch.utils.data.Dataset