Python API¶
This is the python API for the Hangar project.
Repository¶
-
class
Repository
(path: os.PathLike, exists: bool = True)¶ Launching point for all user operations in a Hangar repository.
All interaction, including the ability to initialize a repo, checkout a commit (for either reading or writing), create a branch, merge branches, or generally view the contents or state of the local repository starts here. Just provide this class instance with a path to an existing Hangar repository, or to a directory one should be initialized, and all required data for starting your work on the repo will automatically be populated.
>>> from hangar import Repository >>> repo = Repository('foo/path/to/dir')
Parameters: - path (str) – local directory path where the Hangar repository exists (or initialized)
- exists (bool, optional) –
True if a Hangar repository should exist at the given directory path. Should no Hangar repository exists at that location, a UserWarning will be raised indicating that the
init()
method needs to be called.False if the provided path does not need to (but optionally can) contain a Hangar repository. if a Hangar repository does not exist at that path, the usual UserWarning will be suppressed.
In both cases, the path must exist and the user must have sufficient OS permissions to write to that location. Default = True
-
checkout
(write: bool = False, *, branch: str = 'master', commit: str = '') → Union[hangar.checkout.ReaderCheckout, hangar.checkout.WriterCheckout]¶ Checkout the repo at some point in time in either read or write mode.
Only one writer instance can exist at a time. Write enabled checkout must must create a staging area from the
HEAD
commit of a branch. On the contrary, any number of reader checkouts can exist at the same time and can specify either a branch name or a commit hash.Parameters: - write (bool, optional) – Specify if the checkout is write capable, defaults to False
- branch (str, optional) – name of the branch to checkout. This utilizes the state of the repo
as it existed at the branch
HEAD
commit when this checkout object was instantiated, defaults to ‘master’ - commit (str, optional) – specific hash of a commit to use for the checkout (instead of a
branch
HEAD
commit). This argument takes precedent over a branch name parameter if it is set. Note: this only will be used in non-writeable checkouts, defaults to ‘’
Raises: ValueError
– If the value of write argument is not booleanReturns: Checkout object which can be used to interact with the repository data
Return type: Union[ReaderCheckout, WriterCheckout]
-
clone
(user_name: str, user_email: str, remote_address: str, *, remove_old: bool = False) → str¶ Download a remote repository to the local disk.
The clone method implemented here is very similar to a git clone operation. This method will pull all commit records, history, and data which are parents of the remote’s master branch head commit. If a
Repository
exists at the specified directory, the operation will fail.Parameters: - user_name (str) – Name of the person who will make commits to the repository. This information is recorded permanently in the commit records.
- user_email (str) – Email address of the repository user. This information is recorded permanently in any commits created.
- remote_address (str) – location where the
hangar.remote.server.HangarServer
process is running and accessible by the clone user. - remove_old (bool, optional, kwarg only) – DANGER! DEVELOPMENT USE ONLY! If enabled, a
hangar.repository.Repository
existing on disk at the same path as the requested clone location will be completely removed and replaced with the newly cloned repo. (the default is False, which will not modify any contents on disk and which will refuse to create a repository at a given location if one already exists there.)
Returns: Name of the master branch for the newly cloned repository.
Return type: str
-
create_branch
(name: str, base_commit: str = None) → str¶ create a branch with the provided name from a certain commit.
If no base commit hash is specified, the current writer branch
HEAD
commit is used as the base_commit hash for the branch. Note that creating a branch does not actually create a checkout object for interaction with the data. to interact you must use the repository checkout method to properly initialize a read (or write) enabled checkout object.Parameters: - name (str) – name to assign to the new branch
- base_commit (str, optional) – commit hash to start the branch root at. if not specified, the
writer branch
HEAD
commit at the time of execution will be used, defaults to None
Returns: name of the branch which was created
Return type: str
-
force_release_writer_lock
() → bool¶ Force release the lock left behind by an unclosed writer-checkout
Warning
NEVER USE THIS METHOD IF WRITER PROCESS IS CURRENTLY ACTIVE. At the time of writing, the implications of improper/malicious use of this are not understood, and there is a a risk of of undefined behavior or (potentially) data corruption.
At the moment, the responsibility to close a write-enabled checkout is placed entirely on the user. If the close() method is not called before the program terminates, a new checkout with write=True will fail. The lock can only be released via a call to this method.
Note
This entire mechanism is subject to review/replacement in the future.
Returns: if the operation was successful. Return type: bool
-
init
(user_name: str, user_email: str, *, remove_old: bool = False) → os.PathLike¶ Initialize a Hangar repository at the specified directory path.
This function must be called before a checkout can be performed.
Parameters: - user_name (str) – Name of the repository user account.
- user_email (str) – Email address of the repository user account.
- remove_old (bool, kwarg-only) – DEVELOPER USE ONLY – remove and reinitialize a Hangar repository at the given path, Default = False
Returns: the full directory path where the Hangar repository was initialized on disk.
Return type: os.PathLike
-
initialized
¶ Check if the repository has been initialized or not
Returns: True if repository has been initialized. Return type: bool
-
list_branches
() → List[str]¶ list all branch names created in the repository.
Returns: the branch names recorded in the repository Return type: list of str
-
log
(branch: str = None, commit: str = None, *, return_contents: bool = False, show_time: bool = False, show_user: bool = False) → Optional[dict]¶ Displays a pretty printed commit log graph to the terminal.
Note
For programatic access, the return_contents value can be set to true which will retrieve relevant commit specifications as dictionary elements.
Parameters: - branch (str, optional) – The name of the branch to start the log process from. (Default value = None)
- commit (str, optional) – The commit hash to start the log process from. (Default value = None)
- return_contents (bool, optional, kwarg only) – If true, return the commit graph specifications in a dictionary suitable for programatic access/evaluation.
- show_time (bool, optional, kwarg only) – If true and return_contents is False, show the time of each commit on the printed log graph
- show_user (bool, optional, kwarg only) – If true and return_contents is False, show the committer of each commit on the printed log graph
Returns: Dict containing the commit ancestor graph, and all specifications.
Return type: Optional[dict]
-
merge
(message: str, master_branch: str, dev_branch: str) → str¶ Perform a merge of the changes made on two branches.
Parameters: - message (str) – Commit message to use for this merge.
- master_branch (str) – name of the master branch to merge into
- dev_branch (str) – name of the dev/feature branch to merge
Returns: Hash of the commit which is written if possible.
Return type: str
-
path
¶ Return the path to the repository on disk, read-only attribute
Returns: path to the specified repository, not including .hangar directory Return type: os.PathLike
-
remote
¶ Accessor to the methods controlling remote interactions.
See also
Remotes
for available methods of this propertyReturns: Accessor object methods for controlling remote interactions. Return type: Remotes
-
remove_branch
(name)¶ Not Implemented
-
summary
(*, branch: str = '', commit: str = '') → None¶ Print a summary of the repository contents to the terminal
Parameters: - branch (str, optional) – A specific branch name whose head commit will be used as the summary point (Default value = ‘’)
- commit (str, optional) – A specific commit hash which should be used as the summary point. (Default value = ‘’)
-
version
¶ Find the version of Hangar software the repository is written with
Returns: semantic version of major, minor, micro version of repo software version. Return type: str
-
writer_lock_held
¶ Check if the writer lock is currently marked as held. Read-only attribute.
Returns: True is writer-lock is held, False if writer-lock is free. Return type: bool
-
class
Remotes
¶ Class which governs access to remote interactor objects.
Note
The remote-server implementation is under heavy development, and is likely to undergo changes in the Future. While we intend to ensure compatability between software versions of Hangar repositories written to disk, the API is likely to change. Please follow our process at: https://www.github.com/tensorwerk/hangar-py
-
add
(name: str, address: str) → hangar.remotes.RemoteInfo¶ Add a remote to the repository accessible by name at address.
Parameters: - name (str) – the name which should be used to refer to the remote server (ie: ‘origin’)
- address (str) – the IP:PORT where the hangar server is running
Returns: Two-tuple containing (
name
,address
) of the remote added to the client’s server list.Return type: RemoteInfo
Raises: ValueError
– If a remote with the provided name is already listed on this client, No-Op. In order to update a remote server address, it must be removed and then re-added with the desired address.
-
fetch
(remote: str, branch: str) → str¶ Retrieve new commits made on a remote repository branch.
This is semantically identical to a git fetch command. Any new commits along the branch will be retrieved, but placed on an isolated branch to the local copy (ie.
remote_name/branch_name
). In order to unify histories, simply merge the remote branch into the local branch.Parameters: - remote (str) – name of the remote repository to fetch from (ie.
origin
) - branch (str) – name of the branch to fetch the commit references for.
Returns: Name of the branch which stores the retrieved commits.
Return type: str
- remote (str) – name of the remote repository to fetch from (ie.
-
fetch_data
(remote: str, branch: str = None, commit: str = None, *, arrayset_names: Optional[Sequence[str]] = None, max_num_bytes: int = None, retrieve_all_history: bool = False) → List[str]¶ Retrieve the data for some commit which exists in a partial state.
Parameters: - remote (str) – name of the remote to pull the data from
- branch (str, optional) – The name of a branch whose HEAD will be used as the data fetch
point. If None,
commit
argument expected, by default None - commit (str, optional) – Commit hash to retrieve data for, If None,
branch
argument expected, by default None - arrayset_names (Optional[Sequence[str]]) – Names of the arraysets which should be retrieved for the particular commits, any arraysets not named will not have their data fetched from the server. Default behavior is to retrieve all arraysets
- max_num_bytes (Optional[int]) – If you wish to limit the amount of data sent to the local machine, set a max_num_bytes parameter. This will retrieve only this amount of data from the server to be placed on the local disk. Default is to retrieve all data regardless of how large.
- retrieve_all_history (Optional[bool]) – if data should be retrieved for all history accessible by the parents of this commit HEAD. by default False
Returns: commit hashs of the data which was returned.
Return type: List[str]
Raises: ValueError
– if branch and commit args are set simultaneously.ValueError
– if specified commit does not exist in the repository.ValueError
– if branch name does not exist in the repository.
-
list_all
() → List[hangar.remotes.RemoteInfo]¶ List all remote names and addresses recorded in the client’s repository.
Returns: list of namedtuple specifying ( name
,address
) for each remote server recorded in the client repo.Return type: List[RemoteInfo]
-
ping
(name: str) → float¶ Ping remote server and check the round trip time.
Parameters: name (str) – name of the remote server to ping
Returns: round trip time it took to ping the server after the connection was established and requested client configuration was retrieved
Return type: float
Raises: KeyError
– If no remote with the provided name is recorded.ConnectionError
– If the remote server could not be reached.
-
push
(remote: str, branch: str, *, username: str = '', password: str = '') → bool¶ push changes made on a local repository to a remote repository.
This method is semantically identical to a
git push
operation. Any local updates will be sent to the remote repository.Note
The current implementation is not capable of performing a
force push
operation. As such, remote branches with diverged histories to the local repo must be retrieved, locally merged, then re-pushed. This feature will be added in the near future.Parameters: - remote (str) – name of the remote repository to make the push on.
- branch (str) – Name of the branch to push to the remote. If the branch name does not exist on the remote, the it will be created
- username (str, optional, kwarg-only) – credentials to use for authentication if repository push restrictions are enabled, by default ‘’.
- password (str, optional, kwarg-only) – credentials to use for authentication if repository push restrictions are enabled, by default ‘’.
Returns: Name of the branch which was pushed
Return type: str
-
remove
(name: str) → hangar.remotes.RemoteInfo¶ Remove a remote repository from the branch records
Parameters: name (str) – name of the remote to remove the reference to Raises: ValueError
– If a remote with the provided name does not existReturns: The channel address which was removed at the given remote name Return type: str
-
Write Enabled Checkout¶
-
class
WriterCheckout
¶ Checkout the repository at the head of a given branch for writing.
This is the entry point for all writing operations to the repository, the writer class records all interactions in a special
"staging"
area, which is based off the state of the repository as it existed at theHEAD
commit of a branch.>>> co = repo.checkout(write=True) >>> co.branch_name 'master' >>> co.commit_hash 'masterheadcommithash' >>> co.close()
At the moment, only one instance of this class can write data to the staging area at a time. After the desired operations have been completed, it is crucial to call
close()
to release the writer lock. In addition, after any changes have been made to the staging area, the branchHEAD
cannot be changed. In order to checkout another branchHEAD
for writing, you must eithercommit()
the changes, or perform a hard-reset of the staging area to the last commit viareset_staging_area()
.In order to reduce the chance that the python interpreter is shut down without calling
close()
, which releases the writer lock - a common mistake during ipython / jupyter sessions - an atexit hook is registered toclose()
. If properly closed by the user, the hook is unregistered after completion with no ill effects. So long as a the process is NOT terminated via non-python SIGKILL, fatal internal python error, or or special os exit methods, cleanup will occur on interpreter shutdown and the writer lock will be released. If a non-handled termination method does occur, theforce_release_writer_lock()
method must be called manually when a new python process wishes to open the writer checkout.-
__getitem__
(index)¶ Dictionary style access to arraysets and samples
Checkout object can be thought of as a “dataset” (“dset”) mapping a view of samples across arraysets.
>>> dset = repo.checkout(branch='master')
Get an arrayset contained in the checkout.
>>> dset['foo'] ArraysetDataReader
Get a specific sample from
'foo'
(returns a single array)>>> dset['foo', '1'] np.array([1])
Get multiple samples from
'foo'
(retuns a list of arrays, in order of input keys)>>> dset['foo', ['1', '2', '324']] [np.array([1]), np.ndarray([2]), np.ndarray([324])]
Get sample from multiple arraysets (returns namedtuple of arrays, field names = arrayset names)
>>> dset[('foo', 'bar', 'baz'), '1'] ArraysetData(foo=array([1]), bar=array([11]), baz=array([111]))
Get multiple samples from multiple arraysets(returns list of namedtuple of array sorted in input key order, field names = arrayset names)
>>> dset[('foo', 'bar'), ('1', '2')] [ArraysetData(foo=array([1]), bar=array([11])), ArraysetData(foo=array([2]), bar=array([22]))]
Get samples from all arraysets (shortcut syntax)
>>> out = dset[:, ('1', '2')] >>> out = dset[..., ('1', '2')] >>> out [ArraysetData(foo=array([1]), bar=array([11]), baz=array([111])), ArraysetData(foo=array([2]), bar=array([22]), baz=array([222]))]
>>> out = dset[:, '1'] >>> out = dset[..., '1'] >>> out ArraysetData(foo=array([1]), bar=array([11]), baz=array([111]))
Parameters: index – Please see detailed explanation above for full options.
The first element (or collection) specified must be
str
type and correspond to an arrayset name(s). Alternativly the Ellipsis operator (...
) or unbounded slice operator (:
<==>slice(None)
) can be used to indicate “select all” behavior.If a second element (or collection) is present, the keys correspond to sample names present within (all) the specified arraysets. If a key is not present in even on arrayset, the entire
get
operation will abort withKeyError
. If desired, the same selection syntax can be used with theget()
method, which will not Error in these situations, but simply returnNone
values in the appropriate position for keys which do not exist.Returns: - Arrayset – single arrayset parameter, no samples specified
- np.ndarray – Single arrayset specified, single sample key specified
- List[np.ndarray] – Single arrayset, multiple samples array data for each sample is returned in same order sample keys are recieved.
- List[NamedTuple[*np.ndarray]] – Multiple arraysets, multiple samples. Each arrayset’s name is used as a field in the NamedTuple elements, each NamedTuple contains arrays stored in each arrayset via a common sample key. Each sample key is returned values as an individual element in the List. The sample order is returned in the same order it wasw recieved.
Notes
- All specified arraysets must exist
- All specified sample keys must exist in all specified arraysets, otherwise standard exception thrown
- Slice syntax cannot be used in sample keys field
- Slice syntax for arrayset field cannot specify start, stop, or
step fields, it is soley a shortcut syntax for ‘get all arraysets’ in
the
:
orslice(None)
form
-
__setitem__
(index, value)¶ Syntax for setting items.
Checkout object can be thought of as a “dataset” (“dset”) mapping a view of samples across arraysets:
>>> dset = repo.checkout(branch='master', write=True)
Add single sample to single arrayset
>>> dset['foo', 1] = np.array([1]) >>> dset['foo', 1] array([1])
Add multiple samples to single arrayset
>>> dset['foo', [1, 2, 3]] = [np.array([1]), np.array([2]), np.array([3])] >>> dset['foo', [1, 2, 3]] [array([1]), array([2]), array([3])]
Add single sample to multiple arraysets
>>> dset[['foo', 'bar'], 1] = [np.array([1]), np.array([11])] >>> dset[:, 1] ArraysetData(foo=array([1]), bar=array([11]))
Parameters: - index (Union[Iterable[str], Iterable[str, int]]) –
Please see detailed explanation above for full options.The first element (or collection) specified must be
str
type and correspond to an arrayset name(s). The second element (or collection) are keys corresponding to sample names which the data should be written to.Unlike the
__getitem__()
method, only ONE of thearrayset
name(s) orsample
key(s) can specify multiple elements at the same time. Ie. If multiplearraysets
are specified, only one sample key can be set, likewise if multiplesamples
are specified, only onearrayset
can be specified. When specifying multiplearraysets
orsamples
, each data piece to be stored must reside as individual elements (np.ndarray
) in a List or Tuple. The number of keys and the number of values must match exactally. - values (Union[np.ndarray, Iterable[np.ndarray]]) – Data to store in the specified arraysets/sample keys. When
specifying multiple
arraysets
orsamples
, each data piece to be stored must reside as individual elements (np.ndarray
) in a List or Tuple. The number of keys and the number of values must match exactally.
Notes
- No slicing syntax is supported for either arraysets or samples. This is in order to ensure explicit setting of values in the desired fields/keys
- Add multiple samples to multiple arraysets not yet supported.
- index (Union[Iterable[str], Iterable[str, int]]) –
-
arraysets
¶ Provides access to arrayset interaction object.
Can be used to either return the arraysets accessor for all elements or a single arrayset instance by using dictionary style indexing.
>>> co = repo.checkout(write=True) >>> asets = co.arraysets >>> len(asets) 0 >>> fooAset = asets.init_arrayset('foo', shape=(10, 10), dtype=np.uint8) >>> len(co.arraysets) 1 >>> print(co.arraysets.keys()) ['foo'] >>> fooAset = co.arraysets['foo'] >>> fooAset.dtype np.fooDtype >>> fooAset = asets.get('foo') >>> fooAset.dtype np.fooDtype
See also
The class
Arraysets
contains all methods accessible by this property accessorReturns: weakref proxy to the arraysets object which behaves exactly like a arraysets accessor class but which can be invalidated when the writer lock is released. Return type: Arraysets
-
branch_name
¶ Branch this write enabled checkout’s staging area was based on.
Returns: name of the branch whose commit HEAD
changes are staged from.Return type: str
-
close
() → None¶ Close all handles to the writer checkout and release the writer lock.
Failure to call this method after the writer checkout has been used will result in a lock being placed on the repository which will not allow any writes until it has been manually cleared.
-
commit
(commit_message: str) → str¶ Commit the changes made in the staging area on the checkout branch.
Parameters: commit_message (str, optional) – user proved message for a log of what was changed in this commit. Should a fast forward commit be possible, this will NOT be added to fast-forward HEAD
.Returns: The commit hash of the new commit. Return type: string Raises: RuntimeError
– If no changes have been made in the staging area, no commit occurs.
-
commit_hash
¶ Commit hash which the staging area of branch_name is based on.
Returns: commit hash Return type: string
-
diff
¶ Access the differ methods which are aware of any staged changes.
See also
The class
hangar.diff.WriterUserDiff
contains all methods accessible by this property accessorReturns: weakref proxy to the differ object (and contained methods) which behaves exactly like the differ class but which can be invalidated when the writer lock is released. Return type: WriterUserDiff
-
get
(arraysets, samples, *, except_missing=False)¶ View of samples across arraysets which handles missing sample keys.
Please see
__getitem__()
for full description. This method is identical with a single exception: if a sample key is not present in an arrayset, this method will plane a nullNone
value in it’s return slot rather than throwing aKeyError
like the dict style access does.Parameters: - arraysets (Union[str, Iterable[str], Ellipses, slice(None)]) – Name(s) of the arraysets to query. The Ellipsis operator (
...
) or unbounded slice operator (:
<==>slice(None)
) can be used to indicate “select all” behavior. - samples (Union[str, int, Iterable[Union[str, int]]]) – Names(s) of the samples to query
- except_missing (bool, **KWARG ONLY) – If False, will not throw exceptions on missing sample key value. Will raise KeyError if True and missing key found.
Returns: - Arrayset – single arrayset parameter, no samples specified
- np.ndarray – Single arrayset specified, single sample key specified
- List[np.ndarray] – Single arrayset, multiple samples array data for each sample is returned in same order sample keys are recieved.
- List[NamedTuple[*np.ndarray]] – Multiple arraysets, multiple samples. Each arrayset’s name is used as a field in the NamedTuple elements, each NamedTuple contains arrays stored in each arrayset via a common sample key. Each sample key is returned values as an individual element in the List. The sample order is returned in the same order it wasw recieved.
- arraysets (Union[str, Iterable[str], Ellipses, slice(None)]) – Name(s) of the arraysets to query. The Ellipsis operator (
-
merge
(message: str, dev_branch: str) → str¶ Merge the currently checked out commit with the provided branch name.
If a fast-forward merge is possible, it will be performed, and the commit message argument to this function will be ignored.
Parameters: - message (str) – commit message to attach to a three-way merge
- dev_branch (str) – name of the branch which should be merge into this branch (master)
Returns: commit hash of the new commit for the master branch this checkout was started from.
Return type: str
-
metadata
¶ Provides access to metadata interaction object.
See also
The class
hangar.metadata.MetadataWriter
contains all methods accessible by this property accessorReturns: weakref proxy to the metadata object which behaves exactly like a metadata class but which can be invalidated when the writer lock is released. Return type: MetadataWriter
-
reset_staging_area
() → str¶ Perform a hard reset of the staging area to the last commit head.
After this operation completes, the writer checkout will automatically close in the typical fashion (any held references to :attr:
arrayset
or :attr:metadata
objects will finalize and destruct as normal), In order to perform any further operation, a new checkout needs to be opened.Warning
This operation is IRREVERSIBLE. all records and data which are note stored in a previous commit will be permanently deleted.
Returns: Commit hash of the head which the staging area is reset to. Return type: str Raises: RuntimeError
– If no changes have been made to the staging area, No-Op.
-
Arraysets¶
-
class
Arraysets
¶ Common access patterns and initialization/removal of arraysets in a checkout.
This object is the entry point to all tensor data stored in their individual arraysets. Each arrayset contains a common schema which dictates the general shape, dtype, and access patters which the backends optimize access for. The methods contained within allow us to create, remove, query, and access these collections of common tensors.
-
__contains__
(key: str) → bool¶ Determine if a arrayset with a particular name is stored in the checkout
Parameters: key (str) – name of the arrayset to check for Returns: True if a arrayset with the provided name exists in the checkout, otherwise False. Return type: bool
-
__delitem__
(key: str) → str¶ remove a arrayset and all data records if write-enabled process.
Parameters: key (str) – Name of the arrayset to remove from the repository. This will remove all records from the staging area (though the actual data and all records are still accessible) if they were previously committed Returns: If successful, the name of the removed arrayset. Return type: str Raises: PermissionError
– If this is a read-only checkout, no operation is permitted.
-
__getitem__
(key: str) → Union[hangar.arrayset.ArraysetDataReader, hangar.arrayset.ArraysetDataWriter]¶ Dict style access to return the arrayset object with specified key/name.
Parameters: key (string) – name of the arrayset object to get. Returns: The object which is returned depends on the mode of checkout specified. If the arrayset was checked out with write-enabled, return writer object, otherwise return read only object. Return type: ArraysetDataReader
orArraysetDataWriter
-
__setitem__
(key, value)¶ Specifically prevent use dict style setting for arrayset objects.
Arraysets must be created using the factory function
init_arrayset()
.Raises: PermissionError
– This operation is not allowed under any circumstance
-
contains_remote_references
¶ Dict of bool indicating data reference locality in each arrayset.
Returns: For each arrayset name key, boolean value where False indicates all samples in arrayset exist locally, True if some reference remote sources. Return type: Mapping[str, bool]
-
get
(name: str) → Union[hangar.arrayset.ArraysetDataReader, hangar.arrayset.ArraysetDataWriter]¶ Returns a arrayset access object.
This can be used in lieu of the dictionary style access.
Parameters: name (str) – name of the arrayset to return Returns: ArraysetData accessor (set to read or write mode as appropriate) which governs interaction with the data Return type: Union[ArraysetDataReader, ArraysetDataWriter] Raises: KeyError
– If no arrayset with the given name exists in the checkout
-
init_arrayset
(name: str, shape: Union[int, Tuple[int]] = None, dtype: numpy.dtype = None, prototype: numpy.ndarray = None, named_samples: bool = True, variable_shape: bool = False, *, backend: str = None) → hangar.arrayset.ArraysetDataWriter¶ Initializes a arrayset in the repository.
Arraysets are groups of related data pieces (samples). All samples within a arrayset have the same data type, and number of dimensions. The size of each dimension can be either fixed (the default behavior) or variable per sample.
For fixed dimension sizes, all samples written to the arrayset must have the same size that was initially specified upon arrayset initialization. Variable size arraysets on the other hand, can write samples with dimensions of any size less than a maximum which is required to be set upon arrayset creation.
Parameters: - name (str) – The name assigned to this arrayset.
- shape (Union[int, Tuple[int]]) – The shape of the data samples which will be written in this arrayset. This argument and the dtype argument are required if a prototype is not provided, defaults to None.
- dtype (np.dtype) – The datatype of this arrayset. This argument and the shape argument are required if a prototype is not provided., defaults to None.
- prototype (np.ndarray) – A sample array of correct datatype and shape which will be used to initialize the arrayset storage mechanisms. If this is provided, the shape and dtype arguments must not be set, defaults to None.
- named_samples (bool, optional) – If the samples in the arrayset have names associated with them. If set, all samples must be provided names, if not, no name will be assigned. defaults to True, which means all samples should have names.
- variable_shape (bool, optional) – If this is a variable sized arrayset. If true, a the maximum shape is set from the provided shape or prototype argument. Any sample added to the arrayset can then have dimension sizes <= to this initial specification (so long as they have the same rank as what was specified) defaults to False.
- backend (DEVELOPER USE ONLY. str, optional, kwarg only) – Backend which should be used to write the arrayset files on disk.
Returns: instance object of the initialized arrayset.
Return type: Raises: ValueError
– If provided name contains any non ascii, non alpha-numeric characters.ValueError
– If required shape and dtype arguments are not provided in absence of prototype argument.ValueError
– If prototype argument is not a C contiguous ndarray.LookupError
– If a arrayset already exists with the provided name.ValueError
– If rank of maximum tensor shape > 31.ValueError
– If zero sized dimension in shape argumentValueError
– If the specified backend is not valid.
-
iswriteable
¶ Bool indicating if this arrayset object is write-enabled. Read-only attribute.
-
items
() → Iterable[Tuple[str, Union[hangar.arrayset.ArraysetDataReader, hangar.arrayset.ArraysetDataWriter]]]¶ generator providing access to arrayset_name,
Arraysets
Yields: Iterable[Tuple[str, Union[ArraysetDataReader, ArraysetDataWriter]]] – returns two tuple of all all arrayset names/object pairs in the checkout.
-
keys
() → List[str]¶ list all arrayset keys (names) in the checkout
Returns: list of arrayset names Return type: List[str]
-
multi_add
(mapping: Mapping[str, numpy.ndarray]) → str¶ Add related samples to un-named arraysets with the same generated key.
If you have multiple arraysets in a checkout whose samples are related to each other in some manner, there are two ways of associating samples together:
- using named arraysets and setting each tensor in each arrayset to the same sample “name” using un-named arraysets.
- using this “add” method. which accepts a dictionary of “arrayset names” as keys, and “tensors” (ie. individual samples) as values.
When method (2) - this method - is used, the internally generated sample ids will be set to the same value for the samples in each arrayset. That way a user can iterate over the arrayset key’s in one sample, and use those same keys to get the other related tensor samples in another arrayset.
Parameters: mapping (Mapping[str, np.ndarray]) – Dict mapping (any number of) arrayset names to tensor data (samples) which to add. The arraysets must exist, and must be set to accept samples which are not named by the user Returns: generated id (key) which each sample is stored under in their corresponding arrayset. This is the same for all samples specified in the input dictionary. Return type: str Raises: KeyError
– If no arrayset with the given name exists in the checkout
-
remote_sample_keys
¶ Determine arraysets samples names which reference remote sources.
Returns: dict where keys are arrayset names and values are iterables of samples in the arrayset containing remote references Return type: Mapping[str, Iterable[Union[int, str]]]
-
remove_aset
(aset_name: str) → str¶ remove the arrayset and all data contained within it from the repository.
Parameters: aset_name (str) – name of the arrayset to remove Returns: name of the removed arrayset Return type: str Raises: KeyError
– If a arrayset does not exist with the provided name
-
values
() → Iterable[Union[hangar.arrayset.ArraysetDataReader, hangar.arrayset.ArraysetDataWriter]]¶ yield all arrayset object instances in the checkout.
Yields: Iterable[Union[ArraysetDataReader, ArraysetDataWriter]] – Generator of ArraysetData accessor objects (set to read or write mode as appropriate)
-
Arrayset Data¶
-
class
ArraysetDataWriter
¶ Class implementing methods to write data to a arrayset.
Writer specific methods are contained here, and while read functionality is shared with the methods common to
ArraysetDataReader
. Write-enabled checkouts are not thread/process safe for eitherwrites
ORreads
, a restriction we impose forwrite-enabled
checkouts in order to ensure data integrity above all else.See also
-
__contains__
(key: Union[str, int]) → bool¶ Determine if a key is a valid sample name in the arrayset
Parameters: key (Union[str, int]) – name to check if it is a sample in the arrayset Returns: True if key exists, else False Return type: bool
-
__delitem__
(key: Union[str, int]) → Union[str, int]¶ Remove a sample from the arrayset. Convenience method to
remove()
.See also
Parameters: key (Union[str, int]) – Name of the sample to remove from the arrayset Returns: Name of the sample removed from the arrayset (assuming operation successful) Return type: Union[str, int]
-
__getitem__
(key: Union[str, int]) → numpy.ndarray¶ Retrieve a sample with a given key. Convenience method for dict style access.
See also
Parameters: key (Union[str, int]) – sample key to retrieve from the arrayset Returns: sample array data corresponding to the provided key Return type: np.ndarray
-
__len__
() → int¶ Check how many samples are present in a given arrayset
Returns: number of samples the arrayset contains Return type: int
-
__setitem__
(key: Union[str, int], value: numpy.ndarray) → Union[str, int]¶ Store a piece of data in a arrayset. Convenience method to
add()
.See also
Parameters: - key (Union[str, int]) – name of the sample to add to the arrayset
- value (np.array) – tensor data to add as the sample
Returns: sample name of the stored data (assuming operation was successful)
Return type: Union[str, int]
-
add
(data: numpy.ndarray, name: Union[str, int] = None, **kwargs) → Union[str, int]¶ Store a piece of data in a arrayset
Parameters: - data (np.ndarray) – data to store as a sample in the arrayset.
- name (Union[str, int], optional) – name to assign to the same (assuming the arrayset accepts named samples), If str, can only contain alpha-numeric ascii characters (in addition to ‘-‘, ‘.’, ‘_’). Integer key must be >= 0. by default None
Returns: sample name of the stored data (assuming the operation was successful)
Return type: Union[str, int]
Raises: ValueError
– If no name arg was provided for arrayset requiring named samples.ValueError
– If input data tensor rank exceeds specified rank of arrayset samples.ValueError
– For variable shape arraysets, if a dimension size of the input data tensor exceeds specified max dimension size of the arrayset samples.ValueError
– For fixed shape arraysets, if input data dimensions do not exactly match specified arrayset dimensions.ValueError
– If type of data argument is not an instance of np.ndarray.ValueError
– If data is not “C” contiguous array layout.ValueError
– If the datatype of the input data does not match the specified data type of the arrayset
-
contains_remote_references
¶ Bool indicating if all samples exist locally or if some reference remote sources.
-
dtype
¶ Datatype of the arrayset schema. Read-only attribute.
-
get
(name: Union[str, int]) → numpy.ndarray¶ Retrieve a sample in the arrayset with a specific name.
The method is thread/process safe IF used in a read only checkout. Use this if the calling application wants to manually manage multiprocess logic for data retrieval. Otherwise, see the
get_batch()
method to retrieve multiple data samples simultaneously. This method uses multiprocess pool of workers (managed by hangar) to drastically increase access speed and simplify application developer workflows.Note
in most situations, we have observed little to no performance improvements when using multithreading. However, access time can be nearly linearly decreased with the number of CPU cores / workers if multiprocessing is used.
Parameters: name (Union[str, int]) – Name of the sample to retrieve data for. Returns: Tensor data stored in the arrayset archived with provided name(s). Return type: np.ndarray Raises: KeyError
– if the arrayset does not contain data with the provided name
-
get_batch
(names: Iterable[Union[str, int]], *, n_cpus: int = None, start_method: str = 'spawn') → List[numpy.ndarray]¶ Retrieve a batch of sample data with the provided names.
This method is (technically) thread & process safe, though it should not be called in parallel via multithread/process application code; This method has been seen to drastically decrease retrieval time of sample batches (as compared to looping over single sample names sequentially). Internally it implements a multiprocess pool of workers (managed by hangar) to simplify application developer workflows.
Parameters: - name (Iterable[Union[str, int]]) – list/tuple of sample names to retrieve data for.
- n_cpus (int, kwarg-only) – if not None, uses num_cpus / 2 of the system for retrieval. Setting
this value to
1
will not use a multiprocess pool to perform the work. Default is None - start_method (str, kwarg-only) – One of ‘spawn’, ‘fork’, ‘forkserver’ specifying the process pool start method. Not all options are available on all platforms. see python multiprocess docs for details. Default is ‘spawn’.
Returns: Tensor data stored in the arrayset archived with provided name(s).
If a single sample name is passed in as the, the corresponding np.array data will be returned.
If a list/tuple of sample names are pass in the
names
argument, a tuple of sizelen(names)
will be returned where each element is an np.array containing data at the position it’s name listed in thenames
parameter.Return type: List[np.ndarray]
Raises: KeyError
– if the arrayset does not contain data with the provided name
-
iswriteable
¶ Bool indicating if this arrayset object is write-enabled. Read-only attribute.
-
items
() → Iterator[Tuple[Union[str, int], numpy.ndarray]]¶ generator yielding two-tuple of (name, tensor), for every sample in the arrayset.
For write enabled checkouts, is technically possible to iterate over the arrayset object while adding/deleting data, in order to avoid internal python runtime errors (
dictionary changed size during iteration
we have to make a copy of they key list before beginning the loop.) While not necessary for read checkouts, we perform the same operation for both read and write checkouts in order in order to avoid differences.Yields: Iterator[Tuple[Union[str, int], np.ndarray]] – sample name and stored value for every sample inside the arrayset
-
keys
() → Iterator[Union[str, int]]¶ generator which yields the names of every sample in the arrayset
For write enabled checkouts, is technically possible to iterate over the arrayset object while adding/deleting data, in order to avoid internal python runtime errors (
dictionary changed size during iteration
we have to make a copy of they key list before beginning the loop.) While not necessary for read checkouts, we perform the same operation for both read and write checkouts in order in order to avoid differences.Yields: Iterator[Union[str, int]] – keys of one sample at a time inside the arrayset
-
name
¶ Name of the arrayset. Read-Only attribute.
-
named_samples
¶ Bool indicating if samples are named. Read-only attribute.
-
remote_reference_sample_keys
¶ Returns sample names whose data is stored in a remote server reference.
Returns: list of sample keys in the arrayset. Return type: List[Union[str, int]]
-
remove
(name: Union[str, int]) → Union[str, int]¶ Remove a sample with the provided name from the arrayset.
Note
This operation will NEVER actually remove any data from disk. If you commit a tensor at any point in time, it will always remain accessible by checking out a previous commit when the tensor was present. This is just a way to tell Hangar that you don’t want some piece of data to clutter up the current version of the repository.
Warning
Though this may change in a future release, in the current version of Hangar, we cannot recover references to data which was added to the staging area, written to disk, but then removed before a commit operation was run. This would be a similar sequence of events as: checking out a git branch, changing a bunch of text in the file, and immediately performing a hard reset. If it was never committed, git doesn’t know about it, and (at the moment) neither does Hangar.
Parameters: name (Union[str, int]) – name of the sample to remove. Returns: If the operation was successful, name of the data sample deleted. Return type: Union[str, int] Raises: KeyError
– If a sample with the provided name does not exist in the arrayset.
-
shape
¶ Shape (or max_shape) of the arrayset sample tensors. Read-only attribute.
-
values
() → Iterator[numpy.ndarray]¶ generator which yields the tensor data for every sample in the arrayset
For write enabled checkouts, is technically possible to iterate over the arrayset object while adding/deleting data, in order to avoid internal python runtime errors (
dictionary changed size during iteration
we have to make a copy of they key list before beginning the loop.) While not necessary for read checkouts, we perform the same operation for both read and write checkouts in order in order to avoid differences.Yields: Iterator[np.ndarray] – values of one sample at a time inside the arrayset
-
variable_shape
¶ Bool indicating if arrayset schema is variable sized. Read-only attribute.
-
Metadata¶
-
class
MetadataWriter
¶ Class implementing write access to repository metadata.
Similar to the
ArraysetDataWriter
, this class inherits the functionality of theMetadataReader
for reading. The only difference is that the reader will be initialized with data records pointing to the staging area, and not a commit which is checked out.Note
Write-enabled metadata objects are not thread or process safe. Read-only checkouts can use multithreading safety to retrieve data via the standard
MetadataReader.get()
callsSee also
MetadataReader
for the intended use of this functionality.-
__contains__
(key: Union[str, int]) → bool¶ Determine if a key with the provided name is in the metadata
Parameters: key (Union[str, int]) – key to check for containment testing Returns: True if key exists, False otherwise Return type: bool
-
__delitem__
(key: Union[str, int]) → Union[str, int]¶ Remove a key/value pair from metadata. Convenience method to
remove()
.See also
remove()
for the function this calls into.Parameters: key (Union[str, int]) – Name of the metadata piece to remove. Returns: Metadata key removed from the checkout (assuming operation successful) Return type: Union[str, int]
-
__getitem__
(key: Union[str, int]) → str¶ Retrieve a metadata sample with a key. Convenience method for dict style access.
See also
Parameters: key (Union[str, int]) – metadata key to retrieve from the checkout Returns: value of the metadata key/value pair stored in the checkout. Return type: string
-
__len__
() → int¶ Determine how many metadata key/value pairs are in the checkout
Returns: number of metadata key/value pairs. Return type: int
-
__setitem__
(key: Union[str, int], value: str) → Union[str, int]¶ Store a key/value pair as metadata. Convenience method to
add()
.See also
Parameters: - key (Union[str, int]) – name of the key to add as metadata
- value (string) – value to add as metadata
Returns: key of the stored metadata sample (assuming operation was successful)
Return type: Union[str, int]
-
add
(key: Union[str, int], value: str) → Union[str, int]¶ Add a piece of metadata to the staging area of the next commit.
Parameters: - key (Union[str, int]) – Name of the metadata piece, alphanumeric ascii characters only
- value (string) – Metadata value to store in the repository, any length of valid ascii characters.
Returns: The name of the metadata key written to the database if the operation succeeded.
Return type: Union[str, int]
Raises: ValueError
– If the key contains any whitespace or non alpha-numeric characters.ValueError
– If the value contains any non ascii characters.
-
get
(key: Union[str, int]) → str¶ retrieve a piece of metadata from the checkout.
Parameters: key (Union[str, int]) – The name of the metadata piece to retrieve.
Returns: The stored metadata value associated with the key.
Return type: str
Raises: ValueError
– If the key is not str type or contains whitespace or non alpha-numeric characters.KeyError
– If no metadata exists in the checkout with the provided key.
-
iswriteable
¶ Read-only attribute indicating if this metadata object is write-enabled.
Returns: True if write-enabled checkout, Otherwise False. Return type: bool
-
items
() → Iterator[Tuple[Union[str, int], str]]¶ generator yielding key/value for all metadata recorded in checkout.
For write enabled checkouts, is technically possible to iterate over the metadata object while adding/deleting data, in order to avoid internal python runtime errors (
dictionary changed size during iteration
we have to make a copy of they key list before beginning the loop.) While not necessary for read checkouts, we perform the same operation for both read and write checkouts in order in order to avoid differences.Yields: Iterator[Tuple[Union[str, int], np.ndarray]] – metadata key and stored value for every piece in the checkout.
-
keys
() → Iterator[Union[str, int]]¶ generator which yields the names of every metadata piece in the checkout.
For write enabled checkouts, is technically possible to iterate over the metadata object while adding/deleting data, in order to avoid internal python runtime errors (
dictionary changed size during iteration
we have to make a copy of they key list before beginning the loop.) While not necessary for read checkouts, we perform the same operation for both read and write checkouts in order in order to avoid differences.Yields: Iterator[Union[str, int]] – keys of one metadata sample at a time
-
remove
(key: Union[str, int]) → Union[str, int]¶ Remove a piece of metadata from the staging area of the next commit.
Parameters: key (Union[str, int]) – Metadata name to remove.
Returns: Name of the metadata key/value pair removed, if the operation was successful.
Return type: Union[str, int]
Raises: ValueError
– If the key provided is not string type and containing only ascii-alphanumeric characters.KeyError
– If the checkout does not contain metadata with the provided key.
-
values
() → Iterator[str]¶ generator yielding all metadata values in the checkout
For write enabled checkouts, is technically possible to iterate over the metadata object while adding/deleting data, in order to avoid internal python runtime errors (
dictionary changed size during iteration
we have to make a copy of they key list before beginning the loop.) While not necessary for read checkouts, we perform the same operation for both read and write checkouts in order in order to avoid differences.Yields: Iterator[str] – values of one metadata piece at a time
-
Differ¶
-
class
WriterUserDiff
¶ Methods diffing contents of a
WriterCheckout
instance.These provide diffing implementations to compare the current
HEAD
of a checkout to a branch, commit, or the staging area"base"
contents. The results are generally returned as a nested set of named tuples. In addition, thestatus()
method is implemented which can be used to quickly determine if there are any uncommitted changes written in the checkout.When diffing of commits or branches is performed, if there is not a linear history of commits between current
HEAD
and the diff commit (ie. a history which would permit a"fast-forward" merge
), the result field namedconflict
will contain information on any merge conflicts that would exist if staging areaHEAD
and the (compared)"dev" HEAD
were merged “right now”. Though this field is present for all diff comparisons, it can only contain non-empty values in the cases where a three way merge would need to be performed.Fast Forward is Possible ======================== (master) (foo) a ----- b ----- c ----- d 3-Way Merge Required ==================== (master) a ----- b ----- c ----- d \ \ (foo) \----- ee ----- ff
-
branch
(dev_branch: str) → hangar.diff.DiffAndConflicts¶ Compute diff between HEAD and branch, returning user-facing results.
Parameters: dev_branch (str) – name of the branch whose HEAD will be used to calculate the diff of. Returns: two-tuple of diff, conflict (if any) calculated in the diff algorithm. Return type: DiffAndConflicts Raises: ValueError
– If the specified dev_branch does not exist.
-
commit
(dev_commit_hash: str) → hangar.diff.DiffAndConflicts¶ Compute diff between HEAD and commit, returning user-facing results.
Parameters: dev_commit_hash (str) – hash of the commit to be used as the comparison. Returns: two-tuple of diff, conflict (if any) calculated in the diff algorithm. Return type: DiffAndConflicts Raises: ValueError
– if the specified dev_commit_hash is not a valid commit reference.
-
staged
() → hangar.diff.DiffAndConflicts¶ Return diff of staging area to base, returning user-facing results.
Returns: two-tuple of diff, conflict (if any) calculated in the diff algorithm. Return type: DiffAndConflicts
-
status
() → str¶ Determine if changes have been made in the staging area
If the contents of the staging area and it’s parent commit are the same, the status is said to be “CLEAN”. If even one arrayset or metadata record has changed however, the status is “DIRTY”.
Returns: “CLEAN” if no changes have been made, otherwise “DIRTY” Return type: str
-
Read Only Checkout¶
-
class
ReaderCheckout
¶ Checkout the repository as it exists at a particular branch.
This class is instantiated automatically from a repository checkout operation. This object will govern all access to data and interaction methods the user requests.
>>> co = repo.checkout() >>> isinstance(co, ReaderCheckout) True
If a commit hash is provided, it will take precedent over the branch name parameter. If neither a branch not commit is specified, the staging environment’s base branch
HEAD
commit hash will be read.>>> co = repo.checkout(commit='foocommit') >>> co.commit_hash 'foocommit' >>> co.close() >>> co = repo.checkout(branch='testbranch') >>> co.commit_hash 'someothercommithashhere' >>> co.close()
Unlike
WriterCheckout
, any number ofReaderCheckout
objects can exist on the repository independently. Like thewrite-enabled
variant, theclose()
method should be called after performing the necessary operations on the repo. However, as there is no concept of alock
forread-only
checkouts, this is just to free up memory resources, rather than changing recorded access state.In order to reduce the chance that the python interpreter is shut down without calling
close()
, - a common mistake during ipython / jupyter sessions - an atexit hook is registered toclose()
. If properly closed by the user, the hook is unregistered after completion with no ill effects. So long as a the process is NOT terminated via non-pythonSIGKILL
, fatal internal python error, or or specialos exit
methods, cleanup will occur on interpreter shutdown and resources will be freed. If a non-handled termination method does occur, the implications of holding resources varies on a per-OS basis. While no risk to data integrity is observed, repeated misuse may require a system reboot in order to achieve expected performance characteristics.-
__getitem__
(index)¶ Dictionary style access to arraysets and samples
Checkout object can be thought of as a “dataset” (“dset”) mapping a view of samples across arraysets.
>>> dset = repo.checkout(branch='master')
Get an arrayset contained in the checkout.
>>> dset['foo'] ArraysetDataReader
Get a specific sample from
'foo'
(returns a single array)>>> dset['foo', '1'] np.array([1])
Get multiple samples from
'foo'
(retuns a list of arrays, in order of input keys)>>> dset['foo', ['1', '2', '324']] [np.array([1]), np.ndarray([2]), np.ndarray([324])]
Get sample from multiple arraysets (returns namedtuple of arrays, field names = arrayset names)
>>> dset[('foo', 'bar', 'baz'), '1'] ArraysetData(foo=array([1]), bar=array([11]), baz=array([111]))
Get multiple samples from multiple arraysets(returns list of namedtuple of array sorted in input key order, field names = arrayset names)
>>> dset[('foo', 'bar'), ('1', '2')] [ArraysetData(foo=array([1]), bar=array([11])), ArraysetData(foo=array([2]), bar=array([22]))]
Get samples from all arraysets (shortcut syntax)
>>> out = dset[:, ('1', '2')] >>> out = dset[..., ('1', '2')] >>> out [ArraysetData(foo=array([1]), bar=array([11]), baz=array([111])), ArraysetData(foo=array([2]), bar=array([22]), baz=array([222]))]
>>> out = dset[:, '1'] >>> out = dset[..., '1'] >>> out ArraysetData(foo=array([1]), bar=array([11]), baz=array([111]))
Parameters: index – Please see detailed explanation above for full options. Hard coded options are the order to specification.
The first element (or collection) specified must be
str
type and correspond to an arrayset name(s). Alternativly the Ellipsis operator (...
) or unbounded slice operator (:
<==>slice(None)
) can be used to indicate “select all” behavior.If a second element (or collection) is present, the keys correspond to sample names present within (all) the specified arraysets. If a key is not present in even on arrayset, the entire
get
operation will abort withKeyError
. If desired, the same selection syntax can be used with theget()
method, which will not Error in these situations, but simply returnNone
values in the appropriate position for keys which do not exist.Returns: - Arrayset – single arrayset parameter, no samples specified
- np.ndarray – Single arrayset specified, single sample key specified
- List[np.ndarray] – Single arrayset, multiple samples array data for each sample is returned in same order sample keys are recieved.
- List[NamedTuple[*np.ndarray]] – Multiple arraysets, multiple samples. Each arrayset’s name is used as a field in the NamedTuple elements, each NamedTuple contains arrays stored in each arrayset via a common sample key. Each sample key is returned values as an individual element in the List. The sample order is returned in the same order it wasw recieved.
Notes
- All specified arraysets must exist
- All specified sample keys must exist in all specified arraysets, otherwise standard exception thrown
- Slice syntax cannot be used in sample keys field
- Slice syntax for arrayset field cannot specify start, stop, or
step fields, it is soley a shortcut syntax for ‘get all arraysets’ in
the
:
orslice(None)
form
-
arraysets
¶ Provides access to arrayset interaction object.
Can be used to either return the arraysets accessor for all elements or a single arrayset instance by using dictionary style indexing.
>>> co = repo.checkout(write=False) >>> len(co.arraysets) 1 >>> print(co.arraysets.keys()) ['foo']
>>> fooAset = co.arraysets['foo'] >>> fooAset.dtype np.fooDtype
>>> asets = co.arraysets >>> fooAset = asets['foo'] >>> fooAset = asets.get('foo') >>> fooAset.dtype np.fooDtype
See also
The class
Arraysets
contains all methods accessible by this property accessorReturns: weakref proxy to the arraysets object which behaves exactly like a arraysets accessor class but which can be invalidated when the writer lock is released. Return type: Arraysets
-
close
() → None¶ Gracefully close the reader checkout object.
Though not strictly required for reader checkouts (as opposed to writers), closing the checkout after reading will free file handles and system resources, which may improve performance for repositories with multiple simultaneous read checkouts.
-
commit_hash
¶ Commit hash this read-only checkout’s data is read from.
>>> co.commit_hash foohashdigesthere
Returns: commit hash of the checkout Return type: string
-
diff
¶ Access the differ methods for a read-only checkout.
See also
The class
ReaderUserDiff
contains all methods accessible by this property accessorReturns: weakref proxy to the differ object (and contained methods) which behaves exactly like the differ class but which can be invalidated when the writer lock is released. Return type: ReaderUserDiff
-
get
(arraysets, samples, *, except_missing=False)¶ View of sample data across arraysets gracefully handeling missing sample keys.
Please see
__getitem__()
for full description. This method is identical with a single exception: if a sample key is not present in an arrayset, this method will plane a nullNone
value in it’s return slot rather than throwing aKeyError
like the dict style access does.Parameters: - arraysets (Union[str, Iterable[str], Ellipses, slice(None)]) – Name(s) of the arraysets to query. The Ellipsis operator (
...
) or unbounded slice operator (:
<==>slice(None)
) can be used to indicate “select all” behavior. - samples (Union[str, int, Iterable[Union[str, int]]]) – Names(s) of the samples to query
- except_missing (bool, **KWARG ONLY) – If False, will not throw exceptions on missing sample key value. Will raise KeyError if True and missing key found.
Returns: - Arrayset – single arrayset parameter, no samples specified
- np.ndarray – Single arrayset specified, single sample key specified
- List[np.ndarray] – Single arrayset, multiple samples array data for each sample is returned in same order sample keys are recieved.
- List[NamedTuple[*np.ndarray]] – Multiple arraysets, multiple samples. Each arrayset’s name is used as a field in the NamedTuple elements, each NamedTuple contains arrays stored in each arrayset via a common sample key. Each sample key is returned values as an individual element in the List. The sample order is returned in the same order it wasw recieved.
- arraysets (Union[str, Iterable[str], Ellipses, slice(None)]) – Name(s) of the arraysets to query. The Ellipsis operator (
-
metadata
¶ Provides access to metadata interaction object.
See also
The class
hangar.metadata.MetadataReader
contains all methods accessible by this property accessorReturns: weakref proxy to the metadata object which behaves exactly like a metadata class but which can be invalidated when the writer lock is released. Return type: MetadataReader
-
Arraysets¶
-
class
Arraysets
Common access patterns and initialization/removal of arraysets in a checkout.
This object is the entry point to all tensor data stored in their individual arraysets. Each arrayset contains a common schema which dictates the general shape, dtype, and access patters which the backends optimize access for. The methods contained within allow us to create, remove, query, and access these collections of common tensors.
-
__contains__
(key: str) → bool Determine if a arrayset with a particular name is stored in the checkout
Parameters: key (str) – name of the arrayset to check for Returns: True if a arrayset with the provided name exists in the checkout, otherwise False. Return type: bool
-
__getitem__
(key: str) → Union[hangar.arrayset.ArraysetDataReader, hangar.arrayset.ArraysetDataWriter] Dict style access to return the arrayset object with specified key/name.
Parameters: key (string) – name of the arrayset object to get. Returns: The object which is returned depends on the mode of checkout specified. If the arrayset was checked out with write-enabled, return writer object, otherwise return read only object. Return type: ArraysetDataReader
orArraysetDataWriter
-
get
(name: str) → Union[hangar.arrayset.ArraysetDataReader, hangar.arrayset.ArraysetDataWriter] Returns a arrayset access object.
This can be used in lieu of the dictionary style access.
Parameters: name (str) – name of the arrayset to return Returns: ArraysetData accessor (set to read or write mode as appropriate) which governs interaction with the data Return type: Union[ArraysetDataReader, ArraysetDataWriter] Raises: KeyError
– If no arrayset with the given name exists in the checkout
-
iswriteable
Bool indicating if this arrayset object is write-enabled. Read-only attribute.
-
items
() → Iterable[Tuple[str, Union[hangar.arrayset.ArraysetDataReader, hangar.arrayset.ArraysetDataWriter]]] generator providing access to arrayset_name,
Arraysets
Yields: Iterable[Tuple[str, Union[ArraysetDataReader, ArraysetDataWriter]]] – returns two tuple of all all arrayset names/object pairs in the checkout.
-
keys
() → List[str] list all arrayset keys (names) in the checkout
Returns: list of arrayset names Return type: List[str]
-
values
() → Iterable[Union[hangar.arrayset.ArraysetDataReader, hangar.arrayset.ArraysetDataWriter]] yield all arrayset object instances in the checkout.
Yields: Iterable[Union[ArraysetDataReader, ArraysetDataWriter]] – Generator of ArraysetData accessor objects (set to read or write mode as appropriate)
-
Arrayset Data¶
-
class
ArraysetDataReader
¶ Class implementing get access to data in a arrayset.
The methods implemented here are common to the
ArraysetDataWriter
accessor class as well as to this"read-only"
method. Though minimal, the behavior of read and write checkouts is slightly unique, with the main difference being that"read-only"
checkouts implement both thread and process safe access methods. This is not possible for"write-enabled"
checkouts, and attempts at multiprocess/threaded writes will generally fail with cryptic error messages.-
__contains__
(key: Union[str, int]) → bool¶ Determine if a key is a valid sample name in the arrayset
Parameters: key (Union[str, int]) – name to check if it is a sample in the arrayset Returns: True if key exists, else False Return type: bool
-
__getitem__
(key: Union[str, int]) → numpy.ndarray¶ Retrieve a sample with a given key. Convenience method for dict style access.
See also
Parameters: key (Union[str, int]) – sample key to retrieve from the arrayset Returns: sample array data corresponding to the provided key Return type: np.ndarray
-
__len__
() → int¶ Check how many samples are present in a given arrayset
Returns: number of samples the arrayset contains Return type: int
-
contains_remote_references
¶ Bool indicating if all samples exist locally or if some reference remote sources.
-
dtype
¶ Datatype of the arrayset schema. Read-only attribute.
-
get
(name: Union[str, int]) → numpy.ndarray¶ Retrieve a sample in the arrayset with a specific name.
The method is thread/process safe IF used in a read only checkout. Use this if the calling application wants to manually manage multiprocess logic for data retrieval. Otherwise, see the
get_batch()
method to retrieve multiple data samples simultaneously. This method uses multiprocess pool of workers (managed by hangar) to drastically increase access speed and simplify application developer workflows.Note
in most situations, we have observed little to no performance improvements when using multithreading. However, access time can be nearly linearly decreased with the number of CPU cores / workers if multiprocessing is used.
Parameters: name (Union[str, int]) – Name of the sample to retrieve data for. Returns: Tensor data stored in the arrayset archived with provided name(s). Return type: np.ndarray Raises: KeyError
– if the arrayset does not contain data with the provided name
-
get_batch
(names: Iterable[Union[str, int]], *, n_cpus: int = None, start_method: str = 'spawn') → List[numpy.ndarray]¶ Retrieve a batch of sample data with the provided names.
This method is (technically) thread & process safe, though it should not be called in parallel via multithread/process application code; This method has been seen to drastically decrease retrieval time of sample batches (as compared to looping over single sample names sequentially). Internally it implements a multiprocess pool of workers (managed by hangar) to simplify application developer workflows.
Parameters: - name (Iterable[Union[str, int]]) – list/tuple of sample names to retrieve data for.
- n_cpus (int, kwarg-only) – if not None, uses num_cpus / 2 of the system for retrieval. Setting
this value to
1
will not use a multiprocess pool to perform the work. Default is None - start_method (str, kwarg-only) – One of ‘spawn’, ‘fork’, ‘forkserver’ specifying the process pool start method. Not all options are available on all platforms. see python multiprocess docs for details. Default is ‘spawn’.
Returns: Tensor data stored in the arrayset archived with provided name(s).
If a single sample name is passed in as the, the corresponding np.array data will be returned.
If a list/tuple of sample names are pass in the
names
argument, a tuple of sizelen(names)
will be returned where each element is an np.array containing data at the position it’s name listed in thenames
parameter.Return type: List[np.ndarray]
Raises: KeyError
– if the arrayset does not contain data with the provided name
-
iswriteable
¶ Bool indicating if this arrayset object is write-enabled. Read-only attribute.
-
items
() → Iterator[Tuple[Union[str, int], numpy.ndarray]]¶ generator yielding two-tuple of (name, tensor), for every sample in the arrayset.
For write enabled checkouts, is technically possible to iterate over the arrayset object while adding/deleting data, in order to avoid internal python runtime errors (
dictionary changed size during iteration
we have to make a copy of they key list before beginning the loop.) While not necessary for read checkouts, we perform the same operation for both read and write checkouts in order in order to avoid differences.Yields: Iterator[Tuple[Union[str, int], np.ndarray]] – sample name and stored value for every sample inside the arrayset
-
keys
() → Iterator[Union[str, int]]¶ generator which yields the names of every sample in the arrayset
For write enabled checkouts, is technically possible to iterate over the arrayset object while adding/deleting data, in order to avoid internal python runtime errors (
dictionary changed size during iteration
we have to make a copy of they key list before beginning the loop.) While not necessary for read checkouts, we perform the same operation for both read and write checkouts in order in order to avoid differences.Yields: Iterator[Union[str, int]] – keys of one sample at a time inside the arrayset
-
name
¶ Name of the arrayset. Read-Only attribute.
-
named_samples
¶ Bool indicating if samples are named. Read-only attribute.
-
remote_reference_sample_keys
¶ Returns sample names whose data is stored in a remote server reference.
Returns: list of sample keys in the arrayset. Return type: List[Union[str, int]]
-
shape
¶ Shape (or max_shape) of the arrayset sample tensors. Read-only attribute.
-
values
() → Iterator[numpy.ndarray]¶ generator which yields the tensor data for every sample in the arrayset
For write enabled checkouts, is technically possible to iterate over the arrayset object while adding/deleting data, in order to avoid internal python runtime errors (
dictionary changed size during iteration
we have to make a copy of they key list before beginning the loop.) While not necessary for read checkouts, we perform the same operation for both read and write checkouts in order in order to avoid differences.Yields: Iterator[np.ndarray] – values of one sample at a time inside the arrayset
-
variable_shape
¶ Bool indicating if arrayset schema is variable sized. Read-only attribute.
-
Metadata¶
-
class
MetadataReader
¶ Class implementing get access to the metadata in a repository.
Unlike the
ArraysetDataReader
andArraysetDataWriter
, the equivalent Metadata classes do not need a factory function or class to coordinate access through the checkout. This is primarily because the metadata is only stored at a single level, and because the long term storage is must simpler than for array data (just write to a lmdb database).Note
It is important to realize that this is not intended to serve as a general store large amounts of textual data, and has no optimization to support such use cases at this time. This should only serve to attach helpful labels, or other quick information primarily intended for human book-keeping, to the main tensor data!
Note
Write-enabled metadata objects are not thread or process safe. Read-only checkouts can use multithreading safety to retrieve data via the standard
MetadataReader.get()
calls-
__contains__
(key: Union[str, int]) → bool¶ Determine if a key with the provided name is in the metadata
Parameters: key (Union[str, int]) – key to check for containment testing Returns: True if key exists, False otherwise Return type: bool
-
__getitem__
(key: Union[str, int]) → str¶ Retrieve a metadata sample with a key. Convenience method for dict style access.
See also
Parameters: key (Union[str, int]) – metadata key to retrieve from the checkout Returns: value of the metadata key/value pair stored in the checkout. Return type: string
-
__len__
() → int¶ Determine how many metadata key/value pairs are in the checkout
Returns: number of metadata key/value pairs. Return type: int
-
get
(key: Union[str, int]) → str¶ retrieve a piece of metadata from the checkout.
Parameters: key (Union[str, int]) – The name of the metadata piece to retrieve.
Returns: The stored metadata value associated with the key.
Return type: str
Raises: ValueError
– If the key is not str type or contains whitespace or non alpha-numeric characters.KeyError
– If no metadata exists in the checkout with the provided key.
-
iswriteable
¶ Read-only attribute indicating if this metadata object is write-enabled.
Returns: True if write-enabled checkout, Otherwise False. Return type: bool
-
items
() → Iterator[Tuple[Union[str, int], str]]¶ generator yielding key/value for all metadata recorded in checkout.
For write enabled checkouts, is technically possible to iterate over the metadata object while adding/deleting data, in order to avoid internal python runtime errors (
dictionary changed size during iteration
we have to make a copy of they key list before beginning the loop.) While not necessary for read checkouts, we perform the same operation for both read and write checkouts in order in order to avoid differences.Yields: Iterator[Tuple[Union[str, int], np.ndarray]] – metadata key and stored value for every piece in the checkout.
-
keys
() → Iterator[Union[str, int]]¶ generator which yields the names of every metadata piece in the checkout.
For write enabled checkouts, is technically possible to iterate over the metadata object while adding/deleting data, in order to avoid internal python runtime errors (
dictionary changed size during iteration
we have to make a copy of they key list before beginning the loop.) While not necessary for read checkouts, we perform the same operation for both read and write checkouts in order in order to avoid differences.Yields: Iterator[Union[str, int]] – keys of one metadata sample at a time
-
values
() → Iterator[str]¶ generator yielding all metadata values in the checkout
For write enabled checkouts, is technically possible to iterate over the metadata object while adding/deleting data, in order to avoid internal python runtime errors (
dictionary changed size during iteration
we have to make a copy of they key list before beginning the loop.) While not necessary for read checkouts, we perform the same operation for both read and write checkouts in order in order to avoid differences.Yields: Iterator[str] – values of one metadata piece at a time
-
Differ¶
-
class
ReaderUserDiff
¶ Methods diffing contents of a
ReaderCheckout
instance.These provide diffing implementations to compare the current checkout
HEAD
of a to a branch or commit. The results are generally returned as a nested set of named tuples.When diffing of commits or branches is performed, if there is not a linear history of commits between current
HEAD
and the diff commit (ie. a history which would permit a"fast-forward" merge
), the result field namedconflict
will contain information on any merge conflicts that would exist if staging areaHEAD
and the (compared)"dev" HEAD
were merged “right now”. Though this field is present for all diff comparisons, it can only contain non-empty values in the cases where a three way merge would need to be performed.Fast Forward is Possible ======================== (master) (foo) a ----- b ----- c ----- d 3-Way Merge Required ==================== (master) a ----- b ----- c ----- d \ \ (foo) \----- ee ----- ff
-
branch
(dev_branch: str) → hangar.diff.DiffAndConflicts¶ Compute diff between HEAD and branch name, returning user-facing results.
Parameters: dev_branch (str) – name of the branch whose HEAD will be used to calculate the diff of. Returns: two-tuple of diff, conflict (if any) calculated in the diff algorithm. Return type: DiffAndConflicts Raises: ValueError
– If the specified dev_branch does not exist.
-
commit
(dev_commit_hash: str) → hangar.diff.DiffAndConflicts¶ Compute diff between HEAD and commit hash, returning user-facing results.
Parameters: dev_commit_hash (str) – hash of the commit to be used as the comparison. Returns: two-tuple of diff, conflict (if any) calculated in the diff algorithm. Return type: DiffAndConflicts Raises: ValueError
– if the specified dev_commit_hash is not a valid commit reference.
-
ML Framework Dataloaders¶
Tensorflow¶
-
make_tf_dataset
(arraysets, keys: Sequence[str] = None, index_range: slice = None, shuffle: bool = True)¶ Uses the hangar arraysets to make a tensorflow dataset. It uses from_generator function from tensorflow.data.Dataset with a generator function that wraps all the hangar arraysets. In such instances tensorflow Dataset does shuffle by loading the subset of data which can fit into the memory and shuffle that subset. Since it is not really a global shuffle make_tf_dataset accepts a shuffle argument which will be used by the generator to shuffle each time it is being called.
Warning
tf.data.Dataset.from_generator currently uses tf.compat.v1.py_func() internally. Hence the serialization function (yield_data) will not be serialized in a GraphDef. Therefore, you won’t be able to serialize your model and restore it in a different environment if you use make_tf_dataset. The operation must run in the same address space as the Python program that calls tf.compat.v1.py_func(). If you are using distributed TensorFlow, you must run a tf.distribute.Server in the same process as the program that calls tf.compat.v1.py_func() and you must pin the created operation to a device in that server (e.g. using with tf.device():)
Parameters: - arraysets (
ArraysetDataReader
or Sequence) – A arrayset object, a tuple of arrayset object or a list of arrayset objects` - keys (Sequence[str]) – An iterable of sample names. If given only those samples will fetched from the arrayset
- index_range (slice) – A python slice object which will be used to find the subset of arrayset. Argument keys takes priority over index_range i.e. if both are given, keys will be used and index_range will be ignored
- shuffle (bool) – generator uses this to decide a global shuffle accross all the samples is required or not. But user doesn’t have any restriction on doing`arrayset.shuffle()` on the returned arrayset
Examples
>>> from hangar import Repository >>> from hangar import make_tf_dataset >>> import tensorflow as tf >>> tf.compat.v1.enable_eager_execution() >>> repo = Repository('.') >>> co = repo.checkout() >>> data = co.arraysets['mnist_data'] >>> target = co.arraysets['mnist_target'] >>> tf_dset = make_tf_dataset([data, target]) >>> tf_dset = tf_dset.batch(512) >>> for bdata, btarget in tf_dset: ... print(bdata.shape, btarget.shape)
Returns: Return type: tf.data.Dataset
- arraysets (
Pytorch¶
-
make_torch_dataset
(arraysets, keys: Sequence[str] = None, index_range: slice = None, field_names: Sequence[str] = None)¶ Returns a torch.utils.data.Dataset object which can be loaded into a torch.utils.data.DataLoader.
Parameters: - arraysets (
ArraysetDataReader
or Sequence) – A arrayset object, a tuple of arrayset object or a list of arrayset objects. - keys (Sequence[str]) – An iterable collection of sample names. If given only those samples will fetched from the arrayset
- index_range (slice) – A python slice object which will be used to find the subset of arrayset. Argument keys takes priority over range i.e. if both are given, keys will be used and range will be ignored
- field_names (list or tuple of str) – An array of field names used as the field_names for the returned namedtuple. If not given, arrayset names will be used as the field_names.
Examples
>>> from hangar import Repository >>> from torch.utils.data import DataLoader >>> from hangar import make_torch_dataset >>> repo = Repository('.') >>> co = repo.checkout() >>> aset = co.arraysets['dummy_aset'] >>> torch_dset = make_torch_dataset(aset, index_range=slice(1, 100)) >>> loader = DataLoader(torch_dset, batch_size=16) >>> for batch in loader: ... train_model(batch)
Returns: Return type: torch.utils.data.Dataset
- arraysets (