Change Log¶
v0.5.2 (2020-05-08)¶
New Features¶
Improvements¶
str
typed columns can now accept data containing any unicode code-point. In prior releases data containing anynon-ascii
character could not be written to this column type. (#198) @rlizzo
Bug Fixes¶
Fixed issue where
str
and (newly added)bytes
column data could not be fetched / pushed between a local client repository and remote server. (#198) @rlizzo
v0.5.1 (2020-04-05)¶
BugFixes¶
Fixed issue where importing
make_torch_dataloader
ormake_tf_dataloader
under python 3.6 Would raise aNameError
irrigardless of if the package is installed. (#196) @rlizzo
v0.5.0 (2020-04-4)¶
Improvements¶
Major backend overhaul which defines column layouts and data types in the same interchangable / extensable manner as storage backends. This will allow rapid development of new layouts and data type support as new use cases are discovered by the community. (#184) @rlizzo
Column and backend classes are now fully serializable (pickleable) for
read-only
checkouts. (#180) @rlizzoModularized internal structure of API classes to easily allow new columnn layouts / data types to be added in the future. (#180) @rlizzo
Improved type / value checking of manual specification for column
backend
andbackend_options
. (#180) @rlizzoStandardized column data access API to follow python standard library
dict
methods API. (#180) @rlizzoMemory usage of arrayset checkouts has been reduced by ~70% by using C-structs for allocating sample record locating info. (#179) @rlizzo
Read times from the
HDF5_00
andHDF5_01
backend have been reduced by 33-38% (or more for arraysets with many samples) by eliminating redundant computation of chunked storage B-Tree. (#179) @rlizzoCommit times and checkout times have been reduced by 11-18% by optimizing record parsing and memory allocation. (#179) @rlizzo
New Features¶
Added
str
type column with same behavior asndarray
column (supporting both single-level and nested layouts) added to replace functionality of removedmetadata
container. (#184) @rlizzoNew backend based on
LMDB
has been added (specifier oflmdb_30
). (#184) @rlizzoAdded
.diff()
method toRepository
class to enable diffing changes between any pair of commits / branches without needing to open the diff base in a checkout. (#183) @rlizzoNew CLI command
hangar diff
which reports a summary view of changes made between any pair of commits / branches. (#183) @rlizzoAdded
.log()
method toCheckout
objects so graphical commit graph or machine readable commit details / DAG can be queried when operating on a particular commit. (#183) @rlizzo“string” type columns now supported alongside “ndarray” column type. (#180) @rlizzo
New “column” API, which replaces “arrayset” name. (#180) @rlizzo
Arraysets can now contain “nested subsamples” under a common sample key. (#179) @rlizzo
New API to add and remove samples from and arrayset. (#179) @rlizzo
Added
repo.size_nbytes
andrepo.size_human
to report disk usage of a repository on disk. (#174) @rlizzoAdded method to traverse the entire repository history and cryptographically verify integrity. (#173) @rlizzo
Changes¶
Argument syntax of
__getitem__()
andget()
methods ofReaderCheckout
andWriterCheckout
classes. The new format supports handeling arbitrary arguments specific to retrieval of data from any column type. (#183) @rlizzo
Removed¶
metadata
container forstr
typed data has been completly removed. It is replaced by a highly extensible and much more user-friendlystr
typed column. (#184) @rlizzo__setitem__()
method inWriterCheckout
objects. Writing data to columns via a checkout object is no longer supported. (#183) @rlizzo
Bug Fixes¶
Backend data stores no longer use file symlinks, improving compatibility with some types file systems. (#171) @rlizzo
All arrayset types (“flat” and “nested subsamples”) and backend readers can now be pickled – for parallel processing – in a read-only checkout. (#179) @rlizzo
Breaking changes¶
New backend record serialization format is incompatible with repositories written in version 0.4 or earlier.
New arrayset API is incompatible with Hangar API in version 0.4 or earlier.
v0.4.0 (2019-11-21)¶
New Features¶
Added ability to delete branch names/pointers from a local repository via both API and CLI. (#128) @rlizzo
Added
local
keyword arg to arrayset key/value iterators to return only locally available samples (#131) @rlizzoAbility to change the backend storage format and options applied to an
arrayset
after initialization. (#133) @rlizzoAdded blosc compression to HDF5 backend by default on PyPi installations. (#146) @rlizzo
Added Benchmarking Suite to Test for Performance Regressions in PRs. (#155) @rlizzo
Added new backend optimized to increase speeds for fixed size arrayset access. (#160) @rlizzo
Improvements¶
Removed
msgpack
andpyyaml
dependencies. Cleaned up and improved remote client/server code. (#130) @rlizzoMultiprocess Torch DataLoaders allowed on Linux and MacOS. (#144) @rlizzo
Added CLI options
commit
,checkout
,arrayset create
, &arrayset remove
. (#150) @rlizzoDocumentation Improvements and Typo-Fixes. (#156) @alessiamarcolini
Removed implicit removal of arrayset schema from checkout if every sample was removed from arrayset. This could potentially result in dangling accessors which may or may not self-destruct (as expected) in certain edge-cases. (#159) @rlizzo
Added type codes to hash digests so that calculation function can be updated in the future without breaking repos written in previous Hangar versions. (#165) @rlizzo
Bug Fixes¶
Programatic access to repository log contents now returns branch heads alongside other log info. (#125) @rlizzo
Fixed minor bug in types of values allowed for
Arrayset
names vsSample
names. (#151) @rlizzoFixed issue where using checkout object to access a sample in multiple arraysets would try to create a
namedtuple
instance with invalid field names. Now incompatible field names are automatically renamed with their positional index. (#161) @rlizzoExplicitly raise error if
commit
argument is set while checking out a repository withwrite=True
. (#166) @rlizzo
Breaking changes¶
New commit reference serialization format is incompatible with repositories written in version 0.3.0 or earlier.
v0.3.0 (2019-09-10)¶
New Features¶
API addition allowing reading and writing arrayset data from a checkout object directly. (#115) @rlizzo
Data importer, exporters, and viewers via CLI for common file formats. Includes plugin system for easy extensibility in the future. (#103) (@rlizzo, @hhsecond)
Improvements¶
Added Tutorial on Tensorflow and PyTorch Dataloaders. (#117) @hhsecond
Large performance improvement to diff/merge algorithm (~30x previous). (#112) @rlizzo
New commit hash algorithm which is much more reproducible in the long term. (#120) @rlizzo
HDF5 backend updated to increase speed of reading/writing variable sized dataset compressed chunks (#120) @rlizzo
Bug Fixes¶
Fixed ML Dataloaders errors for a number of edge cases surrounding partial-remote data and non-common keys. (#110) ( @hhsecond, @rlizzo)
Breaking changes¶
New commit hash algorithm is incompatible with repositories written in version 0.2.0 or earlier
v0.2.0 (2019-08-09)¶
New Features¶
Selection heuristics to determine appropriate backend from arrayset schema. (#70) @rlizzo
Partial remote clones and fetch operations now fully supported. (#85) @rlizzo
CLI has been placed under test coverage, added interface usage to docs. (#85) @rlizzo
TensorFlow and PyTorch Machine Learning Dataloader Methods (Experimental Release). (#91) lead: @hhsecond, co-author: @rlizzo, reviewed by: @elistevens
Improvements¶
Record format versioning and standardization so to not break backwards compatibility in the future. (#70) @rlizzo
Backend addition and update developer protocols and documentation. (#70) @rlizzo
Read-only checkout arrayset sample
get
methods now are multithread and multiprocess safe. (#84) @rlizzoRead-only checkout metadata sample
get
methods are thread safe if used within a context manager. (#101) @rlizzoSamples can be assigned integer names in addition to
string
names. (#89) @rlizzoForgetting to close a
write-enabled
checkout before terminating the python process will close the checkout automatically for many situations. (#101) @rlizzoRepository software version compatability methods added to ensure upgrade paths in the future. (#101) @rlizzo
Many tests added (including support for Mac OSX on Travis-CI). lead: @rlizzo, co-author: @hhsecond
Bug Fixes¶
Diff results for fast forward merges now returns sensible results. (#77) @rlizzo
Many type annotations added, and developer documentation improved. @hhsecond & @rlizzo
Breaking changes¶
Renamed all references to
datasets
in the API / world-view toarraysets
.These are backwards incompatible changes. For all versions > 0.2, repository upgrade utilities will be provided if breaking changes occur.
v0.1.1 (2019-05-24)¶
Bug Fixes¶
Fixed typo in README which was uploaded to PyPi
v0.1.0 (2019-05-24)¶
New Features¶
Remote client-server config negotiation and administrator permissions. (#10) @rlizzo
Allow single python process to access multiple repositories simultaneously. (#20) @rlizzo
Fast-Forward and 3-Way Merge and Diff methods now fully supported and behaving as expected. (#32) @rlizzo
Improvements¶
Any potential failure cases raise exceptions instead of silently returning. (#16) @rlizzo
Many usability improvements in a variety of commits.
Bug Fixes¶
Ensure references to checkout arrayset or metadata objects cannot operate after the checkout is closed. (#41) @rlizzo
Sensible exception classes and error messages raised on a variety of situations (Many commits). @hhsecond & @rlizzo
Many minor issues addressed.
API Additions¶
Refer to API documentation (#23)
Breaking changes¶
All repositories written with previous versions of Hangar are liable to break when using this version. Please upgrade versions immediately.
v0.0.0 (2019-04-15)¶
First Public Release of Hangar!