Releases: fako/datagrowth
ETL terminology
Starts using more standard terminology for ETL operations to make the learning curve less steep:
- The
Resourceclass now specifies an abstractextractmethod to adhere to the strategy pattern more explicitly
and harmonize naming with ETL terminology. - To fit better into ETL terminology
ExtractProcessornow also has an alias namedTransformProcessor.
In the same spirit theextractmethod has atransformalias
andextract_from_resourceatransform_resourcealias. - Using configurations with "extract" in their name remain unchanged for now,
until impact on changing them has been assessed.
Apart from this there are general software updates:
- Adds support for Python 3.13 and removes support for 3.8.
- Removes support for Django 3.2 and adds support for Django 5.2.
This version aims to replace all 0.19.x installations to make the maintenance of the project more manageable over the coming years.
Workshop version
A version that enables a playground for ideas originating from an AI workshop.
This update is the first Datagrowth version that includes the DatasetVersion model.
The implementation of that model can be a steep change over current implementation.
However it's not required to implement Datagrowth's DatasetVersion to update to v0.20.
Instead you can run your own DatasetVersion which should implement influence or set the
dataset_version attribute to None for Collection and Document
if you don't want to use any DatasetVersion.
Other important changes:
- Minimal version for Celery is now 5.x.
- Minimal version for jsonschema is now 4.20.0, but jsonschema draft version remains 4.
global_pipeline_app_labelandglobal_pipeline_modelsconfigurations have been renamed
toglobal_datatypes_app_labelandglobal_datatype_models.- The
extractor,depends_on,to_propertyandapply_to_resourceconfigurations are now
part of thegrowth_processornamespace. - The
batch_sizesetting is now part of the default global configuration namespace. - The configuration
asyncwill no longer get patched toasynchronousto be compatible with Python >= 3.7.
Instead supplyasynchronousdirectly and replace allasyncoccurrences. load_configdecorator no longer excepts default values. Useregister_defaultsinstead.- When using
ConfigurationType.supplementdefault values are now ignored when determining if values exist. - The
pipelineattributes gets replaced by thetask_resultsattributes forDocument,Collectionand
DatasetVersion. - When writing contributions to
Documentsthe default field is nowderivatives.
Furthermore a key equal to thegrowth_phaseis automatically added to thederivativesdictionary.
The value for this key is always an empty dictionary. Anyto_propertyconfiguration will write to this dictionary.
Otherwise contributions get merged into the dictionary. It's still possible to write topropertieswithout
adding specialgrowth_phasekeys for backward compatability. - Contributions to
Documentsgathered throughExtractProcessor.pass_resource_through
may consist of simple values. Ifto_propertyis set these values will be available under that property.
Otherwise the simple values get added to a dictionary with one "value" key and
this dictionary gets merged like normal. - If
ResourceGrowthProcessorencounters multipleResourcesperDocumentor
if a singleResourceyields multiple results. Then thereduce_contributionsmethod will be called to
determine how contribution data fromResourcesshould complimentDocumentdata. The default is to only use
the first result that comes fromResourcesin order to be backward compatible. Resourceclass now exposesvalidate_inputto override in child classes for input validation.
This validation strategy will replace JSONSchema based validation for performance reasons in the future.- Adds a
TestClientResourcethat allows to createResourcesthat connect to Django views which return test data.
Especially useful when testing Datagrowth components that takeHttpResourcesas arguments. - Importing
DataStoragefromdatagrowth.datatypes.documents.db.basehas to be replaced
with importing fromdatagrowth.datatypes.storage. - The
DataStoragesdataclass has been added to manage typing for dynamically loadedDataStoragemodels. - The
DatasetVersion.task_definitionsfield holds dictionaries perDataStoragemodel that specifies,
which tasks should run for which model. - The
DatasetVersion.errorsfield has aseedingandtasksfield where some basic error information is kept
for debugging purposes. - A
DatasetVersionwill influence itsCollectionsandDocuments.
Collectionsmay setDatasetVersionforDocumentsand facilitateDatasetVersioninfluence for them. - Task definitions given to
DatasetVersionpropagate toCollectionandDocument
through the influence method. - The
Dataset.create_dataset_versionmethod will create a non-pendingDatasetVersion
with the defaultGROWTH_STRATEGYandDatasetVersion.tasksset.
It also creates a default non-pendingCollectionwithCollection.tasksset.
Customize defaults by settingDOCUMENT_TASKS,COLLECTION_TASKS,DATASET_VERSION_TASKS,
COLLECTION_IDENTIFIER,COLLECTION_REFEREEandDATASET_VERSION_MODEL.
Or overrideDataset.get_collection_factories,Dataset.get_seeding_factoriesand/or
Dataset.get_task_definitionsfor more control. Document.invalidate_taskwill now always setpending_atandfinished_atattributes,
regardless of whether tasks have run before.- The
contentof a Document now contains output fromderivativesthroughDocument.get_derivatives_content. - Calling
validate_pending_data_storagesnow may updateDatasetVersion.is_currentandDatasetVersion.errors. - Commands inheriting from
DatasetCommandthat expectCommunitycompliant objects,
should setcast_as_communityto True on the Command class and renamehandle_datasettohandle_community. - Unlike the legacy
Communitymodel aDatasethas a unique signature. If the signature of aDatasetmatches
an existingDatasetthegrowthmethod will create a newDatasetVersioninstead of a differentDataset.
Workshop version (prerelease)
A first version that enables a playground for ideas originating from an AI workshop.
Resource iterators
This version allows for the use of Resource iterators. Which enables applications to retrieve and process Resources using generators instead of loading everything in memory. To optimally make use of this feature Collection also exposes an iterative interface to add and update Documents.
- Adds support for Python 3.12.
- Doesn't specify a specific parser for BeautifulSoup when loading XML content.
BeautifulSoup warns against using Datagrowth's previous default parser (lxml) for XML parsing as it is less reliable. - Allows
ExtractProcessorto extract data using a generator function for the "@" objective.
This can be useful to extract from nested data structures. - Provides a
send_iteratorgenerator that initiates and sends aHttpResource
as well as any subsequentHttpResources. This generator allows you to do something with in-between results
when fetching the data. - Provides a
send_serie_iteratorgenerator which acts like thesend_iterator
except it can perform multiple send calls. - Provides a
content_iteratorgenerator that given asend_iteratororsend_serie_iterator
will extract the content from generatedHttpResourcesusing a given objective.
This generator will also yield in-between results as extracted content. - Adds
Collection.add_batchesandCollection.update_batcheswhich are variants on
Collection.addandCollection.updatethat will return generators
instead of adding/updating everything in-memory. - The
Collection.update,Collection.add,Collection.update_batchesandCollection.add_batcheswill
check for equality betweenDocumentsbefore adding or updating. This makes it possible to skip insert/updates in
particular cases by overridingDocument.__eq__.Collection.addandCollection.add_batchesrequire
input as a list for this to work to prevent unexpected excessive memory usage. - When using
Collection.add_batchesorCollection.update_batchesaNO_MODIFICATIONobject can be passed
asmodified_atparameter to prevent updatingCollection.modified_atwith these (repeating) calls. - Uses
Collection.document_update_fieldsto determine which fields to update inbulk_updatecalls by Collection. - Adds
Document.buildto support creating aDocumentfrom raw data. Document.updatewill now use properties as update data instead of content
when giving anotherDocumentas data argument.- Deprecates
Collection.init_documentin favour ofCollection.build_documentfor consistency in naming. Document.output_from_contentwill now return lists instead of mapping generators when giving multiple arguments.
The convenience of lists is more important here than memory footprint which will be minimal anyway.- Makes
Document.output_from_contentpass along content if values are not a JSON path. - Allows
Document.output_from_contentto use different starting characters for replacement JSON paths. ConfigurationField.contribute_to_classwill first call theTextField.contribute_to_class
before settingConfigurationPropertyupon the class.- Removes validate parameter from
Collection.add,Collection.updateandDocument.update. - Moved
load_sessiondecorator intodatagrowth.resources.http. - Moved
get_resource_linkfunction intodatagrowth.resources.http. - Sets default batch size to a smaller 100 elements per batch and
Collection.updatenow respects this default. - Removes implicit Indico and Wizenoze API key loading.
- Corrects log names to "datagrowth" instead of "datascope".
- Adds a
copy_datasetcommand that will copy a dataset by signature. - The
asyncconfiguration has been removed from settings file. - A
resource_exception_log_levelsetting now controls at what levelDGResourceExceptionswill get logged. - Additionally
resource_exception_reraisenow controls whetherDGResourceExceptionsget reraised. - Fallback for
JSONFieldimports fromdjango.contrib.postgres.fieldshas been removed. - Adds
global_allow_redirectsconfiguration which controls how requests library will handle redirects.
Defaults to True even for "head" requests. - Exposes
ProcessorFactoryandDataStorageFactoryto easily build processors and datatypes in the future. - Adds the
Collection.reload_document_idsmethod to be able to loadDocument.idafterbulk_create. - For consistent
Resourceserialization addsserialize_resourcesandupdate_serialized_resources. - Experimental support for
ResourceFixturesMixinthat can be used to load resource content through fixture files.
Python 3.11 and Django 4.2
Updates the package to support Python 3.11 and Django 4.2.
Python 3.10
- Adds support for Python 3.10 and drops support for Python 3.6.
- Uses the html.parser instead of html5lib parser when parsing HTML pages.
- Fetches the last
Resourcewhen retrieving from cache to preventMultipleObjectsReturned
exceptions in async environments - Allows PUT as a
HttpResourcesend method
Django 3.2
Updates the package to support Django 3.2 features. It further supports Document and Collection models, which are now unit tested.
These are the breaking changes this release:
- It's recommended to update to Django 3.2 before using Datagrowth 0.17.
- Note that a Django migration is required to make Datagrowth 0.17 work.
- Drops support for Django 1.11.
- MySQL backends are no longer supported with Django versions below 3.2
- Schemas on
DocumentandCollectionare removed as their usage is not recommended.
Consider working schemaless when using theseDataStoragederivative classes. - As schemas are no longer available for
DataStoragederivative classes all write functionality
from defaultDataStorageAPI views is removed. DataStorageAPI URL patterns now require app labels as namespaces to prevent ambiguity.- The API version can be specified using the
DATAGROWTH_API_VERSIONsetting. DataStorage.updateis reintroduced because of potential performance benefits.Document.updateno longer takes first values from iterators given to it.Collection.updateno longer excepts a single dict or Document for updating.
It also works using lookups fromJSONFieldinstead of the inferiorreferencemechanic.DataStorage.urlnow provides a generic way to build URLs forCollectionandDocument.
These URLs will expect URL patterns to exist with names in the format:
v<api-version>:<app-name>:<model-name>-content.
This replaces the old formats which were less flexible:
v1:<app-name>:collection-content and v1:<app-name>:document-content.HttpResourcewill usedjango.contrib.postgres.fields.JSONFieldordjango.db.models.JSONField
forrequestandheadfields.ShellResourcewill usedjango.contrib.postgres.fields.JSONFieldordjango.db.models.JSONField
for thecommandfield.- The resources and datatypes modules now each have an admin module to import
AdminModelseasily. ConfigurationPropertynow uses a simpler constructor and allows defaults for all arguments.- Removes the unused
global_tokendefault configuration. - Removes the unused
http_resource_batch_sizedefault configuration.
Python 3.8
A minor update that drops support for Python 3.5 and adds support for Python 3.8.
It also prepares some updates that are coming up in the near future.
Datagrowth (package)
The first release of the Datagrowth package to be installed in projects.
After copy pasting this code a few times across projects it was time for a package to make maintenance a lot easier.
This versions contains fully functioning, tested and documented Resources & Configuration classes.
As well as some more experimental code that is to be released in full at a later date.
Below are the breaking changes that occur with this release:
- Renamed exceptions that are prefixed with DS to names prefixed with DG.
This migrates Datascope exceptions to Datagrowth exceptions.
Affected exceptions:DSNoContent,DSHttpError403LimitExceeded,DSHttpError400NoToken,DSHttpWarning300
andDSInvalidResource. batchizeused to be a function that returned batches and possibly a leftover batch.
Nowibatchcreates batches internally.reachno longer excepts paths not starting with$- Collection serializers do not include their content by default any more.
Add it yourself by appending to default_fields or use the collection-content endpoint. - A
google_cxconfig value is no longer provided by default.
It should come from theGOOGLE_CXsetting in your settings file. - The
register_config_defaultsalias is no longer available. Useregister_defaultsdirectly. - The
MOCK_CONFIGURATIONalias is no longer available.
Omit the configuration altogether and useregister_defaults. - Passing a default configuration to
load_configis deprecated. Useregister_defaultsinstead. ExtractProcessornow raisesDGNoContent.fetch_onlyrenamed tocache_only- Non-existing resources will now raise a
DGResourceDoesNotExistifcache_onlyis True metaproperty is removed fromResourceusevariablesmethod instead.- All data hashes will be invalidated, because hasher now sorts keys.
schemais allowed to be empty onDataStorage, which means there will be no validation by default.
This is recommended, but requires migrations for some projects._handle_errorshas been renamed tohandle_errorsand is an explicit candidate for overriding._update_from_responsehas been renamed to_update_from_resultsfor more consistent Resource api.
Datagrowth (prerelease)
The first release of the Datagrowth package to be installed in projects.
After copy pasting this code a few times across projects it was time for a package to make maintenance a lot easier.
This versions contains fully functioning, tested and documented Resources & Configuration classes.
As well as some more experimental code that is to be released in full at a later date.