Skip to content

DataShades/ckanext-syndicate

 
 

Repository files navigation

Tests

ckanext-syndicate

CKAN plugin for dataset syndication between CKAN instances

This plugin provides a mechanism for syndicating datasets to another CKAN instance. If a dataset has the syndicate flag set to True in its custom metadata, any updates to that dataset will be reflected in the syndicated version.

Resources in the syndicated dataset are stored as URLs pointing to the resources in the original dataset. You must have the API key of a user on the target CKAN instance. See the Config Settings section below for details.

Other plugins can modify the data being syndicated or react to before/after syndication events by implementing the ISyndicate interface and subscribing to the corresponding signals. This is useful when schemas differ between CKAN instances.

Requirements

Python 3.10+

To work over SSL, requires pyOpenSSL

Compatibility with core CKAN versions:

CKAN version Compatibility
2.9 and earlier no
2.10 yes
2.11 yes

Installation

To install ckanext-auth:

  1. Activate your CKAN virtual environment, for example:
. /usr/lib/ckan/default/bin/activate
  1. Clone the source and install it on the virtualenv
git clone https://github.com/DataShades/ckanext-syndicate.git
cd ckanext-syndicate
pip install -e .
  1. Add syndicate tables to the ckan.plugins setting in your CKAN config file (by default the config file is located at /etc/ckan/default/ckan.ini).

  2. Apply database migrations:

ckan db upgrade
  1. Restart CKAN. For example if you've deployed CKAN with Apache on Ubuntu:
sudo service apache2 reload

Config settings

Syndication performs dataset creation and updates on the remote portal. It also possible to syndicate the dataset to the multiple portals simultaneously. ckanext-syndicate makes no assumptions as to how many syndication endpoints you have and performs each synchronization separately as if you've configured the first syndication endpoint, did syndication, updated configuration did syndication once again.

Internally, set of config option related to the particular endpoint is called profile(ckanext.syndicate.types.Profile). Each profile has an ID. ID is a part of config option: ckanext.syndicate.profile.<PROFILE ID>.<OPTION> If you want to syndicate dataset to the two different portals, first and another, configuration may look like:

ckanext.syndicate.profile.first.ckan_url = https://data.example.com
ckanext.syndicate.profile.another.ckan_url = https://another.example.com

Here is the full list of config options available for Profile. Don't forget to replace PROFILE_ID with any identifier you like.

Note: In the options below, PREFIX = ckanext.syndicate.profile.PROFILE_ID.

Option Default Example Description
PREFIX.ckan_url (required) https://data.example.com The URL of the target CKAN instance to which datasets will be syndicated.
PREFIX.api_key (required) 9efdd954-c643-444a-97a1-c9c374cef861 The API key of the user on the target CKAN instance.
PREFIX.organization None test-org The name of the organization on the target CKAN instance where syndicated datasets are created.
PREFIX.flag syndicate syndicate_to_hdx The custom metadata flag used to mark datasets for syndication.
PREFIX.field_id syndicated_id hdx_id The custom metadata field used to store the syndicated dataset ID on the original dataset.
PREFIX.name_prefix '' my-prefix A prefix added to the name of the syndicated dataset.
PREFIX.replicate_organization false true Whether to replicate the original dataset’s organization on the target CKAN instance.
PREFIX.update_organization false true Whether to update organization metadata (doesn't update extras) if exists
PREFIX.refresh_package_name false true Whether to refresh the dataset name on the remote portal.
PREFIX.author None ricardomm The username whose API key is used. If a dataset already exists on the target CKAN, it will only be updated if its creator matches this username.
PREFIX.user_agent None My CKAN Syndicator/1.0 Custom User-Agent string to use for HTTP requests to the target CKAN instance.
PREFIX.upload_organization_image true false Whether to upload organization image when replicating organization.
PREFIX.queue default syndication The name of the background jobs queue used for syndication tasks for this profile.

In addition, the following config options control behavior of syndication process in general:

Option Default Description
ckanext.syndicate.sync_on_changes true Whether to automatically syndicate datasets whenever they are created, updated, or deleted. Disable this option if syndication should be triggered manually.

Extending

Signals

Syndication can be configured for each individual portal. There are two types of customization: reactions to events and changes to workflow.

Reactions are useful when you need to perform a side-effect right before or right after the syndication. This can be achieved via the blinker's signals. The ckanext-syndicate provides two signals that can be imported from the ckanext.syndicate.signals (or subscribe via ISignal starting from CKAN v2.10):

  • before_syndication
  • after_syndication
  • before_group_syndication
  • after_group_syndication

The before_syndication and after_syndication signals get the local dataset's ID as sender and extra keyword argument with the name profile (current syndication profile). Basic subscription looks like this:

@after_syndication.connect
def after_syndication_listener(package_id, **kwargs):
    profile = kwargs.get("profile")
    if profile:
        do_something(package_id, profile)

Interface

Changes to syndication workflow are made via ckanext.syndicate.interfaces.ISyndicate interface. At moment, it contains next methods:

  • skip_syndication - decide, whether syndication must be performed for the given profile.
  • prepare_package_for_syndication - update the package, before it sent to the remote portal. It can be really useful if the portal that you are syndicating to, is using a different metadata schema.
  • prepare_group_for_syndication - update the group, before it sent to the remote portal.

Basic implementations look like this:

class MyPlugin(plugins.Plugin):
    plugins.implements(ISyndicate, inherit=True)

    def skip_syndication(self, package: model.Package, profile: Profile) -> bool:
        if should_be_syndicated(package):
            return False
        return True

    def prepare_package_for_syndication(
        self, package_id: str, data_dict: dict[str, Any], profile: Profile
    ) -> dict[str, Any]:
        data_dict.pop("sensitive_field")
        return data_dict

    def prepare_group_for_syndication(
        self, group_id: str, group: dict[str, Any], profile: Profile
    ) -> dict[str, Any]:
        data_dict.pop("sensitive_field")
        return group

Default implementation of skip_syndication prevents syndication for:

  • private datasets
  • datasets with the falsy value of the field, specified by ckanext.syndicate.profile.PROFILE_ID.flag config option (syndicate by default)

CLI commands

Mass or individual syndication can be triggered as well from the command line:

ckan syndicate sync [ID]

Syndication provides that will be applied to the given datasets in case of syndication:

ckan syndicate check [ID]

An individual profile synchronization can be triggered as well from the command line:

ckan syndicate sync-profile [PROFILE_ID]
ckan syndicate sync-profile [PROFILE_ID] -f # foreground

Tests

Install dev-requirements.txt:

pip install -r dev-requirements.txt

To run the tests, do:

pytest --ckan-ini=test.ini

License

AGPL

About

A CKAN plugin that allows syndication (pushing) of datasets to another ckan repository

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 93.1%
  • HTML 6.2%
  • Mako 0.7%