-
Notifications
You must be signed in to change notification settings - Fork 1
Usage@Reingest
Reingest process consists of multiple tasks - exporting data, grouping them into batches, starting the ingests ... ARCLib comes with automatized reingest feature which reinigests latest versions of packages of the whole archive. If there is a need for fine control or partial reingests, then combining lower level features like search, export and ingest routines into own reingest process is the way to go.
Workflow definition used for reindex must follow this:
- Should not contain Duplicate SIP check task
- The used SIP Profile must lead ARCLib to find the same Authorial ID as the one which was found by SIP Profile used for the previous ingest
- See Path to XML file with authorial ID and XPath to node with authorial ID at Usage@Sip profiles
- Admin places the packages into transfer area and activate a standard ingest routine.
- If the data must be exported from Archival Storage, here are the export options:
- Export through ARCLib using Export routines
- Export using Archival Storage HTTP API (backup endpoint)
- Export by accessing the logical storage directly
- In all cases the format of exported data is not the same as format required for import
- There is no need for ARCLib XML in transfer area
- There must be .sums file in transfer area
- If the data must be exported from Archival Storage, here are the export options:
Keep in mind that Sip Profile and Validation profile used to produce the original package might not be those referenced from the Producer profile of the original package (as user can choose to use different profile in case of ingest incident).
Initiating
Initiate new reingest button is located in the Reingest UI section. Only one reingest can run at time.
During initiating phase the database is scanned and reingest record for every package to be reingested is created. Packages are grouped by the Producer, Workflow definition and JSON config (including SIP and Validation Profiles entries) used during ingest of the original package. For every <Producer,WorkflowDefinition> pair a special reingest Producer Profile together with special ingest routine is created with name like reingest_timestamp_producerId_workflowDefinitionId. Those entities are accessible in standard UI sections but are not editable.
Unlike standard ingest routines, the folder of the routine itself does not contain packages, it contains other folders - one per every unique JSON configuration used for ingests of the <Producer,WorkflowDefinition> pair. The folder name is MD5sum of the JSON config. In each of these folders a JSON config text file is stored during reingest init phase. In the next phase the folders are filled with packages for reingest.
The init phase ends with switch to JOB_STOPPED state in which the reingest waits for user interaction. User can check and modify data in DB or at filesystem if needed and then Continue with creation of the batches.
Example of modification on the file system can be change of SIP profile ID linked from the JSON config, if the originally used SIP Profile would generate ARCLib XML which is no longer valid in newer ARCLib version. Other option for the admin is to rename the config.json to config.json.skip to completely skip reingest of all packages belonging to that particular folder. Edits like these may be done through the whole process, so if some batches already started with bad config, admin can for example simply skip the rest and cancel incidents of those which have already started.
Export jobs and batching
In period defined in exportCron section of application.yml export job exports data from Archival Storage to the transfer area. Exporter tries to export as much packages as possible with respect to the available size of the transfer area and workspace. There is also a configurable threshold - minimum of space to leave free by the process - in application.yml. Once the disk is filled, Exporter fires all ingest routines which ends up in ingest batches being created via standard process. Users can see, stop, or solve incidents of the reingest batches the same way the do it with standard ingest batches.
If some ingests ends up with failure the data can be examined/repaired on the filesystem and user with reingest privilege can fire the reingest routine (to pickup data from transfer area) again manually at the ingest routine detail - this feature is available only for reingest routine - those are simply run by reingest job or via this button instead of a CRON.
For the whole time Exporter job runs on the background in the defined period, skipping the action if any of the reingest packages is still present at the transfer area. Once the transfer area is free from all the packages, export of next batch of packages from Archival Storage begins.
User can Stop creation of the batches in Reingest UI section, which breaks the exporter period but it has no effect on the ingest batches or on currently running exporter process. Stopped job may continue if user requests so in the Reingest UI section.
Once all data are exported in transfer area, the exporter job switches the Reingest state to JOB_FINISHED.
Terminating reingest
User can terminate reingest at any time. Termination results with deletion of all reingest producer profiles, ingest routines and data from workspace.
After successful reingest (JOB_FINISHED state + successful states of Ingest Batches) the reingest should also be terminated by the user to cleanup the system and open locked features. Just make sure the Count transfer area packages button is hit to make sure there are no running or failed packages waiting in the transfer area which would disappear during termination.
Notes
- Standard ingests are allowed tu run parallel with reingest but
- versioning is not allowed - it fails
- make sure the transfer area or workspace will not overflow
- Deletions and bulk deletions are not allowed during reingest
Home
The Ingest - Archival Process
Instructions for Sample Ingest
Predefined Profiles
Docker
Reindex and Reingest (upgrading ARCLib or its profiles)
- System Setup
- System Setup on Debian (unofficial)
- Api and Authorization
- Administration of running system
- ARCLib XML Index Config
- Usage@Index
- Usage@Reingest
- Sip Format
- Usage@Sip Profiles
- Usage@Validation Profiles
- Usage@Workflow Definitions
- Usage@Producer Profiles
- Usage@Debug Mode
- Tutorial@Custom Ingest