Register Sources BODS is a shared library for the OpenOwnership Register project. It is designed for use with any Beneficial Ownership Data Standard (BODS) format data source.
The primary purposes of this library are:
- Providing typed objects for the JSON-line data. It makes use of the dry-types and dry-struct gems to specify the different object types allowed in the data returned.
- Persisting the BODS records using Elasticsearch. This functionality includes creating a mapping for indexing the possible fields observed as well as functions for storage and retrieval.
- Publishing BODS statements to a designated Kinesis stream.
This library does not include transformation to BODS format of other data standards. That is instead left as the purpose of the Register Transformers.
The data standard is BODS 0.2.
Install and boot Register.
Configure your environment using the example file:
cp .env.example .envRun the tests:
docker compose run sources-bods testTo local ingest xx.jsonl file into raw-xx index, optionally publishing to xx-dev Kinesis stream:
docker compose run sources-bods ingest-local data/imports/xx.jsonl raw-xx
docker compose run sources-bods ingest-local data/imports/xx.jsonl raw-xx xx-devTo local transform xx.jsonl file from raw-xx index into bods_v2_xx_dev1 index, optionally publishing to bods-xx-dev Kinesis stream:
docker compose run sources-bods transform-local data/imports/xx.jsonl raw-xx bods_v2_xx_dev1
docker compose run sources-bods transform-local data/imports/xx.jsonl raw-xx bods_v2_xx_dev1 bods-xx-devOptionally, 0 can be appended to the command to disable resolving via Open Corporates. In case disabling is required but publishing to a Kinesis stream isn't, '' 0 can be used as the final two arguments.
To bulk ingest raw/xx/ S3 prefix into raw-xx index, optionally publishing to xx-dev Kinesis stream:
docker compose run sources-bods ingest-bulk raw/xx/ raw-xx
docker compose run sources-bods ingest-bulk raw/xx/ raw-xx xx-devTo bulk transform raw/xx/ S3 prefix from raw-xx index into bods_v2_xx_dev1 index, optionally publishing to bods-xx-dev Kinesis stream:
docker compose run sources-bods transform-bulk raw/xx/ raw-xx bods_v2_xx_dev1
docker compose run sources-bods transform-bulk raw/xx/ raw-xx bods_v2_xx_dev1 bods-xx-devOptionally, 0 can be appended to the command to disable resolving via Open Corporates. In case disabling is required but publishing to a Kinesis stream isn't, '' 0 can be used as the final two arguments.
In order to perform the monthly bulk data tasks, it is necessary to import the latest raw data, process the raw data to turn it into BODS statements, and export the BODS statements to compressed files available for download internally and from the Register website. These tasks span multiple repositories and commands.
All of these commands should be run on the Register server in EC2 (bods-register).
Ingester OC, Ingester PSC, Ingester DK, and Ingester SK steps can be done in any order, or in parallel.
https://github.com/openownership/register-ingester-oc?tab=readme-ov-file#helper-script
Checkout the latest code and build via Docker:
cd ~/register-ingester-oc/
git checkout main
git pull
docker compose buildIngest the bulk data, where YYYY-MM-DD is the date the Open Corporates FTP files were published:
docker compose run ingester-oc ingest-bulk YYYY-MM-DDThis will ask you for the FTP password, 3 times.
Note that there is also a streaming ingester service running on Heroku (register-ingester-psc-prd). It might not be necessary to complete the rest of this step if that process is all working correctly without missed data (not currently the case).
Checkout the latest code and build via Docker:
cd ~/register-ingester-psc/
git checkout main
git pull
docker compose buildIngest the bulk data:
docker compose run ingester-psc ingest-bulkhttps://github.com/openownership/register-ingester-dk?tab=readme-ov-file#usage
Checkout the latest code and build via Docker:
cd ~/register-ingester-dk/
git checkout master
git pull
docker compose buildIngest the bulk data:
docker compose run ingester-dk ingest-bulkhttps://github.com/openownership/register-ingester-sk?tab=readme-ov-file#usage
Checkout the latest code and build via Docker:
cd ~/register-ingester-sk/
git checkout main
git pull
docker compose buildIngest the bulk data:
docker compose run ingester-sk ingest-bulkTransformer PSC, Transformer DK, and Transformer SK steps can be done in any order, or in parallel, once their dependencies are satisfied.
Transformer PSC step depends on Ingester OC and Ingester PSC steps.
https://github.com/openownership/register-transformer-psc?tab=readme-ov-file#bulk-data
Note that there is also a streaming transformer service running on Heroku (register-transformer-psc-prd). It might not be necessary to complete the rest of this step if that process is all working correctly and no additional bulk data had to be imported.
Checkout the latest code and build via Docker:
cd ~/register-transformer-psc/
git checkout main
git pull
docker compose buildTransform the bulk data, where YYYY and MM are the current year and month to be transformed:
docker compose run transformer-psc transform-bulk raw_data/source=PSC/year=YYYY/month=MM/Transformer DK step depends on Ingester OC and Ingester DK steps.
https://github.com/openownership/register-transformer-dk?tab=readme-ov-file#usage
Checkout the latest code and build via Docker:
cd ~/register-transformer-dk/
git checkout master
git pull
docker compose buildTransform the bulk data, where YYYY and MM are the current year and month to be transformed:
docker compose run transformer-dk transform-bulk raw_data/source=DK/year=YYYY/month=MM/Transformer SK step depends on Ingester OC and Ingester SK steps.
https://github.com/openownership/register-transformer-sk?tab=readme-ov-file#usage
Checkout the latest code and build via Docker:
cd ~/register-transformer-sk/
git checkout main
git pull
docker compose buildTransform the bulk data, where YYYY and MM are the current year and month to be transformed:
docker compose run transformer-sk transform-bulk raw_data/source=SK/year=YYYY/month=MM/Download S3 files and all subsequent Combiner steps depend on Transformer steps being completed.
https://github.com/openownership/register-sources-bods
openownership/register#265 (comment)
Checkout the latest code and build via Docker:
cd ~/register-sources-bods/
git checkout main
git pull
docker compose buildDownload the files:
sync-clonesCombine the files:
docker compose run sources-bods combine data/imports/source=PSC/ data/exports/prd/ pscCombine the files:
docker compose run sources-bods combine data/imports/source=DK/ data/exports/prd/ dkCombine the files:
docker compose run sources-bods combine data/imports/source=SK/ data/exports/prd/ skCombine the files:
docker compose run sources-bods combine-all data/exports/prd/Upload the files:
sync-exports-txCheck that the All compressed file appears on the Register website automatically:
https://register.openownership.org/download
Announce the availability of bulk data exports internally on Slack in #oo-technology channel.