CHAI is an attempt at an open-source data pipeline for package managers. The goal is to have a pipeline that can use the data from any package manager and provide a normalized data source for myriads of different use cases.
Use Docker
- Run
docker compose buildto create the latest Docker images. - Then, run
docker compose upto launch.
Note
This will run CHAI for all package managers. As an example crates by itself will take over an hour and consume >5GB storage.
Currently, we support only two package managers:
- crates
- Homebrew
You can run a single package manager by running
docker compose up -e ... <package_manager>
We are planning on supporting NPM, PyPI, and rubygems next.
Specify these eg. docker compose -e FOO=bar up:
FREQUENCY: Sets how often (in hours) the pipeline should run.TEST: Runs the loader in test mode when set to true, skipping certain data insertions.FETCH: Determines whether to fetch new data from the source when set to true.NO_CACHE: When set to true, deletes temporary files after processing.
Note
The flag NO_CACHE does not mean that files will not get downloaded to your local
storage, just that we'll delete the files once we're done with them
These arguments are all configurable in the docker-compose.yml file.
db: PostgreSQL database for the reduced package dataalembic: handles migrationspackage_managers: fetches and writes data for each package managerapi: a simple REST api for reading from the db
Stuff happens. Start over:
rm -rf ./data: removes all the data the fetcher is putting.
Our goal is to build a data schema that looks like this:
You can read more about specific data models in the dbs readme
Our specific application extracts the dependency graph understand what are critical pieces of the open-source graph. We also built a simple example that displays sbom-metadata for your repository.
There are many other potential use cases for this data:
- License compatibility checker
- Developer publications
- Package popularity
- Dependency analysis vulnerability tool (requires translating semver)
Tip
Help us add the above to the examples folder.
- The database url is
postgresql://postgres:s3cr3t@localhost:5435/chai, and is used asCHAI_DATABASE_URLin the environment.psql CHAI_DATABASE_URLwill connect you to the database.
These are tasks that can be run using [xcfile.dev]. If you use pkgx, typing
dev loads the environment. Alternatively, run them manually.
rm -rf db/data data .venvdocker compose buildRequires: build
docker compose up -dEnv: TEST=true Env: DEBUG=true
docker compose upRequires: build Env: TEST=true Env: DEBUG=true
docker compose updocker compose downdocker compose logsRequires: stop
rm -rf db/dataInputs: MIGRATION_NAME Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai
cd alembic
alembic revision --autogenerate -m "$MIGRATION_NAME"Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai
cd alembic
alembic upgrade headInputs: STEP Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai
cd alembic
alembic downgrade -$STEPpsql "postgresql://postgres:s3cr3t@localhost:5435/chai"psql "postgresql://postgres:s3cr3t@localhost:5435/chai" -c "SELECT count(id) FROM packages;"psql "postgresql://postgres:s3cr3t@localhost:5435/chai" -c "SELECT * FROM load_history;"Refreshes table knowledge from the db.
docker-compose restart apidocker compose down --remove-orphans