This library uses PgDCP SQLa to create an enhnanced GitLab schema which includes utility views and repositories cache.
JustDenoPerl- PostgreSQL client
miller(for testing generated CSVs)
To check if your dependencies are properly installed, run:
just doctorgit clone https://github.com/netspective-studios/gitlab-enhanced.git
cd gitlab-enhancedThen:
- Copy
gitlab-canonical.env.exampletogitlab-canonical.envand fill out GitLab database information. Thegitlab-canonical.envfile is included in.gitignoreand will not be git-tracked. All environment variables ingitlab-canonical.envwill be available, automatically, to thejustdiscover-gitlab-project-repo-assetstarget. These two environment variables are important and they should not share the same value -- thejust context=production db-deploy-cleantarget will destroy any existing$SQLACTL_GITLAB_ENHANCE_SCHEMA_NAMEschema so be sure to separate the schemas as shown by default otherwise you could accidentally delete your production GitLab database.SQLACTL_GITLAB_CANONICAL_SCHEMA_NAME=publicSQLACTL_GITLAB_ENHANCE_SCHEMA_NAME=stateless_enhance_service_gitlab
- Copy
gitlab-project-repo-assets.env.exampletogitlab-project-repo-assets.envand fill out database information for the destination of thegitlab_project_repo_assetstable. Thegitlab-project-repo-assets.envfile is included in.gitignoreand will not be git-tracked. All environment variables ingitlab-project-repo-assets.envwill be available, automatically, to thejustpersist-gitlab-project-repo-assetstarget below.
If the gitlab_project_repo_assets table will be in the same database/schemas defined by gitlab-canonical.env then gitlab-project-repo-assets.env could just have the same database credentials. However, if the gitlab_project_repo_assets table will be in a different database then the contents of gitlab-canonical.env then gitlab-project-repo-assets.env should point to their respective databases.
To drop the enhanced schema and recreate basic SQL objects:
just context=production db-deploy-cleanTo idempotently create basic SQL objects without dropping the enhanced schema:
just context=production db-deployWhat's available after db-deploy in database pointed to by .env:
gitlab_qualified_namespacesview delivers Gitlabnamespaceswith level and hierarchical qualifications (e.g. path/path/... and Name::Name::...) instead of flat onlygitlab_qualified_projectsview delivers GitLab projects with namespace-qualified names and logical paths with hierarchygitlab_qualified_project_reposview delivers GitLab projects with namespace-qualified names, logical paths, and physical Gitaly repository paths with hierarchygitlab_qualified_project_repos_clone(gitlab_host_name)function delivers GitLab projects with namespace-qualified names, logical paths, and relative Gitaly repository paths on disk plus cloning paths on a given hostgitlab_qualified_project_repos_clone(gitlab_host_name, parent_namespace_id)function delivers GitLab projects under a specific namespace ID with namespace-qualified names, logical paths, and relative Gitaly repository paths on disk plus cloning paths on a given hostgitlab_qualified_project_repos_bare(gitlab_bare_repos_home_on_disk)function delivers GitLab projects with namespace-qualified names, logical paths, and absolute paths to Gitaly bare Git repositoriesgitlab_qualified_project_repos_bare(gitlab_bare_repos_home_on_disk, parent_namespace_id)function delivers GitLab projects under a specific namespace ID with namespace-qualified names, logical paths, and absolute paths to Gitaly bare Git repositories
The just db-deploy target deploys convenience PostgreSQL views which can then be used by these four just targets:
discover-gitlab-project-repo-assetspersist-gitlab-project-repo-assetsdiscover-gitlab-project-repo-assets-contentpersist-gitlab-project-repo-assets-content
There is a convenience target that will run all the above in sequence:
discover-persist-gitlab-project-repo-assets-content
just discover-persist-gitlab-project-repo-assets-content 8The above command will run all the following targets in sequence:
just discover-gitlab-project-repo-assets 8
just persist-gitlab-project-repo-assets
just discover-gitlab-project-repo-assets-content
just persist-gitlab-project-repo-assets-contentjust discover-gitlab-project-repo-assets 8uses PostgreSQL convenience views to generate a CSV file (gitlab-project-repo-assets.csv) of all the Gitaly bare Git repositories under GitLab Namespace ID '8' (any GitLab group ID may be passed in).- The generated CSV file contains the latest commit information (only meta-data, not content) for each branch of each GitLab project repo.
- On a 6-core i5 processor with direct access to the GitLab bare Git repos this take around 90 seconds for about 3,500 small project repos (assuming about 37,000 cumulative files included in the 3,500 or so Git repos).
just persist-gitlab-project-repo-assetsvalidatesgitlab-project-repo-assets.csvusingmillerand then inserts all rows ingitlab-project-repo-assets.csvinto thegitlab_project_repo_assetsPostgreSQL table using theCOPY FROMSQL command. This should take less than 30 seconds to complete if the database is on the same server as the CSV file.just discover-gitlab-project-repo-assets-contentuses rows in thegitlab_project_repo_assetsPostgreSQL table to generate a CSV file (gitlab-project-repo-assets-content.csv) which contains the name, size, and base64-encoded content of each unique Git object in all the files discovered throughjust discover-gitlab-project-repo-assets 8.- The generated CSV file contains the Git Object ID, file name, file size in bytes, and base64-encoded content for each unique Git object (uniqueness is determined by Git object ID, git file name, and git file size).
- On a 6-core i5 processor with direct access to the GitLab bare Git repos this take around 45 seconds for about 9,000 small-ish files.
just persist-gitlab-project-repo-assets-contentvalidatesgitlab-project-repo-assets-content.csvusingmillerand then inserts all rows ingitlab-project-repo-assets-content.csvinto thegitlab_project_repo_assets_contentPostgreSQL table using theCOPY FROMSQL command. This should take less than 10 seconds to complete if the database is on the same server as the CSV file.
discover-gitlab-project-repo-assets uses the gitlab_qualified_project_repos_bare(gitlab_bare_repos_home_on_disk, parent_namespace_id) function to find all Gitaly bare Git repositories under a specific GitLab namespace ID (group). Once it finds the bare repos, it uses xargs to run Git commands in parallel (using numproc processes) to create a CSV file with the following:
| Column | Type | Purpose |
|---|---|---|
discovered_at |
timestamptz | Timestamp of when the discovery of this row occurred |
gl_project_id |
integer | GitLab project ID acquired from [GitLab].projects.id table |
gl_project_repo_id |
integer | GitLab project repo ID acquired from [GitLab].project_repositories.id table |
gl_gitaly_bare_repo_host |
text | GitLab project Gitaly bare repo path host |
gl_gitaly_bare_repo_host_ip_addr |
text | GitLab project Gitaly bare repo path host |
gl_gitaly_bare_repo_path |
text | GitLab project Gitaly bare repo path |
git_branch |
text | Git branch acquired from bare Git repo using git for-each-ref command |
git_file_mode |
text | Git file mode acquired from bare Git repo using git ls-tree -r {branch} command |
git_asset_type |
text | Git asset type (e.g. blob) acquired from bare Git repo using git ls-tree -r {branch} command |
git_object_id |
text | Git object (e.g. blob) ID acquired from bare Git repo using git ls-tree -r {branch} command |
git_file_size_bytes |
integer | Git file size in bytes acquired from bare Git repo using git ls-tree -r {branch} command |
git_file_name |
text | Git file name acquired from bare Git repo using git ls-tree -r {branch} command |
git_commit_hash |
text | Git file commit hash acquired from bare Git repo using git log -1 {branch} {git_file_name} command |
git_author_date |
timestamptz | Git file author date acquired from bare Git repo using git log -1 {branch} {git_file_name} command |
git_commit_date |
timestamptz | Git file commit date acquired from bare Git repo using git log -1 {branch} {git_file_name} command (commit date is usually the same as author date unless the repo was manipulated) |
git_author_name |
text | Git file author name acquired from bare Git repo using git log -1 {branch} {git_file_name} command |
git_author_email |
text | Git file author e-mail address acquired from bare Git repo using git log -1 {branch} {git_file_name} command |
git_committer_name |
text | Git file committer name acquired from bare Git repo using git log -1 {branch} {git_file_name} command |
git_committer_email |
text | Git file committer e-mail address acquired from bare Git repo using git log -1 {branch} {git_file_name} command |
git_commit_subject |
text | Git file commit message subject acquired from bare Git repo using git log -1 {branch} {git_file_name} command |
persist-gitlab-project-repo-assets uses the CSV file created by discover-gitlab-project-repo-assets target and:
- drops the
gitlab_project_repo_assetstable - creates the
gitlab_project_repo_assetstable - imports the CSV file into the
gitlab_project_repo_assetstable
Since the repo trees are now just SQL, every file's meta data and latest commit status is a query away:
select *
from stateless_enhance_service_gitlab.gitlab_project_repo_assets
where git_file_name = 'offering-profile.lhc-form.json'
and git_branch = 'draft'just discover-gitlab-project-repo-assets-contentuses rows in thegitlab_project_repo_assetsPostgreSQL table to generate a CSV file (gitlab-project-repo-assets-content.csv) which contains the name, size, and base64-encoded content of each unique Git object in all the files discovered throughjust discover-gitlab-project-repo-assets.- Git objects IDs are considered "globally unique" so we can have the same file content used across Git repos stored only once.
- There might be some long-term issues with content conflicts that should be considered. Read SHA-1 collision detection on GitHub.com and 10.2 Git Internals - Git Objects.
| Column | Type | Purpose |
|---|---|---|
discovered_at |
timestamptz | Timestamp of when the discovery of this row occurred |
git_object_id |
text | Git object (e.g. blob) ID acquired from bare Git repo |
git_file_name |
text | Git file name acquired from bare Git repo |
git_file_size_bytes |
integer | Git file size in bytes acquired from bare Git repo |
git_file_content_base64 |
text | Git file commit content, in Base64 format, from bare Git repo using git show -r {git_object_id} |
just persist-gitlab-project-repo-assets-contentvalidatesgitlab-project-repo-assets-content.csvusingmillerand then inserts all rows ingitlab-project-repo-assets-content.csvinto thegitlab_project_repo_assets_contentPostgreSQL table using theCOPY FROMSQL command.
persist-gitlab-project-repo-assets-content uses the CSV file created by discover-gitlab-project-repo-assets-content target and:
- drops the
gitlab_project_repo_assets_contenttable - creates the
gitlab_project_repo_assets_contenttable - imports the CSV file into the
gitlab_project_repo_assets_contenttable
Since the repo contents are now just SQL, the content and meta data are now a query away:
select gl_project_id,
git_author_date,
prac.git_file_size_bytes,
convert_from(decode(git_file_content_base64, 'base64'), 'UTF8')::jsonb as lhc_form
from stateless_enhance_service_gitlab.gitlab_project_repo_assets pra,
stateless_enhance_service_gitlab.gitlab_project_repo_assets_content prac
where prac.git_file_name = 'offering-profile.lhc-form.json'
and git_branch = 'draft'
and pra.git_object_id = prac.git_object_id