Skip to content

✨ (New Backend) Offline Repository Mining: Integrating GitLab Archive Data into Binocular's Hexagonal Backend #415

@uberroot4

Description

@uberroot4

Binocular is an open-source Mining Software Repositories (MSR) tool. It aggregates data from Version Control Systems (VCS), Issue Tracking Systems (ITS), and Continuous Integration (CI) pipelines to provide time-based analytical visualizations via a modern web dashboard.

The currently active development branch (feature/backend-new) introduces a complete architectural overhaul. The legacy Node.js backend (binocular-backend) is being replaced by a new Kotlin-based backend (binocular-backend-new) designed according to the Hexagonal Architecture (Ports & Adapters) pattern. This approach cleanly separates the domain logic from external integrations, making it straightforward to add new data sources by implementing new adapters against existing domain ports.

Binocular's existing data acquisition model requires a live connection to a GitLab instance via its REST API. This creates a fundamental barrier for projects where the live instance is no longer accessible — e.g., decommissioned company servers, archived academic projects, or compliance-constrained environments. GitLab provides a native project export mechanism that produces a compressed archive containing structured NDJSON files for issues, merge requests, CI pipelines, milestones, members, labels, and related entities. This archive format is a rich, self-contained snapshot of a project's full history.

Exploiting this archive as a data source for Binocular would unlock analysis of offline, historical, or otherwise inaccessible repositories, significantly broadening the tool's applicability in research and industry contexts.

Specific problems to solve include:

  • Schema variance: GitLab archive format versions differ across GitLab releases (version field in the archive). The integration must handle this gracefully.
  • Data volume and streaming: Archives can contain tens of thousands of issues, merge request notes, and pipeline jobs. Naive in-memory deserialization is not viable; NDJSON must be processed in a streaming fashion.
  • Entity relationship resolution: The NDJSON files are relationally structured (e.g., merge requests reference milestones, issues reference labels). Cross-entity resolution must occur during ingestion without live API calls.
  • Mapping to the Binocular domain model: The GitLab archive schema does not map 1:1 to Binocular's internal domain entities. A principled translation layer is required.
  • Idempotency and re-ingestion: Running the import twice must not produce duplicate data.

Functional Requirements

ID Requirement
FR-01 The system SHALL accept a path to a valid GitLab project export archive (.tar.gz) as input.
FR-02 The system SHALL parse and ingest issues.ndjson, merge_requests.ndjson, ci_pipelines.ndjson, labels.ndjson, milestones.ndjson, and project_members.ndjson.
FR-03 The system SHALL correctly resolve intra-archive references (e.g., issue-to-label, merge-request-to-milestone).
FR-04 The system SHALL map all parsed entities to the Binocular domain model and persist them via the existing repository ports.
FR-05 The system SHALL detect the archive format version and log a warning or abort with a structured error for incompatible versions.
FR-06 The system SHALL be idempotent: re-ingesting the same archive SHALL NOT create duplicate domain entities.
FR-07 The system SHALL report ingestion progress (entities processed, errors encountered) in a machine-readable format suitable for CLI output and structured logging.
FR-08 The frontend SHALL provide at least one visualization that is meaningfully extended or newly created to surface data available only through archive ingestion.

Non-Functional Requirements

ID Requirement
NFR-01 The adapter MUST conform to the hexagonal architecture: all external I/O (file system, parsing) MUST be behind port interfaces; the domain core MUST not reference archive-specific types.
NFR-02 NDJSON files MUST be processed as streams; peak heap usage during ingestion of a 500 MB archive MUST NOT exceed a documented and justified threshold.
NFR-03 The implementation MUST be written in idiomatic Kotlin, using coroutines for asynchronous/streaming operations where appropriate.
NFR-04 All public API surfaces (ports, domain services, use cases) MUST be documented with KDoc.
NFR-05 Code MUST be formatted using ktlint and pass static analysis with detekt at the project's configured rule set without suppressions.

Engineering / Quality Requirements

ID Requirement
EQ-01 Unit test coverage of all domain mapping logic and archive parsing components MUST reach ≥ 80% line coverage, measured via JaCoCo and enforced in CI.
EQ-02 Integration tests MUST cover the full ingestion pipeline using a real (anonymized or synthetic) GitLab archive fixture, asserting correct entity counts and relationship integrity.
EQ-03 Property-based tests (e.g., using kotest with its property testing module) MUST be used to validate mapping functions against edge cases (null fields, empty collections, boundary timestamps).
EQ-04 Contract tests MUST verify that the GitLabArchiveAdapter fully satisfies the port interface contracts already defined in binocular-backend-new, using a shared test suite or abstract test class pattern.
EQ-05 A CI pipeline (GitHub Actions or equivalent) MUST be configured to run the full test suite, linting, and coverage gating on every pull request.
EQ-06 All test fixtures (sample archive fragments) MUST be version-controlled and regeneratable from a documented script.
EQ-07 No production code MAY suppress exceptions silently; all error paths MUST be tested by at least one dedicated test case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions