Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# To get started with Dependabot version updates, you'll need to specify which
# package ecosystems to update and where the package manifests are located.
# Please see the documentation for all configuration options:
# https://docs.github.com/code-security/dependabot/dependabot-version-updates/configuration-options-for-the-dependabot.yml-file

version: 2
updates:
- package-ecosystem: "pip" # See documentation for possible values
directory: "/" # Location of package manifests
schedule:
interval: "weekly"
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.pylintrc
.ruff_cache/
.vscode/
170 changes: 170 additions & 0 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# Repoview Architecture Documentation

## Application Overview and Objectives

**Repoview** is a static site generator designed to make YUM repositories easily browseable via a web browser.

The primary objective of Repoview is to transform the raw metadata of a YUM repository (packages, groups, changelogs) into a set of interlinked, user-friendly HTML pages. This allows users to explore the contents of a repository without needing to use command-line tools like `yum` or `dnf`.

Key features include:
- **Static Output**: Generates pure HTML/CSS/XML files, requiring no active server-side processing (like PHP or Python) on the hosting web server.
- **Incremental Generation**: Tracks the state of generated files to only regenerate pages for packages or groups that have changed, significantly speeding up updates for large repositories.
- **Templating Support**: Uses the Genshi templating engine, allowing for complete customization of the output look and feel.
- **RSS Feeds**: Optionally generates RSS feeds for the latest package updates.

## Architecture and Design Choices

Repoview is written in **Python 3** and follows a procedural workflow encapsulated within a main controller class.

### Execution Phases

The `Repoview` constructor performs the entire workflow in a deterministic set of phases (mirrored by inline comments in `repoview.py`):

1. **Repository Discovery** – validate the `repodata/repomd.xml`, locate compressed SQLite artifacts (`primary`, `other`, optional `group`), and open database handles. Compressed inputs (`.gz`, `.bz2`, `.xz`) are streamed into temporary files and scheduled for cleanup.
2. **Filesystem Preparation** – resolve the output directory (user-selectable via `-o/--output-dir`, always nested under the repo root), optionally wipe it when `--force` is set, copy the `layout/` assets from the template directory, and initialize the incremental `state.sqlite` database (optionally stored outside the repo via `--state-dir`, with repository-specific filenames hashed via MD5).
3. **Grouping & Package Rendering** – load groups either from `comps.xml`, RPM `Group` tags, or synthesized letter buckets. For each group, build package summaries, render package pages (with change detection, avoiding duplicate renders through an in-memory cache), and then render the group page if any dependency changed.
4. **Aggregate Views** – compute the latest packages list, render `index.html`, and optionally generate `latest-feed.xml` using the RSS template and package data.
5. **State Finalization** – clean up stale files left from previous runs and commit the updated checksums to `state.sqlite` so subsequent invocations stay incremental.

### Core Components

1. **Data Ingestion (YUM Metadata)**:
- Repoview does not parse RPM headers directly. Instead, it relies on the SQLite metadata databases (`primary.sqlite`, `other.sqlite`) generated by `createrepo`.
- **Compression Handling**: It automatically detects and decompresses metadata databases (supporting `.gz`, `.bz2`, and `.xz` formats) into temporary files for processing.
- It verifies the repository structure by parsing `repodata/repomd.xml`.
- It connects to these SQLite databases to query package details, file lists, and changelogs.

2. **State Management (Incremental Builds)**:
- To avoid rebuilding the entire site on every run, Repoview maintains a local SQLite database (`state.sqlite`).
- **Checksumming**: For every generated page (package, group, index), a content-based checksum (MD5) is calculated.
- **Change Detection**:
- Before writing a file to disk, the calculated checksum is compared against the stored checksum in `state.sqlite`. If they match, the file write is skipped.
- Package data is memoized per name (`self.written`) so packages that appear in multiple groups are rendered once but referenced many times.
- **Stale File Cleanup**: The system tracks which files are visited during a run. Files present in the output directory but not visited are considered "stale" (e.g., deleted packages) and are removed.

3. **Templating Engine**:
- **Genshi**: The application uses Genshi for rendering HTML.
- **Structure**:
- `index.kid`: The main entry page listing groups and latest packages.
- `group.kid`: Displays lists of packages within a specific group.
- `package.kid`: detailed view of a single package.
- `rss.kid`: XML template for the RSS feed.
- **Layout**: A `layout` directory containing static assets (CSS, images) is copied to the output directory.

4. **Grouping Logic**:
- **Comps.xml**: If available, Repoview uses the `comps.xml` file to organize packages into logical groups (e.g., "Development", "System Tools").
- **RPM Groups**: As a fallback, it can group packages based on the `Group` tag in the RPM metadata.
- **Alphabetical**: It automatically generates "Letter Groups" (Packages A, Packages B, etc.) for easier browsing. These groups share the same rendering pipeline and benefit from the package memoization cache.

### Data Flow

1. **Initialization**: Parse arguments, setup output directories, initialize state DB.
2. **Repo Connection**: Connect to `primary` and `other` SQLite databases.
3. **Group Processing**:
- Iterate through each defined group.
- For each package in the group, fetch details and changelogs.
- Render package page -> Checksum -> Write if changed.
- Render group page -> Checksum -> Write if changed.
4. **Index Generation**: Aggregate group lists and "latest modified" packages to render `index.html`.
5. **Finalization**: Commit state changes and delete stale files.

### Python Environment

Repoview is built as a self-contained, single-file Python utility (`repoview.py`) designed for ease of deployment and broad compatibility.

#### Core Dependencies

While the application leans heavily on the Python standard library, it requires a few key external modules to function:

- **Genshi (`genshi.template`)**:
- *Role*: The primary templating engine used to render HTML and XML output.
- *Usage*: It processes `.kid` template files, injecting Python objects (package lists, repository metadata) into the markup.

- **RPM Bindings (`rpm`)**:
- *Role*: Provides native RPM functionality.
- *Usage*: Specifically used for `rpm.labelCompare` to accurately sort and compare package versions (Epoch-Version-Release) and architectures.

- **Libcomps (`libcomps`)**:
- *Role*: Library for parsing `comps.xml` files.
- *Usage*: Optional but recommended. It is used to parse group definitions when `comps.xml` is present or specified.

- **SQLite (`sqlite3`)**:
- *Role*: Database Interface.
- *Usage*: Used to interact with the YUM metadata databases (`primary.sqlite`, `other.sqlite`) and the internal state tracking database (`state.sqlite`).

#### Code Strategy

- **Single-File Distribution**: The entire application logic resides in `repoview.py`, making it easy to copy and run without complex installation procedures.
- **Standard Library First**: It prioritizes standard library modules (`os`, `sys`, `shutil`, `hashlib`, `xml.etree`, `optparse`) to minimize external dependencies.
- **Compression Support**: It uses standard libraries (`gzip`, `bz2`, `lzma`) to transparently handle compressed metadata files commonly found in repositories.
- **Graceful Degradation**: The code includes try-except blocks for imports to handle different environment configurations (e.g., falling back to `cElementTree` or different `sqlite` import paths).

## Command Line Arguments

Repoview CLI accepts the following arguments:

| Argument | Type | Default | Description |
| :--- | :--- | :--- | :--- |
| `repodir` | Path | (Required) | The root directory of the repository (containing the `repodata` folder). |
| `-i`, `--ignore-package` | String (Glob) | `[]` | Ignore packages matching the glob pattern (e.g., `*debuginfo*`). Can be specified multiple times. |
| `-x`, `--exclude-arch` | String | `[]` | Exclude packages for specific architectures (e.g., `src`). Can be specified multiple times. |
| `-k`, `--template-dir` | Path | `/usr/share/repoview/templates/default` | Path to a custom directory containing Genshi templates (`*.kid`) and layout files. |
| `-o`, `--output-dir` | Path | `repoview` | Subdirectory (within `repodir`) where HTML files will be generated. |
| `-s`, `--state-dir` | Path | `[output-dir]` | Directory to store the `state.sqlite` database. Defaults to the output directory. |
| `-t`, `--title` | String | `"Repoview"` | Title of the repository to be displayed on generated pages. |
| `-u`, `--url` | URL | `None` | Base URL of the repository. Required for generating valid RSS feed links. |
| `-f`, `--force` | Flag | `False` | Force regeneration of all pages, ignoring the state database checksums. |
| `-q`, `--quiet` | Flag | `False` | Suppress standard output status messages. Only fatal errors are printed. |
| `-c`, `--comps` | Path | `None` | Path to an alternative `comps.xml` file, overriding the one in `repomd.xml`. |
| `-V`, `--version` | Flag | - | Print version number and exit. |
| `-h`, `--help` | Flag | - | Print usage message and exit. |

## Examples

### Basic Usage

Generate repoview pages for a repository located at `/var/www/html/repo`. The output will be in `/var/www/html/repo/repoview`.

```bash
repoview /var/www/html/repo
```

### Custom Title and RSS Feed

Generate pages with a specific title and enable RSS feed generation (requires URL).

```bash
repoview -t "My Enterprise Updates" -u "http://updates.example.com/repo" /var/www/html/repo
```

### Excluding Debug Packages

Skip processing for debuginfo and documentation packages to save time and space.

```bash
repoview -i "*debuginfo*" -i "*doc*" /var/www/html/repo
```

### Force Regeneration

Force a complete rebuild of the site, useful after changing templates or upgrading Repoview.

```bash
repoview --force /var/www/html/repo
```

### Using Custom Templates

Use a custom set of templates located in `~/my-templates`.

```bash
repoview -k ~/my-templates /var/www/html/repo
```

### Custom Output Location

Write the generated site to `/var/www/html/repo/docs` instead of the default `repoview` folder.

```bash
repoview -o docs /var/www/html/repo
```
181 changes: 181 additions & 0 deletions CHARTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
# Repoview Data Processing and Generation Flow

This codemap traces the complete data processing pipeline of Repoview, from repository metadata parsing through static HTML generation. Key locations include the main entry point, repository validation, package data querying, template rendering, and state management.

## Trace 1: Repository Initialization and Setup
**Description**: Entry point flow from command line to repository validation and database connections

```mermaid
graph TD
subgraph Initialization
A["main command line parser"] -->|"parse_args"| B["Repoview with opts"]
B --> C["Repoview.__init__"]
end

subgraph Constructor
C --> D["setup_repo"]
C --> H["setup_state_db"]
C --> I["setup_outdir"]
C --> J["process groups and packages"]
end

subgraph RepoSetup
D --> E["parse repomd.xml file"]
E --> E1["repoxml = open(repomd).read"]
D --> F["locate metadata databases"]
F --> F1["primary_db for packages"]
F --> F2["other_db for changelogs"]
D --> G["establish SQLite connections"]
G --> G1["self.pconn = sqlite.connect"]
end

J --> K["Repository validation complete"]
```

### Key Locations
| ID | Title | Description | Source | Code |
|---|---|---|---|---|
| 1a | Main Entry Point | Instantiates the Repoview controller with parsed command line options | repoview.py:1048 | `Repoview(opts)` instantiates the controller |
| 1b | Repository Setup | Validates repository structure and locates metadata files | repoview.py:196 | `self.setup_repo()` kicks off repository validation |
| 1c | Metadata Parsing | Reads and parses repomd.xml to find database locations | repoview.py:379 | `repoxml = open(repomd).read()` loads metadata XML |
| 1d | Database Connections | Establishes SQLite connections to primary and other metadata databases | repoview.py:420 | `self.pconn = sqlite.connect(primary)` opens primary DB |


## Trace 2: Group Discovery and Organization
**Description**: How packages are organized into groups using comps.xml or RPM group tags

```mermaid
graph TD
A["Repository setup completion"] --> B["Check for custom comps file"]
B --> C{"Use comps.xml?"}

C -->|"Yes"| D["Parse comps file path"]
D --> E["Setup comps groups"]
E --> F["Load libcomps parser"]
E --> G["Parse XML structure"]

C -->|"No"| H["Fallback to RPM groups"]
H --> I["Setup RPM groups"]
I --> J["Query distinct RPM group tags"]
I --> K["Group packages by RPM metadata"]
```

### Key Locations
| ID | Title | Description | Source | Code |
|---|---|---|---|---|
| 2a | Comps File Check | Determines whether to use custom comps.xml or repository default | repoview.py:428 | `if self.opts.comps:` honors CLI override |
| 2b | Comps Groups Setup | Parses comps.xml to extract package group definitions | repoview.py:432 | `self.setup_comps_groups(comps)` loads comps data |
| 2c | XML Parsing | Uses libcomps to parse the comps.xml file structure | repoview.py:794 | `comps.fromxml_f(compsxml)` parses comps XML |
| 2d | RPM Groups Fallback | Uses RPM group tags when comps.xml is not available | repoview.py:203 | `self.setup_rpm_groups()` builds fallback groups |


## Trace 3: Package Data Processing and Generation
**Description**: Core data flow from package querying to individual HTML page generation

```mermaid
graph TD
A["do_packages entry point"] --> B["for each package in group"]
B --> C["get_package_data"]

subgraph DataFetching
C --> D["SQL query construction"]
D --> E["pcursor.execute"]
C --> F["fetch package versions"]
C --> G["get changelog data"]
end

B --> H["calculate checksum"]
H --> I["mk_checksum"]

B --> J{"check if changed"}
J -->|"Yes"| K["return package tuples"]
J -->|"No"| K

K --> L["Template rendering phase"]
L --> M["Generate HTML if changed"]
```

### Key Locations
| ID | Title | Description | Source | Code |
|---|---|---|---|---|
| 3a | Package Processing Init | Starts processing packages for each group | repoview.py:239 | `packages = self.do_packages(...)` drives group build |
| 3b | Package Data Query | Queries SQLite databases for package metadata and changelogs | repoview.py:641 | `pkg_data = self.get_package_data(pkgname)` |
| 3c | Database Query Execution | Executes SQL to fetch package versions and metadata | repoview.py:532 | `pcursor.execute(query)` runs package query |
| 3d | Change Detection | Calculates checksum to determine if regeneration is needed | repoview.py:650 | `checksum = self.mk_checksum(...)` |


## Trace 4: Template Rendering and HTML Generation
**Description**: How Genshi templates are processed to generate the final HTML output

```mermaid
graph TD
A["do_packages processes group"] --> B["get_package_data"]
B --> C["Package data collected"]
A --> D["mk_checksum"]
D --> E{"has_changed?"}

E -->|"True"| F["Template Loading Phase"]
F --> G["pkg_kid.load PKGKID"]
G --> H["Template Generation Phase"]

subgraph Rendering
H --> I["tmpl.generate"]
I --> J["Injects group_data"]
I --> K["Injects pkg_data"]
I --> L["Injects repo_data"]
end

H --> M["HTML Rendering Phase"]
M --> N["stream.render to XHTML"]
N --> O["f.write saves to file"]

P["Index Page Generation"] --> Q["idx_kid.load IDXKID"]
```

### Key Locations
| ID | Title | Description | Source | Code |
|---|---|---|---|---|
| 4a | Template Loading | Loads the package template using Genshi TemplateLoader | repoview.py:658 | `tmpl = self.pkg_kid.load(PKGKID)` fetches template |
| 4b | Template Generation | Generates template stream with package and repository data | repoview.py:660 | `stream = tmpl.generate(...)` |
| 4c | HTML Rendering | Renders the template to XHTML and writes to file | repoview.py:666 | `handle.write(stream.render(...))` writes XHTML |
| 4d | Index Generation | Generates the main index page with group listings | repoview.py:278 | `tmpl = idx_kid.load(IDXKID)` prepares index template |


## Trace 5: State Management and Incremental Builds
**Description**: Efficiency mechanisms for tracking changes and avoiding unnecessary regeneration

```mermaid
graph TD
A["Repoview.__init__"] --> B["setup_state_db"]

subgraph Setup
B --> C["Load existing checksums"]
B --> D["Initialize state SQLite DB"]
end

A --> E["do_packages"]

subgraph Processing
E --> F["get_package_data"]
E --> G["mk_checksum"]
E --> H["has_changed"]
H --> I["Compare with stored checksum"]
H --> J["INSERT INTO state"]
end

A --> K["Final cleanup phase"]
K --> L["remove_stale"]

subgraph Cleanup
L --> M["Find orphaned files"]
L --> N["DELETE FROM state cleanup"]
end
```

### Key Locations
| ID | Title | Description | Source | Code |
|---|---|---|---|---|
| 5a | State Database Setup | Initializes SQLite database for tracking file checksums | repoview.py:199 | `self.setup_state_db()` prepares state tracking |
| 5b | Change Detection | Checks if file content has changed since last generation | repoview.py:651 | `if self.has_changed(...):` guards writes |
| 5c | State Tracking | Records new file checksums in the state database | repoview.py:716 | `INSERT INTO state (filename, checksum)` |
| 5d | Cleanup Process | Removes files that are no longer present in the repository | repoview.py:295 | `self.remove_stale()` prunes files |
Loading