Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
128 changes: 123 additions & 5 deletions audit/gdcd/scripts/README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,18 @@
# Log Parser Scripts

This directory contains a script to parse GDCD log files and analyze page changes, specifically identifying moved pages vs truly new/removed pages and tracking applied usage examples.
This directory contains scripts to parse GDCD log files and analyze page changes, specifically identifying moved pages vs truly new/removed pages and tracking applied usage examples.

## Files

- `parse-log.go` - Main Go script that performs the log parsing and analysis
- `parse-log.go` - Go script that performs log parsing and analysis for page changes
- `compare-page-counts.go` - Go script that compares page counts from log files with audit-cli output
- `README.md` - This documentation file

## Purpose

The script analyzes log files to distinguish between:
### parse-log.go

The parse-log.go script analyzes log files to distinguish between:

1. **Moved Pages**: Pages that appear to be removed and created but are actually the same page moved to a new location within the same project
2. **Maybe New Pages**: Pages that may be genuinely new additions
Expand All @@ -18,9 +21,43 @@ The script analyzes log files to distinguish between:

All results are reported with **project context** to clearly show which project each page belongs to.

### compare-page-counts.go

The compare-page-counts.go script compares page counts between:

1. **Log File**: Page counts extracted from GDCD log files (lines like "Found 78 docs pages for project csharp")
2. **audit-cli**: Current page counts from running `audit-cli count pages --current-only --count-by-project`

This helps identify discrepancies between what was processed during a GDCD run and the current state of the documentation repository. Differences can indicate:
- Pages added or removed since the log was generated
- Project name mismatches between systems
- Data inconsistencies that need investigation

The script automatically:
1. Runs audit-cli once to identify projects that exist only in audit-cli (not in the log)
2. Re-runs audit-cli with those projects excluded using the `--exclude-dirs` flag
3. Compares the filtered results for a cleaner comparison
Comment on lines +37 to +39
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice


The script includes built-in project name mappings to handle known differences between log file project names and audit-cli project names:
- `scala` → `scala-driver`
- `cloud-docs` → `atlas`
- `c` → `c-driver`
- `cloudgov` → `atlas-government`
- `django` → `django-mongodb`
- `docs` → `manual`
- `docs-relational-migrator` → `relational-migrator`
- `laravel` → `laravel-mongodb`
- `pymongo` → `pymongo-driver`
- `pymongo-arrow` → `pymongo-arrow-driver`
- `mck` → `kubernetes`

The script also excludes deprecated projects from comparison:
- `docs-k8s-operator` (deprecated)

## Dependencies

- Go
- `audit-cli` command (required for compare-page-counts.go) - must be available in your PATH

## How It Works

Expand Down Expand Up @@ -59,7 +96,9 @@ moved, we must manually adjust the count of new applied usage examples to omit t

## Usage

**Important**: You must be in the scripts directory to run the Go script directly:
**Important**: You must be in the scripts directory to run the Go scripts directly:

### parse-log.go

```bash
# Navigate to the scripts directory first
Expand All @@ -70,9 +109,22 @@ go run parse-log.go ../logs/2025-09-24-18-01-30-app.log
go run parse-log.go /absolute/path/to/your/log/file.log
```

### compare-page-counts.go

```bash
# Navigate to the scripts directory first
cd /Your/Local/Filepath/tooling/audit/gdcd/scripts

# Then run the Go script with log file and docs repo path
go run compare-page-counts.go ../logs/2025-12-10-17-58-47-app.log /path/to/docs-mongodb-internal
go run compare-page-counts.go /absolute/path/to/log/file.log /absolute/path/to/docs/repo
```

## Output Format

The script produces four sections:
### parse-log.go

The parse-log.go script produces four sections:

### 1. MOVED PAGES
```
Expand Down Expand Up @@ -108,6 +160,72 @@ APPLIED USAGE [pymongo]: data-formats|custom-types|type-codecs (1 applied usage
Total new applied usage examples: 17
```

### compare-page-counts.go

The compare-page-counts.go script compares page counts from the log file with the current state from audit-cli and produces output like:

```
=== INITIAL COMPARISON ===
Found 6 projects only in audit-cli: [app-services guides mongodb-analyzer mongodb-intellij mongodb-vscode realm]

Re-running audit-cli with exclusions...

=== PAGE COUNT COMPARISON ===

Projects with differences:
--------------------------------------------------
atlas Log: 777 Audit: 703 (diff: -74)
atlas-architecture Log: 124 Audit: 121 (diff: -3)
atlas-cli Log: 1276 Audit: 930 (diff: -346)
atlas-operator Log: 58 Audit: 57 (diff: -1)
c-driver Log: 86 Audit: 56 (diff: -30)
cloud-manager Log: 490 Audit: 482 (diff: -8)
compass Log: 117 Audit: 115 (diff: -2)
cpp-driver Log: 56 Audit: 52 (diff: -4)
csharp Log: 78 Audit: 77 (diff: -1)
database-tools Log: 61 Audit: 53 (diff: -8)
django-mongodb Log: 30 Audit: 27 (diff: -3)
drivers Log: 21 Audit: 20 (diff: -1)
entity-framework Log: 13 Audit: 14 (diff: +1)
golang Log: 143 Audit: 68 (diff: -75)
java Log: 90 Audit: 89 (diff: -1)
java-rs Log: 56 Audit: 55 (diff: -1)
kotlin Log: 88 Audit: 87 (diff: -1)
kotlin-sync Log: 95 Audit: 66 (diff: -29)
landing Log: 27 Audit: 23 (diff: -4)
laravel-mongodb Log: 58 Audit: 57 (diff: -1)
manual Log: 1668 Audit: 1596 (diff: -72)
mongocli Log: 403 Audit: 17 (diff: -386)
mongoid Log: 60 Audit: 59 (diff: -1)
mongosync Log: 73 Audit: 88 (diff: +15)
node Log: 77 Audit: 76 (diff: -1)
ops-manager Log: 632 Audit: 628 (diff: -4)
php-library Log: 259 Audit: 258 (diff: -1)
pymongo-arrow-driver Log: 8 Audit: 9 (diff: +1)
pymongo-driver Log: 67 Audit: 66 (diff: -1)
relational-migrator Log: 135 Audit: 109 (diff: -26)
ruby-driver Log: 91 Audit: 62 (diff: -29)
rust Log: 76 Audit: 74 (diff: -2)
scala-driver Log: 44 Audit: 43 (diff: -1)
spark-connector Log: 16 Audit: 17 (diff: +1)
voyage Log: 0 Audit: 1 (diff: +1)

=== SUMMARY ===
Total projects: 43
Matching counts: 8
Different counts: 35

Total pages in log: 7869
Total pages in audit-cli: 6771
Difference: -1098
```

This helps identify:
- **Matching counts**: Projects where log and audit-cli agree
- **Different counts**: Projects where counts differ (with the difference shown)
- **Only in log**: Projects found in the log but not in audit-cli output (may indicate project name mismatches)
- **Total pages**: Sum of all page counts from each source, excluding deprecated projects and projects only in audit-cli

## Log Format Requirements

The scripts expect log lines in the following formats:
Expand Down
Loading