Skip to content

fix(provider): handle UTF-8 BOM in JSON and CSV parsing#24

Merged
oritwoen merged 2 commits intomainfrom
fix/provider-bom-parsing
Mar 11, 2026
Merged

fix(provider): handle UTF-8 BOM in JSON and CSV parsing#24
oritwoen merged 2 commits intomainfrom
fix/provider-bom-parsing

Conversation

@oritwoen
Copy link
Owner

@oritwoen oritwoen commented Mar 11, 2026

Summary

I hit this while validating fixture ingestion from exported files. detect_format already tolerated UTF-8 BOM, but actual parsing still used raw content, so BOM-prefixed input could be detected correctly and then fail in serde_json or CSV deserialization.

Changes

  • Strip UTF-8 BOM at the start of parse_signatures before dispatching to JSON/CSV parsers
  • Add a regression test for BOM-prefixed JSON input
  • Add a regression test for BOM-prefixed CSV input
  • Add direct detect_format tests for BOM-prefixed JSON and CSV input

Testing

  • cargo test --all-features
  • cargo clippy --all-features -- -D warnings
  • cargo build --release

Related Issues

Closes #25

@codeant-ai
Copy link

codeant-ai bot commented Mar 11, 2026

CodeAnt AI is reviewing your PR.

@coderabbitai
Copy link

coderabbitai bot commented Mar 11, 2026

📝 Walkthrough

Walkthrough

parse_signatures now strips UTF-8 BOM prefixes before format detection to prevent JSON/CSV parsing failures. Two new tests validate BOM handling for both formats while existing behavior for non-BOM inputs remains unchanged.

Changes

Cohort / File(s) Summary
BOM handling in signature parsing
src/provider.rs
Added BOM stripping logic to parse_signatures before format detection. Includes test coverage for JSON and CSV inputs with UTF-8 BOM prefixes.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Poem

A tiny mark at the start of the line,
Breaks JSON parsing—oh what a sign!
Strip it away with a whisper so clean,
Now signatures flow where BOM's never been. 🎯

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed Title accurately describes the main change: fixing UTF-8 BOM handling in JSON and CSV parsing within the provider module.
Description check ✅ Passed PR description covers all required template sections with clear details about the BOM handling fix, specific changes made, and comprehensive testing evidence.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/provider-bom-parsing
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch fix/provider-bom-parsing

Comment @coderabbitai help to get the list of available commands and usage tips.

@augmentcode
Copy link

augmentcode bot commented Mar 11, 2026

🤖 Augment PR Summary

Summary: Ensures signature parsing correctly handles UTF-8 BOM-prefixed inputs.
Changes: Strips a leading BOM before format detection/deserialization and adds regression tests for BOM-prefixed JSON and CSV fixtures.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

@codeant-ai
Copy link

codeant-ai bot commented Mar 11, 2026

Sequence Diagram

This PR updates the signature loading flow to strip a UTF-8 BOM before format detection and parsing so that JSON and CSV inputs with a BOM are parsed correctly.

sequenceDiagram
    participant Caller
    participant Provider

    Caller->>Provider: load signatures from input
    Provider->>Provider: Read file or stdin into string
    Provider->>Provider: Strip UTF-8 BOM prefix if present
    Provider->>Provider: Detect format as JSON or CSV

    alt JSON format
        Provider->>Provider: Parse JSON and normalize inputs
    else CSV format
        Provider->>Provider: Parse CSV and normalize inputs
    end

    Provider-->>Caller: Return parsed signatures list
Loading

Generated by CodeAnt AI

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 1 file

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.

Auto-approved: Trivial fix to handle UTF-8 BOM in input parsing with regression tests.

Architecture diagram
sequenceDiagram
    participant Client
    participant P as Provider (parse_signatures)
    participant D as detect_format()
    participant Parser as JSON/CSV Parsers

    Note over Client,Parser: Signature Ingestion Flow

    Client->>P: parse_signatures(raw_content)
    
    P->>P: NEW: strip_prefix(BOM)
    Note right of P: Removes \u{FEFF} if present at start
    
    P->>D: detect_format(stripped_content)
    D-->>P: returns Format (Json | Csv)

    alt Format::Json
        P->>Parser: CHANGED: parse_json(stripped_content)
        Parser-->>P: Vec<Signature>
    else Format::Csv
        P->>Parser: CHANGED: parse_csv(stripped_content)
        Parser-->>P: Vec<Signature>
    end

    alt Success
        P-->>Client: Ok(signatures)
    else Parse Error (e.g. Malformed)
        P-->>Client: Err(Error)
    end
Loading

@codeant-ai
Copy link

codeant-ai bot commented Mar 11, 2026

Nitpicks 🔍

🔒 No security issues identified
⚡ Recommended areas for review

  • Behavior consistency for CSV reader
    CSV parsing uses csv::Reader::from_reader(content.as_bytes()) after BOM removal. Ensure BOM removal always occurs before any direct CSV parsing path (including callers that may call parse_csv directly) to avoid headers containing a leading BOM. Consider documenting or centralizing this expectation.

  • Test gap: format detection with BOM
    New tests assert that parse_signatures accepts BOM-prefixed inputs, but there is no direct assertion that detect_format correctly classifies BOM-prefixed JSON/CSV inputs. Add tests for detect_format with BOM input to prevent regressions.

  • Redundant BOM handling
    The BOM is now stripped in parse_signatures, but detect_format also strips the BOM when inspecting content. This duplication is harmless but confusing and may lead to subtle inconsistencies. Consider centralizing BOM removal so detection and parsing share a single behavior.

@codeant-ai
Copy link

codeant-ai bot commented Mar 11, 2026

CodeAnt AI finished reviewing your PR.

@oritwoen oritwoen self-assigned this Mar 11, 2026
@oritwoen oritwoen merged commit 4197b9b into main Mar 11, 2026
4 checks passed
@oritwoen oritwoen deleted the fix/provider-bom-parsing branch March 11, 2026 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: BOM-prefixed JSON/CSV input fails during parsing

1 participant