Skip to content

Fix encoding issues and separate recategorization command#4

Merged
HanssonMagnus merged 2 commits intomainfrom
fix/encoding-and-code-quality
Nov 10, 2025
Merged

Fix encoding issues and separate recategorization command#4
HanssonMagnus merged 2 commits intomainfrom
fix/encoding-and-code-quality

Conversation

@HanssonMagnus
Copy link
Copy Markdown
Owner

Summary

This PR fixes critical encoding issues and improves the recategorization workflow.

Changes

Encoding Fixes

  • Add explicit UTF-8 encoding to all file operations
  • Fix missing encoding in date cache read/write operations
  • Fix missing encoding in CLI cache operations
  • Ensures consistent behavior across different systems (especially Windows)

Recategorization Improvements

  • Remove automatic recategorization from scrape_bis()
  • Add standalone recategorize CLI command
  • Update run_all to include recategorization step
  • Improve recategorization with transactional updates (updates metadata.json immediately after each successful move)
  • Add metadata preservation during recategorization
  • Update documentation (README and API docs)

Testing

  • All 48 tests pass
  • Manual testing completed with full scrape workflow
  • Verified encoding fixes work correctly

Impact

  • Breaking Change: scrape_bis() no longer automatically recategorizes. Users must call recategorize_unknown_files() explicitly or use run_all command.
  • Improvement: More explicit workflow and better error recovery
  • Bug Fix: Encoding issues that could cause problems on Windows/non-UTF-8 systems

- Add explicit UTF-8 encoding to all file operations
- Fix missing encoding in date cache read/write operations
- Fix missing encoding in CLI cache operations
- Add documentation for unused log_dir parameter

This ensures consistent behavior across different systems,
especially Windows and non-UTF-8 locales.
- Remove automatic recategorization from scrape_bis()
- Add standalone 'recategorize' CLI command
- Update run_all to include recategorization step
- Update documentation (README and API docs)
- Improve recategorization with transactional updates
- Add metadata preservation during recategorization
- Update CHANGELOG with recent fixes
@HanssonMagnus HanssonMagnus merged commit c616d25 into main Nov 10, 2025
3 checks passed
@HanssonMagnus HanssonMagnus deleted the fix/encoding-and-code-quality branch November 10, 2025 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant