Skip to content

Conversation

@jmcnamara
Copy link
Collaborator

@jmcnamara jmcnamara commented Sep 5, 2025

No description provided.

@jmcnamara
Copy link
Collaborator Author

jmcnamara commented Sep 5, 2025

This PR replaces the built-in Compound File Binary (CFB) file handling with the external, better tested and maintained, cfb crate.

This avoids a number of confusing and potentially out-of-memory bugs such as #550

Some tasks to be completed:

  • Fix failing test case in one of the OOM test cases. Fixed.
  • Convert VBA project reading to cfb.rs.
  • Add test case for xls VBA reading
  • Replace cfb in password checking
  • Remove the redundant cfb code.
  • Do performance testing.

@jmcnamara
Copy link
Collaborator Author

jmcnamara commented Sep 5, 2025

Note for future testing. One of the test cases, test_oom_allocation, failed in an initial version of this PR because it contains both a "Workbook" and "Book" stream. This is valid and was used for backward compatibility since older versions of Excel could load "Book" and ignore "Workbook". Newer versions of Excel would look for "Workbook" first and then if that didn't exist they could fall back to "Book".

Anyway, the calamine parsing of "Workbook" succeeded by "Book" failed. This should be investigated at some point.

@jmcnamara jmcnamara force-pushed the replace_cfb branch 2 times, most recently from 4041e7d to c35e90a Compare September 6, 2025 19:24
jmcnamara added a commit to jmcnamara/calamine that referenced this pull request Sep 6, 2025
Replace the built-in Compound File Binary (CFB) file handling
with the external, better tested and maintained, cfb crate.

The requires a breaking API change to the VbaProject::new()
method and it drops the VbaProject::from_cfb() method.

See: tafia#551 and tafia#550
jmcnamara added a commit to jmcnamara/calamine that referenced this pull request Sep 7, 2025
Replace the built-in Compound File Binary (CFB) file handling
with the external, better tested and maintained, cfb crate.

The requires a breaking API change to the VbaProject::new()
method and it drops the VbaProject::from_cfb() method.

See: tafia#551 and tafia#550
jmcnamara added a commit to jmcnamara/calamine that referenced this pull request Sep 7, 2025
Replace the built-in Compound File Binary (CFB) check for
file password protection/encryption with cfb crate methods.

See: tafia#551
jmcnamara added a commit to jmcnamara/calamine that referenced this pull request Sep 7, 2025
Remove the built-in cfb code that has been replaced by
the cfb.rs crate.

See: tafia#551
@jmcnamara jmcnamara changed the title WIP: replace built-in cfb handling with cfb crate Replace built-in cfb handling with cfb crate Sep 7, 2025
@jmcnamara jmcnamara marked this pull request as ready for review September 7, 2025 00:56
@jmcnamara
Copy link
Collaborator Author

I have moved this from a WIP to a full PR and it is now ready for review.

@jmcnamara jmcnamara requested a review from Copilot September 7, 2025 00:57
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR replaces the built-in CFB (Compound File Binary) handling implementation with the external cfb crate. This modernizes the codebase by leveraging a dedicated library for CFB file parsing instead of maintaining custom implementation.

Key changes:

  • Replaces custom CFB implementation with the cfb crate dependency
  • Updates VBA project parsing logic for both XLS and XLSM formats
  • Refactors password protection detection to use the new CFB library
  • Improves test organization with better naming and structure

Reviewed Changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
Cargo.toml Adds cfb crate dependency
src/cfb.rs Removes custom CFB implementation, keeps utility functions
src/vba.rs Updates VBA parsing to use cfb crate with unified logic for XLS/XLSM
src/xls.rs Refactors XLS parsing to use cfb crate for compound file operations
src/xlsx/mod.rs Updates XLSX VBA parsing and password detection with cfb crate
src/xlsb/mod.rs Updates XLSB VBA parsing and password detection with cfb crate
src/utils.rs Moves utility function to test module
tests/test.rs Improves test naming and organization

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@jmcnamara
Copy link
Collaborator Author

I'm looking for reviews on this but also for some performance checks from anyone who deals with parsing large xls files.

@sftse
Copy link
Contributor

sftse commented Sep 8, 2025

One thing to consider is the deep dependency tree that rust-cfb acquired in order to fix quite an edge case of Unicode uppercasing #62. I wasn't enthusiastic about it at the time, but maybe there's something that can be done before merging this change? That fix took rust-cfb from a lightweight dependency to quite an expensive one.

@jmcnamara
Copy link
Collaborator Author

jmcnamara commented Sep 9, 2025

One thing to consider is the deep dependency tree that rust-cfb acquired in order to fix quite an edge case of Unicode uppercasing

Thanks for pointing that out. That is a huge addition to the dependency tree:

Diff details

$ diff cfb_native_tree.txt cfb_crate_tree.txt 
4a5,81
> ├── cfb v0.11.0
> │   ├── fnv v1.0.7
> │   ├── icu_casemap v1.5.1
> │   │   ├── displaydoc v0.2.5 (proc-macro)
> │   │   │   ├── proc-macro2 v1.0.95
> │   │   │   │   └── unicode-ident v1.0.18
> │   │   │   ├── quote v1.0.40
> │   │   │   │   └── proc-macro2 v1.0.95 (*)
> │   │   │   └── syn v2.0.104
> │   │   │       ├── proc-macro2 v1.0.95 (*)
> │   │   │       ├── quote v1.0.40 (*)
> │   │   │       └── unicode-ident v1.0.18
> │   │   ├── icu_casemap_data v1.5.1
> │   │   ├── icu_collections v1.5.0
> │   │   │   ├── displaydoc v0.2.5 (proc-macro) (*)
> │   │   │   ├── yoke v0.7.5
> │   │   │   │   ├── stable_deref_trait v1.2.0
> │   │   │   │   ├── yoke-derive v0.7.5 (proc-macro)
> │   │   │   │   │   ├── proc-macro2 v1.0.95 (*)
> │   │   │   │   │   ├── quote v1.0.40 (*)
> │   │   │   │   │   ├── syn v2.0.104 (*)
> │   │   │   │   │   └── synstructure v0.13.2
> │   │   │   │   │       ├── proc-macro2 v1.0.95 (*)
> │   │   │   │   │       ├── quote v1.0.40 (*)
> │   │   │   │   │       └── syn v2.0.104 (*)
> │   │   │   │   └── zerofrom v0.1.6
> │   │   │   │       └── zerofrom-derive v0.1.6 (proc-macro)
> │   │   │   │           ├── proc-macro2 v1.0.95 (*)
> │   │   │   │           ├── quote v1.0.40 (*)
> │   │   │   │           ├── syn v2.0.104 (*)
> │   │   │   │           └── synstructure v0.13.2 (*)
> │   │   │   ├── zerofrom v0.1.6 (*)
> │   │   │   └── zerovec v0.10.4
> │   │   │       ├── yoke v0.7.5 (*)
> │   │   │       ├── zerofrom v0.1.6 (*)
> │   │   │       └── zerovec-derive v0.10.3 (proc-macro)
> │   │   │           ├── proc-macro2 v1.0.95 (*)
> │   │   │           ├── quote v1.0.40 (*)
> │   │   │           └── syn v2.0.104 (*)
> │   │   ├── icu_locid v1.5.0
> │   │   │   ├── displaydoc v0.2.5 (proc-macro) (*)
> │   │   │   ├── litemap v0.7.5
> │   │   │   ├── tinystr v0.7.6
> │   │   │   │   ├── displaydoc v0.2.5 (proc-macro) (*)
> │   │   │   │   └── zerovec v0.10.4 (*)
> │   │   │   ├── writeable v0.5.5
> │   │   │   └── zerovec v0.10.4 (*)
> │   │   ├── icu_properties v1.5.1
> │   │   │   ├── displaydoc v0.2.5 (proc-macro) (*)
> │   │   │   ├── icu_collections v1.5.0 (*)
> │   │   │   ├── icu_locid_transform v1.5.0
> │   │   │   │   ├── displaydoc v0.2.5 (proc-macro) (*)
> │   │   │   │   ├── icu_locid v1.5.0 (*)
> │   │   │   │   ├── icu_locid_transform_data v1.5.1
> │   │   │   │   ├── icu_provider v1.5.0
> │   │   │   │   │   ├── displaydoc v0.2.5 (proc-macro) (*)
> │   │   │   │   │   ├── icu_locid v1.5.0 (*)
> │   │   │   │   │   ├── icu_provider_macros v1.5.0 (proc-macro)
> │   │   │   │   │   │   ├── proc-macro2 v1.0.95 (*)
> │   │   │   │   │   │   ├── quote v1.0.40 (*)
> │   │   │   │   │   │   └── syn v2.0.104 (*)
> │   │   │   │   │   ├── stable_deref_trait v1.2.0
> │   │   │   │   │   ├── tinystr v0.7.6 (*)
> │   │   │   │   │   ├── writeable v0.5.5
> │   │   │   │   │   ├── yoke v0.7.5 (*)
> │   │   │   │   │   ├── zerofrom v0.1.6 (*)
> │   │   │   │   │   └── zerovec v0.10.4 (*)
> │   │   │   │   ├── tinystr v0.7.6 (*)
> │   │   │   │   └── zerovec v0.10.4 (*)
> │   │   │   ├── icu_properties_data v1.5.1
> │   │   │   ├── icu_provider v1.5.0 (*)
> │   │   │   ├── tinystr v0.7.6 (*)
> │   │   │   └── zerovec v0.10.4 (*)
> │   │   ├── icu_provider v1.5.0 (*)
> │   │   ├── writeable v0.5.5
> │   │   └── zerovec v0.10.4 (*)
> │   └── uuid v1.18.1

The associated increase in binary size isn't bad. In the following case the increase is about 6%:

$ ls -lSr target/release/examples/excel_to_csv_cfb_*
-rwxr-xr-x  1 John  staff  1299344 Sep  9 09:49 target/release/examples/excel_to_csv_cfb_native
-rwxr-xr-x  1 John  staff  1380984 Sep  9 09:53 target/release/examples/excel_to_csv_cfb_crate

However, the cfb.rs dependency takes calamine's relative small dep chain to something quite big.

So maybe this is a non-runner.

@KillTheMule
Copy link

As a user of calamine, I'd still prefer correctness over lighter dependencies, but that might be just me. Otoh, I'm personally using this with ods, so I'll be paying the bill without any additional gain :) Maybe a feature flag makes sense? One for each file type, with the default adding all (so there's no breaking change), or something like that.

@sftse
Copy link
Contributor

sftse commented Sep 9, 2025

I'm working to see if we can do something about this.

@sftse
Copy link
Contributor

sftse commented Sep 9, 2025

#69 is up.

@jmcnamara
Copy link
Collaborator Author

#69 is up.

Good work. cfb.rs as a lighter dependency would remove any barriers to merging.

jmcnamara added a commit to jmcnamara/calamine that referenced this pull request Sep 15, 2025
Replace the built-in Compound File Binary (CFB) file handling
with the external, better tested and maintained, cfb crate.

The requires a breaking API change to the VbaProject::new()
method and it drops the VbaProject::from_cfb() method.

See: tafia#551 and tafia#550
jmcnamara added a commit to jmcnamara/calamine that referenced this pull request Sep 15, 2025
Replace the built-in Compound File Binary (CFB) check for
file password protection/encryption with cfb crate methods.

See: tafia#551
jmcnamara added a commit to jmcnamara/calamine that referenced this pull request Sep 15, 2025
Remove the built-in cfb code that has been replaced by
the cfb.rs crate.

See: tafia#551
jmcnamara added a commit to jmcnamara/calamine that referenced this pull request Sep 15, 2025
Replace the built-in Compound File Binary (CFB) file handling
with the external, better tested and maintained, cfb crate.

The requires a breaking API change to the VbaProject::new()
method and it drops the VbaProject::from_cfb() method.

See: tafia#551 and tafia#550
jmcnamara added a commit to jmcnamara/calamine that referenced this pull request Sep 15, 2025
Replace the built-in Compound File Binary (CFB) check for
file password protection/encryption with cfb crate methods.

See: tafia#551
jmcnamara added a commit to jmcnamara/calamine that referenced this pull request Sep 15, 2025
Remove the built-in cfb code that has been replaced by
the cfb.rs crate.

See: tafia#551
@jmcnamara jmcnamara changed the title Replace built-in cfb handling with cfb crate cfb: replace built-in cfb handling with cfb crate Sep 16, 2025
@dimastbk
Copy link
Contributor

Master: test bench_xls ... bench: 84,629,129.40 ns/iter (+/- 2,162,809.64)
This PR: test bench_xls ... bench: 194,271,326.00 ns/iter (+/- 12,749,483.05)

File https://github.com/MarkPflug/Benchmarks/blob/main/source/Benchmarks/Data/65K_Records_Data.xls

@jmcnamara
Copy link
Collaborator Author

@dimastbk Thanks. That is helpful. And possibly deal breaking.

Did you use the Calamine "bench" code for this or something else?

@dimastbk
Copy link
Contributor

Did you use the Calamine "bench" code

Yes.

@jmcnamara
Copy link
Collaborator Author

jmcnamara commented Sep 16, 2025

Thanks @dimastbk for this initial performance test.

I've looked into the performance of using cfb.rs vs the native implementation and the cfb crate is ~2x slower, even for small files.

The bottleneck seems to be in reading the CFB Streams via the Stream.read_to_end() function. As far as I can see there is no way to either avoid this or improve it.

There is also an existing performance bug report in the rust-cfb repo which doesn't have a resolution: mdsteele/rust-cfb#57

So all in all, it looks like cfb.rs may not be a suitable replacement for the native implementation.

Adding a typical flamegraph for future reference. The main bottleneck starts with std::io::Read::read_to_end which is called on cfb::Stream:

cfb_perf_pr

jmcnamara added a commit to jmcnamara/calamine that referenced this pull request Oct 21, 2025
Replace the built-in Compound File Binary (CFB) file handling
with the external, better tested and maintained, cfb crate.

The requires a breaking API change to the VbaProject::new()
method and it drops the VbaProject::from_cfb() method.

See: tafia#551 and tafia#550
jmcnamara added a commit to jmcnamara/calamine that referenced this pull request Oct 21, 2025
Replace the built-in Compound File Binary (CFB) check for
file password protection/encryption with cfb crate methods.

See: tafia#551
jmcnamara added a commit to jmcnamara/calamine that referenced this pull request Oct 21, 2025
Remove the built-in cfb code that has been replaced by
the cfb.rs crate.

See: tafia#551
Replace the built-in Compound File Binary (CFB) file handling
with the external, better tested and maintained, cfb crate.

The requires a breaking API change to the VbaProject::new()
method and it drops the VbaProject::from_cfb() method.

See: tafia#551 and tafia#550
Replace the built-in Compound File Binary (CFB) check for
file password protection/encryption with cfb crate methods.

See: tafia#551
jmcnamara added a commit to jmcnamara/calamine that referenced this pull request Oct 21, 2025
Remove the built-in cfb code that has been replaced by
the cfb.rs crate.

See: tafia#551
Remove the built-in cfb code that has been replaced by
the cfb.rs crate.

See: tafia#551
@sftse
Copy link
Contributor

sftse commented Oct 21, 2025

I intend to investigate the performance regression if there is a consensus that this is necessary to merge, but can't say when I can get around to it.

@jmcnamara
Copy link
Collaborator Author

jmcnamara commented Oct 21, 2025

I've updated this PR to pick up pick up cfb.rs v0.12.0 and rebased to main.

However, I am going to park/close this PR since the performance delta is too big. I would probably need to dig into the cfg.rs code to fix that. Also, cfb.rs doesn't seem to be actively maintained.

@jmcnamara jmcnamara closed this Oct 21, 2025
@sftse
Copy link
Contributor

sftse commented Oct 24, 2025

Since the 1. July 6 PRs from me were merged to the cfb crate, this seems to me a rather reasonable cadence for a project that has hardly any open issues. The most recent suggested fix took 10 days from PR to merge, again, rather reasonable. The release seems to have been forgotten, since a follow up request for a release was answered within 8h.

@jmcnamara
Copy link
Collaborator Author

jmcnamara commented Oct 24, 2025

Overall my main concern is the 2x performance delta against the native implementation.

I put a lot of work into this PR so I would like to see it merged and I would really prefer to replace the native cfb implementation with something that is better structured and at least maintained.

I think a 10-20% performance degradation would be an acceptable trade-off for the additional benefits of cfb.rs but 100% is probably too much. If the performance issue is fixed or if enough people agree that the performance trade-off is acceptable I will re-open the PR.

In the meantime I will try to periodically rebase it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants