-
Notifications
You must be signed in to change notification settings - Fork 210
cfb: replace built-in cfb handling with cfb crate #551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This PR replaces the built-in Compound File Binary (CFB) file handling with the external, better tested and maintained, cfb crate. This avoids a number of confusing and potentially out-of-memory bugs such as #550 Some tasks to be completed:
|
18a76c5 to
452c6e7
Compare
|
Note for future testing. One of the test cases, Anyway, the calamine parsing of "Workbook" succeeded by "Book" failed. This should be investigated at some point. |
4041e7d to
c35e90a
Compare
c35e90a to
30469b3
Compare
Replace the built-in Compound File Binary (CFB) check for file password protection/encryption with cfb crate methods. See: tafia#551
Remove the built-in cfb code that has been replaced by the cfb.rs crate. See: tafia#551
30469b3 to
5bcb7cc
Compare
|
I have moved this from a WIP to a full PR and it is now ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR replaces the built-in CFB (Compound File Binary) handling implementation with the external cfb crate. This modernizes the codebase by leveraging a dedicated library for CFB file parsing instead of maintaining custom implementation.
Key changes:
- Replaces custom CFB implementation with the
cfbcrate dependency - Updates VBA project parsing logic for both XLS and XLSM formats
- Refactors password protection detection to use the new CFB library
- Improves test organization with better naming and structure
Reviewed Changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| Cargo.toml | Adds cfb crate dependency |
| src/cfb.rs | Removes custom CFB implementation, keeps utility functions |
| src/vba.rs | Updates VBA parsing to use cfb crate with unified logic for XLS/XLSM |
| src/xls.rs | Refactors XLS parsing to use cfb crate for compound file operations |
| src/xlsx/mod.rs | Updates XLSX VBA parsing and password detection with cfb crate |
| src/xlsb/mod.rs | Updates XLSB VBA parsing and password detection with cfb crate |
| src/utils.rs | Moves utility function to test module |
| tests/test.rs | Improves test naming and organization |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
I'm looking for reviews on this but also for some performance checks from anyone who deals with parsing large xls files. |
|
One thing to consider is the deep dependency tree that |
Thanks for pointing that out. That is a huge addition to the dependency tree: Diff details
$ diff cfb_native_tree.txt cfb_crate_tree.txt
4a5,81
> ├── cfb v0.11.0
> │ ├── fnv v1.0.7
> │ ├── icu_casemap v1.5.1
> │ │ ├── displaydoc v0.2.5 (proc-macro)
> │ │ │ ├── proc-macro2 v1.0.95
> │ │ │ │ └── unicode-ident v1.0.18
> │ │ │ ├── quote v1.0.40
> │ │ │ │ └── proc-macro2 v1.0.95 (*)
> │ │ │ └── syn v2.0.104
> │ │ │ ├── proc-macro2 v1.0.95 (*)
> │ │ │ ├── quote v1.0.40 (*)
> │ │ │ └── unicode-ident v1.0.18
> │ │ ├── icu_casemap_data v1.5.1
> │ │ ├── icu_collections v1.5.0
> │ │ │ ├── displaydoc v0.2.5 (proc-macro) (*)
> │ │ │ ├── yoke v0.7.5
> │ │ │ │ ├── stable_deref_trait v1.2.0
> │ │ │ │ ├── yoke-derive v0.7.5 (proc-macro)
> │ │ │ │ │ ├── proc-macro2 v1.0.95 (*)
> │ │ │ │ │ ├── quote v1.0.40 (*)
> │ │ │ │ │ ├── syn v2.0.104 (*)
> │ │ │ │ │ └── synstructure v0.13.2
> │ │ │ │ │ ├── proc-macro2 v1.0.95 (*)
> │ │ │ │ │ ├── quote v1.0.40 (*)
> │ │ │ │ │ └── syn v2.0.104 (*)
> │ │ │ │ └── zerofrom v0.1.6
> │ │ │ │ └── zerofrom-derive v0.1.6 (proc-macro)
> │ │ │ │ ├── proc-macro2 v1.0.95 (*)
> │ │ │ │ ├── quote v1.0.40 (*)
> │ │ │ │ ├── syn v2.0.104 (*)
> │ │ │ │ └── synstructure v0.13.2 (*)
> │ │ │ ├── zerofrom v0.1.6 (*)
> │ │ │ └── zerovec v0.10.4
> │ │ │ ├── yoke v0.7.5 (*)
> │ │ │ ├── zerofrom v0.1.6 (*)
> │ │ │ └── zerovec-derive v0.10.3 (proc-macro)
> │ │ │ ├── proc-macro2 v1.0.95 (*)
> │ │ │ ├── quote v1.0.40 (*)
> │ │ │ └── syn v2.0.104 (*)
> │ │ ├── icu_locid v1.5.0
> │ │ │ ├── displaydoc v0.2.5 (proc-macro) (*)
> │ │ │ ├── litemap v0.7.5
> │ │ │ ├── tinystr v0.7.6
> │ │ │ │ ├── displaydoc v0.2.5 (proc-macro) (*)
> │ │ │ │ └── zerovec v0.10.4 (*)
> │ │ │ ├── writeable v0.5.5
> │ │ │ └── zerovec v0.10.4 (*)
> │ │ ├── icu_properties v1.5.1
> │ │ │ ├── displaydoc v0.2.5 (proc-macro) (*)
> │ │ │ ├── icu_collections v1.5.0 (*)
> │ │ │ ├── icu_locid_transform v1.5.0
> │ │ │ │ ├── displaydoc v0.2.5 (proc-macro) (*)
> │ │ │ │ ├── icu_locid v1.5.0 (*)
> │ │ │ │ ├── icu_locid_transform_data v1.5.1
> │ │ │ │ ├── icu_provider v1.5.0
> │ │ │ │ │ ├── displaydoc v0.2.5 (proc-macro) (*)
> │ │ │ │ │ ├── icu_locid v1.5.0 (*)
> │ │ │ │ │ ├── icu_provider_macros v1.5.0 (proc-macro)
> │ │ │ │ │ │ ├── proc-macro2 v1.0.95 (*)
> │ │ │ │ │ │ ├── quote v1.0.40 (*)
> │ │ │ │ │ │ └── syn v2.0.104 (*)
> │ │ │ │ │ ├── stable_deref_trait v1.2.0
> │ │ │ │ │ ├── tinystr v0.7.6 (*)
> │ │ │ │ │ ├── writeable v0.5.5
> │ │ │ │ │ ├── yoke v0.7.5 (*)
> │ │ │ │ │ ├── zerofrom v0.1.6 (*)
> │ │ │ │ │ └── zerovec v0.10.4 (*)
> │ │ │ │ ├── tinystr v0.7.6 (*)
> │ │ │ │ └── zerovec v0.10.4 (*)
> │ │ │ ├── icu_properties_data v1.5.1
> │ │ │ ├── icu_provider v1.5.0 (*)
> │ │ │ ├── tinystr v0.7.6 (*)
> │ │ │ └── zerovec v0.10.4 (*)
> │ │ ├── icu_provider v1.5.0 (*)
> │ │ ├── writeable v0.5.5
> │ │ └── zerovec v0.10.4 (*)
> │ └── uuid v1.18.1The associated increase in binary size isn't bad. In the following case the increase is about 6%: However, the So maybe this is a non-runner. |
|
As a user of calamine, I'd still prefer correctness over lighter dependencies, but that might be just me. Otoh, I'm personally using this with ods, so I'll be paying the bill without any additional gain :) Maybe a feature flag makes sense? One for each file type, with the default adding all (so there's no breaking change), or something like that. |
|
I'm working to see if we can do something about this. |
|
#69 is up. |
Good work. |
Replace the built-in Compound File Binary (CFB) check for file password protection/encryption with cfb crate methods. See: tafia#551
Remove the built-in cfb code that has been replaced by the cfb.rs crate. See: tafia#551
5bcb7cc to
0b69329
Compare
0b69329 to
450e437
Compare
Replace the built-in Compound File Binary (CFB) check for file password protection/encryption with cfb crate methods. See: tafia#551
Remove the built-in cfb code that has been replaced by the cfb.rs crate. See: tafia#551
|
Master: File https://github.com/MarkPflug/Benchmarks/blob/main/source/Benchmarks/Data/65K_Records_Data.xls |
|
@dimastbk Thanks. That is helpful. And possibly deal breaking. Did you use the Calamine "bench" code for this or something else? |
Yes. |
|
Thanks @dimastbk for this initial performance test. I've looked into the performance of using The bottleneck seems to be in reading the CFB Streams via the There is also an existing performance bug report in the So all in all, it looks like Adding a typical flamegraph for future reference. The main bottleneck starts with |
Replace the built-in Compound File Binary (CFB) check for file password protection/encryption with cfb crate methods. See: tafia#551
Remove the built-in cfb code that has been replaced by the cfb.rs crate. See: tafia#551
450e437 to
16cbc6d
Compare
Replace the built-in Compound File Binary (CFB) check for file password protection/encryption with cfb crate methods. See: tafia#551
16cbc6d to
6fa3717
Compare
Remove the built-in cfb code that has been replaced by the cfb.rs crate. See: tafia#551
Remove the built-in cfb code that has been replaced by the cfb.rs crate. See: tafia#551
6fa3717 to
8281794
Compare
|
I intend to investigate the performance regression if there is a consensus that this is necessary to merge, but can't say when I can get around to it. |
|
I've updated this PR to pick up pick up cfb.rs v0.12.0 and rebased to main. However, I am going to park/close this PR since the performance delta is too big. I would probably need to dig into the |
|
Since the 1. July 6 PRs from me were merged to the cfb crate, this seems to me a rather reasonable cadence for a project that has hardly any open issues. The most recent suggested fix took 10 days from PR to merge, again, rather reasonable. The release seems to have been forgotten, since a follow up request for a release was answered within 8h. |
|
Overall my main concern is the 2x performance delta against the native implementation. I put a lot of work into this PR so I would like to see it merged and I would really prefer to replace the native cfb implementation with something that is better structured and at least maintained. I think a 10-20% performance degradation would be an acceptable trade-off for the additional benefits of cfb.rs but 100% is probably too much. If the performance issue is fixed or if enough people agree that the performance trade-off is acceptable I will re-open the PR. In the meantime I will try to periodically rebase it. |
No description provided.