Skip to content

Latest commit

 

History

History
112 lines (77 loc) · 5.75 KB

File metadata and controls

112 lines (77 loc) · 5.75 KB

TODO

catdoc issues

medium priority: incomplete/inconsistent input file checking

GitHub issue #9: catdoc doesn't error if you give it a directory to convert; catppt and xls2csv print "/usr is not OLE file or Error" (with no error, exit status is 0). They should consistently fail. Could also check

Unreadable file inconsistency: % catdoc -b /tmp/unreadable prints catdoc: Permission denied but % catppt /tmp/unreadable prints the better /tmp/unreadable: Permission denied

Likewise for /tmp/nosuchfile

catppt issues

low priority: catppt slide separator (form feed) inconsistency

  • basic.ppt (MS PowerPoint format) outputs a form feed between each slide and after the last slide. test_LO_file.ppt (LibreOffice Impress format) only outputs a form feed after the last slide. Root cause: MS PPT stores text in SlideListWithText with SlidePersistAtom records that trigger slide_state = START_SLIDE, while LO PPT stores text in PPDrawing/Escher ClientTextbox records where there is no equivalent per-slide state transition. Fix would require emitting slide separators when entering each Slide container, not just when processing SlidePersistAtom.

medium priority: catppt doesn't extract text from chart object

tests/file_example_PPT_1MB.ppt has a chart object on slide 2. catppt does not output the text in the Data Table is not output.

low priority: "IRM protected" in protection_warning.ppt sample file.

The sample PowerPoint I found and used as a basic .ppt sample test file doesn't show any text in LibreOffice besides "This presentation has been IRM protected by policy." This makes it a bad test file. Maybe it's worth figuring out if there's other text in the file and if this is a special protected PowerPoint feature.

CI

Fedora RPM automation

  • Trigger Fedora package builds for skierpage/catdoc Copr once packit permissions are sorted.
    • trigger: release in .packit.yaml targets owner: skierpage, project: catdoc but builds never happen: the release shows up in the Packit dashboard under "Releases Handled" with no build created, and https://copr.fedorainfracloud.org/coprs/skierpage/catdoc/builds/ stays empty.
    • Packit is installed as a GitHub App (Settings > GitHub Actions > Packit-as-a-service).
    • PR builds work fine (go to temporary packit/skierpage-catdoc-N Copr projects).
    • Latest Packit docs focus on dist-git integration (submitting to Fedora proper); the Copr release trigger workflow may need different configuration or explicit Copr permissions granted to the Packit service account.

Test cleanup

  • Maybe fix the memory leaks reported by asan so tests don't have to set ASAN_OPTIONS=detect_leaks=0

    Investigation shows xls2csv has memory leaks (154 bytes: 128 from rowptr[row].cells allocation at sheet.c:38, 26 from cell content at xlsparse.c:307). catdoc does not leak.

    free_sheet() IS being called and executes all free() calls for rows/cells, but ASan still reports the leak. This may be a complex issue with realloc/ASan interaction, or allocations happening in a different code path. The leaks are small and occur at program exit, so this may be "won't fix".

Research: find more Office 2007 test files

  • Do the Gnome LocalSearch (formerly Tracker), KFileMetadata, and LibreOffice projects have test documents? In particular I'm looking for other encodings.
  • Is there an online Word 97-to-Office 2007 emulator or in-browser tool that lets you make your own?

Research: consider alternative approaches

This is old C code. ALTERNATIVES discusses alternate ways to extract text.

Confirm some CVE fixes

v0.97 fixed nearly all outstanding CVEs and memory access errors, see (NEWS)[NEWS] for details. But there are a few loose ends

  • Contact rycbar77 and ask for their POC for CVE-2023-41633, "Catdoc v0.95 was discovered to contain a NULL pointer dereference via the component xls2csv at src/fileutil.c.". This may skierpage/catdoc issue #8, if so it's fixed by commit e91fef7.

  • CVE-2023-46345 (references rycbar77's gist), Strftime-Nullptr-Dereference "NULL pointer dereference via the component xls2csv at src/xlsparse.c". One of the commits to src/slxparse.c in this fork in February 2026 may have fixed this

    • Need the POC file in order to reproduce.

Investigate fixes committed to libdoc

  • Review commits made to the similar C source code in libdoc that is based on catdoc.

    Note that libdoc issue #1 (CVE-2018-20453) and issue #2 (CVE-2018-20451) seem to be fixed by catdoc commit 12ab509. This is confusing since user kasha13 claimed these were fixed in libdoc with different source code changes.

MISC

BACKLOG

Victor Wagner's original TODO:

  • support dual-byte (CJK) encodings as output
  • Find a way to extract rowspan information from XLS.
  • Make XLS2CSV to output sheet partially when memory exhausted
  • Plain-text output method for XLS2CSV and its support in wordview
  • textmode (ck) wordview
  • Improve RTF support
  • Extract text from Top Level OLE objects ???
  • Write correct TeX commands for accented latin letters and most often
  • used mathematical symbols (20xx-25xx) into tex.specchars file
  • Add handling of tables & footnotes
  • Fastsave support