GitHub issue #9: catdoc doesn't error if you give it a directory to convert; catppt and xls2csv print "/usr is not OLE file or Error" (with no error, exit status is 0). They should consistently fail. Could also check
Unreadable file inconsistency: % catdoc -b /tmp/unreadable prints catdoc: Permission denied but % catppt /tmp/unreadable prints the better /tmp/unreadable: Permission denied
Likewise for /tmp/nosuchfile
-
basic.ppt(MS PowerPoint format) outputs a form feed between each slide and after the last slide.test_LO_file.ppt(LibreOffice Impress format) only outputs a form feed after the last slide. Root cause: MS PPT stores text inSlideListWithTextwithSlidePersistAtomrecords that triggerslide_state = START_SLIDE, while LO PPT stores text inPPDrawing/EscherClientTextboxrecords where there is no equivalent per-slide state transition. Fix would require emitting slide separators when entering eachSlidecontainer, not just when processingSlidePersistAtom.
tests/file_example_PPT_1MB.ppt has a chart object on slide 2. catppt does not output the text in the Data Table is not output.
The sample PowerPoint I found and used as a basic .ppt sample test file doesn't show any text in LibreOffice besides "This presentation has been IRM protected by policy." This makes it a bad test file. Maybe it's worth figuring out if there's other text in the file and if this is a special protected PowerPoint feature.
- Trigger Fedora package builds for skierpage/catdoc Copr once packit permissions are sorted.
trigger: releasein .packit.yaml targetsowner: skierpage, project: catdocbut builds never happen: the release shows up in the Packit dashboard under "Releases Handled" with no build created, and https://copr.fedorainfracloud.org/coprs/skierpage/catdoc/builds/ stays empty.- Packit is installed as a GitHub App (Settings > GitHub Actions > Packit-as-a-service).
- PR builds work fine (go to temporary
packit/skierpage-catdoc-NCopr projects). - Latest Packit docs focus on
dist-gitintegration (submitting to Fedora proper); the Copr release trigger workflow may need different configuration or explicit Copr permissions granted to the Packit service account.
-
Maybe fix the memory leaks reported by asan so tests don't have to set ASAN_OPTIONS=detect_leaks=0
Investigation shows xls2csv has memory leaks (154 bytes: 128 from rowptr[row].cells allocation at sheet.c:38, 26 from cell content at xlsparse.c:307). catdoc does not leak.
free_sheet() IS being called and executes all free() calls for rows/cells, but ASan still reports the leak. This may be a complex issue with realloc/ASan interaction, or allocations happening in a different code path. The leaks are small and occur at program exit, so this may be "won't fix".
- Do the Gnome LocalSearch (formerly Tracker), KFileMetadata, and LibreOffice projects have test documents? In particular I'm looking for other encodings.
- Is there an online Word 97-to-Office 2007 emulator or in-browser tool that lets you make your own?
This is old C code. ALTERNATIVES discusses alternate ways to extract text.
v0.97 fixed nearly all outstanding CVEs and memory access errors, see (NEWS)[NEWS] for details. But there are a few loose ends
-
Contact rycbar77 and ask for their POC for CVE-2023-41633, "Catdoc v0.95 was discovered to contain a NULL pointer dereference via the component xls2csv at src/fileutil.c.". This may skierpage/catdoc issue #8, if so it's fixed by commit e91fef7.
-
CVE-2023-46345 (references rycbar77's gist), Strftime-Nullptr-Dereference "NULL pointer dereference via the component xls2csv at src/xlsparse.c". One of the commits to src/slxparse.c in this fork in February 2026 may have fixed this
- Need the POC file in order to reproduce.
-
Review commits made to the similar C source code in libdoc that is based on catdoc.
Note that libdoc issue #1 (CVE-2018-20453) and issue #2 (CVE-2018-20451) seem to be fixed by catdoc commit 12ab509. This is confusing since user kasha13 claimed these were fixed in libdoc with different source code changes.
- Incorporate Victor Wagner's notes at https://www.wagner.pp.ru/~vitus/software/catdoc/ into README.md
Victor Wagner's original TODO:
- support dual-byte (CJK) encodings as output
- Find a way to extract rowspan information from XLS.
- Make XLS2CSV to output sheet partially when memory exhausted
- Plain-text output method for XLS2CSV and its support in wordview
- textmode (ck) wordview
- Improve RTF support
- Extract text from Top Level OLE objects ???
- Write correct TeX commands for accented latin letters and most often
- used mathematical symbols (20xx-25xx) into tex.specchars file
- Add handling of tables & footnotes
- Fastsave support