Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/workflows/c-cpp.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
name: C/C++ CI

# Per https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true

on:
push:

Expand Down
20 changes: 17 additions & 3 deletions NEWS
Original file line number Diff line number Diff line change
@@ -1,3 +1,15 @@
0.97.2 March 26 2026
Fix Tcl incompatibility in wordview GUI wrapper around catdoc.
Add the "distribution tarball" created by `make check` as a release
asset; other CI automation improvements.
Add more PowerPoint sample files to tests.
Fix empty catppt output from many PowerPoint files (issue #6) including
LibreOffice Impress documents saved in .ppt format (issue #7) by
processing ClientTextboxes in PPDrawing records.
Don't print spurious theme names (e.g. "Office Theme", "Default
Design") in PowerPoint files that appear as CString records in master
slides.

0.97.1 March 21 2026
Clean up address sanitizer tests.
Simplify RPM package builds triggered by Packit.
Expand Down Expand Up @@ -29,9 +41,11 @@
Developed test framework for memory access errors caught by
address sanitizer.
Fixed CVE-2023-31979 global buffer overflow vulnerability in the
process_file function in src/reader.c.
This also addresses CVE-2018-20451 and CVE-2018-20453 that were found
in libdoc (based on catdoc) and were also present in catdoc..
process_file function in src/reader.c, reported as issue 9 in
petewarden's early catdoc repo on GitHub. This also addresses bugs
CVE-2018-20451 and CVE-2018-20453 found in libdoc (based on catdoc),
that were also present in catdoc.

Fixed NULL derefence if charset directory is empty (issue #8), which
may be the same bug as CVE-2023-41633.
Fixed memory leak in charset handling.
Expand Down
29 changes: 17 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# catdoc version 0.97.1 in development
# catdoc version 0.97.2

`catdoc` is a program which reads MS-Office Word `.doc` files and prints their
content as readable ASCII text to stdout. It can also produce correct
Expand All @@ -25,10 +25,14 @@ the text of old MS-Office files.
user-definable output formats and support for Word97 files, which contain
UNICODE internally.

## version 0.97.2

This extracts text from many more PowerPoint .ppt files

## version 0.97

This release of the catdoc programs addresses numerous vulnerabilities
described below. To do so it has updated autoconf/automake tooling to make it
described [below](#vulnerabilities). To do so it has updated autoconf/automake tooling to make it
easier to build with Address Sanitizer, and an automake test harness to check
for memory errors. The steps to build it from source changed slightly, see
[INSTALL](INSTALL).
Expand Down Expand Up @@ -66,21 +70,22 @@ from Office 97-2003 files, or convert them to other formats.
The catdoc programs are unsafe C code that parse old files. Unexpected or
garbled file content will cause them to crash and running them on a
specially-crafted file may allow an attacker to interfere with the operation of
your computer. Version 0.97 fixes several memory access errors and Common
your computer. Version 0.97 fixes many memory access errors and Common
Vulnerabilities and Exposures reported against various forks and distribution
packages of catdoc over the years, but there may be more.

This release of the catdoc programs incorporates the Debian patches for the
vulnerabilities
packages of catdoc over the years:
- it incorporates the Debian patches for the vulnerabilities
[CVE-2024-54028](https://nvd.nist.gov/vuln/detail/CVE-2024-54028),
[CVE-2024-52035](https://nvd.nist.gov/vuln/detail/CVE-2024-52035),
and
[CVE-2024-48877](https://nvd.nist.gov/vuln/detail/CVE-2024-48877)
identified and addressed by the Cisco Talos team, and several other memory
access vulnerabilities reported over the years.
See [NEWS](NEWS) and the commit history (search history for "CVE") for other
fixes made. Some were detected by Address Sanitizer tools, see
[tests/asan_failures/README.md](tests/asan_failures/README.md) for more details.
identified and addressed by the Cisco Talos team
- it fixes the memory access errors [reported by yangzao against vbwagner's catdoc upstream](/vbwagner/catdoc/issues?q=is%3Aissue%20author%3Ayangzao)
- it fixes the buffer overflow [CVE-2023-31979](https://nvd.nist.gov/vuln/detail/CVE-2023-31979) reported by randomssr against petewarden's outdated copy of catdoc; this also fixes the heap buffer over-reads in [CVE-2018-20451](https://nvd.nist.gov/vuln/detail/CVE-2018-20451) and [CVE-2018-20453](https://nvd.nist.gov/vuln/detail/CVE-2018-20453) reported against libuvdoc (based on catdoc)
- it probably fixes [CVE-2023-46345](https://nvd.nist.gov/vuln/detail/CVE-2023-46345) and [CVE-2023-41633](https://nvd.nist.gov/vuln/detail/CVE-2023-41633) reported by rycbar77
- together, these fixes address all the [many bugs detected by Dean Pierce in 2015](https://seclists.org/oss-sec/2015/q1/835)

Some were detected by Address Sanitizer tools,
see [tests/asan_failures/README.md](tests/asan_failures/README.md) for more details.

## Documentation, bugs, more information

Expand Down
70 changes: 55 additions & 15 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,55 @@
# TODO

## catdoc issues

### medium priority: incomplete/inconsistent input file checking

GitHub issue #9: catdoc doesn't error if you give it a directory to convert; catppt and xls2csv print "/usr is not OLE file or Error" (with no error, exit status is 0). They should consistently fail. Could also check

Unreadable file inconsistency:
% catdoc -b /tmp/unreadable
prints
catdoc: Permission denied
but
% catppt /tmp/unreadable
prints the better
/tmp/unreadable: Permission denied

Likewise for /tmp/nosuchfile


## catppt issues

### low priority: catppt slide separator (form feed) inconsistency
- [ ] `basic.ppt` (MS PowerPoint format) outputs a form feed between each slide and after
the last slide. `test_LO_file.ppt` (LibreOffice Impress format) only outputs a form
feed after the last slide. Root cause: MS PPT stores text in `SlideListWithText` with
`SlidePersistAtom` records that trigger `slide_state = START_SLIDE`, while LO PPT
stores text in `PPDrawing`/Escher `ClientTextbox` records where there is no equivalent
per-slide state transition. Fix would require emitting slide separators when entering
each `Slide` container, not just when processing `SlidePersistAtom`.

### medium priority: catppt doesn't extract text from chart object

tests/file_example_PPT_1MB.ppt has a chart object on slide 2. catppt does not output the text in the Data Table is not output.

### low priority: "IRM protected" in protection_warning.ppt sample file.

The sample PowerPoint I found and used as a basic .ppt sample test file doesn't show any text in LibreOffice besides "This presentation has been IRM protected by policy." This makes it a bad test file. Maybe it's worth figuring out if there's other text in the file and if this is a special protected PowerPoint feature.

## CI

### Fedora RPM automation
- [ ] Trigger Fedora package builds for skierpage/catdoc Copr once packit permissions are sorted.
- `trigger: release` in .packit.yaml targets `owner: skierpage, project: catdoc` but builds
never happen: the release shows up in the Packit dashboard under "Releases Handled" with no
build created, and https://copr.fedorainfracloud.org/coprs/skierpage/catdoc/builds/ stays empty.
- Packit is installed as a GitHub App (Settings > GitHub Actions > Packit-as-a-service).
- PR builds work fine (go to temporary `packit/skierpage-catdoc-N` Copr projects).
- Latest Packit docs focus on `dist-git` integration (submitting to Fedora proper); the Copr
release trigger workflow may need different configuration or explicit Copr permissions granted
to the Packit service account.

## Test cleanup

- [ ] Maybe fix the memory leaks reported by asan so tests don't have to set ASAN_OPTIONS=detect_leaks=0
Expand All @@ -20,10 +70,14 @@
tool that lets you make your own?

## Research: consider alternative approaches

This is old C code.
[ALTERNATIVES](ALTERNATIVES.md) discusses alternate ways to extract text.

## Investigate additional CVEs:
## Confirm some CVE fixes

v0.97 fixed nearly all outstanding CVEs and memory access errors, see (NEWS)[NEWS] for details.
But there are a few loose ends

- [ ] Contact rycbar77 and ask for their POC for [CVE-2023-41633](https://nvd.nist.gov/vuln/detail/CVE-2023-41633), "Catdoc v0.95 was discovered to contain a NULL pointer dereference via the component xls2csv at src/fileutil.c.". This may [skierpage/catdoc issue #8](/skierpage/catdoc/issues/8), if so it's fixed by commit e91fef7.

Expand All @@ -37,21 +91,7 @@ This is old C code.

Note that libdoc [issue #1](https://github.com/uvoteam/libdoc/issues/1) (CVE-2018-20453) and [issue #2](https://github.com/uvoteam/libdoc/issues/2) (CVE-2018-20451) seem to be fixed by catdoc commit 12ab509. This is confusing since user kasha13 claimed these were fixed in libdoc with different source code changes.

## Investigate issues reported against petewarden fork of catdoc
[petewarden/catdoc](https://github.com/petewarden/catdoc) is a commit of version 0.93 of catdoc,
dating from around 2010
- [✅] Analyze [issues](https://github.com/petewarden/catdoc/issues) , including:
- [✅] [Issue 3](https://github.com/petewarden/catdoc/issues/3), "change character ก to Ď during xls2csv -d utf-8 /source.xls > desination.csv" may be fixed in this fork, see comment on issue
- [✅] [Issue 4](https://github.com/petewarden/catdoc/issues/4), "Heap-buffer-overflow in catdoc version 0.95 (numutils.c)" is fixed as part of Address Sanitizer fixes in this fork
- [✅] [Issue 5](https://github.com/petewarden/catdoc/issues/5) "Heap-buffer-overflow in catdoc version 0.95 (numutils.c)" is removed
- [✅] [Issue 6](https://github.com/petewarden/catdoc/issues/6) "Global-buffer-overflow in xls2csv" is removed
- [✅] [Issue 7](https://github.com/petewarden/catdoc/issues/7) "Buffer overflow in xls2csv (xlsparse.c:716)" is fixed by commit 44daea395; it may have a Debian bug number.
- [✅] [Issue 7](https://github.com/petewarden/catdoc/issues/7) "Buffer overflow in xls2csv (xlsparse.c:716)" is fixed by commit 44daea395; it may have a Debian bug number.
- [✅] [Issue 9](https://github.com/petewarden/catdoc/issues/9), "catdoc global buffer overflow -- by misuse of the option "-b"" is fixed by commit 1a09fc5
- [✅] [Issue 10](https://github.com/petewarden/catdoc/issues/10), "global-buffer-overflow on reader.c:177:20" is fixed in this fork

## MISC
- [ ] Trigger Fedora package builds for skierpage/catdoc Copr once packit permissions are sorted.
- [ ] Incorporate Victor Wagner's notes at https://www.wagner.pp.ru/~vitus/software/catdoc/ into README.md

# BACKLOG
Expand Down
2 changes: 1 addition & 1 deletion catdoc.spec
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Name: catdoc
Version: 0.97.1
Version: 0.97.2
Release: %autorelease
Summary: programs which extract text from Microsoft Office 97-2004 files
License: GPL-2.0-or-later
Expand Down
2 changes: 1 addition & 1 deletion configure.ac
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
dnl Process this file with autoconf to produce a configure script.
AC_INIT([catdoc],[0.97.1])
AC_INIT([catdoc],[0.97.2])
AC_CONFIG_AUX_DIR([build-aux])
AM_INIT_AUTOMAKE([-Wall foreign])

Expand Down
7 changes: 7 additions & 0 deletions src/catdoc.h
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,13 @@
#ifndef CATDOC_H
#define CATDOC_H

#ifdef DEBUG
#define DBGPRINT(fmt, ...) \
fprintf(stderr, "DEBUG [%s:%d]: " fmt "\n", __FILE__, __LINE__, ##__VA_ARGS__)
#else
#define DBGPRINT(fmt, ...) do {} while(0)
#endif

#ifdef HAVE_CONFIG_H
#include <config.h>
#endif
Expand Down
Loading
Loading