diff --git a/.github/workflows/c-cpp.yml b/.github/workflows/c-cpp.yml index efb00e1..6fc40ce 100644 --- a/.github/workflows/c-cpp.yml +++ b/.github/workflows/c-cpp.yml @@ -1,5 +1,9 @@ name: C/C++ CI +# Per https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/ +env: + FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true + on: push: diff --git a/NEWS b/NEWS index 77e0bfa..d68522b 100644 --- a/NEWS +++ b/NEWS @@ -1,3 +1,15 @@ + 0.97.2 March 26 2026 + Fix Tcl incompatibility in wordview GUI wrapper around catdoc. + Add the "distribution tarball" created by `make check` as a release + asset; other CI automation improvements. + Add more PowerPoint sample files to tests. + Fix empty catppt output from many PowerPoint files (issue #6) including + LibreOffice Impress documents saved in .ppt format (issue #7) by + processing ClientTextboxes in PPDrawing records. + Don't print spurious theme names (e.g. "Office Theme", "Default + Design") in PowerPoint files that appear as CString records in master + slides. + 0.97.1 March 21 2026 Clean up address sanitizer tests. Simplify RPM package builds triggered by Packit. @@ -29,9 +41,11 @@ Developed test framework for memory access errors caught by address sanitizer. Fixed CVE-2023-31979 global buffer overflow vulnerability in the - process_file function in src/reader.c. - This also addresses CVE-2018-20451 and CVE-2018-20453 that were found - in libdoc (based on catdoc) and were also present in catdoc.. + process_file function in src/reader.c, reported as issue 9 in + petewarden's early catdoc repo on GitHub. This also addresses bugs + CVE-2018-20451 and CVE-2018-20453 found in libdoc (based on catdoc), + that were also present in catdoc. + Fixed NULL derefence if charset directory is empty (issue #8), which may be the same bug as CVE-2023-41633. Fixed memory leak in charset handling. diff --git a/README.md b/README.md index e36a3da..7b739c5 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# catdoc version 0.97.1 in development +# catdoc version 0.97.2 `catdoc` is a program which reads MS-Office Word `.doc` files and prints their content as readable ASCII text to stdout. It can also produce correct @@ -25,10 +25,14 @@ the text of old MS-Office files. user-definable output formats and support for Word97 files, which contain UNICODE internally. +## version 0.97.2 + +This extracts text from many more PowerPoint .ppt files + ## version 0.97 This release of the catdoc programs addresses numerous vulnerabilities -described below. To do so it has updated autoconf/automake tooling to make it +described [below](#vulnerabilities). To do so it has updated autoconf/automake tooling to make it easier to build with Address Sanitizer, and an automake test harness to check for memory errors. The steps to build it from source changed slightly, see [INSTALL](INSTALL). @@ -66,21 +70,22 @@ from Office 97-2003 files, or convert them to other formats. The catdoc programs are unsafe C code that parse old files. Unexpected or garbled file content will cause them to crash and running them on a specially-crafted file may allow an attacker to interfere with the operation of -your computer. Version 0.97 fixes several memory access errors and Common +your computer. Version 0.97 fixes many memory access errors and Common Vulnerabilities and Exposures reported against various forks and distribution -packages of catdoc over the years, but there may be more. - -This release of the catdoc programs incorporates the Debian patches for the -vulnerabilities +packages of catdoc over the years: +- it incorporates the Debian patches for the vulnerabilities [CVE-2024-54028](https://nvd.nist.gov/vuln/detail/CVE-2024-54028), [CVE-2024-52035](https://nvd.nist.gov/vuln/detail/CVE-2024-52035), and [CVE-2024-48877](https://nvd.nist.gov/vuln/detail/CVE-2024-48877) -identified and addressed by the Cisco Talos team, and several other memory -access vulnerabilities reported over the years. -See [NEWS](NEWS) and the commit history (search history for "CVE") for other -fixes made. Some were detected by Address Sanitizer tools, see -[tests/asan_failures/README.md](tests/asan_failures/README.md) for more details. +identified and addressed by the Cisco Talos team +- it fixes the memory access errors [reported by yangzao against vbwagner's catdoc upstream](/vbwagner/catdoc/issues?q=is%3Aissue%20author%3Ayangzao) +- it fixes the buffer overflow [CVE-2023-31979](https://nvd.nist.gov/vuln/detail/CVE-2023-31979) reported by randomssr against petewarden's outdated copy of catdoc; this also fixes the heap buffer over-reads in [CVE-2018-20451](https://nvd.nist.gov/vuln/detail/CVE-2018-20451) and [CVE-2018-20453](https://nvd.nist.gov/vuln/detail/CVE-2018-20453) reported against libuvdoc (based on catdoc) +- it probably fixes [CVE-2023-46345](https://nvd.nist.gov/vuln/detail/CVE-2023-46345) and [CVE-2023-41633](https://nvd.nist.gov/vuln/detail/CVE-2023-41633) reported by rycbar77 +- together, these fixes address all the [many bugs detected by Dean Pierce in 2015](https://seclists.org/oss-sec/2015/q1/835) + +Some were detected by Address Sanitizer tools, +see [tests/asan_failures/README.md](tests/asan_failures/README.md) for more details. ## Documentation, bugs, more information diff --git a/TODO.md b/TODO.md index 297b31b..ca9b16c 100644 --- a/TODO.md +++ b/TODO.md @@ -1,5 +1,55 @@ # TODO +## catdoc issues + +### medium priority: incomplete/inconsistent input file checking + +GitHub issue #9: catdoc doesn't error if you give it a directory to convert; catppt and xls2csv print "/usr is not OLE file or Error" (with no error, exit status is 0). They should consistently fail. Could also check + +Unreadable file inconsistency: + % catdoc -b /tmp/unreadable +prints + catdoc: Permission denied +but + % catppt /tmp/unreadable +prints the better + /tmp/unreadable: Permission denied + +Likewise for /tmp/nosuchfile + + +## catppt issues + +### low priority: catppt slide separator (form feed) inconsistency +- [ ] `basic.ppt` (MS PowerPoint format) outputs a form feed between each slide and after + the last slide. `test_LO_file.ppt` (LibreOffice Impress format) only outputs a form + feed after the last slide. Root cause: MS PPT stores text in `SlideListWithText` with + `SlidePersistAtom` records that trigger `slide_state = START_SLIDE`, while LO PPT + stores text in `PPDrawing`/Escher `ClientTextbox` records where there is no equivalent + per-slide state transition. Fix would require emitting slide separators when entering + each `Slide` container, not just when processing `SlidePersistAtom`. + +### medium priority: catppt doesn't extract text from chart object + +tests/file_example_PPT_1MB.ppt has a chart object on slide 2. catppt does not output the text in the Data Table is not output. + +### low priority: "IRM protected" in protection_warning.ppt sample file. + +The sample PowerPoint I found and used as a basic .ppt sample test file doesn't show any text in LibreOffice besides "This presentation has been IRM protected by policy." This makes it a bad test file. Maybe it's worth figuring out if there's other text in the file and if this is a special protected PowerPoint feature. + +## CI + +### Fedora RPM automation +- [ ] Trigger Fedora package builds for skierpage/catdoc Copr once packit permissions are sorted. + - `trigger: release` in .packit.yaml targets `owner: skierpage, project: catdoc` but builds + never happen: the release shows up in the Packit dashboard under "Releases Handled" with no + build created, and https://copr.fedorainfracloud.org/coprs/skierpage/catdoc/builds/ stays empty. + - Packit is installed as a GitHub App (Settings > GitHub Actions > Packit-as-a-service). + - PR builds work fine (go to temporary `packit/skierpage-catdoc-N` Copr projects). + - Latest Packit docs focus on `dist-git` integration (submitting to Fedora proper); the Copr + release trigger workflow may need different configuration or explicit Copr permissions granted + to the Packit service account. + ## Test cleanup - [ ] Maybe fix the memory leaks reported by asan so tests don't have to set ASAN_OPTIONS=detect_leaks=0 @@ -20,10 +70,14 @@ tool that lets you make your own? ## Research: consider alternative approaches + This is old C code. [ALTERNATIVES](ALTERNATIVES.md) discusses alternate ways to extract text. -## Investigate additional CVEs: +## Confirm some CVE fixes + +v0.97 fixed nearly all outstanding CVEs and memory access errors, see (NEWS)[NEWS] for details. +But there are a few loose ends - [ ] Contact rycbar77 and ask for their POC for [CVE-2023-41633](https://nvd.nist.gov/vuln/detail/CVE-2023-41633), "Catdoc v0.95 was discovered to contain a NULL pointer dereference via the component xls2csv at src/fileutil.c.". This may [skierpage/catdoc issue #8](/skierpage/catdoc/issues/8), if so it's fixed by commit e91fef7. @@ -37,21 +91,7 @@ This is old C code. Note that libdoc [issue #1](https://github.com/uvoteam/libdoc/issues/1) (CVE-2018-20453) and [issue #2](https://github.com/uvoteam/libdoc/issues/2) (CVE-2018-20451) seem to be fixed by catdoc commit 12ab509. This is confusing since user kasha13 claimed these were fixed in libdoc with different source code changes. -## Investigate issues reported against petewarden fork of catdoc -[petewarden/catdoc](https://github.com/petewarden/catdoc) is a commit of version 0.93 of catdoc, -dating from around 2010 -- [✅] Analyze [issues](https://github.com/petewarden/catdoc/issues) , including: - - [✅] [Issue 3](https://github.com/petewarden/catdoc/issues/3), "change character ก to Ď during xls2csv -d utf-8 /source.xls > desination.csv" may be fixed in this fork, see comment on issue - - [✅] [Issue 4](https://github.com/petewarden/catdoc/issues/4), "Heap-buffer-overflow in catdoc version 0.95 (numutils.c)" is fixed as part of Address Sanitizer fixes in this fork - - [✅] [Issue 5](https://github.com/petewarden/catdoc/issues/5) "Heap-buffer-overflow in catdoc version 0.95 (numutils.c)" is removed - - [✅] [Issue 6](https://github.com/petewarden/catdoc/issues/6) "Global-buffer-overflow in xls2csv" is removed - - [✅] [Issue 7](https://github.com/petewarden/catdoc/issues/7) "Buffer overflow in xls2csv (xlsparse.c:716)" is fixed by commit 44daea395; it may have a Debian bug number. - - [✅] [Issue 7](https://github.com/petewarden/catdoc/issues/7) "Buffer overflow in xls2csv (xlsparse.c:716)" is fixed by commit 44daea395; it may have a Debian bug number. - - [✅] [Issue 9](https://github.com/petewarden/catdoc/issues/9), "catdoc global buffer overflow -- by misuse of the option "-b"" is fixed by commit 1a09fc5 - - [✅] [Issue 10](https://github.com/petewarden/catdoc/issues/10), "global-buffer-overflow on reader.c:177:20" is fixed in this fork - ## MISC -- [ ] Trigger Fedora package builds for skierpage/catdoc Copr once packit permissions are sorted. - [ ] Incorporate Victor Wagner's notes at https://www.wagner.pp.ru/~vitus/software/catdoc/ into README.md # BACKLOG diff --git a/catdoc.spec b/catdoc.spec index 6d1d31a..73055d8 100644 --- a/catdoc.spec +++ b/catdoc.spec @@ -1,5 +1,5 @@ Name: catdoc -Version: 0.97.1 +Version: 0.97.2 Release: %autorelease Summary: programs which extract text from Microsoft Office 97-2004 files License: GPL-2.0-or-later diff --git a/configure.ac b/configure.ac index a460ec8..49e969b 100644 --- a/configure.ac +++ b/configure.ac @@ -1,5 +1,5 @@ dnl Process this file with autoconf to produce a configure script. -AC_INIT([catdoc],[0.97.1]) +AC_INIT([catdoc],[0.97.2]) AC_CONFIG_AUX_DIR([build-aux]) AM_INIT_AUTOMAKE([-Wall foreign]) diff --git a/src/catdoc.h b/src/catdoc.h index ab8bc5b..b02a890 100644 --- a/src/catdoc.h +++ b/src/catdoc.h @@ -7,6 +7,13 @@ #ifndef CATDOC_H #define CATDOC_H +#ifdef DEBUG +#define DBGPRINT(fmt, ...) \ + fprintf(stderr, "DEBUG [%s:%d]: " fmt "\n", __FILE__, __LINE__, ##__VA_ARGS__) +#else +#define DBGPRINT(fmt, ...) do {} while(0) +#endif + #ifdef HAVE_CONFIG_H #include #endif diff --git a/src/pptparse.c b/src/pptparse.c index f475f07..5f8694f 100644 --- a/src/pptparse.c +++ b/src/pptparse.c @@ -1,13 +1,13 @@ /** * @file pptparse.c * @author Alex Ott - * @date 23 2004 + * @date 23 ??? 2004 * Version: $Id: pptparse.c,v 1.2 2006-10-17 19:11:29 vitus Exp $ * Copyright: Alex Ott - * + * * @brief .ppt parsing routines - * - * + * + * */ #ifdef HAVE_CONFIG_H @@ -21,7 +21,7 @@ #include "catdoc.h" #include "ppttypes.h" -char *slide_separator = "\f"; +char *slide_separator = "\f"; static void process_item (int rectype, long reclen, FILE* input); @@ -31,21 +31,22 @@ static void process_item (int rectype, long reclen, FILE* input); static void start_text_out(void); -/** - * - * - * @param input - * @param filename +/** + * + * + * @param input + * @param filename */ enum {START_FILE,START_SLIDE,TEXTOUT,END_FILE} slide_state ; +static int in_slide = 0; /* true when inside a Slide record, not a master or notes */ static void start_text_out(void) { if (slide_state == START_SLIDE) { fputs(slide_separator,stdout); } slide_state = TEXTOUT; -} +} void do_ppt(FILE *input,char *filename) { int itemsread=1; int rectype; @@ -54,11 +55,7 @@ void do_ppt(FILE *input,char *filename) { slide_state = START_FILE; while(itemsread) { itemsread = catdoc_read(recbuf, 1, 8, input); -/* fprintf(stderr,"itemsread=%d: ",itemsread); */ -/* for(i=0; i<8; i++) */ -/* fprintf(stderr,"%02x ",recbuf[i]); */ -/* fprintf(stderr,"\n"); */ - + if (catdoc_eof(input)) { process_item(DOCUMENT_END,0,input); return; @@ -67,240 +64,258 @@ void do_ppt(FILE *input,char *filename) { break; rectype=getshort(recbuf,2); reclen=getulong(recbuf,4); + DBGPRINT("read record type=%d len=%ld", rectype, reclen); if (reclen < 0) { return; - } + } process_item(rectype,reclen,input); } } -/** - * - * - * @param rectype - * @param reclen - * @param input +/** + * + * + * @param rectype + * @param reclen + * @param input */ static void process_item (int rectype, long reclen, FILE* input) { int i=0, u; static unsigned char buf[2]; -/* fprintf(stderr,"Processing record %d length %d\n",rectype,reclen); - * */ switch(rectype) { case DOCUMENT_END: -/* fprintf(stderr,"End of document, ended at %ld\n",catdoc_tell(input)); */ + DBGPRINT("End of document, ended at %ld", catdoc_tell(input)); catdoc_seek(input, reclen, SEEK_CUR); if (slide_state == TEXTOUT) { fputs(slide_separator,stdout); slide_state = END_FILE; - } + } break; case DOCUMENT: -/* fprintf(stderr,"Start of document, reclen=%ld, started at %ld\n", reclen, */ -/* catdoc_tell(input)); */ + DBGPRINT("Start of document, reclen=%ld, started at %ld", reclen, catdoc_tell(input)); break; case DOCUMENT_ATOM: -/* fprintf(stderr,"DocumentAtom, reclen=%ld\n", reclen); */ + DBGPRINT("DocumentAtom, reclen=%ld", reclen); catdoc_seek(input, reclen, SEEK_CUR); break; case SLIDE: -/* fprintf(stderr,"Slide, reclen=%ld\n", reclen); */ + DBGPRINT("Slide, reclen=%ld", reclen); + in_slide = 1; break; case SLIDE_ATOM: -/* fprintf(stderr,"SlideAtom, reclen=%ld\n", reclen); */ + DBGPRINT("SlideAtom, reclen=%ld", reclen); catdoc_seek(input, reclen, SEEK_CUR); break; - + case SLIDE_BASE: -/* fprintf(stderr,"SlideBase, reclen=%ld\n", reclen); */ + DBGPRINT("SlideBase, reclen=%ld", reclen); break; case SLIDE_BASE_ATOM: -/* fprintf(stderr,"SlideBaseAtom, reclen=%ld\n", reclen); */ + DBGPRINT("SlideBaseAtom, reclen=%ld", reclen); catdoc_seek(input, reclen, SEEK_CUR); break; - + case NOTES: -/* fprintf(stderr,"Notes, reclen=%ld\n", reclen); */ + DBGPRINT("Notes, reclen=%ld", reclen); + in_slide = 0; break; case NOTES_ATOM: -/* fprintf(stderr,"NotesAtom, reclen=%ld\n", reclen); */ + DBGPRINT("NotesAtom, reclen=%ld", reclen); catdoc_seek(input, reclen, SEEK_CUR); break; - + case HEADERS_FOOTERS: -/* fprintf(stderr,"HeadersFooters, reclen=%ld\n", reclen); */ + DBGPRINT("HeadersFooters, reclen=%ld", reclen); break; case HEADERS_FOOTERS_ATOM: -/* fprintf(stderr,"HeadersFootersAtom, reclen=%ld\n", reclen); */ + DBGPRINT("HeadersFootersAtom, reclen=%ld", reclen); catdoc_seek(input, reclen, SEEK_CUR); break; - + case MAIN_MASTER: -/* fprintf(stderr,"MainMaster, reclen=%ld\n", reclen); */ + DBGPRINT("MainMaster, reclen=%ld", reclen); + in_slide = 0; break; - + case TEXT_BYTES_ATOM: { -/* fprintf(stderr,"TextBytes, reclen=%ld\n", reclen); */ - start_text_out(); - for(i=0; i < reclen; i++) { - catdoc_read(buf,1,1,input); - if((unsigned char)*buf!=0x0d) - fputs(convert_char((unsigned char)*buf),stdout); - else - fputc('\n',stdout); - } - fputc('\n',stdout); + DBGPRINT("TextBytesAtom, reclen=%ld", reclen); + start_text_out(); + for(i=0; i < reclen; i++) { + catdoc_read(buf,1,1,input); + if((unsigned char)*buf!=0x0d) + fputs(convert_char((unsigned char)*buf),stdout); + else + fputc('\n',stdout); + } + fputc('\n',stdout); } break; - - case TEXT_CHARS_ATOM: + + case TEXT_CHARS_ATOM: case CSTRING: { - long text_len; - -/* fprintf(stderr,"CString, reclen=%ld\n", reclen); */ - start_text_out(); - text_len=reclen/2; - for(i=0; i < text_len; i++) { - catdoc_read(buf,2,1,input); - u=(unsigned short)getshort(buf,0); - if(u!=0x0d) - fputs(convert_char(u),stdout); - else - fputc('\n',stdout); - } - fputc('\n',stdout); + long text_len; + + DBGPRINT("TextCharsAtom/CString, reclen=%ld", reclen); + if (!in_slide) { + catdoc_seek(input, reclen, SEEK_CUR); + break; + } + start_text_out(); + text_len=reclen/2; + for(i=0; i < text_len; i++) { + catdoc_read(buf,2,1,input); + u=(unsigned short)getshort(buf,0); + if(u!=0x0d) + fputs(convert_char(u),stdout); + else + fputc('\n',stdout); + } + fputc('\n',stdout); } break; - + case USER_EDIT_ATOM: -/* fprintf(stderr,"UserEditAtom, reclen=%ld\n", reclen); */ + DBGPRINT("UserEditAtom, reclen=%ld", reclen); catdoc_seek(input, reclen, SEEK_CUR); break; case COLOR_SCHEME_ATOM: -/* fprintf(stderr,"ColorSchemeAtom, reclen=%ld\n", reclen); */ + DBGPRINT("ColorSchemeAtom, reclen=%ld", reclen); catdoc_seek(input, reclen, SEEK_CUR); break; case PPDRAWING: -/* fprintf(stderr,"PPDrawing, reclen=%ld\n", reclen); */ - catdoc_seek(input, reclen, SEEK_CUR); + if (in_slide) { + DBGPRINT("PPDrawing, reclen=%ld - descending into slide drawing", reclen); + } else { + DBGPRINT("PPDrawing, reclen=%ld - skipping (not in slide)", reclen); + catdoc_seek(input, reclen, SEEK_CUR); + } + break; + + case ESCHER_DG_CONTAINER: + DBGPRINT("Escher DgContainer, reclen=%ld", reclen); + break; + + case ESCHER_SPGR_CONTAINER: + DBGPRINT("Escher SpgrContainer, reclen=%ld", reclen); + break; + + case ESCHER_SP_CONTAINER: + DBGPRINT("Escher SpContainer, reclen=%ld", reclen); + break; + + case ESCHER_CLIENT_TEXTBOX: + DBGPRINT("Escher ClientTextbox, reclen=%ld - contains text", reclen); + break; + + case ESCHER_CLIENT_DATA: + DBGPRINT("Escher ClientData, reclen=%ld", reclen); break; case ENVIRONMENT: -/* fprintf(stderr,"Environment, reclen=%ld\n", reclen); */ + DBGPRINT("Environment, reclen=%ld", reclen); catdoc_seek(input, reclen, SEEK_CUR); break; case SSDOC_INFO_ATOM: -/* fprintf(stderr,"SSDocInfoAtom, reclen=%ld\n", reclen); */ + DBGPRINT("SSDocInfoAtom, reclen=%ld", reclen); catdoc_seek(input, reclen, SEEK_CUR); break; case SSSLIDE_INFO_ATOM: -/* fprintf(stderr,"SSSlideInfoAtom, reclen=%ld\n", reclen); */ + DBGPRINT("SSSlideInfoAtom, reclen=%ld", reclen); catdoc_seek(input, reclen, SEEK_CUR); break; case PROG_TAGS: -/* fprintf(stderr,"ProgTags, reclen=%ld\n", reclen); */ + DBGPRINT("ProgTags, reclen=%ld", reclen); catdoc_seek(input, reclen, SEEK_CUR); break; case PROG_STRING_TAG: -/* fprintf(stderr,"ProgStringTag, reclen=%ld\n", reclen); */ + DBGPRINT("ProgStringTag, reclen=%ld", reclen); catdoc_seek(input, reclen, SEEK_CUR); break; case PROG_BINARY_TAG: -/* fprintf(stderr,"ProgBinaryTag, reclen=%ld\n", reclen); */ + DBGPRINT("ProgBinaryTag, reclen=%ld", reclen); catdoc_seek(input, reclen, SEEK_CUR); break; case LIST: -/* fprintf(stderr,"List, reclen=%ld\n", reclen); */ + DBGPRINT("List, reclen=%ld", reclen); break; case SLIDE_LIST_WITH_TEXT: -/* fprintf(stderr,"SlideListWithText, reclen=%ld\n", reclen); */ -/* fputs("---------------------------------------\n",stderr); */ + DBGPRINT("SlideListWithText, reclen=%ld", reclen); break; case PERSIST_PTR_INCREMENTAL_BLOCK: -/* fprintf(stderr,"PersistPtrIncrementalBlock, reclen=%ld\n", reclen); */ + DBGPRINT("PersistPtrIncrementalBlock, reclen=%ld", reclen); catdoc_seek(input, reclen, SEEK_CUR); break; case EX_OLE_OBJ_STG: -/* fprintf(stderr,"ExOleObjStg, reclen=%ld\n", reclen); */ + DBGPRINT("ExOleObjStg, reclen=%ld", reclen); catdoc_seek(input, reclen, SEEK_CUR); break; case PPDRAWING_GROUP: -/* fprintf(stderr,"PpdrawingGroup, reclen=%ld\n", reclen); */ + DBGPRINT("PpdrawingGroup, reclen=%ld", reclen); catdoc_seek(input, reclen, SEEK_CUR); break; case EX_OBJ_LIST: -/* fprintf(stderr,"ExObjList, reclen=%ld\n", reclen); */ + DBGPRINT("ExObjList, reclen=%ld", reclen); catdoc_seek(input, reclen, SEEK_CUR); break; case TX_MASTER_STYLE_ATOM: -/* fprintf(stderr,"TxMasterStyleAtom, reclen=%ld\n", reclen); */ + DBGPRINT("TxMasterStyleAtom, reclen=%ld", reclen); catdoc_seek(input, reclen, SEEK_CUR); break; case HANDOUT: -/* fprintf(stderr,"Handout, reclen=%ld\n", reclen); */ + DBGPRINT("Handout, reclen=%ld", reclen); catdoc_seek(input, reclen, SEEK_CUR); break; case SLIDE_PERSIST_ATOM: + DBGPRINT("SlidePersistAtom, reclen=%ld", reclen); if (slide_state != START_FILE) { slide_state = START_SLIDE; - } + } catdoc_seek(input, reclen, SEEK_CUR); break; case TEXT_HEADER_ATOM: -/* fprintf(stderr,"TextHeaderAtom, reclen=%ld\n", reclen); */ + DBGPRINT("TextHeaderAtom, reclen=%ld", reclen); catdoc_seek(input, reclen, SEEK_CUR); break; case TEXT_SPEC_INFO: -/* fprintf(stderr,"TextSpecInfo, reclen=%ld\n", reclen); */ + DBGPRINT("TextSpecInfo, reclen=%ld", reclen); catdoc_seek(input, reclen, SEEK_CUR); break; case STYLE_TEXT_PROP_ATOM: -/* fprintf(stderr,"StyleTextPropAtom, reclen=%ld\n", reclen); */ + DBGPRINT("StyleTextPropAtom, reclen=%ld", reclen); catdoc_seek(input, reclen, SEEK_CUR); break; - /* case : - fprintf(stderr,", reclen=%ld\n", reclen); - catdoc_seek(input, reclen, SEEK_CUR); - break;*/ - - /* case : - fprintf(stderr,", reclen=%ld\n", reclen); - catdoc_seek(input, reclen, SEEK_CUR); - break;*/ - default: -/* fprintf(stderr,"Default action for rectype=%d reclen=%ld\n", */ -/* rectype, reclen); */ + DBGPRINT("Default action for rectype=%d reclen=%ld", rectype, reclen); catdoc_seek(input, reclen, SEEK_CUR); } - + } diff --git a/src/ppttypes.h b/src/ppttypes.h index 521e0b3..3693fc5 100644 --- a/src/ppttypes.h +++ b/src/ppttypes.h @@ -1,7 +1,7 @@ /** * @file ppttypes.h * @author Alex Ott - * @date 26 2004 + * @date 26 ��� 2004 * Version: $Id: ppttypes.h,v 1.1 2006-02-24 17:44:06 vitus Exp $ * Copyright: Alex Ott * @@ -53,11 +53,12 @@ #define PROG_STRING_TAG 5001 #define PROG_BINARY_TAG 5002 #define PERSIST_PTR_INCREMENTAL_BLOCK 6002 -/* #define */ -/* #define */ -/* #define */ -/* #define */ -/* #define */ +/* Escher (Office Drawing) container record types, used inside PPDrawing */ +#define ESCHER_DG_CONTAINER 0xF002 /* Drawing container */ +#define ESCHER_SPGR_CONTAINER 0xF003 /* Shape group container */ +#define ESCHER_SP_CONTAINER 0xF004 /* Shape container */ +#define ESCHER_CLIENT_TEXTBOX 0xF00D /* Client textbox - contains PPT text atoms */ +#define ESCHER_CLIENT_DATA 0xF011 /* Client data container */ #endif /* _PPTTYPES_H */ diff --git a/tests/Makefile.am b/tests/Makefile.am index 784975a..74bb5c2 100644 --- a/tests/Makefile.am +++ b/tests/Makefile.am @@ -4,11 +4,14 @@ # File-based tests (each .expected file is a test) FILE_TESTS = \ basic.doc.expected \ - basic.ppt.expected \ + example-3slides.ppt.expected \ basic.xls.expected \ - hungarian.xls.expected \ test_LO_file.doc.expected \ test_LO_file.ppt.expected \ + file_example_PPT_1MB.ppt.expected \ + protection_warning.ppt.expected \ + test_LO_file.ppt.expected \ + hungarian.xls.expected \ unicode_MS.xls.expected # Custom tests (scripts that don't follow the file pattern) @@ -49,7 +52,7 @@ EXPECTED_LOG_COMPILER = $(srcdir)/run-single-test.sh TEST_LOG_COMPILER = # Expected test failures (known bugs not fixed yet) -XFAIL_TESTS = test_LO_file.ppt.expected +XFAIL_TESTS = # Scripts needed for testing check_SCRIPTS = run-single-test.sh \ diff --git a/tests/README.md b/tests/README.md index 71b492c..4dd5e99 100644 --- a/tests/README.md +++ b/tests/README.md @@ -2,11 +2,11 @@ Automatic: install following INSTALL instructions, then run `make check`. Manual: Run each test file through its converter (catdoc for Microsoft Office .doc files, pptdoc for .ppt, and xls2csv for .xls) and compare the output the -corresponding .expected file. +corresponding .expected file, ignoring whitespace differences. For example, - catdoc basic.doc | diff - basic.doc.expected + catdoc basic.doc | diff -u --ignore-blank-lines --ignore-trailing-space - basic.doc.expected ## Test files include: unicode_MS.xls diff --git a/tests/asan_failures/README.md b/tests/asan_failures/README.md index 5c76f42..355c975 100644 --- a/tests/asan_failures/README.md +++ b/tests/asan_failures/README.md @@ -34,27 +34,28 @@ ASAN bug tests. ## Bugs reported against pwarden repo -https://github.com/petewarden/catdoc is the old version 0.93 of catdoc put on GitHub before https://github.com/vbwagner/catdoc (version 0.95) was created. +[petewarden/catdoc](/petewarden/catdoc) is the old version 0.93 of catdoc put on GitHub before [vbwagner/catdoc](vbwagner/catdoc) (version 0.95) was created. - cve-2023-31979.test - This reproduces https://github.com/petewarden/catdoc/issues/9 using test file + This reproduces petewarden/catdoc/#9 using test file global-buffer-overflow, which became CVE-2023-31979 +The other bugs (issues 3-10) [reported against petewarden/catdoc](/petewarden/catdoc/issues) are fixed in later versions of catdoc from vbwagner or in this fork, see comments on those issues. ## Bugs reported against libdoc -https://github.com/uvoteam/libdoc is based on catdoc sources, so some +[uvoteam/libdoc](/uvoteam/libdoc) is based on catdoc sources, so some bugs reported against it apply to catdoc as well. - cve-2018-20451.test - This reproduces https://github.com/uvoteam/libdoc/issues/2 using test file + This reproduces uvoteam/libdoc#2 using test file libdoc_reader_process_file_203.overflow, which became CVE-2018-20451 - cve-2018-20453.test - This reproduces https://github.com/uvoteam/libdoc/issues/1 using test file + This reproduces uvoteam/libdoc#1 using test file libdoc_numutils_getlong_22.overflow, which became CVE-2018-20453 ## Miscellaneous memory access bugs @@ -71,7 +72,7 @@ bugs reported against it apply to catdoc as well. ## yangzao bugs -GitHub user @yangzao reported several issues [in vbwagner's catdoc repository](https://github.com/vbwagner/catdoc/issues) +GitHub user @yangzao reported several issues [in vbwagner's catdoc repository](/vbwagner/catdoc/issues) ### Summary @@ -86,14 +87,14 @@ and cause the program to exit with error codes, but they no longer trigger ASAN GitHub issue | Test File | POC Location | Tool | Current Status | Fixed by | Description | ---- | ---- | ---- | ---- | ---- | ---- | ---- | -[vbwagner #6](https://github.com/vbwagner/catdoc/issues/6) | vbwagner-issue-6.test | vbwagner_issue_6/1 | xls2csv | ✅ PASS | d74d3ac | NULL deref in calcFileBlockOffset() - AccessViolation at ole.c:450/544 | -[vbwagner #7](https://github.com/vbwagner/catdoc/issues/7) | vbwagner-issue-7.test | vbwagner_issue_7/2 | xls2csv | ✅ PASS | 1aa8a2c | AccessViolation at xlsparse.c:679 (number2string) | -[vbwagner #8](https://github.com/vbwagner/catdoc/issues/8) | vbwagner-issue-8.test | vbwagner_issue_8/3 | xls2csv | ✅ PASS | 70d2bd1 | heap-buffer-overflow(read) at xlsparse.c:493 | -[vbwagner #9](https://github.com/vbwagner/catdoc/issues/9) | vbwagner-issue-9.test | vbwagner_issue_9/4 | xls2csv | ✅ PASS | 2c156ed | NULL deref in stradd() - AccessViolation at fileutil.c:124 | -[vbwagner #10](https://github.com/vbwagner/catdoc/issues/10) | vbwagner-issue-10.test | vbwagner_issue_10/5 | xls2csv | ✅ PASS | 70d2bd1 | AccessViolation at xlsparse.c:438 | -[vbwagner #11](https://github.com/vbwagner/catdoc/issues/11)/CVE-2017-11110 | vbwagner-issue-11.test | vbwagner_issue_11/6 | xls2csv | ✅ PASS | possibly 7c6fd7b | heap-buffer-overflow(read) at numutils.c:22 | -[vbwagner #12](https://github.com/vbwagner/catdoc/issues/12) | vbwagner-issue-12.test | vbwagner_issue_12/7 | xls2csv | ✅ PASS | 4c5e43b | global-buffer-overflow(write) at xlsparse.c:608 | -[vbwagner #13](https://github.com/vbwagner/catdoc/issues/13) | vbwagner-issue-13.test | vbwagner_issue_13/8 | xls2csv | ✅ PASS | 44daea3 | global-buffer-overflow(write) at xlsparse.c:716 | +vbwagner/catdoc#6 | vbwagner-issue-6.test | vbwagner_issue_6/1 | xls2csv | ✅ PASS | d74d3ac | NULL deref in calcFileBlockOffset() - AccessViolation at ole.c:450/544 | +vbwagner/catdoc#7 | vbwagner-issue-7.test | vbwagner_issue_7/2 | xls2csv | ✅ PASS | 1aa8a2c | AccessViolation at xlsparse.c:679 (number2string) | +vbwagner/catdoc#8 | vbwagner-issue-8.test | vbwagner_issue_8/3 | xls2csv | ✅ PASS | 70d2bd1 | heap-buffer-overflow(read) at xlsparse.c:493 | +vbwagner/catdoc#9 | vbwagner-issue-9.test | vbwagner_issue_9/4 | xls2csv | ✅ PASS | 2c156ed | NULL deref in stradd() - AccessViolation at fileutil.c:124 | +vbwagner/catdoc#10 | vbwagner-issue-10.test | vbwagner_issue_10/5 | xls2csv | ✅ PASS | 70d2bd1 | AccessViolation at xlsparse.c:438 | +vbwagner/catdoc#11/CVE-2017-11110 | vbwagner-issue-11.test | vbwagner_issue_11/6 | xls2csv | ✅ PASS | possibly 7c6fd7b | heap-buffer-overflow(read) at numutils.c:22 | +vbwagner/catdoc#12 | vbwagner-issue-12.test | vbwagner_issue_12/7 | xls2csv | ✅ PASS | 4c5e43b | global-buffer-overflow(write) at xlsparse.c:608 | +vbwagner/catdoc#13 | vbwagner-issue-13.test | vbwagner_issue_13/8 | xls2csv | ✅ PASS | 44daea3 | global-buffer-overflow(write) at xlsparse.c:716 | --- diff --git a/tests/example-3slides.ppt b/tests/example-3slides.ppt new file mode 100644 index 0000000..59e8ae2 Binary files /dev/null and b/tests/example-3slides.ppt differ diff --git a/tests/example-3slides.ppt.expected b/tests/example-3slides.ppt.expected new file mode 100644 index 0000000..9f02988 --- /dev/null +++ b/tests/example-3slides.ppt.expected @@ -0,0 +1,5 @@ +Example file +Created by: examplefiles.org +Second page +Third page + \ No newline at end of file diff --git a/tests/file_example_PPT_1MB.ppt b/tests/file_example_PPT_1MB.ppt new file mode 100644 index 0000000..34f0fea Binary files /dev/null and b/tests/file_example_PPT_1MB.ppt differ diff --git a/tests/file_example_PPT_1MB.ppt.expected b/tests/file_example_PPT_1MB.ppt.expected new file mode 100644 index 0000000..45e58d8 --- /dev/null +++ b/tests/file_example_PPT_1MB.ppt.expected @@ -0,0 +1,14 @@ +Lorem ipsum +Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio. Vestibulum neque massa, scelerisque sit amet ligula eu, congue molestie mi. Praesent ut varius sem. Nullam at porttitor arcu, nec lacinia nisi. Ut ac dolor vitae odio interdum condimentum. Vivamus dapibus sodales ex, vitae malesuada ipsum cursus convallis. Maecenas sed egestas nulla, ac condimentum orci. Mauris diam felis, vulputate ac suscipit et, iaculis non est. Curabitur semper arcu ac ligula semper, nec luctus nisl blandit. Integer lacinia ante ac libero lobortis imperdiet. Nullam mollis convallis ipsum, ac accumsan nunc vehicula vitae. Nulla eget justo in felis tristique fringilla. Morbi sit amet tortor quis risus auctor condimentum. Morbi in ullamcorper elit. Nulla iaculis tellus sit amet mauris tempus fringilla. + +Maecenas mauris lectus, lobortis et purus mattis, blandit dictum tellus. Maecenas non lorem quis tellus placerat varius. Nulla facilisi. Aenean congue fringilla justo ut aliquam. Mauris id ex erat. Nunc vulputate neque vitae justo facilisis, non condimentum ante sagittis. Morbi viverra semper lorem nec molestie. Maecenas tincidunt est efficitur ligula euismod, sit amet ornare est vulputate. + +Chart +Table +Column 1 +Column 2 +Column 3 +Column 4 +Column 5 +Photo + diff --git a/tests/basic.ppt b/tests/protection_warning.ppt similarity index 100% rename from tests/basic.ppt rename to tests/protection_warning.ppt diff --git a/tests/basic.ppt.expected b/tests/protection_warning.ppt.expected similarity index 76% rename from tests/basic.ppt.expected rename to tests/protection_warning.ppt.expected index 9382ea6..225a545 100644 --- a/tests/basic.ppt.expected +++ b/tests/protection_warning.ppt.expected @@ -1,3 +1,2 @@ This presentation has been IRM protected by policy. - Default Design \ No newline at end of file diff --git a/tests/test_LO_file.ppt b/tests/test_LO_file.ppt index ba62592..7c9e321 100644 Binary files a/tests/test_LO_file.ppt and b/tests/test_LO_file.ppt differ diff --git a/tests/test_LO_file.ppt.expected b/tests/test_LO_file.ppt.expected index d03ce19..b784f88 100644 --- a/tests/test_LO_file.ppt.expected +++ b/tests/test_LO_file.ppt.expected @@ -1,5 +1,6 @@ -Slide 1 header -Slide 2 text -Slide 2 header +Slide 1 subtitle +Slide 1 title +Slide 2 title Slide 2 bullet 1 -Slide 2 bullet 2 +Slide 2 bullet 2 Déjà vu 希望 + \ No newline at end of file