doc

A Go package for extracting text from Microsoft Word .doc binary files.

Features

Extract plain text from Microsoft Word .doc binary files
Handle both compressed and uncompressed text formats
Support for multiple character encodings, including Chinese characters
Simple and easy-to-use API

Installation

go get github.com/lee501/doc

Usage

package main

import (
	"fmt"
	"github.com/lee501/doc"
	"os"
)

func main() {
	// Open Word document
	file, err := os.Open("document.doc")
	if err != nil {
		panic(err)
	}
	defer file.Close()

	// Extract text from document
	text, err := doc.ParseDoc(file)
	if err != nil {
		panic(err)
	}

	// Convert reader to string and print
	fmt.Println(text)
}

Features in Detail

Support Compressed and Uncompressed Text Handling

translateCompressedText and translateUncompressedText
Compressed text typically uses single-byte encoding (e.g., ANSI/CP1252)
Uncompressed text typically uses double-byte Unicode encoding

Enhanced Unicode Support

Properly handle double-byte Unicode characters in translateUncompressedText
Use binary.LittleEndian.Uint16 to read Unicode code points
Convert Unicode code points to UTF-8 output

Improved Chinese Character Handling

handleANSICharacter function to handle potential Chinese characters
Integrated golang.org/x/text/encoding/simplifiedchinese package for GBK encoding support
Attempt GBK decoding for high-byte characters

Better Character Mapping

replaceCompressed function to correctly convert Windows-1252 special characters to UTF-8
Added detailed Unicode code point comments

Encoding Detection

detectChineseEncoding helper function to identify potential Chinese encodings
Laid the foundation for more intelligent encoding detection in the future

Dependencies

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
testData		testData
README.md		README.md
clx.go		clx.go
doc.go		doc.go
doc_test.go		doc_test.go
fib.go		fib.go
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

doc

Features

Installation

Usage

Features in Detail

Dependencies

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

lee501/doc

Folders and files

Latest commit

History

Repository files navigation

doc

Features

Installation

Usage

Features in Detail

Dependencies

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages