Skip to content
/ doc Public

A Go package for extracting text from Microsoft Word .doc binary files. support Chinese characters

Notifications You must be signed in to change notification settings

lee501/doc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

doc

A Go package for extracting text from Microsoft Word .doc binary files.

Features

  • Extract plain text from Microsoft Word .doc binary files
  • Handle both compressed and uncompressed text formats
  • Support for multiple character encodings, including Chinese characters
  • Simple and easy-to-use API

Installation

go get github.com/lee501/doc

Usage

package main

import (
	"fmt"
	"github.com/lee501/doc"
	"os"
)

func main() {
	// Open Word document
	file, err := os.Open("document.doc")
	if err != nil {
		panic(err)
	}
	defer file.Close()

	// Extract text from document
	text, err := doc.ParseDoc(file)
	if err != nil {
		panic(err)
	}

	// Convert reader to string and print
	fmt.Println(text)
}

Features in Detail

  1. Support Compressed and Uncompressed Text Handling
  • translateCompressedText and translateUncompressedText
  • Compressed text typically uses single-byte encoding (e.g., ANSI/CP1252)
  • Uncompressed text typically uses double-byte Unicode encoding
  1. Enhanced Unicode Support
  • Properly handle double-byte Unicode characters in translateUncompressedText
  • Use binary.LittleEndian.Uint16 to read Unicode code points
  • Convert Unicode code points to UTF-8 output
  1. Improved Chinese Character Handling
  • handleANSICharacter function to handle potential Chinese characters
  • Integrated golang.org/x/text/encoding/simplifiedchinese package for GBK encoding support
  • Attempt GBK decoding for high-byte characters
  1. Better Character Mapping
  • replaceCompressed function to correctly convert Windows-1252 special characters to UTF-8
  • Added detailed Unicode code point comments
  1. Encoding Detection
  • detectChineseEncoding helper function to identify potential Chinese encodings
  • Laid the foundation for more intelligent encoding detection in the future

Dependencies

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A Go package for extracting text from Microsoft Word .doc binary files. support Chinese characters

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages