A Go package for extracting text from Microsoft Word .doc binary files.
- Extract plain text from Microsoft Word .doc binary files
- Handle both compressed and uncompressed text formats
- Support for multiple character encodings, including Chinese characters
- Simple and easy-to-use API
go get github.com/lee501/docpackage main
import (
"fmt"
"github.com/lee501/doc"
"os"
)
func main() {
// Open Word document
file, err := os.Open("document.doc")
if err != nil {
panic(err)
}
defer file.Close()
// Extract text from document
text, err := doc.ParseDoc(file)
if err != nil {
panic(err)
}
// Convert reader to string and print
fmt.Println(text)
}- Support Compressed and Uncompressed Text Handling
- translateCompressedText and translateUncompressedText
- Compressed text typically uses single-byte encoding (e.g., ANSI/CP1252)
- Uncompressed text typically uses double-byte Unicode encoding
- Enhanced Unicode Support
- Properly handle double-byte Unicode characters in translateUncompressedText
- Use binary.LittleEndian.Uint16 to read Unicode code points
- Convert Unicode code points to UTF-8 output
- Improved Chinese Character Handling
- handleANSICharacter function to handle potential Chinese characters
- Integrated golang.org/x/text/encoding/simplifiedchinese package for GBK encoding support
- Attempt GBK decoding for high-byte characters
- Better Character Mapping
- replaceCompressed function to correctly convert Windows-1252 special characters to UTF-8
- Added detailed Unicode code point comments
- Encoding Detection
- detectChineseEncoding helper function to identify potential Chinese encodings
- Laid the foundation for more intelligent encoding detection in the future
This project is licensed under the MIT License - see the LICENSE file for details.