go-sat

Pure Go library for sentence boundary detection using wtpsplit/SaT ONNX models.

Features

Sentence boundary detection using neural models
Sentence completeness checking with confidence scores
Text segmentation into sentences
Automatic chunking for long texts (handles sequences > 512 tokens)
Thread-safe with configurable session pooling
Pure Go tokenizer (SentencePiece Unigram algorithm)
Benchmark tool for evaluating model accuracy

Requirements

Go 1.23 or later
ONNX Runtime shared library

Installation

go get github.com/jamesainslie/go-sat

ONNX Runtime

The library requires the ONNX Runtime shared library. See the onnxruntime_go installation guide for platform-specific instructions.

macOS (Homebrew):

brew install onnxruntime
export ONNXRUNTIME_SHARED_LIBRARY_PATH="$(brew --prefix onnxruntime)/lib/libonnxruntime.dylib"

Linux:

# Download from https://github.com/microsoft/onnxruntime/releases
# Extract and set LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/path/to/onnxruntime/lib:$LD_LIBRARY_PATH

Model Files

Download the required model files from HuggingFace:

# Tokenizer (XLM-RoBERTa SentencePiece model)
curl -L -o sentencepiece.bpe.model \
  "https://huggingface.co/xlm-roberta-base/resolve/main/sentencepiece.bpe.model"

# SaT model (sat-1l-sm: fast, suitable for most use cases)
curl -L -o model.onnx \
  "https://huggingface.co/segment-any-text/sat-1l-sm/resolve/main/model.onnx"

Note: Use model.onnx, not model_optimized.onnx. The optimized model uses different input tensor types.

Usage

Basic Example

package main

import (
    "context"
    "fmt"
    "log"

    sat "github.com/jamesainslie/go-sat"
)

func main() {
    seg, err := sat.New("model.onnx", "sentencepiece.bpe.model")
    if err != nil {
        log.Fatal(err)
    }
    defer seg.Close()

    ctx := context.Background()

    // Check if text is a complete sentence
    complete, confidence, err := seg.IsComplete(ctx, "Hello world.")
    if err != nil {
        log.Fatal(err)
    }
    fmt.Printf("Complete: %v (confidence: %.2f)\n", complete, confidence)

    // Split text into sentences
    sentences, err := seg.Segment(ctx, "Hello world. How are you?")
    if err != nil {
        log.Fatal(err)
    }
    for _, s := range sentences {
        fmt.Println(s)
    }
}

Configuration Options

seg, err := sat.New(modelPath, tokenizerPath,
    sat.WithThreshold(0.025),       // Boundary detection threshold (default: 0.025)
    sat.WithPoolSize(4),            // ONNX session pool size (default: runtime.NumCPU())
    sat.WithLogger(slog.Default()), // Custom logger (default: slog.Default())
)

Architecture

Components

Component	Description
Segmenter	Main API - coordinates tokenization, inference, and boundary detection
Tokenizer	Pure Go SentencePiece implementation (Unigram algorithm with Viterbi decoding)
Session Pool	Thread-safe pool of ONNX Runtime sessions for concurrent inference
Inference	Handles float16 tensor conversion and model execution

Data Flow

Text input normalized (whitespace handling, SentencePiece prefix)
Tokenized using Viterbi dynamic programming algorithm
Token IDs remapped from SentencePiece to HuggingFace convention
For long texts (>512 tokens), split into overlapping chunks (64 token overlap)
ONNX model inference produces per-token boundary logits (float16)
Logits converted to float32, averaged in overlap regions
Sigmoid applied; positions above threshold mark sentence boundaries

Model Requirements

The SaT model expects:

input_ids: int64 tensor of token IDs
attention_mask: float16 tensor (1.0 for valid tokens)
Output: float16 logits with shape [batch, sequence, 1]

See docs/ARCHITECTURE.md for detailed technical documentation.

API Reference

Function	Description
`New(modelPath, tokenizerPath string, opts ...Option)`	Create a new Segmenter
`(*Segmenter).IsComplete(ctx, text) (bool, float32, error)`	Check if text is a complete sentence
`(*Segmenter).Segment(ctx, text) ([]string, error)`	Split text into sentences
`(*Segmenter).Close() error`	Release all resources

See docs/API.md for detailed API documentation with examples.

CLI Tools

sat-cli

Command-line tool for testing and debugging:

go install github.com/jamesainslie/go-sat/cmd/sat-cli@latest

# Check sentence completeness
sat-cli -model model.onnx -tokenizer tokenizer.model "Hello world."

# Segment text into sentences
sat-cli -model model.onnx -tokenizer tokenizer.model -mode segment "Hello. World."

sat-bench

Benchmark tool for evaluating model accuracy against a test corpus:

go install github.com/jamesainslie/go-sat/cmd/sat-bench@latest

# Evaluate against UD-EWT gold-standard corpus (recommended)
./scripts/fetch-ud-ewt.sh && go run ./scripts/process-ud-ewt.go
sat-bench -model model.onnx -tokenizer tokenizer.model -corpus testdata/ud-ewt-test -tolerance 10 -sweep

# Or evaluate against TED transcripts
sat-bench -model model.onnx -tokenizer tokenizer.model -corpus testdata/ted -threshold 0.025

Example output (UD-EWT test set):

Precision: 0.87  Recall: 0.80  F1: 0.84  Weighted: 0.84
(TP: 1662, FP: 239, FN: 415)

See docs/BENCHMARKING.md for detailed guidance on interpreting results and corpus formats.

Building

The project uses Stave for build automation:

# Install stave
go install github.com/yaklabco/stave/cmd/stave@latest

# Build everything (lint, test, build)
stave

# Quick shortcuts
stave b          # Build binaries
stave t          # Run tests
stave l          # Run linter

# Run benchmarks (auto-configures ONNX runtime on macOS)
stave bench:run
stave bench:sweep

See stave --list for all available targets.

Performance

Inference latency: Typically < 50ms for utterances under 50 tokens
Thread-safe: Multiple goroutines can call methods concurrently
Session pooling: Configurable pool size to balance memory and throughput

The default pool size equals runtime.NumCPU(). For memory-constrained environments, reduce pool size. For high-throughput scenarios, ensure pool size matches expected concurrency.

Error Handling

The library defines sentinel errors for conditions callers may need to handle:

var (
    ErrModelNotFound   = errors.New("sat: model file not found")
    ErrInvalidModel    = errors.New("sat: invalid model format")
    ErrTokenizerFailed = errors.New("sat: tokenizer initialization failed")
)

Example error handling:

seg, err := sat.New(modelPath, tokenizerPath)
if errors.Is(err, sat.ErrModelNotFound) {
    // Handle missing model file
}

Contributing

Contributions are welcome. See CONTRIBUTING.md for guidelines.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.beads		.beads
cmd		cmd
docs		docs
inference		inference
internal		internal
scripts		scripts
testdata		testdata
tokenizer		tokenizer
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
doc.go		doc.go
errors.go		errors.go
go.mod		go.mod
go.sum		go.sum
options.go		options.go
sat.go		sat.go
sat_test.go		sat_test.go
stave.yaml		stave.yaml
stavefile.go		stavefile.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

go-sat

Features

Requirements

Installation

ONNX Runtime

Model Files

Usage

Basic Example

Configuration Options

Architecture

Components

Data Flow

Model Requirements

API Reference

CLI Tools

sat-cli

sat-bench

Building

Performance

Error Handling

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

go-sat

Features

Requirements

Installation

ONNX Runtime

Model Files

Usage

Basic Example

Configuration Options

Architecture

Components

Data Flow

Model Requirements

API Reference

CLI Tools

sat-cli

sat-bench

Building

Performance

Error Handling

Contributing

License

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages