Skip to content

DKSplit for Go. A high-performance word segmentation library. Split domain names and concatenated strings into words using BiLSTM-CRF + ONNX Runtime.

License

Notifications You must be signed in to change notification settings

ABTdomain/dksplit-go

Repository files navigation

DKSplit-go

⚠️ Security Notice: The only official repositories for this project are ABTdomain/dksplit (Python) and ABTdomain/dksplit-go (Go). We are aware of unauthorized clones that may distribute suspicious files. Please only download from our official repositories.

Go implementation of DKSplit - fast word segmentation for text without spaces.

Built with BiLSTM-CRF model and ONNX Runtime.

Performance

CPU Mode QPS
Intel Core i9-14900K Single ~1,700/s
Intel Core i9-14900K Batch ~7,000/s
Intel Core i9-9900K Single ~1,000/s
Intel Core i9-9900K Batch ~3,000/s

Batch mode is 4.6x faster than single mode.

Compared to Python version:

  • Single: 2.7x faster
  • Batch: 5.6x faster

Install

go get github.com/ABTdomain/dksplit-go

Usage

package main

import (
    "fmt"
    "log"

    dksplit "github.com/ABTdomain/dksplit-go"
)

func main() {
    splitter, err := dksplit.New("models")
    if err != nil {
        log.Fatal(err)
    }
    defer splitter.Close()

    // Single
    result, _ := splitter.Split("chatgptlogin")
    fmt.Println(result)
    // Output: [chatgpt login]

    // Batch
    results, _ := splitter.SplitBatch([]string{"openaikey", "microsoftoffice"}, 256)
    fmt.Println(results)
    // Output: [[openai key] [microsoft office]]
}

Examples

Input Output
chatgptlogin chatgpt login
kubernetescluster kubernetes cluster
microsoftoffice microsoft office
mercibeaucoup merci beaucoup
gutenmorgen guten morgen

Real World Benchmark

Tested on Majestic Million domains:

Input Output
amitriptylineinfo amitriptyline info
autoriteprotectiondonnees autorite protection donnees
mountaingoatsoftware mountain goat software
psychologytoday psychology today
affordablecollegesonline affordable colleges online
stephenwolfram stephen wolfram
ralphlauren ralphlauren
m12ivermectin m12i vermectin

Run benchmark yourself:

wget https://downloads.majestic.com/majestic_million.csv -O top-1m.csv
go test -v -run TestRealWorldBenchmark

Accuracy Benchmark

For detailed accuracy benchmarks on 1,000 real newly registered domains (DKSplit vs WordSegment vs WordNinja vs GPT-5.2), see the Python version benchmark.

The Go and Python versions use the same model and produce identical results.

Results on Intel Core i9-9900K:

  • Dataset: 10,000 unique domains (length > 10, no hyphens)
  • QPS: 3,175/s

Requirements

  • Go 1.21+
  • Linux x64

Links

Support

If you find this useful:

License

This project is licensed under the Apache License 2.0.

Please attribute as: DKsplit by ABTdomain

About

DKSplit for Go. A high-performance word segmentation library. Split domain names and concatenated strings into words using BiLSTM-CRF + ONNX Runtime.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages