Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 18 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,22 @@
<h1 align="center">strutil</h1>

<p align="center">
<a href="https://github.com/adrg/strutil/actions/workflows/tests.yml">
<img alt="Tests status" src="https://github.com/adrg/strutil/actions/workflows/tests.yml/badge.svg">
<a href="https://github.com/dorzzz/strutil/actions/workflows/tests.yml">
<img alt="Tests status" src="https://github.com/dorzzz/strutil/actions/workflows/tests.yml/badge.svg">
</a>
<a href="https://codecov.io/gh/adrg/strutil">
<img alt="Code coverage" src="https://codecov.io/gh/adrg/strutil/branch/master/graphs/badge.svg?branch=master" />
</a>
<a href="https://pkg.go.dev/github.com/adrg/strutil">
<img alt="pkg.go.dev documentation" src="https://pkg.go.dev/badge/github.com/adrg/strutil" />
<a href="https://pkg.go.dev/github.com/dorzzz/strutil">
<img alt="pkg.go.dev documentation" src="https://pkg.go.dev/badge/github.com/dorzzz/strutil" />
</a>
<a href="https://opensource.org/licenses/MIT" rel="nofollow">
<img alt="MIT license" src="https://img.shields.io/github/license/adrg/strutil" />
</a>
<a href="https://goreportcard.com/report/github.com/adrg/strutil">
<img alt="Go report card" src="https://goreportcard.com/badge/github.com/adrg/strutil" />
<a href="https://goreportcard.com/report/github.com/dorzzz/strutil">
<img alt="Go report card" src="https://goreportcard.com/badge/github.com/dorzzz/strutil" />
</a>
<a href="https://github.com/adrg/strutil/issues">
<a href="https://github.com/dorzzz/strutil/issues">
<img alt="GitHub issues" src="https://img.shields.io/github/issues/adrg/strutil" />
</a>
<a href="https://ko-fi.com/T6T72WATK">
Expand All @@ -26,12 +26,12 @@

strutil provides a collection of string metrics for calculating string similarity as well as
other string utility functions.
Full documentation can be found at https://pkg.go.dev/github.com/adrg/strutil.
Full documentation can be found at https://pkg.go.dev/github.com/dorzzz/strutil.

## Installation

```
go get github.com/adrg/strutil
go get github.com/dorzzz/strutil
```

## String metrics
Expand Down Expand Up @@ -60,7 +60,7 @@ func Similarity(a, b string, metric StringMetric) float64 {
```

All defined string metrics can be found in the
[metrics](https://pkg.go.dev/github.com/adrg/strutil/metrics) package.
[metrics](https://pkg.go.dev/github.com/dorzzz/strutil/metrics) package.

#### Hamming

Expand All @@ -77,7 +77,7 @@ fmt.Printf("%d\n", ham.Distance("one", "once")) // Output: 2
```

More information and additional examples can be found on
[pkg.go.dev](https://pkg.go.dev/github.com/adrg/strutil/metrics#Hamming).
[pkg.go.dev](https://pkg.go.dev/github.com/dorzzz/strutil/metrics#Hamming).

#### Levenshtein

Expand Down Expand Up @@ -106,7 +106,7 @@ fmt.Printf("%d\n", lev.Distance("graph", "giraffe")) // Output: 4
```

More information and additional examples can be found on
[pkg.go.dev](https://pkg.go.dev/github.com/adrg/strutil/metrics#Levenshtein).
[pkg.go.dev](https://pkg.go.dev/github.com/dorzzz/strutil/metrics#Levenshtein).

#### Jaro

Expand All @@ -116,7 +116,7 @@ fmt.Printf("%.2f\n", similarity) // Output: 0.78
```

More information and additional examples can be found on
[pkg.go.dev](https://pkg.go.dev/github.com/adrg/strutil/metrics#Jaro).
[pkg.go.dev](https://pkg.go.dev/github.com/dorzzz/strutil/metrics#Jaro).

#### Jaro-Winkler

Expand All @@ -126,7 +126,7 @@ fmt.Printf("%.2f\n", similarity) // Output: 0.80
```

More information and additional examples can be found on
[pkg.go.dev](https://pkg.go.dev/github.com/adrg/strutil/metrics#JaroWinkler).
[pkg.go.dev](https://pkg.go.dev/github.com/dorzzz/strutil/metrics#JaroWinkler).

#### Smith-Waterman-Gotoh

Expand All @@ -152,7 +152,7 @@ fmt.Printf("%.2f\n", similarity) // Output: 0.96
```

More information and additional examples can be found on
[pkg.go.dev](https://pkg.go.dev/github.com/adrg/strutil/metrics#SmithWatermanGotoh).
[pkg.go.dev](https://pkg.go.dev/github.com/dorzzz/strutil/metrics#SmithWatermanGotoh).

#### Sorensen-Dice

Expand All @@ -174,7 +174,7 @@ fmt.Printf("%.2f\n", similarity) // Output: 0.53
```

More information and additional examples can be found on
[pkg.go.dev](https://pkg.go.dev/github.com/adrg/strutil/metrics#SorensenDice).
[pkg.go.dev](https://pkg.go.dev/github.com/dorzzz/strutil/metrics#SorensenDice).

#### Jaccard

Expand Down Expand Up @@ -214,7 +214,7 @@ where SD is the Sorensen-Dice coefficient and J is the Jaccard index.
```

More information and additional examples can be found on
[pkg.go.dev](https://pkg.go.dev/github.com/adrg/strutil/metrics#Jaccard).
[pkg.go.dev](https://pkg.go.dev/github.com/dorzzz/strutil/metrics#Jaccard).

#### Overlap Coefficient

Expand All @@ -236,7 +236,7 @@ fmt.Printf("%.2f\n", similarity) // Output: 0.57
```

More information and additional examples can be found on
[pkg.go.dev](https://pkg.go.dev/github.com/adrg/strutil/metrics#OverlapCoefficient).
[pkg.go.dev](https://pkg.go.dev/github.com/dorzzz/strutil/metrics#OverlapCoefficient).

## References

Expand Down
4 changes: 2 additions & 2 deletions example_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ package strutil_test
import (
"fmt"

"github.com/adrg/strutil"
"github.com/adrg/strutil/metrics"
"github.com/dorzzz/strutil"
"github.com/dorzzz/strutil/metrics"
)

func ExampleSimilarity() {
Expand Down
2 changes: 1 addition & 1 deletion go.mod
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
module github.com/adrg/strutil
module github.com/dorzzz/strutil

go 1.19

Expand Down
2 changes: 1 addition & 1 deletion internal/mathutil/mathutil_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ package mathutil_test
import (
"testing"

"github.com/adrg/strutil/internal/mathutil"
"github.com/dorzzz/strutil/internal/mathutil"
"github.com/stretchr/testify/require"
)

Expand Down
2 changes: 1 addition & 1 deletion internal/ngram/ngram.go
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
package ngram

import "github.com/adrg/strutil/internal/mathutil"
import "github.com/dorzzz/strutil/internal/mathutil"

// Count returns the n-gram count of the specified size for the
// provided term. An n-gram size of 1 is used if the provided size is
Expand Down
2 changes: 1 addition & 1 deletion internal/ngram/ngram_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ package ngram_test
import (
"testing"

"github.com/adrg/strutil/internal/ngram"
"github.com/dorzzz/strutil/internal/ngram"
"github.com/stretchr/testify/require"
)

Expand Down
2 changes: 1 addition & 1 deletion internal/stringutil/stringutil_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ package stringutil_test
import (
"testing"

"github.com/adrg/strutil/internal/stringutil"
"github.com/dorzzz/strutil/internal/stringutil"
"github.com/stretchr/testify/require"
)

Expand Down
2 changes: 1 addition & 1 deletion metrics/examples_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ package metrics_test
import (
"fmt"

"github.com/adrg/strutil/metrics"
"github.com/dorzzz/strutil/metrics"
)

func ExampleHamming() {
Expand Down
2 changes: 1 addition & 1 deletion metrics/jaccard.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ package metrics
import (
"strings"

"github.com/adrg/strutil/internal/ngram"
"github.com/dorzzz/strutil/internal/ngram"
)

// Jaccard represents the Jaccard index for measuring the similarity
Expand Down
125 changes: 117 additions & 8 deletions metrics/jaro.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,39 +4,44 @@ import (
"strings"
"unicode/utf8"

"github.com/adrg/strutil/internal/mathutil"
"github.com/dorzzz/strutil/internal/mathutil"
)

// Jaro represents the Jaro metric for measuring the similarity
// between sequences.
// For more information see https://en.wikipedia.org/wiki/Jaro-Winkler_distance.
type Jaro struct {
// CaseSensitive specifies if the string comparison is case sensitive.
CaseSensitive bool
CaseSensitive bool
UseStandardWindow int
}

// NewJaro returns a new Jaro string metric.
//
// Default options:
// CaseSensitive: true
// UseStandardWindow: 0 (uses original strutil algorithm)
func NewJaro() *Jaro {
return &Jaro{
CaseSensitive: true,
CaseSensitive: true,
UseStandardWindow: 0,
}
}

// Compare returns the Jaro similarity of a and b. The returned similarity is
// a number between 0 and 1. Larger similarity numbers indicate closer matches.
func (m *Jaro) Compare(a, b string) float64 {
// Check if both terms are empty.
// Use rune counts (UTF-8 code points) for lengths.
lenA, lenB := utf8.RuneCountInString(a), utf8.RuneCountInString(b)

// Check if both terms are empty.
if lenA == 0 && lenB == 0 {
return 1
return 1.0
}

// Check if one of the terms is empty.
if lenA == 0 || lenB == 0 {
return 0
return 0.0
}

// Lower terms if case insensitive comparison is specified.
Expand All @@ -45,7 +50,28 @@ func (m *Jaro) Compare(a, b string) float64 {
b = strings.ToLower(b)
}

// Get matching runes.
// Choose algorithm based on UseStandardWindow
if m.UseStandardWindow == 1 {
// Apache Commons implementation
if a == b {
return 1.0
}

ra := []rune(a)
rb := []rune(b)

matches, halfTranspositions := jaroMatches(ra, rb, m.UseStandardWindow)
if matches == 0 {
return 0.0
}

mFloat := float64(matches)
return (mFloat/float64(lenA) +
mFloat/float64(lenB) +
(mFloat-float64(halfTranspositions)/2.0)/mFloat) / 3.0
}

// Original strutil implementation (default)
halfLen := mathutil.Max(0, mathutil.Max(lenA, lenB)/2)
mrA := matchingRunes(a, b, halfLen)
mrB := matchingRunes(b, a, halfLen)
Expand All @@ -55,12 +81,78 @@ func (m *Jaro) Compare(a, b string) float64 {
return 0.0
}

// Return similarity.
return (float64(fmLen)/float64(lenA) +
float64(smLen)/float64(lenB) +
float64(fmLen-transpositions(mrA, mrB)/2)/float64(fmLen)) / 3.0
}

// jaroMatches mirrors Apache's JaroWinklerSimilarity.matches(...) logic,
// but operating on rune slices instead of Java chars.
func jaroMatches(first, second []rune, useStandardWindow int) (matches int, halfTranspositions int) {
var maxRunes, minRunes []rune
if len(first) > len(second) {
maxRunes = first
minRunes = second
} else {
maxRunes = second
minRunes = first
}

// range = Math.max(max.length()/2 - 1, 0)
rng := maxInt(len(maxRunes)/2-useStandardWindow, 0)

matchIndexes := make([]int, len(minRunes))
for i := range matchIndexes {
matchIndexes[i] = -1
}
matchFlags := make([]bool, len(maxRunes))

// Find matches
for mi, c1 := range minRunes {
start := maxInt(mi-rng, 0)
end := minInt(mi+rng+1, len(maxRunes))
for xi := start; xi < end; xi++ {
if !matchFlags[xi] && c1 == maxRunes[xi] {
matchIndexes[mi] = xi
matchFlags[xi] = true
matches++
break
}
}
}

// Build the two matched sequences ms1, ms2
ms1 := make([]rune, matches)
ms2 := make([]rune, matches)

si := 0
for i := 0; i < len(minRunes); i++ {
if matchIndexes[i] != -1 {
ms1[si] = minRunes[i]
si++
}
}

si = 0
for i := 0; i < len(maxRunes); i++ {
if matchFlags[i] {
ms2[si] = maxRunes[i]
si++
}
}

// Count half-transpositions
for i := 0; i < len(ms1); i++ {
if ms1[i] != ms2[i] {
halfTranspositions++
}
}

return matches, halfTranspositions
}

// matchingRunes returns the matching runes between a and b within the specified limit.
// This is the original strutil implementation.
func matchingRunes(a, b string, limit int) []rune {
var (
runesA = []rune(a)
Expand All @@ -83,6 +175,8 @@ func matchingRunes(a, b string, limit int) []rune {
return runesCommon
}

// transpositions counts the number of transpositions between two rune slices.
// This is the original strutil implementation.
func transpositions(a, b []rune) int {
var count int

Expand All @@ -95,3 +189,18 @@ func transpositions(a, b []rune) int {

return count
}

// local int helpers
func minInt(a, b int) int {
if a < b {
return a
}
return b
}

func maxInt(a, b int) int {
if a > b {
return a
}
return b
}
Loading