π― High-compatibility XPath library for Go with precise location tracking
- π― High Compatibility - Strives for close compatibility with jsdom's XPath evaluation
- π Precise Location Tracking - Character-level positioning in source HTML/XML
- π Dual Extraction Modes - Extract full elements or content-only with
contentsOnlyoption - β‘ High Performance - Optimized evaluation engine with smart caching
- π§ Production Ready - Comprehensive error handling and extensive testing
- π§ͺ Battle Tested - Extensively tested against reference implementations
- π¦ Zero Dependencies - Pure Go implementation, no external dependencies
- π¨ Developer Friendly - Rich debugging support with trace logging
package main
import (
"fmt"
"log"
"github.com/reclaimprotocol/xpath-go"
)
func main() {
html := `<html><body><div id="content" class="main">Hello World</div></body></html>`
// Simple query
results, err := xpath.Query("//div[@id='content']", html)
if err != nil {
log.Fatal(err)
}
for _, result := range results {
fmt.Printf("Found: %s\n", result.TextContent)
fmt.Printf("Location: %d-%d\n", result.StartLocation, result.EndLocation)
fmt.Printf("Path: %s\n", result.Path)
}
}go get github.com/reclaimprotocol/xpath-gochild::,parent::,ancestor::,descendant::following::,preceding::,following-sibling::,preceding-sibling::attribute::,namespace::,self::descendant-or-self::,ancestor-or-self::
- Node Functions:
text(),node(),position(),last(),count() - String Functions:
string(),normalize-space(),starts-with(),contains(),substring() - Boolean Functions:
boolean(),not() - Number Functions:
number(),string-length()
- Comparison:
=,!=,<,>,<=,>= - Logical:
and,or,not() - Arithmetic:
+,-,*,div,mod - Union:
|(pipe operator)
- Attribute predicates:
[@id='test'],[@class and @id] - Position predicates:
[1],[last()],[position()>2] - Content predicates:
[text()='value'],[contains(text(), 'substring')] - Complex boolean expressions:
[@id='a' or @class='b'] and [position()=1]
Comprehensive XPath 1.0 implementation with extensive test coverage
| Feature Category | Support | Details |
|---|---|---|
| Basic Selection | β Full | Element, attribute, wildcard selection |
| Attribute Queries | β Full | Attribute existence, value matching, complex conditions |
| Text Functions | β Full | text(), contains(), starts-with(), normalize-space() |
| Position Functions | β Full | position(), last(), numeric positions |
| Axes Navigation | β Full | All XPath axes including ancestor/descendant |
| Complex Predicates | β Full | Boolean logic, nested predicates, unions |
| String Functions | β Full | substring(), string-length() with edge cases |
XPath-Go preserves HTML entities in their original encoded form, which differs from JavaScript's DOM behavior:
<!-- Source HTML -->
<p>Text with & < > characters</p>| Implementation | Text Content | XPath Query |
|---|---|---|
| JavaScript DOM | "Text with & < > characters" |
//p[contains(text(), '&')] β
|
| XPath-Go | "Text with & < > characters" |
//p[contains(text(), '&')] β
|
Why this difference exists:
- β Preserves original HTML content exactly as written
- β No information loss - you can decode when needed
- β Predictable behavior - what you see is what you get
- β Security benefits - prevents entity-related parsing issues
Working with entities:
// Method 1: Query with encoded entities
results, _ := xpath.Query("//p[contains(text(), '&')]", html)
// Method 2: Decode after extraction
results, _ := xpath.Query("//p", html)
decoded := html.UnescapeString(results[0].TextContent)π Read the complete HTML Entity Handling guide for detailed information and best practices.
XPath-Go uses byte-based position tracking for performance and Go ecosystem compatibility:
<!-- Source HTML -->
<p>Hello δΈη</p>| Implementation | Position Calculation | StartLocation |
EndLocation |
|---|---|---|---|
| JavaScript DOM | Character-based | 0 | 27 |
| XPath-Go | Byte-based | 0 | 31 |
Why byte-based positioning:
- β Go idiomatic - Aligns with Go's string handling and byte slice operations
- β Performance - No Unicode code point counting overhead during parsing
- β Memory efficient - Direct byte offset calculations
- β Deterministic - Consistent across all platforms and Go versions
Working with Unicode positions:
// Method 1: Use byte positions directly (recommended for Go)
html := `<p>Hello δΈη</p>`
results, _ := xpath.Query("//p", html)
content := html[results[0].StartLocation:results[0].EndLocation]
// Method 2: Convert to character positions if needed
import "unicode/utf8"
func ByteToCharPos(s string, bytePos int) int {
return utf8.RuneCountInString(s[:bytePos])
}While XPath-Go aims for high compatibility with web standards, there are some intentional design choices:
- HTML Entity Preservation: Maintains original
&vs&for security and consistency - Unicode Position Tracking: Uses byte offsets for Go ecosystem compatibility
- Performance Optimizations: Some complex expressions may have subtle evaluation differences
For complete compatibility details, see docs/COMPATIBILITY.md.
Get precise character positions for all matched nodes:
results, _ := xpath.Query("//div[@class='content']", htmlContent)
for _, result := range results {
fmt.Printf("Element: <%s>\n", result.NodeName)
fmt.Printf("Text: %s\n", result.TextContent)
fmt.Printf("Character Range: %d-%d\n", result.StartLocation, result.EndLocation)
fmt.Printf("XPath: %s\n", result.Path)
fmt.Printf("Attributes: %+v\n", result.Attributes)
}For repeated queries, compile once and reuse:
// Compile once
compiled, err := xpath.Compile("//div[@class='item'][position()>1]")
if err != nil {
log.Fatal(err)
}
// Use multiple times (faster)
for _, htmlDoc := range documents {
results, err := compiled.Evaluate(htmlDoc)
if err != nil {
log.Printf("Error: %v", err)
continue
}
// Process results...
}Control output format and extraction mode:
results, err := xpath.QueryWithOptions("//p", html, xpath.Options{
IncludeLocation: true,
OutputFormat: "values", // "nodes", "values", "paths"
ContentsOnly: false, // Extract full elements (default)
})
// Extract only inner content between tags
results, err := xpath.QueryWithOptions("//div", html, xpath.Options{
ContentsOnly: true, // Extract content-only: <div>content</div> β "content"
})xpath.EnableTrace()
defer xpath.DisableTrace()
results, err := xpath.Query("//div[contains(@class, 'complex')]//p[last()]", html)
// Detailed evaluation steps logged to stderr// Element selection
xpath.Query("//div", html) // All div elements
xpath.Query("/html/body/div", html) // Specific path
xpath.Query("//div[@id='main']", html) // Div with specific ID
// Attribute selection
xpath.Query("//div/@class", html) // Class attributes
xpath.Query("//*[@href]", html) // Elements with href
xpath.Query("//a[@href and @title]", html) // Links with both attributes// Text content
xpath.Query("//p[text()='Hello']", html) // Exact text match
xpath.Query("//div[contains(text(), 'world')]", html) // Text contains
xpath.Query("//span[normalize-space(text())='Clean']", html) // Normalized text
// Position-based
xpath.Query("//li[1]", html) // First list item
xpath.Query("//tr[last()]", html) // Last table row
xpath.Query("//div[position()>2]", html) // Divs after second// Boolean logic
xpath.Query("//div[@id='a' or @class='b']", html) // OR condition
xpath.Query("//p[@class and text()]", html) // AND condition
xpath.Query("//div[not(@class)]", html) // NOT condition
// Nested conditions
xpath.Query("//ul[li[@class='active']]", html) // UL containing active LI
xpath.Query("//div[@class='container']//p[position()=2]", html) // Second P in container
// Complex expressions
xpath.Query("//article[.//h1 and count(.//p)>2]", html) // Articles with H1 and 3+ paragraphs// Family relationships
xpath.Query("//h2/following-sibling::p", html) // P elements after H2
xpath.Query("//span/parent::div[@class='box']", html) // Parent div with class
xpath.Query("//td/ancestor::table[@id='data']", html) // Ancestor table with ID
// Advanced navigation
xpath.Query("//div[@id='start']/descendant-or-self::*[@class]", html) // Descendants with class
xpath.Query("//li[3]/preceding-sibling::li", html) // Previous siblingsExtract either full elements or just their inner content:
html := `<div class="box">Hello <span>World</span>!</div>`
// Full element extraction (default)
results, _ := xpath.QueryWithOptions("//div", html, xpath.Options{
ContentsOnly: false,
})
// StartLocation/EndLocation: <div class="box">Hello <span>World</span>!</div>
// Content-only extraction
results, _ := xpath.QueryWithOptions("//div", html, xpath.Options{
ContentsOnly: true,
})
// StartLocation/EndLocation: Hello <span>World</span>!
// Fine-grained control (always available)
fmt.Printf("Full element: %s\n", html[result.StartLocation:result.EndLocation])
fmt.Printf("Inner content: %s\n", html[result.ContentStart:result.ContentEnd])Use Cases:
- Full elements (
ContentsOnly: false): HTML processing, DOM manipulation, complete element extraction - Content only (
ContentsOnly: true): Text processing, content analysis, clean text extraction without tags
Optimized for production use:
- Fast parsing with caching support
- Efficient evaluation with minimal memory allocations
- Thread-safe design for concurrent usage
- Compiled expressions for repeated queries
// Compile once, use many times
compiled, _ := xpath.Compile("//div[@class='item'][position()>1]")
results, _ := compiled.Evaluate(html)// Basic query evaluation
func Query(xpathExpr, content string) ([]Result, error)
// Query with custom options
func QueryWithOptions(xpathExpr, content string, opts Options) ([]Result, error)
// Compile XPath for reuse (performance optimization)
func Compile(xpathExpr string) (*XPath, error)
// Enable/disable debug tracing
func EnableTrace()
func DisableTrace()type Result struct {
Value string // Node value or text content
NodeName string // Element name (div, span, etc.)
NodeType int // Node type (1=element, 2=attribute, 3=text)
Attributes map[string]string // Element attributes
StartLocation int // Character start position (full element or content-only)
EndLocation int // Character end position (full element or content-only)
ContentStart int // Start of inner content (after opening tag)
ContentEnd int // End of inner content (before closing tag)
Path string // Generated XPath path
TextContent string // Text content of node and children
}type Options struct {
IncludeLocation bool // Include character positions (default: true)
OutputFormat string // "nodes", "values", "paths" (default: "nodes")
ContentsOnly bool // Extract only inner content between tags (default: false)
}ContentsOnly Mode:
false(default): Extract full elements including tags:<div>content</div>true: Extract only inner content:content
Both modes maintain precise position tracking. With ContentsOnly: true, StartLocation/EndLocation point to the content boundaries, while ContentStart/ContentEnd are always available for fine-grained control.
# Go tests
go test ./...
# Compatibility tests (requires Node.js)
cd tests && npm install && npm test
# Benchmarks
go test -bench=. -benchmem ./...We welcome contributions!
# Clone and setup
git clone https://github.com/reclaimprotocol/xpath-go.git
cd xpath-go && go mod download
# Run tests
go test ./... && cd tests && npm install && npm testThis project is licensed under the MIT License - see the LICENSE file for details.
- Built with high compatibility goals for jsdom and web standards
- Inspired by the W3C XPath 1.0 Specification
- Thanks to the Go community for excellent tooling and libraries
π Production Ready: This library is actively used in production and provides reliable XPath evaluation for Go applications.