Concurrent Decoder #589

TuSKan · 2025-12-16T23:30:56Z

Goal of this PR

Expose Reader Offset and OCF Block Status for Concurrent Decoder

Description

This PR enhances the Reader and OCF Decoder by exposing internal state information that is critical for advanced processing scenarios, such as progress tracking, concurrently splitting, and debugging.

Key Changes

Reader.InputOffset(): Adds a method to the Reader to retrieve the current input offset. This allows consumers to know exactly where in the underlying stream the reader is currently positioned.
OCF Decoder.BlockStatus(): Introduces a BlockStatus() method (and corresponding struct) to the OCF Decoder. This provides a snapshot of the current block being processed, including:
- Current: The index of the current record within the block.
- Count: The total number of records in the current block.
- Size: The size (in bytes) of the current block.
- Offset: The input offset provided by the underlying reader.

Motivation

Currently, the avro package abstracts away the underlying stream position and block details. While this is fine for simple reading, it limits users who need to:

Track Progress: Accurately report percentage completion when processing large OCF files.
Implement Splitting: effectively split file processing based on block boundaries and offsets.
Debug: gain better visibility into how the decoder is traversing the file structure.

Use Case Example

A data processing pipeline can now use BlockStatus() to log precise progress or checkpoint processing at specific block offsets, improving reliability and observability.

// Example usage
decoder, _ := ocf.NewDecoder(r)
for decoder.HasNext() {
    var record MyRecord
    decoder.Decode(&record)
    
    // Track progress
    status := decoder.BlockStatus()
    fmt.Printf("Processing record %d/%d of block at offset %d\n", status.Current, status.Count, status.Offset)
}

## How did I test it?

I Make a test TestConcurrentDecode for concurrent decode

Please, let me know if this PR makes sense for you.

… related reader modifications.

This reverts commit 2bffcc1.

nrwiersma · 2025-12-30T04:23:08Z

ocf/ocf.go

-			writer := avro.NewWriter(w, 512, avro.WithWriterConfig(cfg.EncodingConfig))
+			writer := avro.NewWriter(w, 512, avro.WithWriterConfig(avro.DefaultConfig))


Fix bug #590

nrwiersma · 2025-12-30T04:23:15Z

ocf/ocf.go

-	writer := avro.NewWriter(w, 512, avro.WithWriterConfig(cfg.EncodingConfig))
+	writer := avro.NewWriter(w, 512, avro.WithWriterConfig(avro.DefaultConfig))


nrwiersma · 2025-12-30T04:24:54Z

ocf/ocf.go

+func newDecoderConfig(opts ...DecoderFunc) *decoderConfig {
+	cfg := decoderConfig{
+		DecoderConfig: avro.DefaultConfig,
+		SchemaCache:   avro.DefaultSchemaCache,
+		CodecOptions: codecOptions{
+			DeflateCompressionLevel: flate.DefaultCompression,
+		},
+	}
+	for _, opt := range opts {
+		opt(&cfg)
+	}
+	return &cfg
+}
+


Not sure that this change helped anything.

Avoiding duplicate code.

nrwiersma · 2025-12-30T04:34:23Z

reader_skip.go

+			need := min(r.tail-r.head, tokenLen-1)
+
+			// Construct boundary window: stash + beginning of new buffer
+			boundary := make([]byte, len(stash)+need)


This is a known size, allocate once and reuse instead of constantly re-allocating.

nrwiersma · 2025-12-30T04:36:23Z

reader_skip.go

+			copy(boundary, stash)
+			copy(boundary[len(stash):], r.buf[r.head:r.head+need])
+
+			if idx := bytes.Index(boundary, token); idx >= 0 {


In this case, surely the reader has advanced too far, as the start of the token is no longer in the buffer.

This is a case when the token extrapolate the buffer, só a bigger buffer is needed

nrwiersma · 2025-12-30T04:36:37Z

reader_skip_test.go

 			data: []byte{0x38, 0x36},
 		},
 		{
+


Auto format

nrwiersma · 2025-12-30T04:36:45Z

reader_skip_test.go

 			data: []byte{0x38, 0x36},
 		},
 		{
+


Auto format

Rener Castro and others added 8 commits December 16, 2025 01:13

feat: add Avro Object Container File (OCF) encoding and decoding with…

119462c

… related reader modifications.

Merge branch 'main' of https://github.com/TuSKan/avro into header

e54081b

support reseting encoder to reduce load on GC. (hamba#587)

e5cf869

Revert "support reseting encoder to reduce load on GC. (hamba#587)"

83e8673

This reverts commit 2bffcc1.

Merge branch 'header' of https://github.com/TuSKan/avro into header

5ea0bc7

test: add comprehensive tests for Avro reader skip methods

d9175e7

fix tests

60eaece

test

bcc8be6

TuSKan changed the title ~~Header~~ Concurrently Decoder Dec 16, 2025

golangci-lint

069ff00

TuSKan changed the title ~~Concurrently Decoder~~ Concurrent Decoder Dec 16, 2025

fix: change writer container config to default on ocf.Encoder

282a47f

nrwiersma requested changes Dec 30, 2025

View reviewed changes

		writer := avro.NewWriter(w, 512, avro.WithWriterConfig(cfg.EncodingConfig))
		writer := avro.NewWriter(w, 512, avro.WithWriterConfig(avro.DefaultConfig))

Concurrent Decoder #589

Are you sure you want to change the base?

Concurrent Decoder #589

Uh oh!

Conversation

TuSKan commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goal of this PR

Expose Reader Offset and OCF Block Status for Concurrent Decoder

Description

Key Changes

Motivation

Use Case Example

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

TuSKan commented Dec 16, 2025 •

edited

Loading