Skip to content

Conversation

@UnravelSports
Copy link
Contributor

@UnravelSports UnravelSports commented Dec 17, 2025

This is a continuation of stephTchembeu#5 and #513. Due to all the misalignments it was easier to create a new PR.

Overview

This PR adds TrackingDataset.to_cdf() functionality. It includes @koenvo improved writing PR #515 and it includes the cleaned up version of work done by @stephTchembeu.

Basically, we can now output kloppy tracking data to the Common Data Format (Anzer et al. 2025).

from kloppy import skillcorner

dataset = skillcorner.load_open_data(only_alive=False)

dataset.to_cdf(
    metadata_output_file='output/metadata.json',
    tracking_output_file='output/tracking.jsonl'
)

Because kloppy does not process some mandatory values for the CDF (stadium id, competition id, season id, version (tracking) and collection timing) doing the above will throw some warnings, namely:

UserWarning: Missing mandatory ID at 'competition.id'. Currently replaced with the value 'MISSING_MANDATORY_VALUE'. Please provide the correct value to 'additional_metadata' to completely adhere to the CDF specification.

We can resolve this by passing additional_metadata to the to_cdf functionality, using the Common Data Format Validator TypedDicts (you don't have to use this, but it helps keep everything in the correct schema), like so:

from cdf.domain import CdfMetaDataSchema, Stadium, Competition, Season, Meta, Tracking

additional_meta_data = CdfMetaDataSchema(
    competition=Competition(
        id="COMP_123",
        name="Test Competition",
        format="league_20"
    ),
    season=Season(id="SEASON_2024", name="2024/25"),
    stadium=Stadium(
        id="STADIUM_456",
        name="Test Arena",
        turf="grass",
    ),
    meta=Meta(
        tracking=Tracking(
            version="2.0.0",
            name="TestTracker",
            fps=30,
            collection_timing="live"
        )
    )
)

We can then run:

from kloppy import skillcorner

dataset = skillcorner.load_open_data(only_alive=False)

dataset.to_cdf(
    metadata_output_file='output/metadata.json',
    tracking_output_file='output/tracking.jsonl',
    additional_metadata=additional_meta_data
)

This will now not throw any warnings and it should output the correct files.

Note: we set only_alive=True, because not doing so will also show a warning.

Common Data Format Validator

We have new unit tests that test the writing functionality, and tests that validate the output schema to the CDF using the common-data-format-validator. This is a development dependency. Note that if the CDF changes it's structure, these tests will fail on the kloppy side too. I can imagine this is not ideal, but not sure what to do about this. Any suggestions here are more than welcome.

Next Steps

I would like to continue with reading CDF tracking data and writing and reading CDF event data. Should I do this in a new PR, or shall I pile everything into this?

stephTchembeu and others added 27 commits October 28, 2025 09:42
This adds comprehensive write support to the open_as_file() function with
efficient memory management and streaming capabilities.

Key features:
- BufferedStream: SpooledTemporaryFile wrapper with chunked I/O (5MB memory threshold)
- Write modes: 'wb' (write), 'ab' (append) - binary only
- Adapter pattern: write_from_stream() method (opt-in for adapters)
- Compression support: .gz, .bz2, .xz files handled automatically
- Local files and S3 URIs supported via FSSpecAdapter
- Protocols for type safety: SupportsRead, SupportsWrite

Implementation details:
- read_from()/write_to() methods use shutil.copyfileobj for chunked copying
- Context manager pattern buffers writes and flushes on exit
- No breaking changes to existing read functionality
@UnravelSports UnravelSports changed the title [test] [CDF] to_cdf Dec 17, 2025
@UnravelSports UnravelSports requested review from probberechts and removed request for probberechts December 17, 2025 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants