Skip to content

GC overhead with small objects #75

@hamishmorgan

Description

@hamishmorgan

A substantial amount of time is wasted GCing small objects, that need not have been created in the first place. This is particularly prominent in the Count stage of the process, where approximately 20% of CPU time is devoted to GC. Since the count stage is optimally parallelised, GC can account for 1 or more cores worth of processing power being lost.

The offending objects are probably TokenPair and Token objects, though this has yet to be substantiated.

To resolve this the IO layer needs to be restructured and extended so it can produce collections of token ids stored in arrays, similar to the SparseVector system.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions