Skip to content

Implement Chunking for Large Files in freq_checker/io.py #8

@Dnysus

Description

@Dnysus

Problem Currently, freq_checker/io.py loads the entire dataset into memory using pandas.read_csv() or pandas.read_excel(). This causes a MemoryError and crashes the application when processing very large files (e.g., > 5GB, or files larger than available RAM).

Goal Implement chunked processing to handle files of arbitrary size with a low, constant memory footprint.

Proposed Solution:

  1. Update io.py:
  • Modify load_data to return an iterator (generator) of DataFrames when a chunksize is provided.
  1. Update core.py:
  • Refactor find_duplicatesto consume the generator.

  • Aggregation Logic: Instead of running value_counts() once, we must run it on each chunk, append the specific column's counts to a running total (or a temporary lightweight structure), and then aggregate the results at the end.

  • Note: Merging partial counts is required because a duplicate value might appear once in Chunk A and once in Chunk B.

Technical Details:

  • Use pd.read_csv(..., chunksize=10000).

  • Ensure fuzzy matching logic throws a warning or error, as fuzzy matching across chunks is computationally expensive (O(N^2)) and complex to parallelize without blocking.

Acceptance Criteria:

  • Script successfully processes a CSV generated to be larger than available RAM (e.g., 10GB dummy file).

  • Output match counts are identical to the non-chunked version.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions