Implement Chunking for Large Files in  freq_checker/io.py

Problem Currently, freq_checker/io.py loads the entire dataset into memory using pandas.read_csv() or pandas.read_excel(). This causes a MemoryError and crashes the application when processing very large files (e.g., > 5GB, or files larger than available RAM).

Goal Implement chunked processing to handle files of arbitrary size with a low, constant memory footprint.

Proposed Solution:

1. Update io.py:

- Modify load_data to return an iterator (generator) of DataFrames when a chunksize is provided.

2. Update core.py:

- Refactor find_duplicatesto consume the generator.

- Aggregation Logic: Instead of running value_counts() once, we must run it on each chunk, append the specific column's counts to a running total (or a temporary lightweight structure), and then aggregate the results at the end.

- _Note_: Merging partial counts is required because a duplicate value might appear once in Chunk A and once in Chunk B.

Technical Details:

- Use pd.read_csv(..., chunksize=10000).

- Ensure fuzzy matching logic throws a warning or error, as fuzzy matching across chunks is computationally expensive (O(N^2)) and complex to parallelize without blocking.

Acceptance Criteria:

- Script successfully processes a CSV generated to be larger than available RAM (e.g., 10GB dummy file).

- Output match counts are identical to the non-chunked version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Chunking for Large Files in freq_checker/io.py #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Implement Chunking for Large Files in freq_checker/io.py #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions