Problem Currently, freq_checker/io.py loads the entire dataset into memory using pandas.read_csv() or pandas.read_excel(). This causes a MemoryError and crashes the application when processing very large files (e.g., > 5GB, or files larger than available RAM).
Goal Implement chunked processing to handle files of arbitrary size with a low, constant memory footprint.
Proposed Solution:
- Update io.py:
- Modify load_data to return an iterator (generator) of DataFrames when a chunksize is provided.
- Update core.py:
-
Refactor find_duplicatesto consume the generator.
-
Aggregation Logic: Instead of running value_counts() once, we must run it on each chunk, append the specific column's counts to a running total (or a temporary lightweight structure), and then aggregate the results at the end.
-
Note: Merging partial counts is required because a duplicate value might appear once in Chunk A and once in Chunk B.
Technical Details:
-
Use pd.read_csv(..., chunksize=10000).
-
Ensure fuzzy matching logic throws a warning or error, as fuzzy matching across chunks is computationally expensive (O(N^2)) and complex to parallelize without blocking.
Acceptance Criteria:
-
Script successfully processes a CSV generated to be larger than available RAM (e.g., 10GB dummy file).
-
Output match counts are identical to the non-chunked version.
Problem Currently, freq_checker/io.py loads the entire dataset into memory using pandas.read_csv() or pandas.read_excel(). This causes a MemoryError and crashes the application when processing very large files (e.g., > 5GB, or files larger than available RAM).
Goal Implement chunked processing to handle files of arbitrary size with a low, constant memory footprint.
Proposed Solution:
Refactor find_duplicatesto consume the generator.
Aggregation Logic: Instead of running value_counts() once, we must run it on each chunk, append the specific column's counts to a running total (or a temporary lightweight structure), and then aggregate the results at the end.
Note: Merging partial counts is required because a duplicate value might appear once in Chunk A and once in Chunk B.
Technical Details:
Use pd.read_csv(..., chunksize=10000).
Ensure fuzzy matching logic throws a warning or error, as fuzzy matching across chunks is computationally expensive (O(N^2)) and complex to parallelize without blocking.
Acceptance Criteria:
Script successfully processes a CSV generated to be larger than available RAM (e.g., 10GB dummy file).
Output match counts are identical to the non-chunked version.