The benchmark was inferring schemas 5 times:
- Once for the schema benchmark
- Once at the start of conversion benchmark (~22 seconds)
- Then loading from JSON file 4 times (once per format)
Total schema operations: 5 times (1 inference + 4 file loads)
The benchmark now infers schemas only ONCE:
- Infer schemas at the start of conversion benchmark (~22 seconds)
- Reuse the same
SchemaReaderinstance with pre-loaded schemas for all 4 formats - No repeated file loading - schemas stay in memory
Total schema operations: 1 inference, 0 file loads
- Added:
logger.info("Inferring schemas once for all conversions...") - Stores schemas:
schemas = reader.schemas - Passes same
schema_readerto all Converters - Sets
schema_report_path=Noneto prevent file loading
####src/converter.py
- Updated
convert_all()to check for pre-loaded schemas first:if self.schema_reader.schemas: logger.info("Using pre-loaded schemas from SchemaReader") schemas = self.schema_reader.schemas elif self.schema_report_path: logger.info("Loading schemas from schema report...") schemas = SchemaReader.load_schemas_from_json(...)
✅ Schemas inferred once and reused across all 4 format conversions
✅ No redundant file I/O
✅ Cleaner logs - only one schema inference message
✅ Faster benchmarking - eliminates 4x schema loading operations
The logs now show:
2025-11-23 18:05:44 - INFO - Inferring schemas once for all conversions...
2025-11-23 18:05:44 - INFO - Found 6 JSON file(s) in data
[Schema inference happens ONCE - ~22 seconds]
2025-11-23 18:06:06 - INFO - Benchmarking parquet conversion...
2025-11-23 18:06:06 - INFO - Using pre-loaded schemas from SchemaReader ✅
[Converts all files to parquet]
2025-11-23 18:06:10 - INFO - Benchmarking csv conversion...
2025-11-23 18:06:10 - INFO - Using pre-loaded schemas from SchemaReader ✅
[Converts all files to CSV]
2025-11-23 18:06:13 - INFO - Benchmarking avro conversion...
2025-11-23 18:06:13 - INFO - Using pre-loaded schemas from SchemaReader ✅
[Converts all files to Avro]
2025-11-23 18:06:16 - INFO - Benchmarking orc conversion...
2025-11-23 18:06:16 - INFO - Using pre-loaded schemas from SchemaReader ✅
[Converts all files to ORC]
No more repeated schema inference or file loading!