Skip to content

Latest commit

 

History

History
63 lines (53 loc) · 2.44 KB

File metadata and controls

63 lines (53 loc) · 2.44 KB

Benchmark Optimization Summary

Before Optimization

The benchmark was inferring schemas 5 times:

  1. Once for the schema benchmark
  2. Once at the start of conversion benchmark (~22 seconds)
  3. Then loading from JSON file 4 times (once per format)

Total schema operations: 5 times (1 inference + 4 file loads)

After Optimization

The benchmark now infers schemas only ONCE:

  1. Infer schemas at the start of conversion benchmark (~22 seconds)
  2. Reuse the same SchemaReader instance with pre-loaded schemas for all 4 formats
  3. No repeated file loading - schemas stay in memory

Total schema operations: 1 inference, 0 file loads

Key Changes

src/benchmark.py

  • Added: logger.info("Inferring schemas once for all conversions...")
  • Stores schemas: schemas = reader.schemas
  • Passes same schema_reader to all Converters
  • Sets schema_report_path=None to prevent file loading

####src/converter.py

  • Updated convert_all() to check for pre-loaded schemas first:
    if self.schema_reader.schemas:
        logger.info("Using pre-loaded schemas from SchemaReader")
        schemas = self.schema_reader.schemas
    elif self.schema_report_path:
        logger.info("Loading schemas from schema report...")
        schemas = SchemaReader.load_schemas_from_json(...)

Result

Schemas inferred once and reused across all 4 format conversions
No redundant file I/O
Cleaner logs - only one schema inference message
Faster benchmarking - eliminates 4x schema loading operations

The logs now show:

2025-11-23 18:05:44 - INFO - Inferring schemas once for all conversions...
2025-11-23 18:05:44 - INFO - Found 6 JSON file(s) in data
[Schema inference happens ONCE - ~22 seconds]
2025-11-23 18:06:06 - INFO - Benchmarking parquet conversion...
2025-11-23 18:06:06 - INFO - Using pre-loaded schemas from SchemaReader ✅
[Converts all files to parquet]
2025-11-23 18:06:10 - INFO - Benchmarking csv conversion...
2025-11-23 18:06:10 - INFO - Using pre-loaded schemas from SchemaReader ✅
[Converts all files to CSV]
2025-11-23 18:06:13 - INFO - Benchmarking avro conversion...
2025-11-23 18:06:13 - INFO - Using pre-loaded schemas from SchemaReader ✅
[Converts all files to Avro]
2025-11-23 18:06:16 - INFO - Benchmarking orc conversion...
2025-11-23 18:06:16 - INFO - Using pre-loaded schemas from SchemaReader ✅
[Converts all files to ORC]

No more repeated schema inference or file loading!