SchemaForge/BENCHMARK_OPTIMIZATION.md at main · Syntax-Error-1337/SchemaForge

Benchmark Optimization Summary

Before Optimization

The benchmark was inferring schemas 5 times:

Once for the schema benchmark
Once at the start of conversion benchmark (~22 seconds)
Then loading from JSON file 4 times (once per format)

Total schema operations: 5 times (1 inference + 4 file loads)

After Optimization

The benchmark now infers schemas only ONCE:

Infer schemas at the start of conversion benchmark (~22 seconds)
Reuse the same SchemaReader instance with pre-loaded schemas for all 4 formats
No repeated file loading - schemas stay in memory

Total schema operations: 1 inference, 0 file loads

Key Changes

`src/benchmark.py`

Added: logger.info("Inferring schemas once for all conversions...")
Stores schemas: schemas = reader.schemas
Passes same schema_reader to all Converters
Sets schema_report_path=None to prevent file loading

####src/converter.py

Updated convert_all() to check for pre-loaded schemas first:

if self.schema_reader.schemas:
    logger.info("Using pre-loaded schemas from SchemaReader")
    schemas = self.schema_reader.schemas
elif self.schema_report_path:
    logger.info("Loading schemas from schema report...")
    schemas = SchemaReader.load_schemas_from_json(...)

Result

✅ Schemas inferred once and reused across all 4 format conversions
✅ No redundant file I/O
✅ Cleaner logs - only one schema inference message
✅ Faster benchmarking - eliminates 4x schema loading operations

The logs now show:

2025-11-23 18:05:44 - INFO - Inferring schemas once for all conversions...
2025-11-23 18:05:44 - INFO - Found 6 JSON file(s) in data
[Schema inference happens ONCE - ~22 seconds]
2025-11-23 18:06:06 - INFO - Benchmarking parquet conversion...
2025-11-23 18:06:06 - INFO - Using pre-loaded schemas from SchemaReader ✅
[Converts all files to parquet]
2025-11-23 18:06:10 - INFO - Benchmarking csv conversion...
2025-11-23 18:06:10 - INFO - Using pre-loaded schemas from SchemaReader ✅
[Converts all files to CSV]
2025-11-23 18:06:13 - INFO - Benchmarking avro conversion...
2025-11-23 18:06:13 - INFO - Using pre-loaded schemas from SchemaReader ✅
[Converts all files to Avro]
2025-11-23 18:06:16 - INFO - Benchmarking orc conversion...
2025-11-23 18:06:16 - INFO - Using pre-loaded schemas from SchemaReader ✅
[Converts all files to ORC]

No more repeated schema inference or file loading!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Optimization Summary

Before Optimization

After Optimization

Key Changes

`src/benchmark.py`

Result

FilesExpand file tree

BENCHMARK_OPTIMIZATION.md

Latest commit

History

BENCHMARK_OPTIMIZATION.md

File metadata and controls

Benchmark Optimization Summary

Before Optimization

After Optimization

Key Changes

src/benchmark.py

Result

`src/benchmark.py`