-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Romain Ardiet edited this page Sep 21, 2025
·
2 revisions
- Instantiate a ParquetReader
import io.github.romibuzi.parquetdiff.ParquetReader;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
ParquetReader reader = new ParquetReader(FileSystem.get(new Configuration()));
// or (equivalent)
ParquetReader reader = ParquetReader.getDefault();- Read a Parquet directory
import io.github.romibuzi.parquetdiff.metadata.ParquetDetails;
import org.apache.hadoop.fs.Path;
import java.util.List;
List<ParquetDetails> parquets = reader.readParquetDirectory(new Path("my_data.parquet"));
// or
List<ParquetDetails> parquets = reader.readParquetDirectory("my_data.parquet");- Compare all Parquets schemas, using pair-wise comparison
import io.github.romibuzi.parquetdiff.diff.ParquetComparator;
import io.github.romibuzi.parquetdiff.diff.ParquetSchemaDiff;
List<ParquetSchemaDiff> diffs = ParquetComparator.findSchemasDifferences(parquets);
if (!diffs.isEmpty() {
ParquetSchemaDiff firstDiff = diffs.get(0);
}- Inspect differences found
ParquetSchemaDiff represents one or more differences (missing field, additional field, field with a different type) between two single parquets files.
System.out.println("differences between " + firstDiff.getFirst() + " and " + firstDiff.getSecond());
System.out.println(firstDiff.getAdditionalNodes());
System.out.println(firstDiff.getMissingNodes());
System.out.println(firstDiff.getTypeDiffs());
System.out.println(firstDiff.getPrimitiveTypeDiffs());
System.out.println(firstDiff.getRepetitionDiffs());Read a single Parquet file
ParquetDetails file = reader.readParquetFile("my_file.parquet");
ParquetDetails other = reader.readParquetFile("my_other_file.parquet");Compare two Parquet files
Optional<ParquetSchemaDiff> diff = ParquetComparator.findSchemasDifferences(file, other);A single ParquetSchemaDiff is returned.