Skip to content
Romain Ardiet edited this page Sep 21, 2025 · 2 revisions

How to use the library.

  1. Instantiate a ParquetReader
import io.github.romibuzi.parquetdiff.ParquetReader;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;


ParquetReader reader = new ParquetReader(FileSystem.get(new Configuration()));
// or (equivalent)
ParquetReader reader = ParquetReader.getDefault();
  1. Read a Parquet directory
import io.github.romibuzi.parquetdiff.metadata.ParquetDetails;
import org.apache.hadoop.fs.Path;
import java.util.List;

List<ParquetDetails> parquets = reader.readParquetDirectory(new Path("my_data.parquet"));
// or
List<ParquetDetails> parquets = reader.readParquetDirectory("my_data.parquet");
  1. Compare all Parquets schemas, using pair-wise comparison
import io.github.romibuzi.parquetdiff.diff.ParquetComparator;
import io.github.romibuzi.parquetdiff.diff.ParquetSchemaDiff;

List<ParquetSchemaDiff> diffs = ParquetComparator.findSchemasDifferences(parquets);

if (!diffs.isEmpty() {
    ParquetSchemaDiff firstDiff = diffs.get(0);
}
  1. Inspect differences found

ParquetSchemaDiff represents one or more differences (missing field, additional field, field with a different type) between two single parquets files.

System.out.println("differences between " + firstDiff.getFirst() + " and " + firstDiff.getSecond());
System.out.println(firstDiff.getAdditionalNodes());
System.out.println(firstDiff.getMissingNodes());
System.out.println(firstDiff.getTypeDiffs());
System.out.println(firstDiff.getPrimitiveTypeDiffs());
System.out.println(firstDiff.getRepetitionDiffs());

Alternate Use Cases

Read a single Parquet file

ParquetDetails file = reader.readParquetFile("my_file.parquet");
ParquetDetails other = reader.readParquetFile("my_other_file.parquet");

Compare two Parquet files

Optional<ParquetSchemaDiff> diff = ParquetComparator.findSchemasDifferences(file, other);

A single ParquetSchemaDiff is returned.

Clone this wiki locally