Command line parquet file explorer. Built with Rust, Clap, Arrow and Dataflow.
$ cargo build --release
$ cargo run -- -f <file.parquet> -d
- -d (--describe) show table structure
- -f <filename> (--file <filename>) parquet file
- -q <query> (--query <query>) "SELECT * FROM tablename", tablename is the filename w/o extension
- -n (--newfile) signals that a new file will be create, requires -o
- -o <outputfile> (--output <outputfile>) goes along with -n to dump the query result to a new parquet file
- -s (--singlefile) tells parquet-explorer to write a single parquet file instead of a partitioned parquet (default behaviour)
Datafusion provides a PostgreSQL compatible SQL dialect. The tablename will follow the parquet filename (if the parquet is called original_parquet you should query against that). The information schema is enabled so "SHOW COLUMNS FROM original_parquet" will print columns from the original_parquet file.
parquet-explorer -f original_parquet -q "SELECT * from original_parquet LIMIT 10" will get the top 10 rows from original_parquet
parquet-explorer -f original_parquet -q "SELECT * from original_parquet LIMIT 10" -n -o destination_parquet will query original_parquet, take the top 10 lines and create a new parquet file with them.
parquet-explorer -f datasource.parquet -i will spawn a REPL prompt. You can query the datasource using regular SQL statements, see the schema using the describe command and quit using CTRL+D, exit or quit