ParquetSharpLINQ

A high-performance LINQ provider for querying Hive-partitioned and Delta Lake Parquet files with automatic query optimization.

Features

Source Generated Mappers - Zero reflection for data mapping
Delta Lake Support - Automatic transaction log reading
Partition Pruning - Only scans matching partitions
Column Projection - Only reads requested columns
Indexed Column Predicates - Row-group pruning for indexed properties
Type Safe - Compile-time validation
Cross-Platform - Works on Windows and Linux
Azure Blob Storage - Stream directly from cloud storage

Quick Start

Installation

dotnet add package ParquetSharpLINQ

For Azure Blob Storage support:

dotnet add package ParquetSharpLINQ.Azure

Define Your Entity

using ParquetSharpLINQ.Attributes;

public class SalesRecord
{
    [ParquetColumn("id")]
    public long Id { get; set; }
    
    [ParquetColumn("product_name")]
    public string ProductName { get; set; }
    
    [ParquetColumn("total_amount")]
    public decimal TotalAmount { get; set; }
    
    [ParquetColumn("year", IsPartition = true)]
    public int Year { get; set; }
    
    [ParquetColumn("region", IsPartition = true)]
    public string Region { get; set; }
}

Query Local Files

using ParquetSharpLINQ;

using var table = ParquetTable<SalesRecord>.Factory.FromFileSystem("/data/sales");

var count = table.Count(s => s.Region == "eu-west" && s.Year == 2024);
var results = table.Where(s => s.TotalAmount > 1000).ToList();

Query Delta Lake Tables

using ParquetSharpLINQ;

// Automatically detects and reads _delta_log/
using var table = ParquetTable<SalesRecord>.Factory.FromFileSystem("/data/delta-sales");

var results = table.Where(s => s.Year == 2024).ToList();

Query Azure Blob Storage

using ParquetSharpLINQ;
using ParquetSharpLINQ.Azure;  // Extension methods for Azure support

using var table = ParquetTable<SalesRecord>.Factory.FromAzureBlob(
    connectionString: "DefaultEndpointsProtocol=https;AccountName=...",
    containerName: "sales-data");

var results = table.Where(s => s.Year == 2024).ToList();

Directory Structure

Hive-style partitioning:

/data/sales/
├── year=2023/
│   ├── region=us-east/
│   │   └── data.parquet
│   └── region=eu-west/
│       └── data.parquet
└── year=2024/
    └── region=us-east/
        └── data.parquet

Delta Lake:

/data/delta-sales/
├── _delta_log/
│   ├── 00000000000000000000.json
│   └── 00000000000000000001.json
└── year=2024/
    └── data.parquet

Key Features

Partition Pruning

All LINQ methods with predicates support automatic partition pruning:

Count(predicate), LongCount(predicate)
Any(predicate), All(predicate)
First(predicate), FirstOrDefault(predicate)
Single(predicate), SingleOrDefault(predicate)
Last(predicate), LastOrDefault(predicate)
Where(predicate)

Column Projection

Only requested columns are read. When there is no Select projection, only mapped entity columns are read (partition columns are enriched from directory metadata, not read from Parquet files):

var summary = table
    .Select(s => new { s.Id, s.ProductName })
    .ToList();

Indexed Column Predicates

Mark properties as indexed to enable fast row-group pruning for Count and Where predicates.

using ParquetSharpLINQ.Attributes;

public class SalesRecord
{
    [ParquetColumn("id", Indexed = true)]
    public long Id { get; set; }

    [ParquetColumn("product_name", Indexed = true, ComparerType = typeof(StringComparer))]
    public string ProductName { get; set; }
}

Notes:

Indexing uses values read per row group and an in-memory cache per column/file.
ComparerType is optional; if omitted, the property type must implement IComparable or IComparable<T>.
Currently optimized constraints: equality, inequality, range comparisons, and string.StartsWith with StringComparison.Ordinal.

AllowMissing

Use AllowMissing to permit missing columns (nullable properties only):

public class SalesRecord
{
    [ParquetColumn("optional_note", AllowMissing = true)]
    public string? OptionalNote { get; set; }
}

Automatic Type Conversion

Partition directory names (strings) are converted to property types:

"06" → 6 (int)
"2024-12-07" → DateTime or DateOnly

Case-Insensitive Matching

Column names and partition values are case-insensitive. For partition filtering, use lowercase values or case-insensitive comparison:

// Recommended: lowercase (matches normalized values)
table.Where(s => s.Region == "us-east")

// Alternative: case-insensitive comparison
table.Where(s => s.Region.Equals("US-EAST", StringComparison.OrdinalIgnoreCase))

Delta Lake Support

Automatically detects _delta_log/ directory and:

Reads transaction log files
Queries only active files (respects deletes/updates)
Falls back to Hive-style scanning if no Delta log found

Supported: Add, Remove, Metadata, Protocol actions
Not supported: Time travel, Checkpoints (uses JSON logs only)

Testing

# All tests
dotnet test

# Unit tests only
dotnet test --filter "Category=Unit"

# Integration tests
dotnet test --filter "Category=Integration"

See ParquetSharpLINQ.Tests/README.md for details.

Performance

Benchmark results with 180 partitions (900K records):

Query	Partitions Read	Speedup
Full scan	180/180	1.0x
`region='eu-west'`	36/180	~5x
`year=2024 AND region='eu-west'`	12/180	~15x

Indexed column benchmarks (180 partitions, 540 parquet files, 5,400,000 records):

Method	Mean	Error	StdDev	Allocated
Indexed `.Count(r => r.ClientId.StartsWith("46"))`	7.035 ms	0.1072 ms	0.0166 ms	105.62 KB
Non Indexed `.Count(r => r.ClientId.StartsWith("46"))`	5,938.280 ms	1,191.0275 ms	309.3061 ms	6,973,873.87 KB
Indexed `.Where(r => r.ClientId.StartsWith("46"))` `.ToList()`	343.405 ms	8.3807 ms	2.1765 ms	352,443.88 KB
Non Indexed `.Where(r => r.ClientId.StartsWith("46"))` `.ToList()`	6,274.230 ms	1,088.8686 ms	168.5036 ms	6,974,388.8 KB
Indexed `.Where(r => r.ClientId.StartsWith("46"))` `.Select(r => r.ProductName)` `.ToList()`	254.965 ms	11.1857 ms	2.9049 ms	250,093.02 KB
Non Indexed `.Where(r => r.ClientId.StartsWith("46"))` `.Select(r => r.ProductName)` `.ToList()`	3,902.831 ms	499.7335 ms	129.7792 ms	4,668,789.84 KB

Benchmarks

cd ParquetSharpLINQ.Benchmarks

# Generate test data
dotnet run -c Release -- generate ./data 5000

# Run benchmarks  
dotnet run -c Release -- analyze ./data

See ParquetSharpLINQ.Benchmarks/README.md for details.

Requirements

.NET 8.0 or higher
ParquetSharp 21.0.0+

Architecture

ParquetSharpLINQ - Core LINQ query provider with composition-based design
- ParquetTable<T> - Main queryable interface (implements IQueryable<T>)
- ParquetTableFactory<T> - Factory for creating ParquetTable instances
- ParquetQueryProvider<T> - LINQ expression tree visitor and query optimizer
- ParquetEnumerationStrategy<T> - Executes queries with partition pruning, indexing, and column projection
- IPartitionDiscoveryStrategy - Pluggable partition discovery interface
- IParquetReader - Pluggable Parquet reading interface
- FileSystemPartitionDiscovery - Discovers Hive partitions and Delta logs from local filesystem
- ParquetSharpReader - Reads Parquet files from local filesystem
ParquetSharpLINQ.Generator - Source generator for zero-reflection data mappers
ParquetSharpLINQ.Azure - Azure Blob Storage extension package
- ParquetTableFactoryExtensions - Adds FromAzureBlob() factory method
- AzureBlobPartitionDiscovery - Discovers partitions and Delta logs from Azure Blob Storage
- AzureBlobParquetReader - Streams Parquet files from Azure with LRU caching
ParquetSharpLINQ.Tests - Unit and integration tests
ParquetSharpLINQ.Benchmarks - Performance testing

License

MIT License - see LICENSE for details.

Author

Kornél Naszály - GitHub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ParquetSharpLINQ

Features

Quick Start

Installation

Define Your Entity

Query Local Files

Query Delta Lake Tables

Query Azure Blob Storage

Directory Structure

Key Features

Partition Pruning

Column Projection

Indexed Column Predicates

AllowMissing

Automatic Type Conversion

Case-Insensitive Matching

Delta Lake Support

Testing

Performance

Benchmarks

Requirements

Architecture

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
.github/workflows		.github/workflows
ParquetSharpLINQ.Azure		ParquetSharpLINQ.Azure
ParquetSharpLINQ.Benchmarks		ParquetSharpLINQ.Benchmarks
ParquetSharpLINQ.Constants		ParquetSharpLINQ.Constants
ParquetSharpLINQ.DataGenerator		ParquetSharpLINQ.DataGenerator
ParquetSharpLINQ.Generator		ParquetSharpLINQ.Generator
ParquetSharpLINQ.Tests		ParquetSharpLINQ.Tests
ParquetSharpLINQ		ParquetSharpLINQ
.gitignore		.gitignore
LICENSE		LICENSE
ParquetSharpLINQ.sln		ParquetSharpLINQ.sln
README.md		README.md

License

naszly/ParquetSharpLINQ

Folders and files

Latest commit

History

Repository files navigation

ParquetSharpLINQ

Features

Quick Start

Installation

Define Your Entity

Query Local Files

Query Delta Lake Tables

Query Azure Blob Storage

Directory Structure

Key Features

Partition Pruning

Column Projection

Indexed Column Predicates

AllowMissing

Automatic Type Conversion

Case-Insensitive Matching

Delta Lake Support

Testing

Performance

Benchmarks

Requirements

Architecture

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages