๐ Test PySpark code at lightning speedโno JVM required
Current release: 3.29.0
โก 10x faster tests โข ๐ฏ Drop-in PySpark replacement โข ๐ฆ Zero JVM overhead โข ๐งต Thread-safe Polars backend
Tired of waiting 30+ seconds for Spark to initialize in every test?
Sparkless is a lightweight PySpark replacement that runs your tests 10x faster by eliminating JVM overhead. Your existing PySpark code works unchangedโjust swap the import.
# Before
from pyspark.sql import SparkSession
# After
from sparkless.sql import SparkSession| Feature | Description |
|---|---|
| โก 10x Faster | No JVM startup (30s โ 0.1s) |
| ๐ฏ Drop-in Replacement | Use existing PySpark code unchanged |
| ๐ฆ Zero Java | Pure Python with Polars backend (thread-safe, no SQL required) |
| ๐งช 100% Compatible | Full PySpark 3.2-3.5 API support |
| ๐ Lazy Evaluation | Mirrors PySpark's execution model |
| ๐ญ Production Ready | 2314+ passing tests, 100% mypy typed |
| ๐งต Thread-Safe | Polars backend designed for parallel execution |
| ๐ง Modular Design | DDL parsing via standalone spark-ddl-parser package |
| ๐ฏ Type Safe | Full type checking with ty, comprehensive type annotations |
- Unit Testing - Fast, isolated test execution with automatic cleanup
- CI/CD Pipelines - Reliable tests without infrastructure or resource leaks
- Local Development - Prototype without Spark cluster
- Documentation - Runnable examples without setup
- Learning - Understand PySpark without complexity
- Integration Tests - Configurable memory limits for large dataset testing
pip install sparkless๐ Need help? Check out the full documentation for detailed guides, API reference, and examples.
from sparkless.sql import SparkSession, functions as F
# Create session
spark = SparkSession("MyApp")
# Your PySpark code works as-is
data = [{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}]
df = spark.createDataFrame(data)
# All operations work
result = df.filter(F.col("age") > 25).select("name").collect()
print(result)
# Output: [Row(name='Bob')]
# Show the DataFrame
df.show()
# Output:
# DataFrame[2 rows, 2 columns]
# age name
# 25 Alice
# 30 Bobimport pytest
from sparkless.sql import SparkSession, functions as F
def test_data_pipeline():
"""Test PySpark logic without Spark cluster."""
spark = SparkSession("TestApp")
# Test data
data = [{"score": 95}, {"score": 87}, {"score": 92}]
df = spark.createDataFrame(data)
# Business logic
high_scores = df.filter(F.col("score") > 90)
# Assertions
assert high_scores.count() == 2
assert high_scores.agg(F.avg("score")).collect()[0][0] == 93.5
# Always clean up
spark.stop()Sparkless implements 120+ functions and 70+ DataFrame methods across PySpark 3.0-3.5:
| Category | Functions | Examples |
|---|---|---|
| String (40+) | Text manipulation, regex, formatting | upper, concat, regexp_extract, soundex |
| Math (35+) | Arithmetic, trigonometry, rounding | abs, sqrt, sin, cos, ln |
| DateTime (30+) | Date/time operations, timezones | date_add, hour, weekday, convert_timezone |
| Array (25+) | Array manipulation, lambdas | array_distinct, transform, filter, aggregate |
| Aggregate (20+) | Statistical functions | sum, avg, median, percentile, max_by |
| Map (10+) | Dictionary operations | map_keys, map_filter, transform_values |
| Conditional (8+) | Logic and null handling | when, coalesce, ifnull, nullif |
| Window (8+) | Ranking and analytics | row_number, rank, lag, lead |
| XML (9+) | XML parsing and generation | from_xml, to_xml, xpath_* |
| Bitwise (6+) | Bit manipulation | bit_count, bit_and, bit_xor |
๐ See complete function list: PYSPARK_FUNCTION_MATRIX.md | Full API Documentation
- Transformations:
select,filter,withColumn,drop,distinct,orderBy,replace - Aggregations:
groupBy,agg,count,sum,avg,min,max,median,mode - Joins:
inner,left,right,outer,cross - Advanced:
union,pivot,unpivot,explode,transform
from sparkless.sql import Window, functions as F
# Ranking and analytics
df = spark.createDataFrame([
{"name": "Alice", "dept": "IT", "salary": 50000},
{"name": "Bob", "dept": "HR", "salary": 60000},
{"name": "Charlie", "dept": "IT", "salary": 70000},
])
result = df.withColumn("rank", F.row_number().over(
Window.partitionBy("dept").orderBy("salary")
))
# Show results
for row in result.collect():
print(row)
# Output:
# Row(dept='HR', name='Bob', salary=60000, rank=1)
# Row(dept='IT', name='Alice', salary=50000, rank=1)
# Row(dept='IT', name='Charlie', salary=70000, rank=2)df = spark.createDataFrame([
{"name": "Alice", "salary": 50000},
{"name": "Bob", "salary": 60000},
{"name": "Charlie", "salary": 70000},
])
# Create temporary view for SQL queries
df.createOrReplaceTempView("employees")
# Execute SQL queries
result = spark.sql("SELECT name, salary FROM employees WHERE salary > 50000")
result.show()
# SQL support enables querying DataFrames using SQL syntaxFull Delta Lake table format support:
# Write as Delta table
df.write.format("delta").mode("overwrite").saveAsTable("catalog.users")
# Time travel - query historical versions
v0_data = spark.read.format("delta").option("versionAsOf", 0).table("catalog.users")
# Schema evolution
new_df.write.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.saveAsTable("catalog.users")
# MERGE operations for upserts
spark.sql("""
MERGE INTO catalog.users AS target
USING updates AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")Sparkless mirrors PySpark's lazy execution model:
# Transformations are queued (not executed)
result = df.filter(F.col("age") > 25).select("name")
# Actions trigger execution
rows = result.collect() # โ Execution happens here
count = result.count() # โ Or hereDataFrame operation chains are automatically optimized using Common Table Expressions:
# Enable lazy evaluation for CTE optimization
data = [
{"name": "Alice", "age": 25, "salary": 50000},
{"name": "Bob", "age": 30, "salary": 60000},
{"name": "Charlie", "age": 35, "salary": 70000},
{"name": "David", "age": 28, "salary": 55000},
]
df = spark.createDataFrame(data)
# This entire chain executes as ONE optimized query:
result = (
df.filter(F.col("age") > 25) # CTE 0: WHERE clause
.select("name", "age", "salary") # CTE 1: Column selection
.withColumn("bonus", F.col("salary") * 0.1) # CTE 2: New column
.orderBy(F.desc("salary")) # CTE 3: ORDER BY
.limit(2) # CTE 4: LIMIT
).collect() # Single query execution here
# Result:
# [Row(name='Charlie', age=35, salary=70000, bonus=7000.0),
# Row(name='Bob', age=30, salary=60000, bonus=6000.0)]
# Performance: 5-10x faster than creating 5 intermediate tablesSparkless uses Polars as the default backend, providing:
- ๐งต Thread Safety - Designed for parallel execution
- โก High Performance - Optimized DataFrame operations
- ๐ Parquet Storage - Tables persist as Parquet files
- ๐ Lazy Evaluation - Automatic query optimization
# Default backend (Polars) - thread-safe, high-performance
spark = SparkSession("MyApp")
# Explicit backend selection
spark = SparkSession.builder \
.config("spark.sparkless.backend", "polars") \
.getOrCreate()# Memory backend for lightweight testing
spark = SparkSession.builder \
.config("spark.sparkless.backend", "memory") \
.getOrCreate()
# File backend for persistent storage
spark = SparkSession.builder \
.config("spark.sparkless.backend", "file") \
.config("spark.sparkless.backend.basePath", "/tmp/sparkless") \
.getOrCreate()Tables created with saveAsTable() can persist across multiple sessions:
# First session - create table
spark1 = SparkSession("App1", db_path="test.db")
df = spark1.createDataFrame([{"id": 1, "name": "Alice"}])
df.write.mode("overwrite").saveAsTable("schema.my_table")
spark1.stop()
# Second session - table persists
spark2 = SparkSession("App2", db_path="test.db")
assert spark2.catalog.tableExists("schema", "my_table") # โ
True
result = spark2.table("schema.my_table").collect() # โ
Works!
spark2.stop()Key Features:
- Cross-Session Persistence: Tables persist when using
db_pathparameter - Schema Discovery: Automatically discovers existing schemas and tables
- Catalog Synchronization: Reliable
catalog.tableExists()checks - Data Integrity: Full support for
appendandoverwritemodes
Control memory usage and test isolation:
# Default: 1GB memory limit, no disk spillover (best for tests)
spark = SparkSession("MyApp")
# Custom memory limit
spark = SparkSession("MyApp", max_memory="4GB")
# Allow disk spillover for large datasets
spark = SparkSession(
"MyApp",
max_memory="8GB",
allow_disk_spillover=True # Uses unique temp directory per session
)Real-world test suite improvements:
| Operation | PySpark | Sparkless | Speedup |
|---|---|---|---|
| Session Creation | 30-45s | 0.1s | 300x |
| Simple Query | 2-5s | 0.01s | 200x |
| Window Functions | 5-10s | 0.05s | 100x |
| Full Test Suite | 5-10min | 30-60s | 10x |
- ๐ค 9 New String & JSON Functions - Implemented missing PySpark functions for improved compatibility
- String Functions:
soundex()(phonetic matching),translate()(character translation),levenshtein()(edit distance),crc32()(checksum),xxhash64()(deterministic hashing),regexp_extract_all()(array of regex matches),substring_index()(delimiter-based extraction) - JSON Functions:
get_json_object()(JSONPath extraction),json_tuple()(multi-field extraction to separate columns) - All functions match PySpark behavior exactly, including edge cases (nulls, empty strings, invalid JSON)
xxhash64uses deterministic XXHash64 algorithm (seed=42) matching PySpark outputregexp_extract_allextracts all regex matches as arrays (PySpark 3.5+ feature)
- String Functions:
- ๐งช Comprehensive Test Coverage - 18 new tests ensuring full PySpark compatibility
- 8 parity tests validating exact PySpark behavior
- 9 robust edge case tests covering nulls, empty strings, invalid JSON, missing paths/fields
- All tests pass in both Sparkless (mock) and PySpark backends
- ๐ Documentation Updates - Updated CHANGELOG and PYSPARK_FUNCTION_MATRIX with implementation details
- ๐ค Case-Insensitive Column Names - Complete refactoring with centralized
ColumnResolversystem- Added
spark.sql.caseSensitiveconfiguration (default:false, matching PySpark) - All column resolution respects case sensitivity settings
- Ambiguity detection for multiple columns differing only by case
- Comprehensive test coverage: 34 unit tests + 17 integration tests
- Issue #264: Fixed case-insensitive column resolution in
withColumnwithF.col() - Fixed case-sensitive mode enforcement for all DataFrame operations
- Added
- ๐ Tuple-Based DataFrame Creation - Fixed tuple-based data parameter support
- Issue #270: Fixed
createDataFramewith tuple-based data to convert tuples to dictionaries - All operations now work correctly:
.show(),.unionByName(),.fillna(),.replace(),.dropna(),.join(), etc. - Strict length validation matching PySpark behavior (
LENGTH_SHOULD_BE_THE_SAMEerror) - Supports tuple, list, and mixed tuple/dict/Row data
- 23 comprehensive unit tests, all passing in Sparkless and PySpark modes
- Issue #270: Fixed
- ๐ง PySpark Compatibility Enhancements
- Issue #247: Added
elementTypekeyword argument support toArrayType(PySpark convention) - Issue #260: Implemented
Column.eqNullSafefor null-safe equality comparisons (NULL <=> NULLreturnsTrue) - Issue #261: Implemented full support for
Column.between()API with inclusive bounds - Issue #262: Fixed
ArrayTypeinitialization with positional arguments (e.g.,ArrayType(DoubleType(), True)) - Issue #263: Fixed
isnan()on string columns to match PySpark behavior (returnsFalsefor strings)
- Issue #247: Added
- ๐ Bug Fixes
- Fixed
fillna()to properly materialize lazy DataFrames after joins - Fixed all AttributeError issues with tuple-based data (
.keys(),.get(),.items(),.copy()) - Fixed
isnan()Polars backend errors on string columns
- Fixed
- ๐ Issue Fixes โ Fixed 7 critical issues (225-231) improving PySpark compatibility:
- Issue #225: String-to-numeric type coercion for comparison operations
- Issue #226:
isin()method with*valuesarguments and type coercion - Issue #227:
getItem()out-of-bounds handling (returnsNoneinstead of errors) - Issue #228: Regex look-ahead/look-behind fallback support
- Issue #229: Pandas DataFrame support with proper recognition
- Issue #230: Case-insensitive column name matching across all operations
- Issue #231:
simpleString()method for all DataType classes
- ๐ง SQL JOIN Parsing โ Fixed SQL JOIN condition parsing and validation
- โ select() Validation โ Fixed validation to properly handle ColumnOperation expressions
- ๐งช Test Coverage โ All 50 tests passing for issues 225-231, including pandas DataFrame support
- ๐ฆ Code Quality โ Applied ruff formatting, fixed linting issues, and resolved mypy type errors
- ๐ Exception Handling Fixes โ Fixed critical exception handling issues (issue #183): replaced bare
except:clause withexcept Exception:and added comprehensive logging to exception handlers for better debuggability. - ๐งช Comprehensive Test Coverage โ Added 10 comprehensive test cases for string concatenation cache handling edge cases (issue #188), covering empty strings, None values, nested operations, and numeric vs string operations.
- ๐ Improved Documentation โ Enhanced documentation for string concatenation cache heuristic, documenting limitations and expected behavior vs PySpark.
- ๐ Code Quality Review โ Systematic review of dictionary.get() usage patterns throughout codebase, confirming all patterns are safe with appropriate default values.
- โ Type Safety โ Fixed mypy errors in CI: improved type narrowing for ColumnOperation.operation and removed redundant casts in writer.py.
- ๐๏ธ Complete SQL DDL/DML โ Full implementation of
CREATE TABLE,DROP TABLE,INSERT INTO,UPDATE, andDELETE FROMstatements in the SQL executor. - ๐ Enhanced SQL Parser โ Comprehensive support for DDL statements with column definitions,
IF NOT EXISTS, andIF EXISTSclauses. - ๐พ INSERT Operations โ Support for
INSERT INTO ... VALUES (...)with multiple rows andINSERT INTO ... SELECT ...sub-queries. - ๐ UPDATE & DELETE โ Full support for
UPDATE ... SET ... WHERE ...andDELETE FROM ... WHERE ...with Python-based expression evaluation. - ๐ Bug Fixes โ Fixed recursion errors in schema projection and resolved import shadowing issues in SQL executor.
- โจ Code Quality โ Improved linting, formatting, and type safety across the codebase.
- โก Feature-Flagged Profiling โ Introduced
sparkless.utils.profilingwith opt-in instrumentation for Polars hot paths and expression evaluation, plus a new guide atdocs/performance/profiling.md. - ๐ Adaptive Execution Simulation โ Query plans can now inject synthetic
REPARTITIONsteps based on skew metrics, configurable viaQueryOptimizer.configure_adaptive_executionand covered by new regression tests. - ๐ผ Pandas Backend Choice โ Added an optional native pandas mode (
MOCK_SPARK_PANDAS_MODE) with benchmarking support (scripts/benchmark_pandas_fallback.py) and documentation indocs/performance/pandas_fallback.md.
- ๐งญ Session-Literal Helpers โ
F.current_catalog,F.current_database,F.current_schema, andF.current_userreturn PySpark-compatible literals and understand the active session (with new regression coverage). - ๐๏ธ Reliable Catalog Context โ The Polars backend and unified storage manager now track the selected schema so
setCurrentDatabaseworks end-to-end, andSparkContext.sparkUser()mirrors PySpark behaviour. - ๐งฎ Pure-Python Stats โ Lightweight
percentileandcovariancehelpers keep percentile/cov tests green even without NumPy, eliminating native-crash regressions. - ๐ ๏ธ Dynamic Dispatch โ
F.call_function("func_name", ...)lets wrappers dynamically invoke registered Sparkless functions with PySpark-style error messages.
- โป๏ธ Unified Commands โ
Makefile,install.sh, and docs now point tobash tests/run_all_tests.sh,ruff, andmypyas the standard dev workflow. - ๐ก๏ธ Automated Gates โ New GitHub Actions pipeline runs linting, type-checking, and the full test suite on every push and PR.
- ๐บ๏ธ Forward Roadmap โ Published
plans/typing_delta_roadmap.mdto track mypy debt reduction and Delta feature milestones. - ๐ Documentation Sweep โ README and quick-start docs highlight the 3.4.0 tooling changes and contributor expectations.
- ๐งฎ Zero mypy Debt โ
mypy sparklessnow runs clean after migrating the Polars executor, expression evaluator, Delta merge helpers, and reader/writer stack to Python 3.8+ compatible type syntax. - ๐งพ Accurate DataFrame Interfaces โ
DataFrameReader.load()and related helpers now returnIDataFrameconsistently while keeping type-only imports behindTYPE_CHECKING. - ๐งฑ Safer Delta & Projection Fallbacks โ Python-evaluated select columns always receive string
aliases, and Delta merge alias handling no longer leaks
Nonekeys into evaluation contexts. - ๐ Docs & Metadata Updated โ README highlights the new type guarantees and all packaging metadata points to v3.3.0.
- ๐ Python 3.8+ Required โ Packaging metadata, tooling configs, and installation docs now align on Python 3.8 as the minimum supported runtime.
- ๐งฉ Compatibility Layer โ Uses
typing_extensionsfor Python 3.8 compatibility; datetime helpers use native typing with proper fallbacks. - ๐ช Type Hint Modernisation โ Uses
typingmodule generics (List,Dict,Tuple) for Python 3.8 compatibility, withfrom __future__ import annotationsfor deferred evaluation. - ๐งผ Ruff Formatting by Default โ Adopted
ruff formatacross the repository, keeping style consistent with the Ruff rule set.
- โ
260-File Type Coverage โ DataFrame mixins now implement structural typing protocols (
SupportsDataFrameOps), giving a cleanmypyrun across the entire project. - ๐งน Zero Ruff Debt โ Repository-wide linting is enabled by default;
ruff checkpasses with no warnings thanks to tighter casts, imports, and configuration. - ๐งญ Backend Selection Docs โ Updated configuration builder and new
docs/backend_selection.mdmake it trivial to toggle between Polars, Memory, File, or DuckDB backends. - ๐งช Delta Schema Evolution Fixes โ Polars mergeSchema appends now align frames to the on-disk schema, restoring compatibility with evolving Delta tables.
- ๐งฐ Improved Test Harness โ
tests/run_all_tests.shrespects virtual environments and ensures documentation examples are executed with the correct interpreter.
Dependency Cleanup & Type Safety:
- ๐งน Removed Legacy Dependencies - Removed unused
sqlglotdependency (legacy DuckDB/SQL backend code) - ๐๏ธ Code Cleanup - Removed unused legacy SQL translation modules (
sql_translator.py,spark_function_mapper.py) - โ
Type Safety - Fixed 177 type errors using
tytype checker, improved return type annotations - ๐ Linting - Fixed all 63 ruff linting errors, codebase fully formatted
- โ All Tests Passing - Full test suite validated (2314+ tests, all passing)
- ๐ฆ Cleaner Dependencies - Reduced dependency footprint, faster installation
Polars Backend Migration:
- ๐ Polars Backend - Complete migration to Polars for thread-safe, high-performance operations
- ๐งต Thread Safety - Polars is thread-safe by design - no more connection locks or threading issues
- ๐ Parquet Storage - Tables now persist as Parquet files
- โก Performance - Better performance for DataFrame operations
- โ All tests passing - Full test suite validated with Polars backend
- ๐ฆ Production-ready - Stable release with improved architecture
See Migration Guide for details.
๐ Full documentation available at sparkless.readthedocs.io
- ๐ Installation & Setup
- ๐ฏ Quick Start Guide
- ๐ Migration from PySpark
- ๐ง spark-ddl-parser - Zero-dependency PySpark DDL schema parser
- ๐ API Reference
- ๐ Lazy Evaluation
- ๐๏ธ SQL Operations
- ๐พ Storage & Persistence
- โ๏ธ Configuration
- ๐ Benchmarking
- ๐ Plugins & Hooks
- ๐ Pytest Integration
- ๐งต Threading Guide
- ๐ง Memory Management
- โก CTE Optimization
# Install for development
git clone https://github.com/eddiethedean/sparkless.git
cd sparkless
pip install -e ".[dev]"
# Run all tests (with proper isolation)
bash tests/run_all_tests.sh
# Format code
ruff format .
ruff check . --fix
# Type checking
mypy sparkless tests
# Linting
ruff check .We welcome contributions! Areas of interest:
- โก Performance - Further Polars optimizations
- ๐ Documentation - Examples, guides, tutorials
- ๐ Bug Fixes - Edge cases and compatibility issues
- ๐งช PySpark API Coverage - Additional functions and methods
- ๐งช Tests - Additional test coverage and scenarios
While Sparkless provides comprehensive PySpark compatibility, some advanced features are planned for future releases:
- Error Handling: Enhanced error messages with recovery strategies
- Performance: Advanced query optimization, parallel execution, intelligent caching
- Enterprise: Schema evolution, data lineage, audit logging
- Compatibility: PySpark 3.6+, Iceberg support
Want to contribute? These are great opportunities for community contributions!
MIT License - see LICENSE file for details.
- GitHub: github.com/eddiethedean/sparkless
- PyPI: pypi.org/project/sparkless
- Documentation: sparkless.readthedocs.io
- Issues: github.com/eddiethedean/sparkless/issues
Built with โค๏ธ for the PySpark community
Star โญ this repo if Sparkless helps speed up your tests!