Abraxos is a lightweight Python toolkit for robust, row-aware data processing using Pandas and Pydantic. It helps you build resilient ETL pipelines that gracefully handle errors at the row level.
Traditional data pipelines fail completely when they encounter a single bad row. Abraxos changes that:
- π‘οΈ Fault-tolerant by design - isolate and recover from row-level errors
- π Full error visibility - see exactly which rows failed and why
- π Automatic retry logic - recursive splitting to isolate problem rows
- π Production-ready - 118 tests, 92% coverage, type-safe
-
π CSV Ingestion with Bad Line Recovery
Read CSVs in full or in chunks, automatically capturing malformed lines separately. -
π Transform DataFrames Resiliently
Apply transformation functions and automatically isolate rows that fail. -
π§ͺ Pydantic-Based Row Validation
Validate each row using Pydantic models, separating valid and invalid records. -
π’οΈ SQL Insertion with Error Splitting
Insert DataFrames into SQL databases with automatic retry and chunking for failed rows.
pip install abraxosWith optional dependencies:
# For SQL support
pip install abraxos[sql]
# For Pydantic validation
pip install abraxos[validate]
# For development
pip install abraxos[dev]
# Everything
pip install abraxos[all]Requirements:
- Python 3.10+
- pandas >= 1.5.0
- numpy >= 1.23.0
- Optional: sqlalchemy >= 2.0.0
- Optional: pydantic >= 2.0.0
Full documentation is available at: https://abraxos.readthedocs.io
Here are real, tested examples showing Abraxos in action:
Abraxos captures malformed lines instead of crashing your pipeline:
from abraxos import read_csv
# Read a CSV that has some malformed lines
result = read_csv("data.csv")
print("Bad lines:", result.bad_lines)
print("\nClean data:")
print(result.dataframe)Output:
Bad lines: [['TOO', 'MANY', 'COLUMNS', 'HERE']]
Clean data:
id name age
0 1 Joe 28
1 2 Alice 35
2 3 Marcus 40
Apply transformations that automatically isolate problematic rows:
import pandas as pd
from abraxos import transform
df = pd.DataFrame({
'id': [1, 2, 3],
'name': [' Joe ', ' Alice ', ' Marcus '],
'age': [28, 35, 40]
})
def clean_data(df):
df = df.copy()
df["name"] = df["name"].str.strip().str.lower()
return df
result = transform(df, clean_data)
print("Errors:", result.errors)
print("\nSuccess DataFrame:")
print(result.success_df)Output:
Errors: []
Success DataFrame:
id name age
0 1 joe 28
1 2 alice 35
2 3 marcus 40
When transformation fails on some rows, Abraxos automatically isolates them:
import pandas as pd
from abraxos import transform
df = pd.DataFrame({'value': [1, 2, 0, 3, 4]})
def divide_by_value(df):
df = df.copy()
if (df['value'] == 0).any():
raise ValueError('Cannot divide by zero')
df['result'] = 100 / df['value']
return df
result = transform(df, divide_by_value)
print(f"Errors encountered: {len(result.errors)}")
print(f"\nSuccessful rows ({len(result.success_df)}):")
print(result.success_df)
print(f"\nFailed rows ({len(result.errored_df)}):")
print(result.errored_df)Output:
Errors encountered: 1
Successful rows (4):
value result
0 1 100.000000
1 2 50.000000
3 3 33.333333
4 4 25.000000
Failed rows (1):
value
2 0
Notice how Abraxos automatically isolated the problematic row (value=0) and processed the rest!
Validate each row and separate valid from invalid data:
import pandas as pd
from abraxos import validate
from pydantic import BaseModel
class Person(BaseModel):
name: str
age: int
df = pd.DataFrame({
'name': ['Joe', 'Alice', 'Marcus'],
'age': [28, 'invalid', 40]
})
result = validate(df, Person)
print("Valid rows:")
print(result.success_df)
print(f"\nNumber of validation errors: {len(result.errors)}")
print("\nInvalid rows:")
print(result.errored_df)Output:
Valid rows:
name age
0 Joe 28
2 Marcus 40
Number of validation errors: 1
Invalid rows:
name age
1 Alice invalid
Insert data into SQL with automatic error handling:
import pandas as pd
from abraxos import to_sql
from sqlalchemy import create_engine
engine = create_engine("sqlite:///example.db")
df = pd.DataFrame({
'name': ['Joe', 'Alice', 'Marcus'],
'age': [28, 35, 40]
})
result = to_sql(df, "people", engine)
print(f"Successful inserts: {result.success_df.shape[0]}")
print(f"Failed rows: {result.errored_df.shape[0]}")Output:
Successful inserts: 3
Failed rows: 0
Data in database:
name age
0 Joe 28
1 Alice 35
2 Marcus 40
Read and process large CSV files efficiently:
from abraxos import read_csv
# Read in chunks of 1000 rows
for chunk_result in read_csv("large_file.csv", chunksize=1000):
print(f"Processing chunk with {len(chunk_result.dataframe)} rows")
print(f"Bad lines in this chunk: {len(chunk_result.bad_lines)}")
# Process the chunk
# ... your processing logic hereOutput:
Reading in chunks of 2 rows:
Chunk 1:
id value
0 1 10
1 2 20
Chunk 2:
id value
2 3 30
3 4 40
Chunk 3:
id value
4 5 50
Here's a complete example combining multiple features:
from abraxos import read_csv, transform, validate, to_sql
from pydantic import BaseModel
from sqlalchemy import create_engine
# 1. Extract: Read CSV with error recovery
csv_result = read_csv("messy_data.csv")
print(f"Captured {len(csv_result.bad_lines)} bad lines")
# 2. Transform: Clean the data
def clean_data(df):
df = df.copy()
df['name'] = df['name'].str.strip().str.title()
df['age'] = pd.to_numeric(df['age'], errors='coerce')
return df.dropna()
transform_result = transform(csv_result.dataframe, clean_data)
print(f"Transformed {len(transform_result.success_df)} rows successfully")
# 3. Validate: Ensure data quality
class Person(BaseModel):
name: str
age: int
validate_result = validate(transform_result.success_df, Person)
print(f"Validated {len(validate_result.success_df)} rows")
print(f"Validation failed for {len(validate_result.errored_df)} rows")
# 4. Load: Insert into database
engine = create_engine("sqlite:///clean_data.db")
load_result = to_sql(validate_result.success_df, "people", engine)
print(f"Loaded {len(load_result.success_df)} rows to database")
# Save error reports
csv_result.bad_lines # Malformed CSV lines
transform_result.errored_df # Rows that failed transformation
validate_result.errored_df # Rows that failed validation
load_result.errored_df # Rows that failed to insertRead CSV files with automatic bad line recovery.
Returns: ReadCsvResult(bad_lines, dataframe) or generator of results if chunked.
Apply a transformation function with automatic error isolation.
Returns: TransformResult(errors, errored_df, success_df)
Validate DataFrame rows using a Pydantic model.
Returns: ValidateResult(errors, errored_df, success_df)
Insert DataFrame into SQL database with retry logic.
Returns: ToSqlResult(errors, errored_df, success_df)
split(df, n=2)- Split DataFrame into n partsclear(df)- Create empty DataFrame with same schemato_records(df)- Convert DataFrame to list of dicts with None for NaN
Abraxos is thoroughly tested and type-safe:
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests with coverage (118 tests, 92% coverage)
pytest
# Run type checking
mypy abraxos # Success: no issues found
# Run linting and formatting
ruff check . # All checks passed
ruff format .Test Coverage:
- 118 tests passing
- 92% code coverage
- All major code paths tested
- Type-safe with mypy
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Quick checklist:
- β Add tests for new features
- β Maintain 90%+ coverage
- β
Pass all type checks (
mypy abraxos) - β
Pass all lints (
ruff check .) - β Update documentation
See CHANGELOG.md for version history and migration guides.
MIT License Β© 2024 Odos Matthews
Crafted by Odos Matthews to bring resilience and magic to data workflows.
If Abraxos helps your project, consider:
- β Starring the repo
- π Reporting issues
- π€ Contributing improvements
- π’ Sharing with others
Happy data processing! π