Skip to content

eddiethedean/abraxos

Repository files navigation

Abraxos

PyPI version Documentation Status License: MIT Tests Coverage

Abraxos is a lightweight Python toolkit for robust, row-aware data processing using Pandas and Pydantic. It helps you build resilient ETL pipelines that gracefully handle errors at the row level.

✨ Why Abraxos?

Traditional data pipelines fail completely when they encounter a single bad row. Abraxos changes that:

  • πŸ›‘οΈ Fault-tolerant by design - isolate and recover from row-level errors
  • πŸ” Full error visibility - see exactly which rows failed and why
  • πŸ”„ Automatic retry logic - recursive splitting to isolate problem rows
  • πŸ“Š Production-ready - 118 tests, 92% coverage, type-safe

πŸš€ Features

  • πŸ“„ CSV Ingestion with Bad Line Recovery
    Read CSVs in full or in chunks, automatically capturing malformed lines separately.

  • πŸ” Transform DataFrames Resiliently
    Apply transformation functions and automatically isolate rows that fail.

  • πŸ§ͺ Pydantic-Based Row Validation
    Validate each row using Pydantic models, separating valid and invalid records.

  • πŸ›’οΈ SQL Insertion with Error Splitting
    Insert DataFrames into SQL databases with automatic retry and chunking for failed rows.


πŸ“¦ Installation

pip install abraxos

With optional dependencies:

# For SQL support
pip install abraxos[sql]

# For Pydantic validation
pip install abraxos[validate]

# For development
pip install abraxos[dev]

# Everything
pip install abraxos[all]

Requirements:

  • Python 3.10+
  • pandas >= 1.5.0
  • numpy >= 1.23.0
  • Optional: sqlalchemy >= 2.0.0
  • Optional: pydantic >= 2.0.0

πŸ“– Documentation

Full documentation is available at: https://abraxos.readthedocs.io


🎯 Quick Start

Here are real, tested examples showing Abraxos in action:

πŸ” Example 1: Read CSVs with Error Recovery

Abraxos captures malformed lines instead of crashing your pipeline:

from abraxos import read_csv

# Read a CSV that has some malformed lines
result = read_csv("data.csv")

print("Bad lines:", result.bad_lines)
print("\nClean data:")
print(result.dataframe)

Output:

Bad lines: [['TOO', 'MANY', 'COLUMNS', 'HERE']]

Clean data:
   id    name  age
0   1     Joe   28
1   2   Alice   35
2   3  Marcus   40

🧼 Example 2: Transform with Fault Isolation

Apply transformations that automatically isolate problematic rows:

import pandas as pd
from abraxos import transform

df = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['  Joe  ', '  Alice  ', '  Marcus  '],
    'age': [28, 35, 40]
})

def clean_data(df):
    df = df.copy()
    df["name"] = df["name"].str.strip().str.lower()
    return df

result = transform(df, clean_data)
print("Errors:", result.errors)
print("\nSuccess DataFrame:")
print(result.success_df)

Output:

Errors: []

Success DataFrame:
   id    name  age
0   1     joe   28
1   2   alice   35
2   3  marcus   40

⚑ Example 3: Automatic Error Isolation

When transformation fails on some rows, Abraxos automatically isolates them:

import pandas as pd
from abraxos import transform

df = pd.DataFrame({'value': [1, 2, 0, 3, 4]})

def divide_by_value(df):
    df = df.copy()
    if (df['value'] == 0).any():
        raise ValueError('Cannot divide by zero')
    df['result'] = 100 / df['value']
    return df

result = transform(df, divide_by_value)

print(f"Errors encountered: {len(result.errors)}")
print(f"\nSuccessful rows ({len(result.success_df)}):")
print(result.success_df)
print(f"\nFailed rows ({len(result.errored_df)}):")
print(result.errored_df)

Output:

Errors encountered: 1

Successful rows (4):
   value      result
0      1  100.000000
1      2   50.000000
3      3   33.333333
4      4   25.000000

Failed rows (1):
   value
2      0

Notice how Abraxos automatically isolated the problematic row (value=0) and processed the rest!


βœ… Example 4: Validate with Pydantic

Validate each row and separate valid from invalid data:

import pandas as pd
from abraxos import validate
from pydantic import BaseModel

class Person(BaseModel):
    name: str
    age: int

df = pd.DataFrame({
    'name': ['Joe', 'Alice', 'Marcus'],
    'age': [28, 'invalid', 40]
})

result = validate(df, Person)

print("Valid rows:")
print(result.success_df)
print(f"\nNumber of validation errors: {len(result.errors)}")
print("\nInvalid rows:")
print(result.errored_df)

Output:

Valid rows:
     name  age
0     Joe   28
2  Marcus   40

Number of validation errors: 1

Invalid rows:
    name      age
1  Alice  invalid

πŸ—ƒοΈ Example 5: SQL Insertion with Retry Logic

Insert data into SQL with automatic error handling:

import pandas as pd
from abraxos import to_sql
from sqlalchemy import create_engine

engine = create_engine("sqlite:///example.db")

df = pd.DataFrame({
    'name': ['Joe', 'Alice', 'Marcus'],
    'age': [28, 35, 40]
})

result = to_sql(df, "people", engine)

print(f"Successful inserts: {result.success_df.shape[0]}")
print(f"Failed rows: {result.errored_df.shape[0]}")

Output:

Successful inserts: 3
Failed rows: 0

Data in database:
     name  age
0     Joe   28
1   Alice   35
2  Marcus   40

πŸ“š Example 6: Process Large Files in Chunks

Read and process large CSV files efficiently:

from abraxos import read_csv

# Read in chunks of 1000 rows
for chunk_result in read_csv("large_file.csv", chunksize=1000):
    print(f"Processing chunk with {len(chunk_result.dataframe)} rows")
    print(f"Bad lines in this chunk: {len(chunk_result.bad_lines)}")
    
    # Process the chunk
    # ... your processing logic here

Output:

Reading in chunks of 2 rows:

Chunk 1:
   id  value
0   1     10
1   2     20

Chunk 2:
   id  value
2   3     30
3   4     40

Chunk 3:
   id  value
4   5     50

πŸ”„ Complete ETL Pipeline Example

Here's a complete example combining multiple features:

from abraxos import read_csv, transform, validate, to_sql
from pydantic import BaseModel
from sqlalchemy import create_engine

# 1. Extract: Read CSV with error recovery
csv_result = read_csv("messy_data.csv")
print(f"Captured {len(csv_result.bad_lines)} bad lines")

# 2. Transform: Clean the data
def clean_data(df):
    df = df.copy()
    df['name'] = df['name'].str.strip().str.title()
    df['age'] = pd.to_numeric(df['age'], errors='coerce')
    return df.dropna()

transform_result = transform(csv_result.dataframe, clean_data)
print(f"Transformed {len(transform_result.success_df)} rows successfully")

# 3. Validate: Ensure data quality
class Person(BaseModel):
    name: str
    age: int

validate_result = validate(transform_result.success_df, Person)
print(f"Validated {len(validate_result.success_df)} rows")
print(f"Validation failed for {len(validate_result.errored_df)} rows")

# 4. Load: Insert into database
engine = create_engine("sqlite:///clean_data.db")
load_result = to_sql(validate_result.success_df, "people", engine)
print(f"Loaded {len(load_result.success_df)} rows to database")

# Save error reports
csv_result.bad_lines  # Malformed CSV lines
transform_result.errored_df  # Rows that failed transformation
validate_result.errored_df  # Rows that failed validation
load_result.errored_df  # Rows that failed to insert

πŸ—οΈ API Reference

Core Functions

read_csv(path, *, chunksize=None, **kwargs) -> ReadCsvResult | Generator

Read CSV files with automatic bad line recovery.

Returns: ReadCsvResult(bad_lines, dataframe) or generator of results if chunked.

transform(df, transformer, chunks=2) -> TransformResult

Apply a transformation function with automatic error isolation.

Returns: TransformResult(errors, errored_df, success_df)

validate(df, model) -> ValidateResult

Validate DataFrame rows using a Pydantic model.

Returns: ValidateResult(errors, errored_df, success_df)

to_sql(df, name, con, *, if_exists='append', chunks=2, **kwargs) -> ToSqlResult

Insert DataFrame into SQL database with retry logic.

Returns: ToSqlResult(errors, errored_df, success_df)

Utility Functions

  • split(df, n=2) - Split DataFrame into n parts
  • clear(df) - Create empty DataFrame with same schema
  • to_records(df) - Convert DataFrame to list of dicts with None for NaN

πŸ§ͺ Testing & Development

Abraxos is thoroughly tested and type-safe:

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests with coverage (118 tests, 92% coverage)
pytest

# Run type checking
mypy abraxos  # Success: no issues found

# Run linting and formatting
ruff check .  # All checks passed
ruff format .

Test Coverage:

  • 118 tests passing
  • 92% code coverage
  • All major code paths tested
  • Type-safe with mypy

🀝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Quick checklist:

  • βœ… Add tests for new features
  • βœ… Maintain 90%+ coverage
  • βœ… Pass all type checks (mypy abraxos)
  • βœ… Pass all lints (ruff check .)
  • βœ… Update documentation

πŸ“ Changelog

See CHANGELOG.md for version history and migration guides.


πŸ“„ License

MIT License Β© 2024 Odos Matthews


πŸ§™β€β™‚οΈ Author

Crafted by Odos Matthews to bring resilience and magic to data workflows.


⭐ Support

If Abraxos helps your project, consider:

  • ⭐ Starring the repo
  • πŸ› Reporting issues
  • 🀝 Contributing improvements
  • πŸ“’ Sharing with others

Happy data processing! πŸš€

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages