Skip to content

eddiethedean/articuno

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

❄️ Articuno ❄️

Convert Polars, Pandas, PySpark DataFrames, SQLAlchemy models, or SQLModel classes to Pydantic models with schema inference — and generate clean Python class code. Also supports bidirectional conversion from Pydantic to these formats.

Python 3.8+ Type Checked Code Style Test Coverage


✨ Features

Core Functionality:

  • 🔍 Infer Pydantic models dynamically from Polars, Pandas, or PySpark DataFrames
  • 🗄️ Convert SQLAlchemy and SQLModel model classes to/from Pydantic
  • 📋 Infer models directly from iterables of dictionaries (SQL results, JSON records, etc.)
  • 🎯 Automatic type detection for basic types, nested structures, and temporal data
  • 🔄 Generator-based for memory-efficient processing of large datasets
  • 🔁 Bidirectional conversions between Pydantic and supported backends
  • 🎨 Generate clean Python model code using datamodel-code-generator

Advanced Features:

  • PyArrow support for high-performance Pandas columns (int64[pyarrow], string[pyarrow], timestamp[pyarrow], etc.)
  • 📅 Full temporal type support: datetime, date, timedelta across all backends
  • 🗂️ Nested structures: Supports nested dicts, lists, and complex hierarchies
  • 🔧 Optional field detection: Automatically identifies nullable fields
  • 🎛️ Configurable scanning: max_scan parameter to limit schema inference
  • 🔒 Force optional mode: Make all fields optional regardless of data
  • Model name validation: Ensures valid Python identifiers
  • 🧪 Comprehensively tested: 112 tests, 87% code coverage

Design:

  • 🪶 Lightweight, dependency-flexible architecture
  • 🔌 Optional dependencies for Polars, Pandas, PyArrow, SQLAlchemy, SQLModel, and PySpark
  • 🎯 Type-checked with mypy
  • 📏 Linted with ruff

📦 Installation

Install the core package:

pip install articuno

Add optional dependencies as needed:

# For Polars support
pip install articuno[polars]

# For Pandas support (with PyArrow)
pip install articuno[pandas]

# For SQLAlchemy support
pip install articuno[sqlalchemy]

# For SQLModel support
pip install articuno[sqlmodel]

# For PySpark support
pip install articuno[pyspark]

# Full install with all backends
pip install articuno[polars,pandas,sqlalchemy,sqlmodel,pyspark]

# Development dependencies (includes pytest, mypy, ruff)
pip install articuno[dev]

🚀 Quick Start

DataFrame to Pydantic Models

from articuno import df_to_pydantic, infer_pydantic_model
import polars as pl

# Create a DataFrame
df = pl.DataFrame({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "score": [95.5, 88.0, 92.3],
    "active": [True, False, True]
})

# Convert to Pydantic instances (returns a generator)
instances = list(df_to_pydantic(df, model_name="UserModel"))
print(instances[0])
# Output: id=1 name='Alice' score=95.5 active=True

# Or just get the model class
Model = infer_pydantic_model(df, model_name="UserModel")
print(Model.model_json_schema())

Dict Iterables to Pydantic

Perfect for SQL query results, API responses, or JSON data:

from articuno import df_to_pydantic, infer_pydantic_model

# From database results, API responses, etc.
records = [
    {"id": 1, "name": "Alice", "email": "alice@example.com"},
    {"id": 2, "name": "Bob", "email": "bob@example.com"},
]

# Automatically infer and create instances
instances = list(df_to_pydantic(records, model_name="User"))

# Or infer just the model
Model = infer_pydantic_model(records, model_name="User")

📓 Example Notebooks

Comprehensive Jupyter notebooks demonstrating all features:

Core Examples

Advanced Examples

Legacy Examples

Note: All example notebooks have been executed and saved with outputs, so you can view the results directly on GitHub without running them.


📚 Advanced Usage

Temporal Types Support

Articuno fully supports datetime, date, and timedelta types:

from datetime import datetime, date, timedelta
from articuno import infer_pydantic_model

data = [
    {
        "event_id": 1,
        "event_date": date(2024, 1, 15),
        "timestamp": datetime(2024, 1, 15, 10, 30),
        "duration": timedelta(hours=2, minutes=30)
    }
]

Model = infer_pydantic_model(data, model_name="Event")
# Fields will have correct datetime.date, datetime.datetime, datetime.timedelta types

PyArrow-Backed Pandas Columns

Full support for high-performance PyArrow dtypes:

import pandas as pd
from articuno import infer_pydantic_model

df = pd.DataFrame({
    "id": pd.Series([1, 2, 3], dtype="int64[pyarrow]"),
    "name": pd.Series(["Alice", "Bob", "Charlie"], dtype="string[pyarrow]"),
    "created": pd.Series([
        datetime(2024, 1, 1),
        datetime(2024, 1, 2),
        datetime(2024, 1, 3)
    ], dtype=pd.ArrowDtype(pa.timestamp("ms"))),
    "active": pd.Series([True, False, True], dtype="bool[pyarrow]")
})

Model = infer_pydantic_model(df, model_name="ArrowModel")

Supported PyArrow types:

  • int64[pyarrow], int32[pyarrow], etc.
  • string[pyarrow]
  • bool[pyarrow]
  • timestamp[pyarrow]datetime.datetime
  • date32[pyarrow], date64[pyarrow]datetime.date
  • duration[pyarrow]datetime.timedelta

Nested Structures

Handle complex nested data with ease:

data = [
    {
        "user_id": 1,
        "profile": {
            "name": "Alice",
            "age": 30,
            "preferences": {
                "theme": "dark",
                "notifications": True
            }
        },
        "tags": ["python", "data-science"]
    }
]

Model = infer_pydantic_model(data, model_name="UserProfile")
# Nested dicts become nested Pydantic models
# Lists are preserved with List[...] typing

Force Optional Fields

Make all fields optional regardless of the data:

from articuno import infer_pydantic_model

df = pl.DataFrame({
    "required": [1, 2, 3],
    "also_required": ["a", "b", "c"]
})

# Force all fields to be Optional
Model = infer_pydantic_model(df, force_optional=True)

# Now you can create instances with None values
instance = Model(required=None, also_required=None)

Limit Schema Scanning

For large datasets, limit how many records are scanned:

# Only scan first 100 records for schema inference
Model = infer_pydantic_model(
    large_dataset,
    model_name="LargeModel",
    max_scan=100
)

Memory-Efficient Processing

df_to_pydantic returns a generator for memory efficiency:

# Generator - memory efficient for large datasets
instances_gen = df_to_pydantic(large_df, model_name="Record")

# Process one at a time
for instance in instances_gen:
    process(instance)

# Or collect all at once if needed
instances_list = list(df_to_pydantic(df, model_name="Record"))

Code Generation

Generate clean Python code for your models:

from articuno import infer_pydantic_model
from articuno.codegen import generate_class_code

# Infer model from data
Model = infer_pydantic_model(data, model_name="User")

# Generate Python code
code = generate_class_code(Model)
print(code)

# Or save to file
code = generate_class_code(Model, output_path="models.py")

Pre-defined Models

Use a pre-defined model instead of inferring:

from pydantic import BaseModel
from articuno import df_to_pydantic

class UserModel(BaseModel):
    id: int
    name: str
    email: str

# Use your existing model
instances = list(df_to_pydantic(df, model=UserModel))

SQLAlchemy Model Conversion

Convert between SQLAlchemy declarative models and Pydantic:

from sqlalchemy import Column, Integer, String, Float
from sqlalchemy.orm import DeclarativeBase
from articuno import infer_pydantic_model, pydantic_to_sqlalchemy

class Base(DeclarativeBase):
    pass

class User(Base):
    __tablename__ = "users"
    id = Column(Integer, primary_key=True)
    name = Column(String(255), nullable=False)
    email = Column(String(255), nullable=True)

# Convert SQLAlchemy model to Pydantic
PydanticModel = infer_pydantic_model(User, model_name="UserModel")

# Convert Pydantic model to SQLAlchemy
from pydantic import BaseModel as PydanticBase

class ProductModel(PydanticBase):
    id: int
    name: str
    price: float

SQLAlchemyModel = pydantic_to_sqlalchemy(ProductModel, model_name="Product")

SQLModel Conversion

Convert between SQLModel and Pydantic (SQLModel already extends Pydantic):

from sqlmodel import SQLModel, Field
from articuno import infer_pydantic_model, pydantic_to_sqlmodel

class User(SQLModel, table=True):
    __tablename__ = "users"
    id: int | None = Field(default=None, primary_key=True)
    name: str
    email: str | None = None

# Convert SQLModel to Pydantic
PydanticModel = infer_pydantic_model(User, model_name="UserModel")

# Convert Pydantic to SQLModel
from pydantic import BaseModel

class ProductModel(BaseModel):
    id: int | None = None
    name: str
    price: float

SQLModelClass = pydantic_to_sqlmodel(ProductModel, model_name="Product")

PySpark DataFrame Conversion

Convert between PySpark DataFrames and Pydantic:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, LongType, StringType
from articuno import infer_pydantic_model, df_to_pydantic, pydantic_to_pyspark

spark = SparkSession.builder.appName("articuno").getOrCreate()

# Create PySpark DataFrame
data = [(1, "Alice"), (2, "Bob")]
schema = StructType([
    StructField("id", LongType(), False),
    StructField("name", StringType(), False),
])
df = spark.createDataFrame(data, schema=schema)

# Convert PySpark DataFrame to Pydantic
Model = infer_pydantic_model(df, model_name="UserModel")
instances = list(df_to_pydantic(df, model=Model))

# Convert Pydantic instances to PySpark DataFrame
from pydantic import BaseModel

class UserModel(BaseModel):
    id: int
    name: str

instances = [UserModel(id=1, name="Alice"), UserModel(id=2, name="Bob")]
df = pydantic_to_pyspark(instances, model=UserModel)

⚙️ Supported Type Mappings

Polars Type Pandas Type (incl. PyArrow) Dict/Iterable SQLAlchemy Type SQLModel Type PySpark Type Pydantic Type
pl.Int*, pl.UInt* int64, Int64, int64[pyarrow] int Integer, BigInteger int IntegerType, LongType int
pl.Float* float64, float64[pyarrow] float Float, Numeric float FloatType, DoubleType float
pl.Utf8, pl.String object, string[pyarrow] str String, Text str StringType str
pl.Boolean bool, bool[pyarrow] bool Boolean bool BooleanType bool
pl.Date datetime64[ns], date[pyarrow] date Date date DateType datetime.date
pl.Datetime datetime64[ns], timestamp[pyarrow] datetime DateTime datetime TimestampType datetime.datetime
pl.Duration timedelta64[ns], duration[pyarrow] timedelta - - - datetime.timedelta
pl.List list list - List[...] ArrayType List[...]
pl.Struct dict (nested) dict (nested) - - StructType Nested BaseModel
pl.Null None, NaN None nullable=True Optional[...] nullable=True Optional[...]

🎯 Real-World Examples

API Response Processing

# Process API responses
api_data = [
    {
        "status": "success",
        "data": {
            "user_id": 123,
            "username": "alice",
            "created_at": datetime(2024, 1, 15, 10, 30)
        }
    }
]

APIResponse = infer_pydantic_model(api_data, model_name="APIResponse")
instances = list(df_to_pydantic(api_data, model=APIResponse))

SQL Query Results

import sqlite3
from articuno import infer_pydantic_model, df_to_pydantic

# Get results from database
conn = sqlite3.connect("database.db")
conn.row_factory = sqlite3.Row
cursor = conn.execute("SELECT * FROM users")
rows = [dict(row) for row in cursor.fetchall()]

# Convert to Pydantic
UserModel = infer_pydantic_model(rows, model_name="User")
users = list(df_to_pydantic(rows, model=UserModel))

E-commerce Order Processing

orders = [
    {
        "order_id": 1001,
        "customer": {"id": 501, "name": "Alice", "email": "alice@example.com"},
        "items": [
            {"product": "Laptop", "quantity": 1, "price": 999.99},
            {"product": "Mouse", "quantity": 2, "price": 29.99}
        ],
        "total": 1059.97,
        "created_at": datetime(2024, 1, 15, 10, 30)
    }
]

Order = infer_pydantic_model(orders, model_name="Order")

🧪 Testing & Quality

Articuno is thoroughly tested and type-checked:

# Run tests
pytest

# Run with coverage
pytest --cov=articuno --cov-report=term-missing

# Type checking
mypy articuno

# Linting
ruff check .

Test Statistics:

  • 112 comprehensive tests
  • 87% code coverage
  • All tests passing ✅
  • Type-checked with mypy ✅
  • Linted with ruff ✅

🔧 Development

# Clone the repository
git clone https://github.com/eddiethedean/articuno.git
cd articuno

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run type checking
mypy articuno

# Run linting
ruff check .

💡 Tips & Best Practices

  1. Use generators for large datasets: df_to_pydantic returns a generator by default for memory efficiency
  2. Limit scanning for performance: Use max_scan parameter when dealing with huge datasets
  3. Validate model names: Articuno automatically validates that model names are valid Python identifiers
  4. Handle optional fields: Use force_optional=True when working with sparse data
  5. Type precedence: Articuno correctly handles bool vs int (bool is checked first)

🐛 Troubleshooting

Import Errors

If you get import errors for polars or pandas:

pip install articuno[polars]  # or [pandas]

PyArrow Issues

For PyArrow support:

pip install pyarrow

Generator Indexing

df_to_pydantic returns a generator. Convert to list if you need indexing:

instances = list(df_to_pydantic(df, model_name="Model"))
print(instances[0])  # Now you can index

📖 API Reference

Main Functions

infer_pydantic_model(source, model_name="AutoModel", force_optional=False, max_scan=1000)

Infer a Pydantic model class from a DataFrame or dict iterable.

Parameters:

  • source: Pandas DataFrame, Polars DataFrame, or iterable of dicts
  • model_name: Name for the generated model (must be valid Python identifier)
  • force_optional: Make all fields optional
  • max_scan: Max records to scan for dict iterables

Returns: Type[BaseModel]

df_to_pydantic(source, model=None, model_name=None, force_optional=False, max_scan=1000)

Convert DataFrame or dict iterable to Pydantic instances.

Parameters:

  • source: Pandas DataFrame, Polars DataFrame, or iterable of dicts
  • model: Optional pre-defined model to use
  • model_name: Name for inferred model if model is None
  • force_optional: Make all fields optional
  • max_scan: Max records to scan for dict iterables

Returns: Generator[BaseModel, None, None]

generate_class_code(model, output_path=None, model_name=None)

Generate Python code from a Pydantic model.

Parameters:

  • model: Pydantic model class
  • output_path: Optional file path to write code to
  • model_name: Optional name override

Returns: str (the generated code)


🔗 Links


📄 License

MIT © Odos Matthews


🙏 Acknowledgments


📝 Changelog

v0.9.0

  • ✨ Added full datetime/date/timedelta support across all backends
  • ✨ Added PyArrow temporal type support (timestamp, date, duration)
  • ✨ Added model name validation (ensures valid Python identifiers)
  • 🐛 Fixed bool vs int type precedence
  • 🐛 Fixed DataFrame vs iterable detection order
  • 🐛 Fixed temporary directory cleanup in code generation
  • 🐛 Added defensive checks for empty samples
  • 📝 Improved documentation with generator behavior notes
  • 🧪 Added comprehensive test suite (112 tests, 87% coverage)
  • 🔍 Full mypy type checking
  • 📏 Ruff linting compliance
  • 📓 Added 9 comprehensive example notebooks with outputs
  • 📚 Enhanced README with complete guide and examples

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages