❄️ Articuno ❄️

Convert Polars, Pandas, PySpark DataFrames, SQLAlchemy models, or SQLModel classes to Pydantic models with schema inference — and generate clean Python class code. Also supports bidirectional conversion from Pydantic to these formats.

✨ Features

Core Functionality:

🔍 Infer Pydantic models dynamically from Polars, Pandas, or PySpark DataFrames
🗄️ Convert SQLAlchemy and SQLModel model classes to/from Pydantic
📋 Infer models directly from iterables of dictionaries (SQL results, JSON records, etc.)
🎯 Automatic type detection for basic types, nested structures, and temporal data
🔄 Generator-based for memory-efficient processing of large datasets
🔁 Bidirectional conversions between Pydantic and supported backends
🎨 Generate clean Python model code using datamodel-code-generator

Advanced Features:

⚡ PyArrow support for high-performance Pandas columns (int64[pyarrow], string[pyarrow], timestamp[pyarrow], etc.)
📅 Full temporal type support: datetime, date, timedelta across all backends
🗂️ Nested structures: Supports nested dicts, lists, and complex hierarchies
🔧 Optional field detection: Automatically identifies nullable fields
🎛️ Configurable scanning: max_scan parameter to limit schema inference
🔒 Force optional mode: Make all fields optional regardless of data
✅ Model name validation: Ensures valid Python identifiers
🧪 Comprehensively tested: 112 tests, 87% code coverage

Design:

🪶 Lightweight, dependency-flexible architecture
🔌 Optional dependencies for Polars, Pandas, PyArrow, SQLAlchemy, SQLModel, and PySpark
🎯 Type-checked with mypy
📏 Linted with ruff

📦 Installation

Install the core package:

pip install articuno

Add optional dependencies as needed:

# For Polars support
pip install articuno[polars]

# For Pandas support (with PyArrow)
pip install articuno[pandas]

# For SQLAlchemy support
pip install articuno[sqlalchemy]

# For SQLModel support
pip install articuno[sqlmodel]

# For PySpark support
pip install articuno[pyspark]

# Full install with all backends
pip install articuno[polars,pandas,sqlalchemy,sqlmodel,pyspark]

# Development dependencies (includes pytest, mypy, ruff)
pip install articuno[dev]

🚀 Quick Start

DataFrame to Pydantic Models

from articuno import df_to_pydantic, infer_pydantic_model
import polars as pl

# Create a DataFrame
df = pl.DataFrame({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "score": [95.5, 88.0, 92.3],
    "active": [True, False, True]
})

# Convert to Pydantic instances (returns a generator)
instances = list(df_to_pydantic(df, model_name="UserModel"))
print(instances[0])
# Output: id=1 name='Alice' score=95.5 active=True

# Or just get the model class
Model = infer_pydantic_model(df, model_name="UserModel")
print(Model.model_json_schema())

Dict Iterables to Pydantic

Perfect for SQL query results, API responses, or JSON data:

from articuno import df_to_pydantic, infer_pydantic_model

# From database results, API responses, etc.
records = [
    {"id": 1, "name": "Alice", "email": "alice@example.com"},
    {"id": 2, "name": "Bob", "email": "bob@example.com"},
]

# Automatically infer and create instances
instances = list(df_to_pydantic(records, model_name="User"))

# Or infer just the model
Model = infer_pydantic_model(records, model_name="User")

📓 Example Notebooks

Comprehensive Jupyter notebooks demonstrating all features:

Core Examples

01_quick_start.ipynb - Basic usage with Polars, Pandas, and dict iterables
02_temporal_types.ipynb - Working with datetime, date, and timedelta
03_pyarrow_support.ipynb - PyArrow-backed Pandas columns
04_nested_structures.ipynb - Complex nested dictionaries and lists

Advanced Examples

05_force_optional_and_max_scan.ipynb - Control optional fields and scanning
06_code_generation.ipynb - Generate Python code from models
07_api_responses.ipynb - Process REST API responses
08_database_results.ipynb - SQL query results to Pydantic
09_advanced_features.ipynb - Generators, type precedence, unicode

Legacy Examples

polars_inference.ipynb - Polars-specific inference
articuno_pandas_pyarrow_example.ipynb - Pandas with PyArrow
pandas_nested.ipynb - Nested Pandas structures
optional_nested_example.ipynb - Optional nested fields
articuno_inference_demo.ipynb - General inference demo

Note: All example notebooks have been executed and saved with outputs, so you can view the results directly on GitHub without running them.

📚 Advanced Usage

Temporal Types Support

Articuno fully supports datetime, date, and timedelta types:

from datetime import datetime, date, timedelta
from articuno import infer_pydantic_model

data = [
    {
        "event_id": 1,
        "event_date": date(2024, 1, 15),
        "timestamp": datetime(2024, 1, 15, 10, 30),
        "duration": timedelta(hours=2, minutes=30)
    }
]

Model = infer_pydantic_model(data, model_name="Event")
# Fields will have correct datetime.date, datetime.datetime, datetime.timedelta types

PyArrow-Backed Pandas Columns

Full support for high-performance PyArrow dtypes:

import pandas as pd
from articuno import infer_pydantic_model

df = pd.DataFrame({
    "id": pd.Series([1, 2, 3], dtype="int64[pyarrow]"),
    "name": pd.Series(["Alice", "Bob", "Charlie"], dtype="string[pyarrow]"),
    "created": pd.Series([
        datetime(2024, 1, 1),
        datetime(2024, 1, 2),
        datetime(2024, 1, 3)
    ], dtype=pd.ArrowDtype(pa.timestamp("ms"))),
    "active": pd.Series([True, False, True], dtype="bool[pyarrow]")
})

Model = infer_pydantic_model(df, model_name="ArrowModel")

Supported PyArrow types:

int64[pyarrow], int32[pyarrow], etc.
string[pyarrow]
bool[pyarrow]
timestamp[pyarrow] → datetime.datetime
date32[pyarrow], date64[pyarrow] → datetime.date
duration[pyarrow] → datetime.timedelta

Nested Structures

Handle complex nested data with ease:

data = [
    {
        "user_id": 1,
        "profile": {
            "name": "Alice",
            "age": 30,
            "preferences": {
                "theme": "dark",
                "notifications": True
            }
        },
        "tags": ["python", "data-science"]
    }
]

Model = infer_pydantic_model(data, model_name="UserProfile")
# Nested dicts become nested Pydantic models
# Lists are preserved with List[...] typing

Force Optional Fields

Make all fields optional regardless of the data:

from articuno import infer_pydantic_model

df = pl.DataFrame({
    "required": [1, 2, 3],
    "also_required": ["a", "b", "c"]
})

# Force all fields to be Optional
Model = infer_pydantic_model(df, force_optional=True)

# Now you can create instances with None values
instance = Model(required=None, also_required=None)

Limit Schema Scanning

For large datasets, limit how many records are scanned:

# Only scan first 100 records for schema inference
Model = infer_pydantic_model(
    large_dataset,
    model_name="LargeModel",
    max_scan=100
)

Memory-Efficient Processing

df_to_pydantic returns a generator for memory efficiency:

# Generator - memory efficient for large datasets
instances_gen = df_to_pydantic(large_df, model_name="Record")

# Process one at a time
for instance in instances_gen:
    process(instance)

# Or collect all at once if needed
instances_list = list(df_to_pydantic(df, model_name="Record"))

Code Generation

Generate clean Python code for your models:

from articuno import infer_pydantic_model
from articuno.codegen import generate_class_code

# Infer model from data
Model = infer_pydantic_model(data, model_name="User")

# Generate Python code
code = generate_class_code(Model)
print(code)

# Or save to file
code = generate_class_code(Model, output_path="models.py")

Pre-defined Models

Use a pre-defined model instead of inferring:

from pydantic import BaseModel
from articuno import df_to_pydantic

class UserModel(BaseModel):
    id: int
    name: str
    email: str

# Use your existing model
instances = list(df_to_pydantic(df, model=UserModel))

SQLAlchemy Model Conversion

Convert between SQLAlchemy declarative models and Pydantic:

from sqlalchemy import Column, Integer, String, Float
from sqlalchemy.orm import DeclarativeBase
from articuno import infer_pydantic_model, pydantic_to_sqlalchemy

class Base(DeclarativeBase):
    pass

class User(Base):
    __tablename__ = "users"
    id = Column(Integer, primary_key=True)
    name = Column(String(255), nullable=False)
    email = Column(String(255), nullable=True)

# Convert SQLAlchemy model to Pydantic
PydanticModel = infer_pydantic_model(User, model_name="UserModel")

# Convert Pydantic model to SQLAlchemy
from pydantic import BaseModel as PydanticBase

class ProductModel(PydanticBase):
    id: int
    name: str
    price: float

SQLAlchemyModel = pydantic_to_sqlalchemy(ProductModel, model_name="Product")

SQLModel Conversion

Convert between SQLModel and Pydantic (SQLModel already extends Pydantic):

from sqlmodel import SQLModel, Field
from articuno import infer_pydantic_model, pydantic_to_sqlmodel

class User(SQLModel, table=True):
    __tablename__ = "users"
    id: int | None = Field(default=None, primary_key=True)
    name: str
    email: str | None = None

# Convert SQLModel to Pydantic
PydanticModel = infer_pydantic_model(User, model_name="UserModel")

# Convert Pydantic to SQLModel
from pydantic import BaseModel

class ProductModel(BaseModel):
    id: int | None = None
    name: str
    price: float

SQLModelClass = pydantic_to_sqlmodel(ProductModel, model_name="Product")

PySpark DataFrame Conversion

Convert between PySpark DataFrames and Pydantic:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, LongType, StringType
from articuno import infer_pydantic_model, df_to_pydantic, pydantic_to_pyspark

spark = SparkSession.builder.appName("articuno").getOrCreate()

# Create PySpark DataFrame
data = [(1, "Alice"), (2, "Bob")]
schema = StructType([
    StructField("id", LongType(), False),
    StructField("name", StringType(), False),
])
df = spark.createDataFrame(data, schema=schema)

# Convert PySpark DataFrame to Pydantic
Model = infer_pydantic_model(df, model_name="UserModel")
instances = list(df_to_pydantic(df, model=Model))

# Convert Pydantic instances to PySpark DataFrame
from pydantic import BaseModel

class UserModel(BaseModel):
    id: int
    name: str

instances = [UserModel(id=1, name="Alice"), UserModel(id=2, name="Bob")]
df = pydantic_to_pyspark(instances, model=UserModel)

⚙️ Supported Type Mappings

Polars Type	Pandas Type (incl. PyArrow)	Dict/Iterable	SQLAlchemy Type	SQLModel Type	PySpark Type	Pydantic Type
`pl.Int`, `pl.UInt`	`int64`, `Int64`, `int64[pyarrow]`	`int`	`Integer`, `BigInteger`	`int`	`IntegerType`, `LongType`	`int`
`pl.Float*`	`float64`, `float64[pyarrow]`	`float`	`Float`, `Numeric`	`float`	`FloatType`, `DoubleType`	`float`
`pl.Utf8`, `pl.String`	`object`, `string[pyarrow]`	`str`	`String`, `Text`	`str`	`StringType`	`str`
`pl.Boolean`	`bool`, `bool[pyarrow]`	`bool`	`Boolean`	`bool`	`BooleanType`	`bool`
`pl.Date`	`datetime64[ns]`, `date[pyarrow]`	`date`	`Date`	`date`	`DateType`	`datetime.date`
`pl.Datetime`	`datetime64[ns]`, `timestamp[pyarrow]`	`datetime`	`DateTime`	`datetime`	`TimestampType`	`datetime.datetime`
`pl.Duration`	`timedelta64[ns]`, `duration[pyarrow]`	`timedelta`	-	-	-	`datetime.timedelta`
`pl.List`	`list`	`list`	-	`List[...]`	`ArrayType`	`List[...]`
`pl.Struct`	`dict` (nested)	`dict` (nested)	-	-	`StructType`	Nested `BaseModel`
`pl.Null`	`None`, `NaN`	`None`	`nullable=True`	`Optional[...]`	`nullable=True`	`Optional[...]`

🎯 Real-World Examples

API Response Processing

# Process API responses
api_data = [
    {
        "status": "success",
        "data": {
            "user_id": 123,
            "username": "alice",
            "created_at": datetime(2024, 1, 15, 10, 30)
        }
    }
]

APIResponse = infer_pydantic_model(api_data, model_name="APIResponse")
instances = list(df_to_pydantic(api_data, model=APIResponse))

SQL Query Results

import sqlite3
from articuno import infer_pydantic_model, df_to_pydantic

# Get results from database
conn = sqlite3.connect("database.db")
conn.row_factory = sqlite3.Row
cursor = conn.execute("SELECT * FROM users")
rows = [dict(row) for row in cursor.fetchall()]

# Convert to Pydantic
UserModel = infer_pydantic_model(rows, model_name="User")
users = list(df_to_pydantic(rows, model=UserModel))

E-commerce Order Processing

orders = [
    {
        "order_id": 1001,
        "customer": {"id": 501, "name": "Alice", "email": "alice@example.com"},
        "items": [
            {"product": "Laptop", "quantity": 1, "price": 999.99},
            {"product": "Mouse", "quantity": 2, "price": 29.99}
        ],
        "total": 1059.97,
        "created_at": datetime(2024, 1, 15, 10, 30)
    }
]

Order = infer_pydantic_model(orders, model_name="Order")

🧪 Testing & Quality

Articuno is thoroughly tested and type-checked:

# Run tests
pytest

# Run with coverage
pytest --cov=articuno --cov-report=term-missing

# Type checking
mypy articuno

# Linting
ruff check .

Test Statistics:

112 comprehensive tests
87% code coverage
All tests passing ✅
Type-checked with mypy ✅
Linted with ruff ✅

🔧 Development

# Clone the repository
git clone https://github.com/eddiethedean/articuno.git
cd articuno

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run type checking
mypy articuno

# Run linting
ruff check .

💡 Tips & Best Practices

Use generators for large datasets: df_to_pydantic returns a generator by default for memory efficiency
Limit scanning for performance: Use max_scan parameter when dealing with huge datasets
Validate model names: Articuno automatically validates that model names are valid Python identifiers
Handle optional fields: Use force_optional=True when working with sparse data
Type precedence: Articuno correctly handles bool vs int (bool is checked first)

🐛 Troubleshooting

Import Errors

If you get import errors for polars or pandas:

pip install articuno[polars]  # or [pandas]

PyArrow Issues

For PyArrow support:

pip install pyarrow

Generator Indexing

df_to_pydantic returns a generator. Convert to list if you need indexing:

instances = list(df_to_pydantic(df, model_name="Model"))
print(instances[0])  # Now you can index

📖 API Reference

Main Functions

`infer_pydantic_model(source, model_name="AutoModel", force_optional=False, max_scan=1000)`

Infer a Pydantic model class from a DataFrame or dict iterable.

Parameters:

source: Pandas DataFrame, Polars DataFrame, or iterable of dicts
model_name: Name for the generated model (must be valid Python identifier)
force_optional: Make all fields optional
max_scan: Max records to scan for dict iterables

Returns: Type[BaseModel]

`df_to_pydantic(source, model=None, model_name=None, force_optional=False, max_scan=1000)`

Convert DataFrame or dict iterable to Pydantic instances.

Parameters:

source: Pandas DataFrame, Polars DataFrame, or iterable of dicts
model: Optional pre-defined model to use
model_name: Name for inferred model if model is None
force_optional: Make all fields optional
max_scan: Max records to scan for dict iterables

Returns: Generator[BaseModel, None, None]

`generate_class_code(model, output_path=None, model_name=None)`

Generate Python code from a Pydantic model.

Parameters:

model: Pydantic model class
output_path: Optional file path to write code to
model_name: Optional name override

Returns: str (the generated code)

🔗 Links

📄 License

🙏 Acknowledgments

Built with Pydantic
Code generation powered by datamodel-code-generator
Polars support via poldantic

📝 Changelog

v0.9.0

✨ Added full datetime/date/timedelta support across all backends
✨ Added PyArrow temporal type support (timestamp, date, duration)
✨ Added model name validation (ensures valid Python identifiers)
🐛 Fixed bool vs int type precedence
🐛 Fixed DataFrame vs iterable detection order
🐛 Fixed temporary directory cleanup in code generation
🐛 Added defensive checks for empty samples
📝 Improved documentation with generator behavior notes
🧪 Added comprehensive test suite (112 tests, 87% coverage)
🔍 Full mypy type checking
📏 Ruff linting compliance
📓 Added 9 comprehensive example notebooks with outputs
📚 Enhanced README with complete guide and examples

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
articuno		articuno
docs		docs
examples		examples
tests		tests
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
README.md		README.md
pyproject.toml		pyproject.toml

eddiethedean/articuno

Folders and files

Latest commit

History

Repository files navigation

❄️ Articuno ❄️

✨ Features

📦 Installation

🚀 Quick Start

DataFrame to Pydantic Models

Dict Iterables to Pydantic

📓 Example Notebooks

Core Examples

Advanced Examples

Legacy Examples

📚 Advanced Usage

Temporal Types Support

PyArrow-Backed Pandas Columns

Nested Structures

Force Optional Fields

Limit Schema Scanning

Memory-Efficient Processing

Code Generation

Pre-defined Models

SQLAlchemy Model Conversion

SQLModel Conversion

PySpark DataFrame Conversion

⚙️ Supported Type Mappings

🎯 Real-World Examples

API Response Processing

SQL Query Results

E-commerce Order Processing

🧪 Testing & Quality

🔧 Development

💡 Tips & Best Practices

🐛 Troubleshooting

Import Errors

PyArrow Issues

Generator Indexing

📖 API Reference

Main Functions

infer_pydantic_model(source, model_name="AutoModel", force_optional=False, max_scan=1000)

df_to_pydantic(source, model=None, model_name=None, force_optional=False, max_scan=1000)

generate_class_code(model, output_path=None, model_name=None)

🔗 Links

📄 License

🙏 Acknowledgments

📝 Changelog

v0.9.0

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`infer_pydantic_model(source, model_name="AutoModel", force_optional=False, max_scan=1000)`

`df_to_pydantic(source, model=None, model_name=None, force_optional=False, max_scan=1000)`

`generate_class_code(model, output_path=None, model_name=None)`

Packages