Convert Polars, Pandas, PySpark DataFrames, SQLAlchemy models, or SQLModel classes to Pydantic models with schema inference — and generate clean Python class code. Also supports bidirectional conversion from Pydantic to these formats.
Core Functionality:
- 🔍 Infer Pydantic models dynamically from Polars, Pandas, or PySpark DataFrames
- 🗄️ Convert SQLAlchemy and SQLModel model classes to/from Pydantic
- 📋 Infer models directly from iterables of dictionaries (SQL results, JSON records, etc.)
- 🎯 Automatic type detection for basic types, nested structures, and temporal data
- 🔄 Generator-based for memory-efficient processing of large datasets
- 🔁 Bidirectional conversions between Pydantic and supported backends
- 🎨 Generate clean Python model code using datamodel-code-generator
Advanced Features:
- ⚡ PyArrow support for high-performance Pandas columns (
int64[pyarrow],string[pyarrow],timestamp[pyarrow], etc.) - 📅 Full temporal type support:
datetime,date,timedeltaacross all backends - 🗂️ Nested structures: Supports nested dicts, lists, and complex hierarchies
- 🔧 Optional field detection: Automatically identifies nullable fields
- 🎛️ Configurable scanning:
max_scanparameter to limit schema inference - 🔒 Force optional mode: Make all fields optional regardless of data
- ✅ Model name validation: Ensures valid Python identifiers
- 🧪 Comprehensively tested: 112 tests, 87% code coverage
Design:
- 🪶 Lightweight, dependency-flexible architecture
- 🔌 Optional dependencies for Polars, Pandas, PyArrow, SQLAlchemy, SQLModel, and PySpark
- 🎯 Type-checked with mypy
- 📏 Linted with ruff
Install the core package:
pip install articunoAdd optional dependencies as needed:
# For Polars support
pip install articuno[polars]
# For Pandas support (with PyArrow)
pip install articuno[pandas]
# For SQLAlchemy support
pip install articuno[sqlalchemy]
# For SQLModel support
pip install articuno[sqlmodel]
# For PySpark support
pip install articuno[pyspark]
# Full install with all backends
pip install articuno[polars,pandas,sqlalchemy,sqlmodel,pyspark]
# Development dependencies (includes pytest, mypy, ruff)
pip install articuno[dev]from articuno import df_to_pydantic, infer_pydantic_model
import polars as pl
# Create a DataFrame
df = pl.DataFrame({
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"score": [95.5, 88.0, 92.3],
"active": [True, False, True]
})
# Convert to Pydantic instances (returns a generator)
instances = list(df_to_pydantic(df, model_name="UserModel"))
print(instances[0])
# Output: id=1 name='Alice' score=95.5 active=True
# Or just get the model class
Model = infer_pydantic_model(df, model_name="UserModel")
print(Model.model_json_schema())Perfect for SQL query results, API responses, or JSON data:
from articuno import df_to_pydantic, infer_pydantic_model
# From database results, API responses, etc.
records = [
{"id": 1, "name": "Alice", "email": "alice@example.com"},
{"id": 2, "name": "Bob", "email": "bob@example.com"},
]
# Automatically infer and create instances
instances = list(df_to_pydantic(records, model_name="User"))
# Or infer just the model
Model = infer_pydantic_model(records, model_name="User")Comprehensive Jupyter notebooks demonstrating all features:
- 01_quick_start.ipynb - Basic usage with Polars, Pandas, and dict iterables
- 02_temporal_types.ipynb - Working with datetime, date, and timedelta
- 03_pyarrow_support.ipynb - PyArrow-backed Pandas columns
- 04_nested_structures.ipynb - Complex nested dictionaries and lists
- 05_force_optional_and_max_scan.ipynb - Control optional fields and scanning
- 06_code_generation.ipynb - Generate Python code from models
- 07_api_responses.ipynb - Process REST API responses
- 08_database_results.ipynb - SQL query results to Pydantic
- 09_advanced_features.ipynb - Generators, type precedence, unicode
- polars_inference.ipynb - Polars-specific inference
- articuno_pandas_pyarrow_example.ipynb - Pandas with PyArrow
- pandas_nested.ipynb - Nested Pandas structures
- optional_nested_example.ipynb - Optional nested fields
- articuno_inference_demo.ipynb - General inference demo
Note: All example notebooks have been executed and saved with outputs, so you can view the results directly on GitHub without running them.
Articuno fully supports datetime, date, and timedelta types:
from datetime import datetime, date, timedelta
from articuno import infer_pydantic_model
data = [
{
"event_id": 1,
"event_date": date(2024, 1, 15),
"timestamp": datetime(2024, 1, 15, 10, 30),
"duration": timedelta(hours=2, minutes=30)
}
]
Model = infer_pydantic_model(data, model_name="Event")
# Fields will have correct datetime.date, datetime.datetime, datetime.timedelta typesFull support for high-performance PyArrow dtypes:
import pandas as pd
from articuno import infer_pydantic_model
df = pd.DataFrame({
"id": pd.Series([1, 2, 3], dtype="int64[pyarrow]"),
"name": pd.Series(["Alice", "Bob", "Charlie"], dtype="string[pyarrow]"),
"created": pd.Series([
datetime(2024, 1, 1),
datetime(2024, 1, 2),
datetime(2024, 1, 3)
], dtype=pd.ArrowDtype(pa.timestamp("ms"))),
"active": pd.Series([True, False, True], dtype="bool[pyarrow]")
})
Model = infer_pydantic_model(df, model_name="ArrowModel")Supported PyArrow types:
int64[pyarrow],int32[pyarrow], etc.string[pyarrow]bool[pyarrow]timestamp[pyarrow]→datetime.datetimedate32[pyarrow],date64[pyarrow]→datetime.dateduration[pyarrow]→datetime.timedelta
Handle complex nested data with ease:
data = [
{
"user_id": 1,
"profile": {
"name": "Alice",
"age": 30,
"preferences": {
"theme": "dark",
"notifications": True
}
},
"tags": ["python", "data-science"]
}
]
Model = infer_pydantic_model(data, model_name="UserProfile")
# Nested dicts become nested Pydantic models
# Lists are preserved with List[...] typingMake all fields optional regardless of the data:
from articuno import infer_pydantic_model
df = pl.DataFrame({
"required": [1, 2, 3],
"also_required": ["a", "b", "c"]
})
# Force all fields to be Optional
Model = infer_pydantic_model(df, force_optional=True)
# Now you can create instances with None values
instance = Model(required=None, also_required=None)For large datasets, limit how many records are scanned:
# Only scan first 100 records for schema inference
Model = infer_pydantic_model(
large_dataset,
model_name="LargeModel",
max_scan=100
)df_to_pydantic returns a generator for memory efficiency:
# Generator - memory efficient for large datasets
instances_gen = df_to_pydantic(large_df, model_name="Record")
# Process one at a time
for instance in instances_gen:
process(instance)
# Or collect all at once if needed
instances_list = list(df_to_pydantic(df, model_name="Record"))Generate clean Python code for your models:
from articuno import infer_pydantic_model
from articuno.codegen import generate_class_code
# Infer model from data
Model = infer_pydantic_model(data, model_name="User")
# Generate Python code
code = generate_class_code(Model)
print(code)
# Or save to file
code = generate_class_code(Model, output_path="models.py")Use a pre-defined model instead of inferring:
from pydantic import BaseModel
from articuno import df_to_pydantic
class UserModel(BaseModel):
id: int
name: str
email: str
# Use your existing model
instances = list(df_to_pydantic(df, model=UserModel))Convert between SQLAlchemy declarative models and Pydantic:
from sqlalchemy import Column, Integer, String, Float
from sqlalchemy.orm import DeclarativeBase
from articuno import infer_pydantic_model, pydantic_to_sqlalchemy
class Base(DeclarativeBase):
pass
class User(Base):
__tablename__ = "users"
id = Column(Integer, primary_key=True)
name = Column(String(255), nullable=False)
email = Column(String(255), nullable=True)
# Convert SQLAlchemy model to Pydantic
PydanticModel = infer_pydantic_model(User, model_name="UserModel")
# Convert Pydantic model to SQLAlchemy
from pydantic import BaseModel as PydanticBase
class ProductModel(PydanticBase):
id: int
name: str
price: float
SQLAlchemyModel = pydantic_to_sqlalchemy(ProductModel, model_name="Product")Convert between SQLModel and Pydantic (SQLModel already extends Pydantic):
from sqlmodel import SQLModel, Field
from articuno import infer_pydantic_model, pydantic_to_sqlmodel
class User(SQLModel, table=True):
__tablename__ = "users"
id: int | None = Field(default=None, primary_key=True)
name: str
email: str | None = None
# Convert SQLModel to Pydantic
PydanticModel = infer_pydantic_model(User, model_name="UserModel")
# Convert Pydantic to SQLModel
from pydantic import BaseModel
class ProductModel(BaseModel):
id: int | None = None
name: str
price: float
SQLModelClass = pydantic_to_sqlmodel(ProductModel, model_name="Product")Convert between PySpark DataFrames and Pydantic:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, LongType, StringType
from articuno import infer_pydantic_model, df_to_pydantic, pydantic_to_pyspark
spark = SparkSession.builder.appName("articuno").getOrCreate()
# Create PySpark DataFrame
data = [(1, "Alice"), (2, "Bob")]
schema = StructType([
StructField("id", LongType(), False),
StructField("name", StringType(), False),
])
df = spark.createDataFrame(data, schema=schema)
# Convert PySpark DataFrame to Pydantic
Model = infer_pydantic_model(df, model_name="UserModel")
instances = list(df_to_pydantic(df, model=Model))
# Convert Pydantic instances to PySpark DataFrame
from pydantic import BaseModel
class UserModel(BaseModel):
id: int
name: str
instances = [UserModel(id=1, name="Alice"), UserModel(id=2, name="Bob")]
df = pydantic_to_pyspark(instances, model=UserModel)| Polars Type | Pandas Type (incl. PyArrow) | Dict/Iterable | SQLAlchemy Type | SQLModel Type | PySpark Type | Pydantic Type |
|---|---|---|---|---|---|---|
pl.Int*, pl.UInt* |
int64, Int64, int64[pyarrow] |
int |
Integer, BigInteger |
int |
IntegerType, LongType |
int |
pl.Float* |
float64, float64[pyarrow] |
float |
Float, Numeric |
float |
FloatType, DoubleType |
float |
pl.Utf8, pl.String |
object, string[pyarrow] |
str |
String, Text |
str |
StringType |
str |
pl.Boolean |
bool, bool[pyarrow] |
bool |
Boolean |
bool |
BooleanType |
bool |
pl.Date |
datetime64[ns], date[pyarrow] |
date |
Date |
date |
DateType |
datetime.date |
pl.Datetime |
datetime64[ns], timestamp[pyarrow] |
datetime |
DateTime |
datetime |
TimestampType |
datetime.datetime |
pl.Duration |
timedelta64[ns], duration[pyarrow] |
timedelta |
- | - | - | datetime.timedelta |
pl.List |
list |
list |
- | List[...] |
ArrayType |
List[...] |
pl.Struct |
dict (nested) |
dict (nested) |
- | - | StructType |
Nested BaseModel |
pl.Null |
None, NaN |
None |
nullable=True |
Optional[...] |
nullable=True |
Optional[...] |
# Process API responses
api_data = [
{
"status": "success",
"data": {
"user_id": 123,
"username": "alice",
"created_at": datetime(2024, 1, 15, 10, 30)
}
}
]
APIResponse = infer_pydantic_model(api_data, model_name="APIResponse")
instances = list(df_to_pydantic(api_data, model=APIResponse))import sqlite3
from articuno import infer_pydantic_model, df_to_pydantic
# Get results from database
conn = sqlite3.connect("database.db")
conn.row_factory = sqlite3.Row
cursor = conn.execute("SELECT * FROM users")
rows = [dict(row) for row in cursor.fetchall()]
# Convert to Pydantic
UserModel = infer_pydantic_model(rows, model_name="User")
users = list(df_to_pydantic(rows, model=UserModel))orders = [
{
"order_id": 1001,
"customer": {"id": 501, "name": "Alice", "email": "alice@example.com"},
"items": [
{"product": "Laptop", "quantity": 1, "price": 999.99},
{"product": "Mouse", "quantity": 2, "price": 29.99}
],
"total": 1059.97,
"created_at": datetime(2024, 1, 15, 10, 30)
}
]
Order = infer_pydantic_model(orders, model_name="Order")Articuno is thoroughly tested and type-checked:
# Run tests
pytest
# Run with coverage
pytest --cov=articuno --cov-report=term-missing
# Type checking
mypy articuno
# Linting
ruff check .Test Statistics:
- 112 comprehensive tests
- 87% code coverage
- All tests passing ✅
- Type-checked with mypy ✅
- Linted with ruff ✅
# Clone the repository
git clone https://github.com/eddiethedean/articuno.git
cd articuno
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run type checking
mypy articuno
# Run linting
ruff check .- Use generators for large datasets:
df_to_pydanticreturns a generator by default for memory efficiency - Limit scanning for performance: Use
max_scanparameter when dealing with huge datasets - Validate model names: Articuno automatically validates that model names are valid Python identifiers
- Handle optional fields: Use
force_optional=Truewhen working with sparse data - Type precedence: Articuno correctly handles bool vs int (bool is checked first)
If you get import errors for polars or pandas:
pip install articuno[polars] # or [pandas]For PyArrow support:
pip install pyarrowdf_to_pydantic returns a generator. Convert to list if you need indexing:
instances = list(df_to_pydantic(df, model_name="Model"))
print(instances[0]) # Now you can indexInfer a Pydantic model class from a DataFrame or dict iterable.
Parameters:
source: Pandas DataFrame, Polars DataFrame, or iterable of dictsmodel_name: Name for the generated model (must be valid Python identifier)force_optional: Make all fields optionalmax_scan: Max records to scan for dict iterables
Returns: Type[BaseModel]
Convert DataFrame or dict iterable to Pydantic instances.
Parameters:
source: Pandas DataFrame, Polars DataFrame, or iterable of dictsmodel: Optional pre-defined model to usemodel_name: Name for inferred model ifmodelis Noneforce_optional: Make all fields optionalmax_scan: Max records to scan for dict iterables
Returns: Generator[BaseModel, None, None]
Generate Python code from a Pydantic model.
Parameters:
model: Pydantic model classoutput_path: Optional file path to write code tomodel_name: Optional name override
Returns: str (the generated code)
- GitHub Repository
- Datamodel Code Generator
- Poldantic (Polars integration)
- Polars
- Pandas
- PyArrow
MIT © Odos Matthews
- Built with Pydantic
- Code generation powered by datamodel-code-generator
- Polars support via poldantic
- ✨ Added full datetime/date/timedelta support across all backends
- ✨ Added PyArrow temporal type support (timestamp, date, duration)
- ✨ Added model name validation (ensures valid Python identifiers)
- 🐛 Fixed bool vs int type precedence
- 🐛 Fixed DataFrame vs iterable detection order
- 🐛 Fixed temporary directory cleanup in code generation
- 🐛 Added defensive checks for empty samples
- 📝 Improved documentation with generator behavior notes
- 🧪 Added comprehensive test suite (112 tests, 87% coverage)
- 🔍 Full mypy type checking
- 📏 Ruff linting compliance
- 📓 Added 9 comprehensive example notebooks with outputs
- 📚 Enhanced README with complete guide and examples