Anvil

Anvil is a dataflow-oriented scripting language and execution engine built on Rust and Apache DataFusion.

It is designed for readable, composable, graph-friendly data pipelines that can be parsed, analyzed, and executed deterministically.

At its core, Anvil treats data processing as a sequence of tools connected by flows, with optional branching and variable binding.

Command Line Interface

Anvil can be run either by providing a script file or by entering an interactive REPL (when no script is specified).

Usage

anvil [OPTIONS] [SCRIPT]

SCRIPT
Optional path to an Anvil script file.
If omitted, Anvil starts in REPL mode.

Options

`-d, --dot [PATH]`

Emit the execution plan as a Graphviz DOT graph instead of executing the plan.

If PATH is provided, the DOT output is written to that file.
If PATH is omitted, the DOT output is written to stdout.

This is useful for inspecting execution order, data lineage, tool dependencies, and branching behavior.

Examples

Run an Anvil script normally

anvil examples/join.avl

Generate DOT output to stdout

anvil --dot examples/join.avl

Write DOT output to a file

anvil --dot plan.dot examples/join.avl

Generate a PNG using Graphviz

You can pipe the DOT output directly into the dot binary to generate an image:

anvil --dot examples/join.avl | dot -Tpng -o plan.png

Or, if you wrote the DOT file explicitly:

anvil --dot plan.dot examples/join.avl
dot -Tpng plan.dot -o plan.png

The resulting graph visually distinguishes tools and variables, and edge labels represent data ports (including branch outputs such as true and false).

Key Concepts

Flow-based execution

An Anvil script is a sequence of statements. Each statement defines a flow of tools and variables connected by pipes (|).

[input: './data/users.parquet'] | [select: id='id', email='email'] | [print];

Each tool consumes one or more dataframes and produces zero or more dataframes.

Tools

Tools are written using bracket syntax:

[tool_name: arguments]

Examples:

[input: './data/users.parquet']
[select: 'id,email']
[print]

The brackets clearly distinguish tools from variables and make branching explicit.

Variables

A statement can bind its final result to a variable using >:

[input: './data/users.parquet'] > users;

Variables may be used as inputs to later flows:

users | [count] | [print];

Variables are first-class graph nodes — they represent stored data, not execution.

Comments and whitespace

Comments start with #
Whitespace is flexible and mostly insignificant

# Load users
[input: './data/users.parquet'] > users;

Grammar Overview (Informal)

Statements end with ;
Tools are chained with |
Variable binding uses >
Tool arguments support:
- positional arguments
- keyword arguments
- flow arguments (nested pipelines in parentheses)

Tool Arguments

Positional arguments

[limit: 10]

Keyword arguments

[register: './data/users.parquet', table='users']

Mixed positional + keyword (positional first)

[sort: id=desc]

Flow arguments (subflows)

Some tools accept flows as argument values. Flows used as arguments must be wrapped in parentheses.

[join:
    df_lt=users
    df_rt=([input: './data/orders.parquet'])
    left_cols='id'
    right_cols='user_id'
]

A flow argument may reference:

a variable
a tool
an entire pipeline

Branching

Flows may branch using : and named branch targets.

users | [filter: '$age < 18']
    : true => minors
    , false => adults;

Each branch produces its own output flow.

Available Tools

I/O

input — read a file into a dataframe
output — write a dataframe to a file
print — write a dataframe to stdout
register — register a file as a SQL table

Inspection

schema — produce a dataframe describing the schema
describe — metadata and statistics
count — count rows
distinct — distinct rows

Transformation

select — select columns / expressions
filter — filter rows using expressions
project — compute new columns from expressions
sort — sort by expressions
limit — limit number of rows
drop — drop columns
fill — fill null values

Set operations

union
intersect
join

SQL

sql — execute SQL against registered tables

Expressions

Many tools accept DataFusion expressions as strings.

Examples:

[filter: '$age > 30']
[project: total='$price * $quantity']
[sort: created_at, user_id]

Column references use $column_name.

Example Scripts

Load and inspect data

[input: './data/users.parquet'] > users;

users | [schema] | [print];
users | [count]  | [print];

Filtering and projection

[input: './data/users.parquet']
| [filter: '$age >= 18']
| [project:
      full_name='$first_name || " " || $last_name',
      age_bucket='$age / 10'
  ]
| [print];

Join with subflows

[join:
    type='inner'
    df_lt=([input: './data/users.parquet'])
    df_rt=([input: './data/orders.parquet'])
    cols_lt='id'
    cols_rt='user_id'
]
| [print];

Branching example

[input: './data/messy.parquet'] | [filter: '$three == true']:
	true => [print],
	false => df;

df | [print];

SQL example

[register: './data/users.parquet', table='users'];

[sql: 'SELECT age, COUNT(*) FROM users GROUP BY age']
| [print];

Design Goals

Readable pipelines
Explicit dataflow
Graph-based execution
Static analyzability (lineage, dependencies)
Tight integration with DataFusion

Status

Anvil is under active development.

Current areas of focus:

execution graph construction
data lineage tracking
REPL support
richer expression semantics

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
crates		crates
data		data
examples		examples
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
GRAMMAR.md		GRAMMAR.md
JOIN.png		JOIN.png
LICENSE		LICENSE
README.md		README.md

License

mpyle101/anvil

Folders and files

Latest commit

History

Repository files navigation