Anvil is a dataflow-oriented scripting language and execution engine built on Rust and Apache DataFusion.
It is designed for readable, composable, graph-friendly data pipelines that can be parsed, analyzed, and executed deterministically.
At its core, Anvil treats data processing as a sequence of tools connected by flows, with optional branching and variable binding.
Anvil can be run either by providing a script file or by entering an interactive REPL (when no script is specified).
anvil [OPTIONS] [SCRIPT]SCRIPT
Optional path to an Anvil script file.
If omitted, Anvil starts in REPL mode.
Emit the execution plan as a Graphviz DOT graph instead of executing the plan.
- If
PATHis provided, the DOT output is written to that file. - If
PATHis omitted, the DOT output is written to stdout.
This is useful for inspecting execution order, data lineage, tool dependencies, and branching behavior.
anvil examples/join.avlanvil --dot examples/join.avlanvil --dot plan.dot examples/join.avlYou can pipe the DOT output directly into the dot binary to generate an image:
anvil --dot examples/join.avl | dot -Tpng -o plan.pngOr, if you wrote the DOT file explicitly:
anvil --dot plan.dot examples/join.avl
dot -Tpng plan.dot -o plan.pngThe resulting graph visually distinguishes tools and variables, and edge labels represent data ports (including branch outputs such as true and false).
An Anvil script is a sequence of statements. Each statement defines a flow of tools and variables connected by pipes (|).
[input: './data/users.parquet'] | [select: id='id', email='email'] | [print];
Each tool consumes one or more dataframes and produces zero or more dataframes.
Tools are written using bracket syntax:
[tool_name: arguments]
Examples:
[input: './data/users.parquet']
[select: 'id,email']
[print]
The brackets clearly distinguish tools from variables and make branching explicit.
A statement can bind its final result to a variable using >:
[input: './data/users.parquet'] > users;
Variables may be used as inputs to later flows:
users | [count] | [print];
Variables are first-class graph nodes — they represent stored data, not execution.
- Comments start with
# - Whitespace is flexible and mostly insignificant
# Load users
[input: './data/users.parquet'] > users;
-
Statements end with
; -
Tools are chained with
| -
Variable binding uses
> -
Tool arguments support:
- positional arguments
- keyword arguments
- flow arguments (nested pipelines in parentheses)
[limit: 10]
[register: './data/users.parquet', table='users']
[sort: id=desc]
Some tools accept flows as argument values. Flows used as arguments must be wrapped in parentheses.
[join:
df_lt=users
df_rt=([input: './data/orders.parquet'])
left_cols='id'
right_cols='user_id'
]
A flow argument may reference:
- a variable
- a tool
- an entire pipeline
Flows may branch using : and named branch targets.
users | [filter: '$age < 18']
: true => minors
, false => adults;
Each branch produces its own output flow.
- input — read a file into a dataframe
- output — write a dataframe to a file
- print — write a dataframe to stdout
- register — register a file as a SQL table
- schema — produce a dataframe describing the schema
- describe — metadata and statistics
- count — count rows
- distinct — distinct rows
- select — select columns / expressions
- filter — filter rows using expressions
- project — compute new columns from expressions
- sort — sort by expressions
- limit — limit number of rows
- drop — drop columns
- fill — fill null values
- union
- intersect
- join
- sql — execute SQL against registered tables
Many tools accept DataFusion expressions as strings.
Examples:
[filter: '$age > 30']
[project: total='$price * $quantity']
[sort: created_at, user_id]
Column references use $column_name.
[input: './data/users.parquet'] > users;
users | [schema] | [print];
users | [count] | [print];
[input: './data/users.parquet']
| [filter: '$age >= 18']
| [project:
full_name='$first_name || " " || $last_name',
age_bucket='$age / 10'
]
| [print];
[join:
type='inner'
df_lt=([input: './data/users.parquet'])
df_rt=([input: './data/orders.parquet'])
cols_lt='id'
cols_rt='user_id'
]
| [print];
[input: './data/messy.parquet'] | [filter: '$three == true']:
true => [print],
false => df;
df | [print];
[register: './data/users.parquet', table='users'];
[sql: 'SELECT age, COUNT(*) FROM users GROUP BY age']
| [print];
- Readable pipelines
- Explicit dataflow
- Graph-based execution
- Static analyzability (lineage, dependencies)
- Tight integration with DataFusion
Anvil is under active development.
Current areas of focus:
- execution graph construction
- data lineage tracking
- REPL support
- richer expression semantics