README.md is the English source of truth. The Chinese mirror lives in README-zh.md. Keep both files aligned.
Velaria is a local-first C++17 dataflow engine research project. The current goal is narrow and explicit:
- keep one native kernel as the execution source of truth
- keep the single-node path stable
- expose that kernel through a supported Python ecosystem layer
- use the same-host actor/rpc path as an experiment lane, not as a second kernel
Owns:
- local batch and streaming execution
- logical planning and minimal SQL mapping
- source/sink ABI
- explain / progress / checkpoint contract
- local vector search
Repository entrypoints:
- docs:
- source groups:
//:velaria_core_logical_sources//:velaria_core_execution_sources//:velaria_core_contract_sources
- regression:
//:core_regression
Owns:
- native binding in
python_api - Arrow ingress and output
uvworkflow- wheel / native wheel / CLI packaging
- Excel / Bitable / custom stream adapters
- local workspace tracking for runs and artifacts
Does not own:
- execution hot-path semantics
- independent explain / progress / checkpoint semantics
- replacement checkpoint storage
- SQLite as a large-result engine
Repository entrypoints:
- docs:
- source groups:
//:velaria_python_ecosystem_sources//python_api:velaria_python_supported_sources//python_api:velaria_python_example_sources//python_api:velaria_python_experimental_sources
- regression:
//:python_ecosystem_regression//python_api:velaria_python_supported_regression./scripts/run_python_ecosystem_regression.sh
Owns:
- same-host
actor/rpc/jobmasterexperiments - transport / codec / scheduler observation
- same-host smoke and benchmark tooling
Does not imply:
- distributed scheduling
- distributed fault recovery
- cluster resource governance
- production distributed execution
Repository entrypoints:
- source group:
//:velaria_experimental_sources
- regression:
//:experimental_regression./scripts/run_experimental_regression.sh
Arrow / CSV / Python ingress
-> DataflowSession / DataFrame / StreamingDataFrame
-> local runtime kernel
-> sink
-> explain / progress / checkpoint
Public session entry:
DataflowSession
Core user-facing objects:
DataFrameStreamingDataFrameStreamingQuery
Main stream entry points:
session.readStream(source)session.readStreamCsvDir(path)session.streamSql(sql)session.explainStreamSql(sql, options)session.startStreamSql(sql, options)StreamingDataFrame.writeStream(sink, options)
Stable contract surfaces:
StreamingQueryProgresssnapshotJson()explainStreamSql(...)execution_mode / execution_reason / transport_modecheckpoint_delivery_mode- source/sink lifecycle:
open -> nextBatch -> checkpoint -> ack -> close
explainStreamSql(...) always returns:
logicalphysicalstrategy
strategy is the single outlet for mode selection, fallback reason, transport, backpressure, and checkpoint delivery mode.
Workspace persistence keeps the kernel contract unchanged:
explain.jsonstoreslogical / physical / strategyprogress.jsonlappends nativesnapshotJson()output line by line- large results stay in files; SQLite stores only index rows and small previews
Available today:
- one native kernel for batch + streaming
read_csv,readStream(...),readStreamCsvDir(...)- query-local backpressure, bounded backlog, progress snapshots, checkpoint path
- execution modes:
single-process,local-workers - file source/sink support
- basic stream operators:
select / filter / withColumn / drop / limit / window - stateful stream aggregates:
sum / count / min / max / avg - minimal stream SQL subset
- local vector search on fixed-dimension float vectors
- Python Arrow ingress/output
- tracked local runs with run directory persistence and artifact indexing
- same-host actor/rpc/jobmaster smoke path
Out of scope:
- completed distributed runtime claims
- Python callbacks or Python UDFs in the hot path
- broad SQL expansion such as full
JOIN / CTE / subquery / UNION - ANN / standalone vector DB / distributed vector execution
Main supported Python surfaces:
Session.read_csv(...)Session.sql(...)Session.create_dataframe_from_arrow(...)Session.create_stream_from_arrow(...)Session.create_temp_view(...)Session.read_stream_csv_dir(...)Session.stream_sql(...)Session.explain_stream_sql(...)Session.start_stream_sql(...)Session.vector_search(...)Session.explain_vector_search(...)read_excel(...)- custom source / custom sink adapters
Workspace- root under
VELARIA_HOMEor~/.velaria
- root under
RunStore- one run directory per execution
- persists
run.json,inputs.json,explain.json,progress.jsonl, logs, andartifacts/
ArtifactIndex- SQLite-first metadata index
- JSONL fallback when SQLite is unavailable
- preview cache for small result slices only
This layer is for agent/skill invocation, local traceability, and machine-readable CLI integration. It is not a second execution engine.
Repo-visible CLI entrypoints are:
- source checkout:
uv run --project python_api python python_api/velaria_cli.py ...
- packaged binary:
./dist/velaria-cli ...
Do not assume a global velaria-cli command exists unless you have installed one separately.
Bootstrap:
bazel build //:velaria_pyext
bazel run //python_api:sync_native_extension
uv sync --project python_api --python python3.12Run examples:
uv run --project python_api python python_api/examples/demo_batch_sql_arrow.py
uv run --project python_api python python_api/examples/demo_stream_sql.py
uv run --project python_api python python_api/examples/demo_vector_search.pyTracked run examples:
uv run --project python_api python python_api/velaria_cli.py run start -- csv-sql \
--csv /path/to/input.csv \
--query "SELECT * FROM input_table LIMIT 5"
uv run --project python_api python python_api/velaria_cli.py run show --run-id <run_id>
uv run --project python_api python python_api/velaria_cli.py artifacts list --run-id <run_id>
uv run --project python_api python python_api/velaria_cli.py artifacts preview --artifact-id <artifact_id>Vector search is a local kernel capability, not a separate subsystem.
Current scope:
- fixed-dimension
float32 - metrics:
cosine,dot,l2 top-k- exact scan only
- Python
Session.vector_search(...) - Arrow
FixedSizeList<float32> - explain output
Preferred local CSV vector text shape:
[1 2 3][1,2,3]
Design doc:
CLI examples:
uv run --project python_api python python_api/velaria_cli.py csv-sql \
--csv /path/to/input.csv \
--query "SELECT * FROM input_table LIMIT 5"
./dist/velaria-cli vector-search \
--csv /path/to/vectors.csv \
--vector-column embedding \
--query-vector "0.1,0.2,0.3" \
--metric cosine \
--top-k 5Vector explain is part of the stable contract. Current fields include:
mode=exact-scanmetric=<cosine|dot|l2>dimension=<N>top_k=<K>candidate_rows=<M>filter_pushdown=falseacceleration=flat-buffer+heap-topk
Benchmark baseline:
./scripts/run_vector_search_benchmark.shSame-host flow:
client -> scheduler(jobmaster) -> worker -> in-proc operator chain -> result
Build:
bazel build //:actor_rpc_scheduler //:actor_rpc_worker //:actor_rpc_client //:actor_rpc_smokeSmoke:
bazel run //:actor_rpc_smokeThree-process local run:
bazel run //:actor_rpc_scheduler -- --listen 127.0.0.1:61000 --node-id scheduler
bazel run //:actor_rpc_worker -- --connect 127.0.0.1:61000 --node-id worker-1
bazel run //:actor_rpc_client -- --connect 127.0.0.1:61000 --payload "demo payload"Single-node baseline:
bazel run //:sql_demo
bazel run //:df_demo
bazel run //:stream_demoLayered regression entrypoints:
./scripts/run_core_regression.sh
./scripts/run_python_ecosystem_regression.sh
./scripts/run_experimental_regression.sh
./scripts/run_stream_observability_regression.shDirect Bazel suites:
bazel test //:core_regression
bazel test //:python_ecosystem_regression
bazel test //:experimental_regression- language baseline:
C++17 - build system:
Bazel - keep
DataflowSessionas the public session entry - do not break
sql_demo / df_demo / stream_demo - keep example source files as
.cc - use
uvfor Python commands in this repository - keep
README.mdandREADME-zh.mdaligned