Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ cc_library(
"src/dataflow/runtime/job_master.cc",
"src/dataflow/runtime/byte_transport.cc",
"src/dataflow/runtime/rpc_runner.cc",
"src/dataflow/runtime/vector_index.cc",
"src/dataflow/rpc/rpc_codec.cc",
"src/dataflow/transport/ipc_transport.cc",
"src/dataflow/ai/plugin_runtime.cc",
Expand Down Expand Up @@ -43,6 +44,7 @@ cc_library(
"src/dataflow/runtime/job_master.h",
"src/dataflow/runtime/observability.h",
"src/dataflow/runtime/rpc_runner.h",
"src/dataflow/runtime/vector_index.h",
"src/dataflow/ai/plugin_runtime.h",
"src/dataflow/rpc/rpc_codec.h",
"src/dataflow/transport/ipc_transport.h",
Expand Down Expand Up @@ -128,6 +130,12 @@ cc_binary(
deps = [":dataflow_core"],
)

cc_binary(
name = "velaria_cli",
srcs = ["src/dataflow/examples/velaria_cli.cc"],
deps = [":dataflow_core"],
)

cc_library(
name = "dataflow_actor_rpc_codec",
srcs = [
Expand Down Expand Up @@ -235,6 +243,15 @@ cc_binary(
deps = [":dataflow_stream_actor_runtime"],
)

cc_binary(
name = "vector_search_benchmark",
srcs = ["src/dataflow/examples/vector_search_benchmark.cc"],
deps = [
":dataflow_actor_rpc_codec",
":dataflow_core",
],
)

cc_test(
name = "sql_regression_test",
srcs = ["src/dataflow/tests/sql_regression_test.cc"],
Expand Down Expand Up @@ -270,3 +287,12 @@ cc_test(
srcs = ["src/dataflow/tests/stream_strategy_explain_test.cc"],
deps = [":dataflow_core"],
)

cc_test(
name = "vector_runtime_test",
srcs = ["src/dataflow/tests/vector_runtime_test.cc"],
deps = [
":dataflow_actor_rpc_codec",
":dataflow_core",
],
)
51 changes: 51 additions & 0 deletions README-zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,43 @@ uv run --project python_api python python_api/demo_batch_sql_arrow.py
uv run --project python_api python python_api/demo_stream_sql.py
```

同时在 Session 侧新增了向量查询入口:`Session.vectorQuery(table, vector_column, query_vector, top_k, metric)`(metric 支持 cosine/dot/l2),以及 explain 接口 `Session.explainVectorQuery(...)`。

支持打包单文件 CLI 可执行产物(内含 Python 运行时依赖 + native `_velaria.so`):

```bash
./scripts/build_py_cli_executable.sh
./dist/velaria-cli csv-sql \
--csv /path/to/input.csv \
--query "SELECT * FROM input_table LIMIT 5"
```

额外支持直接编译 native CLI 二进制(运行时不依赖 Python 环境):

```bash
bazel build //:velaria_cli
./bazel-bin/velaria_cli \
--csv /path/to/input.csv \
--query "SELECT * FROM input_table LIMIT 5"
```

native CLI 向量查询(fixed length vector,支持 cosine/cosin、dot 与 l2):

```bash
./bazel-bin/velaria_cli \
--csv /path/to/vectors.csv \
--vector-column embedding \
--query-vector "0.1,0.2,0.3" \
--metric cosine \
--top-k 5
```

runtime 传输层现已在 proto-like 与 binary row batch codec 中保留 `FixedVector` 类型,跨进程传输时不会丢失向量维度语义。
FixedVector 在内部 codec 里改为 raw float bit payload 编码,避免文本往返造成的精度损耗。
当前向量检索范围为本地 exact scan(`mode=exact-scan`)+ 固定维度 float 向量;v0.1 不包含 ANN 与分布式执行路径。
Arrow ingestion 已增加 `FixedSizeList<float32>` 的 native 快路径,可减少向量列的 Python 对象转换开销。
同机 actor runtime 的结果回传现在采用“双帧”模型:控制消息继续走 `actor-rpc-v1`,结果表单独走 `table-bin-v1` 的 `DataBatch` 帧,并通过 `correlation_id` 关联;热路径不再把整张结果表塞进 actor JSON body。

## 同机多进程实验路径

同机路径刻意保持最小:
Expand All @@ -249,6 +286,8 @@ smoke:
bazel run //:actor_rpc_smoke
```

该 smoke 现会同时校验 actor 控制消息和关联的二进制 `DataBatch` 结果帧。

三进程本地运行:

```bash
Expand All @@ -270,6 +309,18 @@ Dashboard:
- `//:stream_benchmark`
- `//:stream_actor_benchmark`
- `//:tpch_q1_style_benchmark`
- `//:vector_search_benchmark`

向量 benchmark:

```bash
bazel run //:vector_search_benchmark
```

会输出两类 JSON 行:

- `vector-query`:cold query、warm query、warm explain 延迟
- `vector-transport`:proto-like 与 `BinaryRowBatch` 的编解码耗时、payload 大小,以及 actor 控制帧开销

同机 observability regression:

Expand Down
51 changes: 51 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,8 @@ Main API:
- `Session.stream_sql(...)`
- `Session.explain_stream_sql(...)`
- `Session.start_stream_sql(...)`
- `Session.vectorQuery(table, vector_column, query_vector, top_k, metric)` (`metric`: cosine/dot/l2)
- `Session.explainVectorQuery(table, vector_column, query_vector, top_k, metric)`

Arrow ingestion accepts:

Expand All @@ -208,6 +210,41 @@ uv run --project python_api python python_api/demo_batch_sql_arrow.py
uv run --project python_api python python_api/demo_stream_sql.py
```

Build a single-file CLI executable (bundles Python runtime deps + native `_velaria.so`):

```bash
./scripts/build_py_cli_executable.sh
./dist/velaria-cli csv-sql \
--csv /path/to/input.csv \
--query "SELECT * FROM input_table LIMIT 5"
```

Build a native CLI binary (no Python runtime dependency required at runtime):

```bash
bazel build //:velaria_cli
./bazel-bin/velaria_cli \
--csv /path/to/input.csv \
--query "SELECT * FROM input_table LIMIT 5"
```

Vector query (fixed-length vector, cosine/dot/l2) via native CLI:

```bash
./bazel-bin/velaria_cli \
--csv /path/to/vectors.csv \
--vector-column embedding \
--query-vector "0.1,0.2,0.3" \
--metric cosine \
--top-k 5
```

Runtime-level vector transport now preserves `FixedVector` through proto-like and binary row batch codecs, so cross-process payloads keep vector type and dimensions.
FixedVector serialization now uses raw float bit payload encoding in internal codecs to avoid text round-trip precision loss.
Current vector search scope is local-only exact scan (`mode=exact-scan`) with fixed-dimension float vectors; no ANN/distributed path in v0.1.
Arrow ingestion now includes a direct `FixedSizeList<float32>` fast path in the native bridge, reducing Python object conversion overhead on vector columns.
For same-host actor runtime results, the control message stays on `actor-rpc-v1`, while the result table is forwarded as a separate `table-bin-v1` `DataBatch` frame linked by `correlation_id`. The hot result path no longer puts row payloads inside the actor JSON body.

## Same-Host Multi-Process Experiment

The same-host path is intentionally minimal:
Expand All @@ -228,6 +265,8 @@ Smoke:
bazel run //:actor_rpc_smoke
```

The smoke target now verifies both the actor control message and the correlated binary `DataBatch` result frame.

Three-process local run:

```bash
Expand All @@ -249,6 +288,18 @@ Useful local targets:
- `//:stream_benchmark`
- `//:stream_actor_benchmark`
- `//:tpch_q1_style_benchmark`
- `//:vector_search_benchmark`

Vector benchmark:

```bash
bazel run //:vector_search_benchmark
```

It emits JSON lines for:

- `vector-query`: cold query, warm query, and warm explain latency
- `vector-transport`: proto-like vs `BinaryRowBatch` serialize/deserialize cost and payload size, plus actor control-frame overhead

Same-host observability regression:

Expand Down
67 changes: 67 additions & 0 deletions docs/local_vector_search_v01.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Local Vector Search v0.1 (Velaria)

## Scope

This document defines a minimal local-first vector search path for Velaria.

### Goals

- Fixed-dimension `float32` vector column support.
- Exact scan backend only.
- Metrics: `cosine`, `dot`, `l2`.
- `top-k` query support.
- C++ API via `DataFrame` / `DataflowSession`.
- Python front-end API for invoking vector search.
- Explain text that mirrors actual runtime behavior.
- Keep ingestion/query path zero-copy-oriented where possible.

### Non-goals (v0.1)

- No ANN index (HNSW/IVF/PQ).
- No distributed vector execution.
- No standalone vector database subsystem.
- No new SQL grammar for vector search in this phase.

## Minimal abstractions

- `Value::DataType::FixedVector` stores fixed-dimension float vectors.
- `VectorIndex` runtime interface with an `ExactScanVectorIndex` implementation.
- `ExactScanVectorIndex` uses flat contiguous buffers and heap top-k selection for scan acceleration.
- Internal vector transport codecs use raw float bit payloads to avoid text precision loss.
- `VectorSearchMetric`: cosine/dot/l2.
- `VectorSearchResult`: `{row_id, score}`.

## Public API draft

### C++

- `DataFrame::vectorQuery(vector_column, query_vector, top_k, metric)`
- `DataFrame::explainVectorQuery(vector_column, query_vector, top_k, metric)`
- `DataflowSession::vectorQuery(table, vector_column, query_vector, top_k, metric)`
- `DataflowSession::explainVectorQuery(table, vector_column, query_vector, top_k, metric)`

### Python

- `Session.vector_search(table, vector_column, query_vector, top_k=10, metric="cosine")`
- `Session.explain_vector_search(table, vector_column, query_vector, top_k=10, metric="cosine")`

## Explain fields

Current explain output contains:

- `mode=exact-scan`
- `metric=<cosine|dot|l2>`
- `dimension=<N>`
- `top_k=<K>`
- `candidate_rows=<M>`
- `filter_pushdown=false`
- `acceleration=flat-buffer+heap-topk`

## Test matrix

- Vector value roundtrip in proto-like serializer.
- Vector value roundtrip in binary row batch codec.
- Runtime query correctness for cosine/l2/dot top-k.
- Dimension mismatch rejection.
- Python API shape and argument validation.
- Arrow `FixedSizeList<float32>` ingestion fast path coverage.
18 changes: 18 additions & 0 deletions python_api/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,13 @@ py_binary(
deps = [":velaria_py_pkg"],
)

py_binary(
name = "velaria_cli",
srcs = ["velaria_cli.py"],
main = "velaria_cli.py",
deps = [":velaria_py_pkg"],
)

py_package(
name = "velaria_pkg",
packages = ["velaria"],
Expand Down Expand Up @@ -157,3 +164,14 @@ py_test(
":velaria_py_pkg",
],
)

py_test(
name = "vector_search_test",
srcs = ["tests/test_vector_search.py"],
main = "tests/test_vector_search.py",
imports = ["."],
deps = [
":velaria_py_pkg",
requirement("pyarrow"),
],
)
30 changes: 30 additions & 0 deletions python_api/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,36 @@ uv run --project python_api python python_api/demo_batch_sql_arrow.py
uv run --project python_api python python_api/demo_stream_sql.py
```

Single-file CLI packaging (Python deps + native `_velaria.so`):

```bash
./scripts/build_py_cli_executable.sh
./dist/velaria-cli csv-sql --csv /path/to/input.csv --query "SELECT * FROM input_table LIMIT 5"
./dist/velaria-cli vector-search --csv /path/to/vectors.csv --vector-column embedding --query-vector "0.1,0.2,0.3" --metric cosine --top-k 5
```

Python Session API for local vector search:

```python
from velaria import Session

session = Session()
# assume a temp view named "vec_src" already exists
out = session.vector_search("vec_src", "embedding", [0.1, 0.2, 0.3], top_k=5, metric="dot")
print(out.to_rows())
print(session.explain_vector_search("vec_src", "embedding", [0.1, 0.2, 0.3], top_k=5, metric="dot"))
```

Current vector search scope is local exact scan only (`cosine`/`dot`/`l2`) on fixed-dimension float vectors.

Native binary CLI alternative (runtime does not require Python environment):

```bash
bazel build //:velaria_cli
./bazel-bin/velaria_cli --csv /path/to/input.csv --query "SELECT * FROM input_table LIMIT 5"
./bazel-bin/velaria_cli --csv /path/to/vectors.csv --vector-column embedding --query-vector "0.1,0.2,0.3" --metric l2 --top-k 5
```

## CI packaging

PR CI builds and uploads two native wheel variants:
Expand Down
Loading
Loading