Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
158 changes: 158 additions & 0 deletions docs/ivf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# IVF Index

## Definition
IVF (Inverted File Index) improves search efficiency by **partitioning** data into buckets, thus reducing the search scope.

## Working Principle
1. **Clustering Phase**:
First, perform a clustering operation on the entire high - dimensional vector dataset, dividing it into multiple non - overlapping clusters (also known as inverted lists). Commonly used clustering algorithms include K - means, etc. The cluster centers are called centroids. Suppose there are $n$ vectors clustered into $m$ clusters, with each cluster having a centroid.
2. **Index Building Phase**:
Each vector is assigned to the cluster whose centroid is closest to it. The vector's information (such as the vector ID) is added to the corresponding inverted list. In this way, the IVF index is built, with each inverted list containing all the vectors belonging to that cluster.
3. **Search Phase**:
When a query vector is given, first calculate the distances between the query vector and all centroids, and find the $k$ closest centroids ($k$ is a retrieval parameter). Then, perform an exact nearest - neighbor search only within the inverted lists corresponding to these $k$ centroids, significantly reducing the number of vectors to be searched.

## Suitable Scenarios (Recommended for use if any 2 - 4 of the following conditions are met)
1. The vector dimension is not very high, usually less than 512 dimensions. High dimensions may lead to the "curse of dimensionality" problem (disadvantage).
2. High - scale data scenarios, typically with over 100 million data points.
3. Memory - constrained scenarios, as its memory usage is lower than that of graph algorithms.
4. Large top - k recall requirements or complex filtering scenarios.

## Usage
For examples, refer to [106_index_ivf.cpp](https://github.com/antgroup/vsag/blob/main/examples/cpp/106_index_ivf.cpp).

## Detailed Explanation of Building Parameters

### partition_strategy_type
- **Parameter Type**: string
- **Parameter Description**: Bucket partitioning strategy type
- **Optional Values**: "ivf", "gno_imi"
- **Default Value**: "ivf"

### first_order_buckets_count
- **Parameter Type**: int
- **Parameter Description**: Only effective when `partition_strategy_type` is "gno_imi", representing the number of first - level buckets.
- **Optional Values**: 1 to INT_MAX
- **Default Value**: 10

### second_order_buckets_count
- **Parameter Type**: int
- **Parameter Description**: Only effective when `partition_strategy_type` is "gno_imi", representing the number of second - level buckets.
- **Optional Values**: 1 to INT_MAX
- **Default Value**: 10

### buckets_count
- **Parameter Type**: int
- **Parameter Description**: Only effective when `partition_strategy_type` is "ivf", representing the number of buckets.
- **Optional Values**: 1 to INT_MAX
- **Default Value**: 10

### ivf_train_type
- **Parameter Type**: string
- **Parameter Description**: Clustering algorithm type
- **Optional Values**: "kmeans", "random"
- **Default Value**: "kmeans"

### base_quantization_type
- **Parameter Type**: string
- **Parameter Description**: Coarse - ranking vector quantization type (encoding of in - bucket vectors)
- **Optional Values**: "fp32", "fp16", "bf16", "sq8", "sq8_uniform", "sq4_uniform", "pq", "rabitq", "pqfs"
- **Default Value**: "fp32"

### base_io_type
- **Parameter Type**: string
- **Parameter Description**: Coarse - ranking vector IO type (storage access type of in - bucket vectors)
- **Optional Values**: "memory_io", "block_memory_io"
- **Default Value**: "memory_io"

### base_pq_dim
- **Parameter Type**: int
- **Parameter Description**: Coarse - ranking vector PQ dimension, used for re - ranking
- **Optional Values**: 1 to dim
- **Default Value**: 1

### use_reorder
- **Parameter Type**: bool
- **Parameter Description**: Whether to use re - ranking
- **Optional Values**: true, false
- **Default Value**: false

### precise_quantization_type
- **Parameter Type**: string
- **Parameter Description**: Fine - ranking vector quantization type, used for re - ranking
- **Optional Values**: "fp32", "fp16", "bf16", "sq8", "sq8_uniform", "sq4_uniform", "pq", "rabitq", "pqfs"
- **Default Value**: "fp32"

### precise_io_type
- **Parameter Type**: string
- **Parameter Description**: Fine - ranking vector IO type, used for re - ranking
- **Optional Values**: "memory_io", "block_memory_io", "mmap_io", "buffer_io", "async_io", "reader_io"
- **Default Value**: "block_memory_io"

### precise_file_path
- **Parameter Type**: string
- **Parameter Description**: Fine - ranking vector file path, used for re - ranking
- **Optional Values**: Any valid file path
- **Default Value**: ""

## Examples for Build Parameter String
```json
"index_param": {
"buckets_count": 50,
"base_quantization_type": "fp32",
"partition_strategy_type": "ivf",
"ivf_train_type": "kmeans"
}
```
means that the index is built using 50 buckets, the base quantization type is fp32, the partition strategy type is ivf, and the ivf train type is kmeans.

```json
"index_param": {
"buckets_count": 50,
"base_quantization_type": "pqfs",
"partition_strategy_type": "ivf",
"ivf_train_type": "random",
"precise_quantization_type": "fp16",
"use_reorder": true,
"base_pq_dim": 32,
"precise_io_type": "async_io",
"precise_file_path": "./precise_codes"
}
```
means that the index is built using 50 buckets, the base quantization type is pqfs with pq dim = 32, the partition strategy type is ivf, and the ivf train type is random. this configuration enables reordering, the precise quantization type is fp16, uses libaio's asynchronous I/O for precise operations, and specifies the file for precise codes as './precise_codes'

## Detailed Explanation of Search Parameters

### scan_buckets_count
- **Parameter Type**: int
- **Parameter Description**: Number of buckets to scan
- **Optional Values**: 1 to buckets_count
- **Default Value**: **must be provided (no default value)**

### factor
- **Parameter Type**: float
- **Parameter Description**: Scan factor, used for reordering, for example, if topk=10, factor=2.0, then IVF stage will recall 20 points, and then use precise code for reordering
- **Optional Values**: 1.0 to FLOAT_MAX
- **Default Value**: 2.0

### parallelism
- **Parameter Type**: int
- **Parameter Description**: Number of threads to use for parallel search per query
- **Optional Values**: 1 to INT_MAX
- **Default Value**: 1 (only the search main thread do the search)

### timeout_ms
- **Parameter Type**: double
- **Parameter Description**: Maximum time cost in milliseconds for each query, used to control the search time cost
- **Optional Values**: 1 to DOUBLE_MAX
- **Default Value**: DOUBLE_MAX

## Examples for Search Parameter String
```json
"ivf": {
"scan_buckets_count": 10,
"factor": 2.0,
"parallelism": 4,
"timeout_ms": 30.0
}
```
means that the search will scan 10 buckets, the factor is 2.0, and the parallelism is 4, around 4 threads per query, and the max time cost is 30ms (when search time exceed 30ms, will return the current result).
26 changes: 26 additions & 0 deletions include/vsag/dataset.h
Original file line number Diff line number Diff line change
Expand Up @@ -270,6 +270,32 @@ class Dataset : public std::enable_shared_from_this<Dataset> {
*/
virtual int64_t
GetExtraInfoSize() const = 0;

/*
* @brief Sets the Statstics for the dataset.
*
* @param Statstics The Statstics string.
* @return DatasetPtr A shared pointer to the dataset with updated Statstics.
*/
virtual DatasetPtr
Statstics(const std::string& Statstics) = 0;

/**
* @brief Retrieves the all Statstics of the dataset.
*
* @return std::string The Statstics string.
*/
virtual std::string
GetStatstics() const = 0;

/**
* @brief Retrieves the Statstics of the dataset.
*
* @param stat_keys The vector of stat keys.
* @return std::vector<std::string> The vector of stat values.
*/
virtual std::vector<std::string>
GetStatstics(const std::vector<std::string>& stat_keys) const = 0;
};

}; // namespace vsag
18 changes: 12 additions & 6 deletions src/algorithm/hgraph.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -985,12 +985,11 @@ HGraph::add_one_point(const void* data, int level, InnerIdType inner_id) {
bool
HGraph::graph_add_one(const void* data, int level, InnerIdType inner_id) {
DistHeapPtr result = nullptr;
InnerSearchParam param{
.topk = 1,
.ep = this->entry_point_id_,
.ef = 1,
.is_inner_id_allowed = nullptr,
};
InnerSearchParam param;
param.topk = 1;
param.ep = this->entry_point_id_;
param.ef = 1;
param.is_inner_id_allowed = nullptr;

LockGuard cur_lock(neighbors_mutex_, inner_id);
auto flatten_codes = basic_flatten_codes_;
Expand Down Expand Up @@ -1675,6 +1674,7 @@ HGraph::SearchWithRequest(const SearchRequest& request) const {
search_param.ef = 1;
search_param.is_inner_id_allowed = nullptr;
search_param.search_alloc = search_allocator;

const auto* raw_query = get_data(query);
for (auto i = static_cast<int64_t>(this->route_graphs_.size() - 1); i >= 0; --i) {
auto result = this->search_one_graph(
Expand Down Expand Up @@ -1703,6 +1703,11 @@ HGraph::SearchWithRequest(const SearchRequest& request) const {
search_param.is_inner_id_allowed = ft;
search_param.topk = static_cast<int64_t>(search_param.ef);
search_param.consider_duplicate = true;
if (params.enable_time_record) {
search_param.time_cost = std::make_shared<Timer>();
search_param.time_cost->SetThreshold(params.timeout_ms);
(*search_param.stats)["is_timeout"] = false;
}
auto search_result = this->search_one_graph(
raw_query, this->bottom_graph_, this->basic_flatten_codes_, search_param);

Expand Down Expand Up @@ -1734,6 +1739,7 @@ HGraph::SearchWithRequest(const SearchRequest& request) const {
}
search_result->Pop();
}
dataset_results->Statstics(search_param.stats->dump());
return std::move(dataset_results);
}

Expand Down
5 changes: 5 additions & 0 deletions src/algorithm/hgraph_parameter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,11 @@ HGraphSearchParameters::FromJson(const std::string& json_string) {
obj.use_extra_info_filter = params[INDEX_TYPE_HGRAPH][HGRAPH_USE_EXTRA_INFO_FILTER];
}

if (params[INDEX_TYPE_HGRAPH].contains(SEARCH_MAX_TIME_COST_MS)) {
obj.timeout_ms = params[INDEX_TYPE_HGRAPH][SEARCH_MAX_TIME_COST_MS];
obj.enable_time_record = true;
}

return obj;
}
} // namespace vsag
2 changes: 2 additions & 0 deletions src/algorithm/hgraph_parameter.h
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,8 @@ class HGraphSearchParameters {
int64_t ef_search{30};
bool use_reorder{false};
bool use_extra_info_filter{false};
bool enable_time_record{false};
double timeout_ms{std::numeric_limits<double>::max()};

private:
HGraphSearchParameters() = default;
Expand Down
10 changes: 9 additions & 1 deletion src/algorithm/ivf.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -626,14 +626,19 @@ IVF::create_search_param(const std::string& parameters, const FilterPtr& filter)
param.factor = search_param.topk_factor;
param.first_order_scan_ratio = search_param.first_order_scan_ratio;
param.parallel_search_thread_count = search_param.parallel_search_thread_count;
if (search_param.enable_time_record) {
param.time_cost = std::make_shared<Timer>();
param.time_cost->SetThreshold(search_param.timeout_ms);
}
return param;
}

DatasetPtr
IVF::reorder(int64_t topk, DistHeapPtr& input, const float* query) const {
auto [dataset_results, dists, labels] = create_fast_dataset(topk, allocator_);
auto reorder_heap = Reorder::ReorderByFlatten(input, reorder_codes_, query, allocator_, topk);
for (int64_t j = topk - 1; j >= 0; --j) {
auto size = static_cast<int64_t>(reorder_heap->Size());
for (int64_t j = size - 1; j >= 0; --j) {
dists[j] = reorder_heap->Top().first;
labels[j] = label_table_->GetLabelById(reorder_heap->Top().second);
reorder_heap->Pop();
Expand Down Expand Up @@ -696,6 +701,9 @@ IVF::search(const DatasetPtr& query, const InnerSearchParam& param) const {
Vector<float> centroid(dim_, allocator_);
Vector<float> dist(allocator_);
for (uint64_t i = 0; i < bucket_count; ++i) {
if (param.time_cost != nullptr and param.time_cost->CheckOvertime()) {
break;
}
if (i % search_thread_count != thread_id) {
continue;
}
Expand Down
7 changes: 7 additions & 0 deletions src/algorithm/ivf_parameter.h
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,11 @@ class IVFSearchParameters {
obj.parallel_search_thread_count = params[INDEX_TYPE_IVF][IVF_SEARCH_PARALLELISM];
}

if (params[INDEX_TYPE_IVF].contains(SEARCH_MAX_TIME_COST_MS)) {
obj.timeout_ms = params[INDEX_TYPE_IVF][SEARCH_MAX_TIME_COST_MS];
obj.enable_time_record = true;
}

return obj;
}

Expand All @@ -93,6 +98,8 @@ class IVFSearchParameters {
float topk_factor{2.0F};
float first_order_scan_ratio{1.0F};
int64_t parallel_search_thread_count{1};
double timeout_ms{std::numeric_limits<double>::max()};
bool enable_time_record{false};

private:
IVFSearchParameters() = default;
Expand Down
16 changes: 16 additions & 0 deletions src/dataset_impl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@

#include "dataset_impl.h"

#include "typing.h"

namespace vsag {

DatasetPtr
Expand All @@ -29,4 +31,18 @@ DatasetImpl::MakeEmptyDataset() {
return result;
}

std::vector<std::string>
DatasetImpl::GetStatstics(const std::vector<std::string>& stat_keys) const {
auto json = JsonType::parse(this->statstics_);
std::vector<std::string> result;
for (const auto& key : stat_keys) {
if (json.contains(key)) {
result.emplace_back(json[key].dump());
} else {
result.emplace_back("");
}
}
return result;
}

}; // namespace vsag
16 changes: 16 additions & 0 deletions src/dataset_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -267,13 +267,29 @@ class DatasetImpl : public Dataset {
return 0;
}

DatasetPtr
Statstics(const std::string& statstics) override {
this->statstics_ = statstics;
return shared_from_this();
}

std::vector<std::string>
GetStatstics(const std::vector<std::string>& stat_keys) const override;

std::string
GetStatstics() const override {
return this->statstics_;
}

static DatasetPtr
MakeEmptyDataset();

private:
bool owner_{true};
std::unordered_map<std::string, var> data_;
Allocator* allocator_ = nullptr;

std::string statstics_;
};

}; // namespace vsag
6 changes: 6 additions & 0 deletions src/impl/basic_searcher.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -298,6 +298,12 @@ BasicSearcher::search_impl(const GraphInterfacePtr& graph,
hops++;
auto current_node_pair = candidate_set->Top();

if (inner_search_param.time_cost != nullptr and
inner_search_param.time_cost->CheckOvertime()) {
(*inner_search_param.stats)["is_timeout"] = true;
break;
}

if constexpr (mode == InnerSearchMode::KNN_SEARCH) {
if ((-current_node_pair.first) > lower_bound && top_candidates->Size() == ef) {
break;
Expand Down
Loading
Loading