Data inside each colored patch is retrieved in a single I/O operation. This pattern of "localized data access" extends
to high dimensional space.
This project provides a Scalable, Durable DistanceTree.
A
DistanceTree<K,V>is a Key-Value store that allows you to find similar Key-Value pairs in high-dimensional datasets.A
DistanceTreecan:
- Perform Range Searches:
getAllWithinRange(K searchKey, double range)(left side of graphic)- Perform kNN Searches:
getNClosest(K searchKey, int n)(right side of graphic, k = 10)All
DistanceTree<K,V>searches return SearchResults<K, V>
Imagine range and knn searches like this, implemented efficiently, and applied to high dimensional data!
Prior to this project, multi-dimensional data was stored and searched using a
MetricTree (see GitHub implementation here).
Unfortunately, a MetricTree is just an in-memory data structure. This project adapts MetricTree's
implementation into a DistanceTree that can:
- Store a dataset that cannot fit into memory.
- Search a dataset that cannot fit into memory.
- Persist data beyond the lifetime of a single JVM process.
- Adopt the data storage layer of their choice (e.g., local files on disk, DuckDB, PostgreSQL, cloud blob storage, etc.). Each data storage layer will have different performance, durability, and dollar cost characteristics.
- Adopting this Java library requires your project to be on Java 17 or later.
- The latest official release is version:
0.0.1 - All official releases are available at Maven Central.
GRADLE:
dependencies {
implementation("org.mitre:dist-tree:0.0.1")
}
MAVEN:
<dependency>
<groupId>org.mitre</groupId>
<artifactId>dist-tree</artifactId>
<version>0.0.1</version>
</dependency>
// Building an DistanceTree starts with a TreeConfig
DistanceTree<LatLong, String> tree = TreeConfig.<LatLong, String>builder()
.keySerde(latLongSerde())
.valueSerde(stringUtf8Serde())
.distMetric((a, b) -> a.distanceTo(b).inNauticalMiles())
.buildTree();
// Gather the Key-Value data we want to store in the tree
List<LatLong> locations = loadBusinessLocations();
List<String> names = loadBusinessNames();
tree.addBatches(batchify(locations, names, 1_000));
// JVM is shutdown, but our dataset is persisted!
System.exit()
// The next JVM loads the tree from the default DataStore
DistanceTree<LatLong, String> tree = TreeConfig.<LatLong, String>builder()
.keySerde(latLongSerde())
.valueSerde(stringUtf8Serde())
.distMetric((a, b) -> a.distanceTo(b).inNauticalMiles())
.buildTree();
// No manual data loading is necessary
// Perform a "range search"
LatLong searchKey = randomLatLong();
double searchRadiusNm = 0.5;
SearchResults<LatLong, String> businessesNearby = tree.rangeSearch(searchKey, searchRadiusNm);
// Perform a "knn search"
SearchResults<LatLong, String> fiveClosestBusinesses = tree.knnSearch(searchKey, 5);
Note: Searching Key-Value data in a DistanceTree does not require shutting down the JVM and building a new tree.
Searches can be performed on the same tree that received the original Key-Value batches. However, advanced
TreeConfig options can prohibit co-mingling data mutations (e.g. writes) and data searching (e.g. reads).
In a nutshell:
- If you can measure the "distance" between two keys, then a
DistanceTreecan efficiently search Key-Value pairs using their Keys. - MITRE's Commons Project contains multiple data types designed to integrate
with this
DistanceTreeproject.- These pre-built key-types include:
LatLong,LatLong64,LatLongPath,LatLong64Path,AltitudePath, andVehiclePath, andPathPair.
- These pre-built key-types include:
-
Types derived from physical concepts
LatLongandLatLong64- These are simple 2-d location measurements
- DistanceMetric =
latLong1.distanceTo(latLong2).in(myFavoriteUnit);
LatLongPathandLatLong64Path- These are sequences of
LatLonglocations - The distance (e.g. difference) between any two paths can be measured.
- This means we can easily search for "similar paths" using a
DistanceTree
- These are sequences of
VehiclePathdata- These are sequences of 3d position data (e.g. (lat, long, alt)_1, (lat, long, alt)_2, (lat, long, alt)_3, ...)
- The idea is the same as
LatLongPath, but also with altitude data
PathPairdata- These are pairs of
VehiclePaths - Use the similarity between "VehiclePath Pairs" to find safety events that share similar path dynamics
- These are pairs of
Audio Signals- Compute the FFT of an audio signal
- Distance function = "Earth mover distance(FFT_1, FFT_2)"
-
Types derived from mathy concepts
n-dimensional unit vectors- If your keys are simple vectors it may make sense to use a Vector Database
Probabilty Mass Functions- Use the Kullback–Leibler Divergence to measure the distance btw two PMFs.
Markov Transition Matrices- An
n x nMarkov Transition Matrix is justnProbability Mass Functions. - The "distance" between any 2 markov matrices is the sum of
nseparate KL-Divergence measures (one for each pair of rows in the transition matrix).
- An
A DistanceTree stores Key-Value pairs (i.e. Tuples) and make those KV pairs searchable. The physical data storage
solution is represented by the DataStore interface. Different implementations of DataStore provide different levels
of speed, latency, and reliability.
There are two pre-packaged DataStore implementations. Select a DataStore, and receive the performance vs.
reliability trade-off that DataStore makes. (Hopefully, more DataStore implementation will be coming soon.)
The available DataStores are:
DuckDbStore: This is the defaultDataStoreimplementation.- Pros: DuckDB is a well adopted, open-source database system that runs "in-process" on locally stored data. DuckDB's core thesis is that modern computers are powerful enough that many database-ish tasks can now occur locally. The overhead and complexity of a remote, sharded, and replicated database may not be necessary. In other words, DuckDB is simple. And using DuckDB to persist data and manage I/O means DistanceTree users do not need to integrate with a separate data storage service (other than the local disk).
- Cons: This
DataStoredoes not benefit from reliability improving features like redundant components, data replication, and load balancing.
InMemoryStore: This DataStore is fast. It does not perform I/O. All data is kept in memory.- Pros: Fast! Great for problems that do not need persistent data storage.
- Cons: Does not persist data.
(Note: Currently, no pre-packaged DataStore is backed by a truly highly-reliable data storage system.
Implementing a DataStore backed by a more reliable data storage system is perfectly achievable. We have not
prioritized this task because DuckDb has been sufficient for our needs.)
- Performance: A discussion on how Tree Configuration choices can impact performance is here.
- Frequently Asked Questions: are listed here.
- What is a Metric Space?: is discussion here
- Storing Nodes, Tuples, and Pages: is discussed here.
- Project Goals and Non-Goals: are discussed here.
- Possible future optimizations: Are discussed here.
- Idea Graveyard: Old and rejected ideas are discussed here
- Open Source Release Process: Our manual(??!!) release process is described here, the script is here.
Contributions are welcomed and encouraged. We are currently looking for contributions that:
- Implement any of these future optimizations
- Add
DataStoreimplementations backed by:PostgreSQL,SQLite, or other data storage platforms
- "M-tree An Efficient Access Method for Similarity Search in Metric Spaces"
- The most relevant paper by Paolo Ciaccia, Marco Patella, and Pavel Zezula
- "Indexing Metric Spaces with M-tree"
- By: Paolo Ciaccia, Marco Patella, and Pavel Zezula
- "An implementation of the M-tree index structure for PostgreSQL using GiST
- By István Donkó, János M. Szalai-Gindl, Gergo Gombos, and Attila Kiss
- Vantage-Point trees
- Binary space partition trees
- PostgreSQL GiST extension
- pgvector
- faiss
- This repo contains links to many other academic references
- Approved for Public Release; Distribution Unlimited. Public Release Case Number 25-1212.
- Copyright: The contents of this project is copyright
The MITRE Corporation. See details here . - Open Source License: This project is released under the Apache License Version 2.0. See details here.
- Data Rights Legend: See details here.
