-
Notifications
You must be signed in to change notification settings - Fork 23
Add Documentation.md #370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
tomjridge
wants to merge
4
commits into
mirage:main
Choose a base branch
from
tomjridge:documentation.md
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Add Documentation.md #370
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| # Index: overview of what it is and how it works | ||
|
|
||
|
|
||
|
|
||
| Index is essentially a key-value store. The main functionality is described in the file `src/index_intf.ml`. However this file does not describe the implementation in much detail. The purpose of this document is to fill that gap. | ||
|
|
||
|
|
||
|
|
||
| ## Initial comments | ||
|
|
||
| Index is parameterized by various things. The most important point is that **keys and values must be fixed size**. And, indeed, in order for things to run smoothly the keys and values must be reasonably small. The Tezos instance, for example, uses 32 bytes for the keys and 3 integers (or similar) for the values. So, Index is not a generic key-value store (although it would be fairly simple to build such a thing on top of index). | ||
|
|
||
| Further, keys must be hashable (with the hash represented as an int), and the implementation even requires that the user specify the number of bits in the hash that are relevant. | ||
|
|
||
| FIXME There is a worrying comment regarding the key hash: "underestimation [of the number of relevant bits] will result in undefined behavior"; most code that uses hashes should still work (albeit very slowly) even if all hashes have the same value. So, this comment is a bit unusual. What happens if hashes collide? What do we do about this? | ||
|
|
||
| Apart from these restrictions on keys and values, the interface exposed by Index is fairly typical of key-value stores: | ||
|
|
||
| * Basic operations: find, mem, replace (note, no delete operation; replace functions as add) | ||
| * Slightly unusual operation: clear | ||
| * Traversals: filter, iter | ||
| * Low-level syncing: sync, flush | ||
| * Unusual merge operations: is_merging, merge, try_merge | ||
|
|
||
| The merge operations relate to the internal implementation of Index, which we now discuss further. | ||
|
|
||
|
|
||
|
|
||
| ## Internal operation of Index | ||
|
|
||
| As a rough approximation, Index functions as follows: | ||
|
|
||
| * New key-value entries are added to the end of a log file; the contents of the log file is also kept as a hashtable in memory. | ||
| * When the log gets "big", it is merged into the index proper. **Merging takes place asynchronously** (in a separate OS thread, and there is some contention on the OCaml runtime lock... so there is some interference with the main thread, and perhaps the two threads could be better balanced). While merging is taking place, new entries are placed in a "log_async" file, which eventually gets renamed over the original log file (when the merge completes and the original log is no longer needed). | ||
| * The index proper (or, index/data) is a single (usually large) file which contains all the key-value entries, **sorted by the hash of the key**. | ||
| * There is an in-memory **fan** which provides fast lookup within the index data. Essentially it provides a function `search` which takes an integer hash (of the desired key) and returns a pair of (lo,hi) offsets within the data file (i.e., it indicates the part of the file that contains the relevant key-value, if there is an entry for the key at all). The fan is usually constructed as part of the merge. | ||
|
|
||
| The merge then consists of creating a new data file, with additional entries. At the moment, this is done by scanning the data file from the beginning, copying entries to a new data file, and inserting (or replacing!) the additional entries from the in-memory log hashtable. When this is finished, the new data file is renamed over the old data file. | ||
|
|
||
| A drawback of this scheme is that as the data becomes larger, the time to merge becomes correspondingly larger. However, the scheme is remarkably effective for even quite large data, because sequential reading and writing of files is extremely well-optimised on modern systems. | ||
|
|
||
|
|
||
|
|
||
| ## Blocking merges | ||
|
|
||
| Eventually, the index data gets sufficiently large that the "log_async" becomes full before the merge completes. At this point Index will block, waiting for the merge of the log to complete. When the merge completes, the full log_async causes another merge to be initiated. Thus, in a high write scenario, with a large data file, merges will be running most of the time, and there will be periodic blocking, waiting for a merge of the log to complete when the log_async is full. | ||
|
|
||
|
|
||
|
|
||
| ## Correctness | ||
|
|
||
| Some care is taken to try to ensure that the code functions correctly even in the event of a system crash or similar. We consider some of the files involved, and what measures are taken. | ||
|
|
||
| **Log files:** Note that these files are updated by appending data to the end. Reads can occur at arbitrary offsets, but writes add data at the end of the file. The log and log_async use the "IO" interface `io_intf.ml`, which is a model of the underlying filesystem (it includes eg a function "rename" to rename a file). For files, operations are: create (v, v_readonly), get "offset" (which is actually the offset at which new data will be placed when flushed from the buffer), read from offset, and get/set header. | ||
|
|
||
| The header includes a "generation" and an "offset". | ||
|
|
||
| Data can be buffered, hence the size of the underlying file (the "raw" file) can be less than the size of the in-memory data. | ||
|
|
||
| FIXME not clear what header (int63) is; the offset at the end of the header info? Need to get someone to walk me through the code in index_unix.ml | ||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You must be referring to that part of the doc:
index/src/data.ml
Lines 17 to 22 in fe5e962
hash_sizeis used by the fan to select the bits in the short hash of a keys, depending on the size of the fan.Overestimating imply that the fan will refer to bits that don't carry informations (typically zeroes).
Underestimating should be fine in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Underestimating causes an undefined behaviour because that means the bits selection drops the MSBs, which breaks the order. Having the bit selection preserving the hash order is an important invariant to the fan-out.