jonathan's notes #1
Replies: 8 comments
-
So, I wanted to have two possibilities. One is to explicitly say that a value is deleted. As in, if it exists in the sqlite db, it should explicitly be removed. One could see that a value was removed in data pulled, but I would also like to have the idea of being able to do graftable rewrites of these histories, where someone could, for example, reset this every year or whatever if it's getting large and it wouldn't remove anything, only add from that point on. However, perhaps we can just use a deletion from a head you have to one you fetch as an explicit removal and have a different methodology for ref fetches that all of a sudden don't share a parent. |
Beta Was this translation helpful? Give feedback.
-
No, I certainly write a parent as the last thing that was serialized or materialized for this data. This is indeed how I do 3-way determination of conflict or fast detection of new values. |
Beta Was this translation helpful? Give feedback.
-
I only assumed writing state to refs when you're about to push. Not every modification would create a tree/commit |
Beta Was this translation helpful? Give feedback.
-
I may have misdocumented this. I thought the spec said that the remote value wins. However, if we did need to look at a timestamp, there is a timestamp in the first commit that introduces the new key. |
Beta Was this translation helpful? Give feedback.
-
I think this would need to be the domain of something that uses this spec/library (so, GitButler). I don't think it would neccesarily be automatic - there are some things that you would not want to retarget (CI attestations, gpg signatures, etc), and some things you would (patch id? signoff? branch id?) and some things we may want to add to new commits automatically (previous version, etc). |
Beta Was this translation helpful? Give feedback.
-
So most of the point of this spec is to try to figure out a more flexible and scalable system for transmitting metadata than notes. I started with considerations of the problem set for vcs metadata that Rodrigo outlined in his JJ talk Namely a system that can accommodate wider use cases than notes, for example:
I was trying to design a primitive set of instructions and storage formats that would let GitButler implement the |
Beta Was this translation helpful? Give feedback.
-
I didn't want to do this (although it would be possible) both because IO would be slower (especially value writes, since we would have to write a new tree and commit on every mutation) and also I would like to be able to have a data source that could locally combine multiple meta sources (ie, an internal company metadata ref and a public one). If we're operating directly out of the refs, we would have to check all of them whenever we do any key read, which would be slow and cumbersome I think. |
Beta Was this translation helpful? Give feedback.
-
I'm not particularly interested in hybrid solutions with Since this is built for our own porcelains ( |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Jonathan Tan wrote this in Discord:
Here are my thoughts on
gmeta. I'm only writing based on what I read in https://schacon.github.io/gmeta/ and https://github.com/schacon/gmeta - in particular, I didn't look at the source code.My summary of
gmetaThis is a system of attaching metadata to certain "targets". The metadata is stored locally in a SQLite database and can be serialized into Git objects, which can be pushed, fetched, and "materialized" (combining fetched metadata with local metadata).
Local storage
Each metadata item consists of a "target" (commit, change ID, branch, path, or "project" representing a global value), a "key" (arbitrary string with limits on bytes allowed), and a "value". The value can be of type "string" (from what I can see, there are no disallowed bytes), "list", or tombstone.
Each modification is also written to a log, including its timestamp and the email address of the user who made the modification.
"List" is actually a set of (timestamp, string) pairs, but the UX mostly treats this as a list of strings ordered by timestamp (for example, when writing a list, the second element is written with the timestamp incremented by 1 and so on). The "lists" can be pushed to, popped from, and cleared, but more complicated manipulations like inserting an item in an arbitrary position don't seem possible without rewriting the timestamps of existing items or forging the timestamp of the newly added item.
Tombstone doesn't seem strictly necessary except possibly to make
gmeta get --with-authorship(get the current value and provide information about the last modification) not need to check the log to see if a missing entry is due to a deletion.Serialization
Whenever serialization happens, a commit is written to
refs/meta/local. The commit's trees, as recursively seen by a command likegit ls-tree -r, will contain:100644 blob <blob-id> <target-path>/<key-path>/__value100644 blob <blob-id> <target-path>/<key-path>/__list/<timestamp>-<short-hash-of-contents>entries100644 blob 6f31... <blob-id> <target-path>/<key-path>/__deletedpointing to a blob with JSON-stored timestamp and email of deleterIt is unclear to me what the parent of this commit is, if any. It is written:
This leads me to think that whenever serialization happens, a single commit with no parent is written. But subsequently, there is talk of a "three-way merge" and "fast-forward materialization", which are only possible if there is a commit for every modification, and if each commit has a parent representing the immediately previous modification. So I'm not sure.
Exchange
The
refs/meta/localcommit can be pushed and fetched as usual. Agmeta materialize <remote>command (which assumes that the metadata was fetched intorefs/meta/remotes/<remote>) is included; it will combine the contents of that ref with the local contents. For "string" values, last timestamp wins (where the timestamp is stored doesn't seem to be described in the spec). For "list", all the entries from both sides are combined and deduplicated - this works because they are stored as<timestamp>-<short-hash-of-contents>.Automated retargeting of metadata when a commit is rewritten?
The spec doesn't seem to describe this.
Comparison with
git notesgit notesis more restricted in its targets (only supports any Git object; does not support arbitrary strings like change IDs) and has no concept ofgmeta's "key" (it maps targets to values directly). It also supportsonly the equivalent of
gmeta's "string" value. Its main on-disk format is a commit with a flat tree containing entries whose filename is the target object ID in hexadecimal and whose blob ID is the value.However, it is more integrated with other Git commands.
git logcan automatically display associated notes, andgitcan be configured to automatically rewrite notes when a commit is rewritten (from the documentation, currently rewriting through onlyamendandrebaseseems to be supported). There is no automatic fetching/pushing of notes whenever its associated commit is fetched/pushed, though.Possible hybrid solutions
It might be worth discussing hybrid solutions, especially if we plan to upstream
gmeta. One option is to reuse thegit notesformat, but by convention treat the value as a JSON file (or Git trailers, and so on). This makes it cumbersome to write large payloads to one specific key, though (not only do we have to base64 encode or similar, we always have to rewrite the whole file).Probably a better option is to greatly expand the target part of
git notesinto a target+key format. Since we are adding backwards incompatible entries anyway, we could add nested trees, much like in thegmetaserialization format. Existing notes would remain as-is (treated as metadata with a commit "target" and an empty "key"). The existing mechanism that automatically rewrites commits could be expanded.Both options could also be combined. Small human-readable payloads like "agent model" and "agent provider" could go in the empty-string key part (all combined into one blob), which
git logcan be configured to automatically display, and large payloads like a transcript could go into a named key into its own blob, andgit logwill not automatically display them.As a migration path, we could even have both
refs/meta/localandrefs/notes/commits(wheregit notesstores its data) operating at the same time.refs/meta/localcould store whatever we want (whether the original proposal or one of my proposed hybrid solutions) and inrefs/notes/commits, we note the original object ID of any commit that we have metadata for. Then, Git can track rewrites, and whenever the user performs an operation on metadata, we can automatically update the existing notes according to what's inrefs/notes/commitsbefore proceeding with the operation.Conclusion
If we're interested in better integration with Git and/or upstreaming this, I think we should consider something more like the hybrid solutions proposed above, or at least explain why we can't do them (e.g. Git objects are too slow, which is why we're using a SQLite database, or we really need the "list" data type which is cumbersome to represent with Git objects, so we might as well design it from scratch). Even if not, it might be worth investigating if operating directly with the serialized form (instead of having a SQLite database) is possible, as that is simpler (only one data representation).
Also, we should clarify whether the serialized commit has a parent or not.
Beta Was this translation helpful? Give feedback.
All reactions