[unified arch] Cache the outputs of the vision encoder #241
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Start tracking the tokens and cache in
cache_wrapper. When we receive a followup prompt, we now no longer reprocess the images.Cases:
I added two tests, one for SWA caches which often cannot be trimmed, and one for non-SWA caches, which are usually always trimmable.
Note that there are still opportunities for improvement. Namely, we could be caching the embeddings per image so that we can selectively re-use the embeddings. This can be added as a feature in a future PR, in cases where the cache cannot be trimmed.
Note that this doesn't cache images going through the non-unified stack, that can be added in a future PR.