Embedded Allocation Performance Analysis #3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR experiments with variable width allocation to compare performance with the current xmalloc approach.
Allocated embeddings with
rb_data_typed_object_zallocand marked the type asRUBY_TYPED_EMBEDDABLEin the C extension. This allow an efficient variable-width allocation with embeddable typed data for compact memory usage.Embeddings are managed as native C structures with automatic memory management. Now it is done using variable-width allocation for each vector.
Result: xmalloc proves superior in all metrics
Performance
Note on the old comparison: The performance spec was later improved to fill some gaps. I leave this first comparison session to leave the trace of the progression.
bundle exec rspec spec/performance_spec.rb --seed 57210Performance Benchmarks Comparison (OLD COMPARISON)
Embedding Creation Performance (10,000 iterations)
Cosine Similarity Performance (10,000 iterations)
RSS Memory Usage During Tests
Memory Usage Delta (10,000 embeddings)
The Paradox Explained
The massive memory increase with the embedded allocation (current branch) is actually expected and highlights a fundamental difference in how Ruby's GC manages memory:
With xmalloc (Master Branch)
With Embedded Allocation (Current Branch)
Issues with that Benchmark
We're creating 20,000 embeddings total (10k + 10k), but only measuring RSS after creating all of them. The embedded allocation keeps all this memory until GC runs.
The objs array holds references to all 10,000 embeddings, preventing GC from cleaning them up even if it runs.
Performance Benchmarks Comparison NEW
Core Performance Metrics
Embedding Creation Performance (10,000 iterations)
Cosine Similarity Performance (10,000 iterations)
RSS Memory After Cleanup
Memory Usage Analysis (10,000 embeddings)
Memory Delta Comparison
Memory Retained After GC
Memory Efficiency Ratings
Allocation Pattern Analysis
Create and Discard Pattern (Final RSS)
Hold References Pattern (Final RSS)
Critical Issues with Embedded Allocation:
The embedded allocation (current branch) is consistently worse in every meaningful metric:
The Technical Explanation
The embedded allocation approach fails because:
Final consideration
The variable width allocation technique, while elegant in theory, is completely unsuitable for this use case where we're creating many large objects that contain significant amounts of data.
So I'll stick with xmalloc. The data is unambiguous - xmalloc is superior in every way: