Embedded Allocation Performance Analysis #3

marcomd · 2025-06-15T11:29:00Z

Description

This PR experiments with variable width allocation to compare performance with the current xmalloc approach.

Allocated embeddings with rb_data_typed_object_zalloc and marked the type as RUBY_TYPED_EMBEDDABLE in the C extension. This allow an efficient variable-width allocation with embeddable typed data for compact memory usage.

Embeddings are managed as native C structures with automatic memory management. Now it is done using variable-width allocation for each vector.

Result: xmalloc proves superior in all metrics

Performance

⚠️ Take these data as indicative because they are very variable and an average of many readings should be done, while for simplicity I took the best of three attempts.

Note on the old comparison: The performance spec was later improved to fill some gaps. I leave this first comparison session to leave the trace of the progression.

bundle exec rspec spec/performance_spec.rb --seed 57210

Performance Benchmarks Comparison (OLD COMPARISON)

Metric	Master Branch	Current Branch	Change	Impact
Test Suite Runtime	1.44s	1.57s	+0.13s	⚠️ 9% slower
Test Suite Load Time	1.11s	1.01s	-0.10s	✅ 9% faster

Embedding Creation Performance (10,000 iterations)

Embedding Size	Master Branch	Current Branch	Change	Performance
768	36ms	26ms	-10ms	✅ 28% faster
2048	54ms	51ms	-3ms	✅ 6% faster
3072	69ms	73ms	+4ms	⚠️ 6% slower
4096	131ms	128ms	-3ms	✅ 2% faster

Cosine Similarity Performance (10,000 iterations)

Embedding Size	Master Branch	Current Branch	Change	Performance
768	26ms	33ms	+7ms	⚠️ 27% slower
2048	66ms	75ms	+9ms	⚠️ 14% slower
3072	105ms	117ms	+12ms	⚠️ 11% slower
4096	149ms	142ms	-7ms	✅ 5% faster

RSS Memory Usage During Tests

Embedding Size	Master Branch	Current Branch	Change	Memory Impact
768	148.95 MB	633.72 MB	+484.77 MB	⚠️ 325% increase
2048	200.16 MB	884.05 MB	+683.89 MB	⚠️ 342% increase
3072	240.94 MB	1409.16 MB	+1168.22 MB	⚠️ 485% increase
4096	286.59 MB	531.33 MB	+244.74 MB	⚠️ 85% increase

Memory Usage Delta (10,000 embeddings)

Embedding Size	Master Branch	Current Branch	Change	Memory Efficiency
768	5.02 MB	34.27 MB	+29.25 MB	⚠️ 582% increase
2048	13.48 MB	82.92 MB	+69.44 MB	⚠️ 515% increase
3072	53.41 MB	122.7 MB	+69.29 MB	⚠️ 130% increase
4096	66.56 MB	162.02 MB	+95.46 MB	⚠️ 143% increase

The Paradox Explained

The massive memory increase with the embedded allocation (current branch) is actually expected and highlights a fundamental difference in how Ruby's GC manages memory:

With xmalloc (Master Branch)

Ruby allocates a small Ruby object (~40-80 bytes)
The embedding data is allocated separately via xmalloc
Ruby's GC can independently manage and free both pieces
Immediate deallocation: When the Ruby object is swept, the C memory is freed immediately

With Embedded Allocation (Current Branch)

Ruby allocates one large object containing both the Ruby object and embedding data
The entire chunk must be managed as a single unit by Ruby's GC
Delayed deallocation: Memory stays allocated until the next GC cycle

Issues with that Benchmark

Memory Measurement Timing

# This is problematic:
objs = Array.new(n) { RagEmbeddings::Embedding.from_array(emb1) }
sim_time = Benchmark.realtime do
  objs.each do |obj|
    obj2 = RagEmbeddings::Embedding.from_array(emb2)  # Creating n MORE objects!
    obj.cosine_similarity(obj2)
  end
end

We're creating 20,000 embeddings total (10k + 10k), but only measuring RSS after creating all of them. The embedded allocation keeps all this memory until GC runs.

RSS Measurement Issues

RSS includes all process memory, not just your embeddings
No baseline measurement
GC may not have run when you measure

Array References Preventing GC
The objs array holds references to all 10,000 embeddings, preventing GC from cleaning them up even if it runs.

Performance Benchmarks Comparison NEW

Metric	Master Branch (xmalloc)	Current Branch (Data Embedded)	Change	Impact
Test Suite Runtime	4.50s	4.95s	+0.45s	⚠️ 10% slower
Test Suite Load Time	0.47s	0.23s	-0.24s	✅ 51% faster

Core Performance Metrics

Embedding Creation Performance (10,000 iterations)

Embedding Size	Master Branch	Current Branch	Change	Performance
768	22ms	23ms	+1ms	⚠️ 5% slower
2048	57ms	62ms	+5ms	⚠️ 9% slower
3072	92ms	96ms	+4ms	⚠️ 4% slower
4096	112ms	115ms	+3ms	⚠️ 3% slower

Cosine Similarity Performance (10,000 iterations)

Embedding Size	Master Branch	Current Branch	Change	Performance
768	8ms	10ms	+2ms	⚠️ 25% slower
2048	21ms	27ms	+6ms	⚠️ 29% slower
3072	30ms	38ms	+8ms	⚠️ 27% slower
4096	41ms	48ms	+7ms	⚠️ 17% slower

RSS Memory After Cleanup

Embedding Size	Master Branch	Current Branch	Change	Memory Impact
768	81.11 MB	112.63 MB	+31.52 MB	⚠️ 39% increase
2048	160.64 MB	368.89 MB	+208.25 MB	⚠️ 130% increase
3072	162.33 MB	767.22 MB	+604.89 MB	⚠️ 373% increase
4096	196.02 MB	682.80 MB	+486.78 MB	⚠️ 248% increase

Memory Usage Analysis (10,000 embeddings)

Memory Delta Comparison

Embedding Size	Master Branch	Current Branch	Change	Memory Efficiency
768	16.47 MB	34.37 MB	+17.90 MB	⚠️ 109% increase
2048	4.52 MB	44.12 MB	+39.60 MB	⚠️ 876% increase
3072	68.96 MB	122.71 MB	+53.75 MB	⚠️ 78% increase
4096	45.97 MB	98.85 MB	+52.88 MB	⚠️ 115% increase

Memory Retained After GC

Embedding Size	Master Branch	Current Branch	Change	GC Efficiency
768	16.48 MB	34.37 MB	+17.89 MB	⚠️ 109% increase
2048	4.52 MB	29.83 MB	+25.31 MB	⚠️ 560% increase
3072	20.97 MB	122.71 MB	+101.74 MB	⚠️ 485% increase
4096	2.14 MB	98.85 MB	+96.71 MB	⚠️ 4520% increase

Memory Efficiency Ratings

Embedding Size	Master Branch	Current Branch	Change	Efficiency Impact
768	179.3%	85.9%	-93.4pp	⚠️ Less efficient
2048	1733.5%	177.6%	-1555.9pp	⚠️ Significantly less efficient
3072	170.3%	95.7%	-74.6pp	⚠️ Less efficient
4096	340.4%	158.3%	-182.1pp	⚠️ Less efficient

Allocation Pattern Analysis

Create and Discard Pattern (Final RSS)

Embedding Size	Master Branch	Current Branch	Change	Pattern Impact
768	88.95 MB	146.88 MB	+57.93 MB	⚠️ 65% increase
2048	167.63 MB	437.25 MB	+269.62 MB	⚠️ 161% increase
3072	126.83 MB	461.06 MB	+334.23 MB	⚠️ 264% increase
4096	188.02 MB	731.53 MB	+543.51 MB	⚠️ 289% increase

Hold References Pattern (Final RSS)

Embedding Size	Master Branch	Current Branch	Change	Reference Impact
768	97.75 MB	181.36 MB	+83.61 MB	⚠️ 86% increase
2048	168.91 MB	520.70 MB	+351.79 MB	⚠️ 208% increase
3072	194.23 MB	583.91 MB	+389.68 MB	⚠️ 201% increase
4096	183.08 MB	756.03 MB	+572.95 MB	⚠️ 313% increase

Critical Issues with Embedded Allocation:

The embedded allocation (current branch) is consistently worse in every meaningful metric:

Massive Memory Bloat

2048 size: 876% increase in memory delta
4096 size: 4520% increase in retained memory after GC
Peak memory usage 2-4x higher

Performance Degradation

Cosine similarity 17-29% slower (this is huge for a core operation)
Creation 3-9% slower
Overall test suite 10% slower

Poor GC Behavior

Memory retention after GC is catastrophically bad
"Create and discard" pattern shows 65-289% higher final memory usage

The Technical Explanation

The embedded allocation approach fails because:

Ruby's GC algorithm isn't optimized for very large objects
GC trigger thresholds are based on object count, not memory size
Sweep cost increases significantly with larger objects
Memory fragmentation becomes worse with large embedded allocations

Final consideration

The variable width allocation technique, while elegant in theory, is completely unsuitable for this use case where we're creating many large objects that contain significant amounts of data.
So I'll stick with xmalloc. The data is unambiguous - xmalloc is superior in every way:

✅ faster cosine similarity (core operation)
✅ 2-45x more memory efficient
✅ Better GC behavior
✅ More predictable memory usage

- Allocated embeddings with `rb_data_typed_object_zalloc` and marked the type as `RUBY_TYPED_EMBEDDABLE` in the C extension

marcomd added 4 commits June 15, 2025 11:15

Implemented Variable Width Allocation

af92207

- Allocated embeddings with `rb_data_typed_object_zalloc` and marked the type as `RUBY_TYPED_EMBEDDABLE` in the C extension

Merge branch 'master' into feature/variable-width-allocation

3d38718

Merge branch 'master' into feature/variable-width-allocation

2b20d05

Merge branch 'master' into feature/variable-width-allocation

7cfd16a

marcomd self-assigned this Jun 15, 2025

marcomd added the enhancement New feature or request label Jun 15, 2025

This was referenced Jun 15, 2025

Add RUBY_TYPED_EMBEDDABLE flag support #1

Closed

Add RUBY_TYPED_EMBEDDABLE flag support - solution 2 #2

Closed

marcomd added 3 commits June 15, 2025 13:39

Fixed VERSION

10316a4

Added required_ruby_version

112d5ab

Updated README

7077238

marcomd changed the title ~~Implemented Variable Width Allocation~~ Research: Embedded Allocation Performance Analysis Jun 22, 2025

marcomd added the research label Jun 22, 2025

marcomd changed the title ~~Research: Embedded Allocation Performance Analysis~~ Embedded Allocation Performance Analysis Jun 22, 2025

marcomd removed the enhancement New feature or request label Jun 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedded Allocation Performance Analysis #3

Embedded Allocation Performance Analysis #3

Uh oh!

marcomd commented Jun 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Embedded Allocation Performance Analysis #3

Are you sure you want to change the base?

Embedded Allocation Performance Analysis #3

Uh oh!

Conversation

marcomd commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Performance

Performance Benchmarks Comparison (OLD COMPARISON)

Embedding Creation Performance (10,000 iterations)

Cosine Similarity Performance (10,000 iterations)

RSS Memory Usage During Tests

Memory Usage Delta (10,000 embeddings)

The Paradox Explained

Issues with that Benchmark

Performance Benchmarks Comparison NEW

Core Performance Metrics

Embedding Creation Performance (10,000 iterations)

Cosine Similarity Performance (10,000 iterations)

RSS Memory After Cleanup

Memory Usage Analysis (10,000 embeddings)

Memory Delta Comparison

Memory Retained After GC

Memory Efficiency Ratings

Allocation Pattern Analysis

Create and Discard Pattern (Final RSS)

Hold References Pattern (Final RSS)

Critical Issues with Embedded Allocation:

The Technical Explanation

Final consideration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

marcomd commented Jun 15, 2025 •

edited

Loading