Skip to content

Conversation

@marcomd
Copy link
Owner

@marcomd marcomd commented Jun 15, 2025

Description

This PR experiments with variable width allocation to compare performance with the current xmalloc approach.

Allocated embeddings with rb_data_typed_object_zalloc and marked the type as RUBY_TYPED_EMBEDDABLE in the C extension. This allow an efficient variable-width allocation with embeddable typed data for compact memory usage.

Embeddings are managed as native C structures with automatic memory management. Now it is done using variable-width allocation for each vector.

Result: xmalloc proves superior in all metrics

Performance

⚠️ Take these data as indicative because they are very variable and an average of many readings should be done, while for simplicity I took the best of three attempts.

Note on the old comparison: The performance spec was later improved to fill some gaps. I leave this first comparison session to leave the trace of the progression.

bundle exec rspec spec/performance_spec.rb --seed 57210

Performance Benchmarks Comparison (OLD COMPARISON)

Metric Master Branch Current Branch Change Impact
Test Suite Runtime 1.44s 1.57s +0.13s ⚠️ 9% slower
Test Suite Load Time 1.11s 1.01s -0.10s ✅ 9% faster

Embedding Creation Performance (10,000 iterations)

Embedding Size Master Branch Current Branch Change Performance
768 36ms 26ms -10ms ✅ 28% faster
2048 54ms 51ms -3ms ✅ 6% faster
3072 69ms 73ms +4ms ⚠️ 6% slower
4096 131ms 128ms -3ms ✅ 2% faster

Cosine Similarity Performance (10,000 iterations)

Embedding Size Master Branch Current Branch Change Performance
768 26ms 33ms +7ms ⚠️ 27% slower
2048 66ms 75ms +9ms ⚠️ 14% slower
3072 105ms 117ms +12ms ⚠️ 11% slower
4096 149ms 142ms -7ms ✅ 5% faster

RSS Memory Usage During Tests

Embedding Size Master Branch Current Branch Change Memory Impact
768 148.95 MB 633.72 MB +484.77 MB ⚠️ 325% increase
2048 200.16 MB 884.05 MB +683.89 MB ⚠️ 342% increase
3072 240.94 MB 1409.16 MB +1168.22 MB ⚠️ 485% increase
4096 286.59 MB 531.33 MB +244.74 MB ⚠️ 85% increase

Memory Usage Delta (10,000 embeddings)

Embedding Size Master Branch Current Branch Change Memory Efficiency
768 5.02 MB 34.27 MB +29.25 MB ⚠️ 582% increase
2048 13.48 MB 82.92 MB +69.44 MB ⚠️ 515% increase
3072 53.41 MB 122.7 MB +69.29 MB ⚠️ 130% increase
4096 66.56 MB 162.02 MB +95.46 MB ⚠️ 143% increase

The Paradox Explained

The massive memory increase with the embedded allocation (current branch) is actually expected and highlights a fundamental difference in how Ruby's GC manages memory:

With xmalloc (Master Branch)

  • Ruby allocates a small Ruby object (~40-80 bytes)
  • The embedding data is allocated separately via xmalloc
  • Ruby's GC can independently manage and free both pieces
  • Immediate deallocation: When the Ruby object is swept, the C memory is freed immediately

With Embedded Allocation (Current Branch)

  • Ruby allocates one large object containing both the Ruby object and embedding data
  • The entire chunk must be managed as a single unit by Ruby's GC
  • Delayed deallocation: Memory stays allocated until the next GC cycle

Issues with that Benchmark

  1. Memory Measurement Timing
# This is problematic:
objs = Array.new(n) { RagEmbeddings::Embedding.from_array(emb1) }
sim_time = Benchmark.realtime do
  objs.each do |obj|
    obj2 = RagEmbeddings::Embedding.from_array(emb2)  # Creating n MORE objects!
    obj.cosine_similarity(obj2)
  end
end

We're creating 20,000 embeddings total (10k + 10k), but only measuring RSS after creating all of them. The embedded allocation keeps all this memory until GC runs.

  1. RSS Measurement Issues
  • RSS includes all process memory, not just your embeddings
  • No baseline measurement
  • GC may not have run when you measure
  1. Array References Preventing GC
    The objs array holds references to all 10,000 embeddings, preventing GC from cleaning them up even if it runs.

Performance Benchmarks Comparison NEW

Metric Master Branch (xmalloc) Current Branch (Data Embedded) Change Impact
Test Suite Runtime 4.50s 4.95s +0.45s ⚠️ 10% slower
Test Suite Load Time 0.47s 0.23s -0.24s ✅ 51% faster

Core Performance Metrics

Embedding Creation Performance (10,000 iterations)

Embedding Size Master Branch Current Branch Change Performance
768 22ms 23ms +1ms ⚠️ 5% slower
2048 57ms 62ms +5ms ⚠️ 9% slower
3072 92ms 96ms +4ms ⚠️ 4% slower
4096 112ms 115ms +3ms ⚠️ 3% slower

Cosine Similarity Performance (10,000 iterations)

Embedding Size Master Branch Current Branch Change Performance
768 8ms 10ms +2ms ⚠️ 25% slower
2048 21ms 27ms +6ms ⚠️ 29% slower
3072 30ms 38ms +8ms ⚠️ 27% slower
4096 41ms 48ms +7ms ⚠️ 17% slower

RSS Memory After Cleanup

Embedding Size Master Branch Current Branch Change Memory Impact
768 81.11 MB 112.63 MB +31.52 MB ⚠️ 39% increase
2048 160.64 MB 368.89 MB +208.25 MB ⚠️ 130% increase
3072 162.33 MB 767.22 MB +604.89 MB ⚠️ 373% increase
4096 196.02 MB 682.80 MB +486.78 MB ⚠️ 248% increase

Memory Usage Analysis (10,000 embeddings)

Memory Delta Comparison

Embedding Size Master Branch Current Branch Change Memory Efficiency
768 16.47 MB 34.37 MB +17.90 MB ⚠️ 109% increase
2048 4.52 MB 44.12 MB +39.60 MB ⚠️ 876% increase
3072 68.96 MB 122.71 MB +53.75 MB ⚠️ 78% increase
4096 45.97 MB 98.85 MB +52.88 MB ⚠️ 115% increase

Memory Retained After GC

Embedding Size Master Branch Current Branch Change GC Efficiency
768 16.48 MB 34.37 MB +17.89 MB ⚠️ 109% increase
2048 4.52 MB 29.83 MB +25.31 MB ⚠️ 560% increase
3072 20.97 MB 122.71 MB +101.74 MB ⚠️ 485% increase
4096 2.14 MB 98.85 MB +96.71 MB ⚠️ 4520% increase

Memory Efficiency Ratings

Embedding Size Master Branch Current Branch Change Efficiency Impact
768 179.3% 85.9% -93.4pp ⚠️ Less efficient
2048 1733.5% 177.6% -1555.9pp ⚠️ Significantly less efficient
3072 170.3% 95.7% -74.6pp ⚠️ Less efficient
4096 340.4% 158.3% -182.1pp ⚠️ Less efficient

Allocation Pattern Analysis

Create and Discard Pattern (Final RSS)

Embedding Size Master Branch Current Branch Change Pattern Impact
768 88.95 MB 146.88 MB +57.93 MB ⚠️ 65% increase
2048 167.63 MB 437.25 MB +269.62 MB ⚠️ 161% increase
3072 126.83 MB 461.06 MB +334.23 MB ⚠️ 264% increase
4096 188.02 MB 731.53 MB +543.51 MB ⚠️ 289% increase

Hold References Pattern (Final RSS)

Embedding Size Master Branch Current Branch Change Reference Impact
768 97.75 MB 181.36 MB +83.61 MB ⚠️ 86% increase
2048 168.91 MB 520.70 MB +351.79 MB ⚠️ 208% increase
3072 194.23 MB 583.91 MB +389.68 MB ⚠️ 201% increase
4096 183.08 MB 756.03 MB +572.95 MB ⚠️ 313% increase

Critical Issues with Embedded Allocation:

The embedded allocation (current branch) is consistently worse in every meaningful metric:

  1. Massive Memory Bloat
  • 2048 size: 876% increase in memory delta
  • 4096 size: 4520% increase in retained memory after GC
  • Peak memory usage 2-4x higher
  1. Performance Degradation
  • Cosine similarity 17-29% slower (this is huge for a core operation)
  • Creation 3-9% slower
  • Overall test suite 10% slower
  1. Poor GC Behavior
  • Memory retention after GC is catastrophically bad
  • "Create and discard" pattern shows 65-289% higher final memory usage

The Technical Explanation

The embedded allocation approach fails because:

  • Ruby's GC algorithm isn't optimized for very large objects
  • GC trigger thresholds are based on object count, not memory size
  • Sweep cost increases significantly with larger objects
  • Memory fragmentation becomes worse with large embedded allocations

Final consideration

The variable width allocation technique, while elegant in theory, is completely unsuitable for this use case where we're creating many large objects that contain significant amounts of data.
So I'll stick with xmalloc. The data is unambiguous - xmalloc is superior in every way:

  • ✅ faster cosine similarity (core operation)
  • ✅ 2-45x more memory efficient
  • ✅ Better GC behavior
  • ✅ More predictable memory usage

@marcomd marcomd self-assigned this Jun 15, 2025
@marcomd marcomd added the enhancement New feature or request label Jun 15, 2025
@marcomd marcomd changed the title Implemented Variable Width Allocation Research: Embedded Allocation Performance Analysis Jun 22, 2025
@marcomd marcomd changed the title Research: Embedded Allocation Performance Analysis Embedded Allocation Performance Analysis Jun 22, 2025
@marcomd marcomd removed the enhancement New feature or request label Jun 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants