-
Notifications
You must be signed in to change notification settings - Fork 36
Description
Thanks for working on this!
I've been testing running embeddings in a runpod serverless environment, but the performance isn't what I would have expected. For running bge-m3, we're seeing an end to end latency of ~600ms. Runpod itself reports around 100ms delay time and around 110ms processing time.
I tried running bge-m3 locally on my machine (on a Geforce 4080, directly via python using BGEM3FlagModel) and for the first embedding I see a very high latency as well (~180ms), but for embeddings afterwards the latency is very low, as expected. Around 4-5ms for simple text.
I don't see obvious reasons why requests after the first one on a running worker would still take 100+ms. Is this something that can be improved somehow? I would be willing to contribute, but would like to ask first if this performance is to be expected or if there is potential to improve it.
I would also like to ask about the 100ms delay time. What could be the reasons for it being so high, even though the worker is already running?
We are using European Data centers. Could it be that the requests are somehow routed through the US?
This is the python script I used for testing locally:
import time
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=False)
sentences = ["What is BGE M3?"]
sentences2 = ["More text"]
sentences3 = ["<<< More text"]
def get_embeddings(inputs):
start_time = time.time()
model.encode(inputs)['dense_vecs']
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time} seconds")
get_embeddings(sentences)
get_embeddings(sentences2)
get_embeddings(sentences3)Output:
Execution time: 0.214857816696167 seconds
Execution time: 0.004781007766723633 seconds
Execution time: 0.00433349609375 seconds