Skip to content

Conversation

@vishalbakshi
Copy link
Contributor

@vishalbakshi vishalbakshi commented Feb 11, 2025

I was building an index of 200k documents (from the UKPLab/DAPR's Genomics dataset) and noted that when using add_to_index (to build the index in batches) it was hanging on a list comprehension in line 162 of colbert.py:

new_documents_with_ids = [
    {"content": doc, "document_id": new_pid_docid_map[pid]} 
    for pid, doc in enumerate(new_documents)
    if new_pid_docid_map[pid] not in self.pid_docid_map.values()
]

Profiling this showed that the entire index building process took 1240 seconds and this list comprehension took 352 seconds.

image

The issue was that each iteration of the list comprehension was calculating self.pid_docid_map.values(). Pulling this out of the list comprehension to calculate it once (and only referencing it in the list comprehension) brought its execution time from 352 seconds down to <0.6 seconds.

pid_values = set(self.pid_docid_map.values())
new_documents_with_ids = [
    {"content": doc, "document_id": new_pid_docid_map[pid]} 
    for pid, doc in enumerate(new_documents)
    if new_pid_docid_map[pid] not in pid_values
]

image

This PR implements that code.

This is my first PR in RAGatouille so please let me know if additional information is needed. Thank you for maintaining such an awesome library!

For reference, here's the code I used to profile this issue in a Google Colab Pro notebook (T4 w/High-RAM instance):

!pip install datasets ragatouille
!pip uninstall --y faiss-cpu & pip install faiss-gpu-cu12

from datasets import load_dataset
from ragatouille import RAGPretrainedModel

dataset_name = "Genomics"
passages = load_dataset("UKPLab/dapr", f"{dataset_name}-corpus", split="test")
queries = load_dataset("UKPLab/dapr", f"{dataset_name}-queries", split="test")
qrels_rows = load_dataset("UKPLab/dapr", f"{dataset_name}-qrels", split="test")

RAG = RAGPretrainedModel.from_pretrained("answerdotai/answerai-colbert-small-v1")

batch_size = 100_000
batch = passages[:batch_size]

# Process the first batch separately to create the index
index_path = RAG.index(
    index_name=f"{dataset_name}_index",
    collection=batch["text"],
    document_ids=batch["_id"],
    use_faiss=True
)

# add second batch with `add_to_index`
batch = passages[batch_size : batch_size*2]

%prun -D /content/bs_100000.prof RAG.add_to_index(new_collection=batch["text"], new_document_ids=batch["_id"], use_faiss=True)

Currently `if new_pid_docid_map[pid] not in self.pid_docid_map.values()` calculates `.values()` each iteration of the list comprehension so I'm moving that piece out of the list comprehension and putting it in a `set`. Hoping this speeds up the `add_to_index` operation.
Removing the previous list comprehension that I commented out while testing the updated code.
@bclavie
Copy link
Contributor

bclavie commented Feb 11, 2025

Very helpful, thnak you! I have no word to answer the shame I feel reading the values() call in a loop 😆

@bclavie bclavie enabled auto-merge (squash) February 11, 2025 01:41
@bclavie
Copy link
Contributor

bclavie commented Feb 11, 2025

(ignore the CI btw -- it's going away in a hot minute. It's not related to your PR but to getting rid of one of the worst ideas I've ever had, using poetry. We're moving very temporarily to a homebrew pyproject.toml, then uv)

@bclavie bclavie merged commit 0df18e7 into AnswerDotAI:main Feb 11, 2025
@vishalbakshi vishalbakshi deleted the bugfix/add_to_index branch February 11, 2025 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants