[CLEAN] Synthetic Benchmark PR #29981 - perf: optimize DatasetRetrieval.retrieve、RetrievalService._deduplicat… #49

tomerqodo · 2025-12-30T06:35:55Z

Benchmark PR langgenius#29981

Type: Clean (correct implementation)

Original PR Title: perf: optimize DatasetRetrieval.retrieve、RetrievalService._deduplicat…
Original PR Description: …e_documents、RetrievalService.format_retrieval_documents

Important

Make sure you have read our contribution guidelines
Ensure there is an associated issue and you have been assigned to it
Use the correct syntax to link this PR: Fixes #<issue number>.

Summary

fix langgenius#29750

Based on typical RAG retrieval scenarios (assuming 50-200 documents):

Small-scale scenario (10-50 documents)

Query optimization: 6.25x - 31.25x
Network latency optimization: ~3-5x
Memory optimization: ~20-30%
Overall improvement: 15-20x

Medium-scale scenario (100-500 documents)

Query optimization: 62.5x - 312.5x
Network latency optimization: ~10-20x
Memory optimization: ~40-50%
Overall improvement: 100-200x

Large-scale scenario (1000+ documents)

Query optimization: 625x+
Network latency optimization: ~50x+
Memory optimization: ~60%+
Overall improvement: 500-1000x

Summary

Significant performance improvements:

Small scale (10-50 documents): 15-20x performance improvement
Medium scale (100-500 documents): 100-200x performance improvement
Large scale (1000+ documents): 500-1000x performance improvement

optimize RetrievalService._deduplicate_documents speed o(n^2) -> o(n)

Screenshots

Before	After
...	...

Checklist

This change requires a documentation update, included: Dify Document
I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
I've updated the documentation accordingly.
I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

Original PR URL: langgenius#29981

Apply changes for benchmark PR

0a8dafc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CLEAN] Synthetic Benchmark PR #29981 - perf: optimize DatasetRetrieval.retrieve、RetrievalService._deduplicat… #49

[CLEAN] Synthetic Benchmark PR #29981 - perf: optimize DatasetRetrieval.retrieve、RetrievalService._deduplicat… #49

tomerqodo commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[CLEAN] Synthetic Benchmark PR #29981 - perf: optimize DatasetRetrieval.retrieve、RetrievalService._deduplicat… #49

Are you sure you want to change the base?

[CLEAN] Synthetic Benchmark PR #29981 - perf: optimize DatasetRetrieval.retrieve、RetrievalService._deduplicat… #49

Conversation

tomerqodo commented Dec 30, 2025

Benchmark PR langgenius#29981

Summary

Screenshots

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants