This following report details the rationale and evaluation process of comparing two different versions of the search functionality on our AGENT platform. The goal of this evaluation is to assess which search system offers better relevance and ranking quality; we assessed this using industry-standard metrics drawn from information retrieval (IR) research (see 9. References).
This evaluation is focused on two search functionalities:
- Dynamic relational search: a key-based search currently employed on our platforms, referred to as "OLD"
- Vector database search: a vector-based semantic search in testing, referred to as "NEW"
Limitation: The current search function only allows users to search from studies published on the platform. Eventually, it will return results across a combined catalog of studies, models, tools, and cohorts. For purposes of this evaluation, both search functions being compared will only be returning search results for studies published on the platform.
Given time and resource constraints, a human relevance judgment approach was adopted, using # evaluators with knowledge of the domain. A set of representative queries was developed based on common user tasks and search goals. Each system's top 5 (or other #?) results for these queries were captured and scored manually for relevance.
We selected three well-established metrics from the information retrieval literature (see 8. References):
-
Precision@5 (P@5):
Measures the proportion of relevant results in the top 5 results. Useful for judging immediate usefulness. [1, 2] -
Mean Reciprocal Rank (MRR):
Considers the rank position of the first relevant result. High MRR means users likely find a relevant item quickly. [3, 4] -
Normalized Discounted Cumulative Gain at 5 (nDCG@5):
Measures the usefulness of results ranked by position and graded relevance, with discounted value for lower-ranked relevant items. [5, 6, 7]
These metrics collectively capture both relevance and ranking quality.
Each query’s results were manually assessed using the following relevance scale:
2= Highly relevant1= Somewhat relevant0= Not relevant-= I don't know (evaluator provides commentary in lieu of quantitative rank)
The table below captures all relevant information for each query/system combination:
| Query ID | Query Text | Filters Applied | System Version | Rank | Result Title | Relevance (0–2) | Reciprocal Rank | Log2(Rank+1) | DCG Contribution |
|---|---|---|---|---|---|---|---|---|---|
| Q1 | RATE | n/a | Old | 1 | ... | ||||
| Q2 | RATE | Wearables | Old | 1 | ... | ||||
| Q3 | Wearables | n/a | Old | 2 | ... | ||||
| Q4 | n/a | Wearables | Old | 1 | ... | ||||
| Q5 | RATE | n/a | Old | 1 | ... | ||||
| Q6 | RATE | n/a | Old | 1 | ... | ||||
| Q7 | RATE | n/a | Old | 1 | ... | ||||
| Q8 | RATE | n/a | Old | 1 | ... | ||||
| Q9 | RATE | n/a | Old | 1 | ... | ||||
| Q10 | RATE | n/a | Old | 1 | ... |
The following steps were used to ensure consistency in the evaluation:
- A set of 5–10 representative user queries were defined in advance
- Top 5 results for each query were captured from both systems
- Relevance judgments were made based on domain knowledge
- Calculations for each metric (P@5, MRR, nDCG@5) were computed using the scoring matrix
- Evaluator bias: This evaluation reflects the judgment of a combination of a few domain experts, which may introduce subjectivity
- Limited query set: A small number of test queries were used, limiting generalizability
- Offline testing only: Results were not validated with real users in live sessions or A/B tests
- Search behavior context omitted: The evaluation does not account for time to result, scrolling, or click behavior
Despite these limitations, the evaluation provides a directionally useful analysis of relative system performance.
Based on metric results and qualitative observations:
- If the vector search consistently outperforms in MRR and nDCG@5, consider prioritizing its rollout
- Use the results to guide further user testing or A/B testing with live users
- Explore ways to blend keyword and semantic search if some queries perform better in one engine than the other
- Continue refining query processing and metadata quality, as both impact performance
- Search Evaluation Evaluator Input.xlsx
filepath: Documents/VisDes and Workflows/10. Search and Discovery/.03 Testing/Search Evaluation Evaluator Input - Search Scoring Matrix.md
filepath: repo/data/search scoring matrix.md
Of the top X results, how many were actually relevant — higher is better.
-
What it measures?
- The proportion of relevant items in the top k results by the search engine.
- K is defined as the entire catalog on AGENT (env) due to a small search catalog.
-
Why it's recommended?
- It's simple, intuitive, and reflects what users see first (typically only the top 5-10 results)
- Strongly aligned with real-world behavior: users rarely go beyond the first page of results
- Useful for comparing how well two systems prioritize relevant results early in the ranking
-
Sources
- [1] Manning, R., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
- [2] Sakai, T. (2007). On the Reliability of Information Retrieval Metrics.,
How early does the first relevant result appear in the list — the sooner, the better, so higher is better.
-
What it measures?
- The inverse of the rank at which the first relevant result appears, averaged across queries
-
Why it's recommended?
- Especially effective for navigational queries (where the user is looking for a specific item)
- Rewards systems that return at least one relevant result early in the list
- Often used in QA systems and internal search where finding one "correct" thing is the goal
-
Sources
- [3] Voorhees, E. M. (1999). The TREC-8 Question Answering Track Evaluation.
- [4] Manning et al. (2008) — Introduction to IR, Chapter 8
How well are the most relevant results ranked near the top — higher is better, with a perfect score being 1.
-
What it measures?
- A graded relevance score that penalizes relevant results that appear lower in the ranking
- Supports multi-level relevance judgments (e.g., hihgly relevant vs. somewhat relevant)
-
Why it's recommended?
- Captures ranking quality more holistically than binary metrics
- Suitable when some items are more relevant than others — especially common in serach catalogs like ours
- De factor standard for graded relevance in search engines, recommendation systems, and ML benchmarks
-
Sources
- [5] Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM TOIS.
- [6] Microsoft Research (2008). Learning to Rank Challenge.
- [7] Manning et al. (2008), Chapter 8.4