Search Engine Evaluation Report

1. Overview

This following report details the rationale and evaluation process of comparing two different versions of the search functionality on our AGENT platform. The goal of this evaluation is to assess which search system offers better relevance and ranking quality; we assessed this using industry-standard metrics drawn from information retrieval (IR) research (see 9. References).

This evaluation is focused on two search functionalities:

Dynamic relational search: a key-based search currently employed on our platforms, referred to as "OLD"
Vector database search: a vector-based semantic search in testing, referred to as "NEW"

Limitation: The current search function only allows users to search from studies published on the platform. Eventually, it will return results across a combined catalog of studies, models, tools, and cohorts. For purposes of this evaluation, both search functions being compared will only be returning search results for studies published on the platform.

2. Evaluation Approach

Given time and resource constraints, a human relevance judgment approach was adopted, using # evaluators with knowledge of the domain. A set of representative queries was developed based on common user tasks and search goals. Each system's top 5 (or other #?) results for these queries were captured and scored manually for relevance.

Metrics Used

We selected three well-established metrics from the information retrieval literature (see 8. References):

Precision@5 (P@5):
Measures the proportion of relevant results in the top 5 results. Useful for judging immediate usefulness. [1, 2]
Mean Reciprocal Rank (MRR):
Considers the rank position of the first relevant result. High MRR means users likely find a relevant item quickly. [3, 4]
Normalized Discounted Cumulative Gain at 5 (nDCG@5):
Measures the usefulness of results ranked by position and graded relevance, with discounted value for lower-ranked relevant items. [5, 6, 7]

These metrics collectively capture both relevance and ranking quality.

3. Evaluation Matrix

Each query’s results were manually assessed using the following relevance scale:

2 = Highly relevant
1 = Somewhat relevant
0 = Not relevant
- = I don't know (evaluator provides commentary in lieu of quantitative rank)

The table below captures all relevant information for each query/system combination:

Query ID	Query Text	Filters Applied	System Version	Rank	Result Title
Q1	RATE	n/a	Old	1	...
Q2	RATE	Wearables	Old	1	...
Q3	Wearables	n/a	Old	2	...
Q4	n/a	Wearables	Old	1	...
Q5	RATE	n/a	Old	1	...
Q6	RATE	n/a	Old	1	...
Q7	RATE	n/a	Old	1	...
Q8	RATE	n/a	Old	1	...
Q9	RATE	n/a	Old	1	...
Q10	RATE	n/a	Old	1	...

4. Evaluation Protocol

The following steps were used to ensure consistency in the evaluation:

A set of 5–10 representative user queries were defined in advance
Top 5 results for each query were captured from both systems
Relevance judgments were made based on domain knowledge
Calculations for each metric (P@5, MRR, nDCG@5) were computed using the scoring matrix

5. Limitations

Evaluator bias: This evaluation reflects the judgment of a combination of a few domain experts, which may introduce subjectivity
Limited query set: A small number of test queries were used, limiting generalizability
Offline testing only: Results were not validated with real users in live sessions or A/B tests
Search behavior context omitted: The evaluation does not account for time to result, scrolling, or click behavior

Despite these limitations, the evaluation provides a directionally useful analysis of relative system performance.

6. Recommendations

Based on metric results and qualitative observations:

If the vector search consistently outperforms in MRR and nDCG@5, consider prioritizing its rollout
Use the results to guide further user testing or A/B testing with live users
Explore ways to blend keyword and semantic search if some queries perform better in one engine than the other
Continue refining query processing and metadata quality, as both impact performance

7. Appendix

Search Evaluation Evaluator Input.xlsx
filepath: Documents/VisDes and Workflows/10. Search and Discovery/.03 Testing/Search Evaluation Evaluator Input
Search Scoring Matrix.md
filepath: repo/data/search scoring matrix.md

8. References

Precision@5

Of the top X results, how many were actually relevant — higher is better.

What it measures?
- The proportion of relevant items in the top k results by the search engine.
- K is defined as the entire catalog on AGENT (env) due to a small search catalog.
Why it's recommended?
- It's simple, intuitive, and reflects what users see first (typically only the top 5-10 results)
- Strongly aligned with real-world behavior: users rarely go beyond the first page of results
- Useful for comparing how well two systems prioritize relevant results early in the ranking
Sources
- [1] Manning, R., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
- [2] Sakai, T. (2007). On the Reliability of Information Retrieval Metrics.,

Mean Reciprocal Rank (MRR)

How early does the first relevant result appear in the list — the sooner, the better, so higher is better.

$$ MRR = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{rank_i} $$

What it measures?
- The inverse of the rank at which the first relevant result appears, averaged across queries
Why it's recommended?
- Especially effective for navigational queries (where the user is looking for a specific item)
- Rewards systems that return at least one relevant result early in the list
- Often used in QA systems and internal search where finding one "correct" thing is the goal
Sources
- [3] Voorhees, E. M. (1999). The TREC-8 Question Answering Track Evaluation.
- [4] Manning et al. (2008) — Introduction to IR, Chapter 8

NDCG@k (Normalized Discounted Cumulative Gain)

How well are the most relevant results ranked near the top — higher is better, with a perfect score being 1.

$$ DCG@k = rel_1 + \sum_{i=2}^k \frac{rel_i}{\log_2(i + 1)} $$

What it measures?
- A graded relevance score that penalizes relevant results that appear lower in the ranking
- Supports multi-level relevance judgments (e.g., hihgly relevant vs. somewhat relevant)
Why it's recommended?
- Captures ranking quality more holistically than binary metrics
- Suitable when some items are more relevant than others — especially common in serach catalogs like ours
- De factor standard for graded relevance in search engines, recommendation systems, and ML benchmarks
Sources
- [5] Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM TOIS.
- [6] Microsoft Research (2008). Learning to Rank Challenge.
- [7] Manning et al. (2008), Chapter 8.4

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Search Engine Evaluation Report

1. Overview

2. Evaluation Approach

Metrics Used

3. Evaluation Matrix

4. Evaluation Protocol

5. Limitations

6. Recommendations

7. Appendix

8. References

Precision@5

Mean Reciprocal Rank (MRR)

NDCG@k (Normalized Discounted Cumulative Gain)

About

Uh oh!

Releases

Packages

Languages

isaacquinn/search-evaluation

Folders and files

Latest commit

History

Repository files navigation

Search Engine Evaluation Report

1. Overview

2. Evaluation Approach

Metrics Used

3. Evaluation Matrix

4. Evaluation Protocol

5. Limitations

6. Recommendations

7. Appendix

8. References

Precision@5

Mean Reciprocal Rank (MRR)

NDCG@k (Normalized Discounted Cumulative Gain)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages