benchmark eval changes #404

danielchalef · 2025-07-22T02:14:14Z

benchmarks
Add LongMemEval dataset analysis scripts and .gitignore files
refactor
refactor and cleanup
Refactor dataset analysis scripts for improved readability and consistency
Add start index parameter to LongMemEvalRunner for controlled data ingestion
wip

Important

This pull request adds scripts for evaluating Locomo and LongMemEval datasets using Zep's knowledge graph and OpenAI's LLM, including ingestion, response generation, grading, and analysis functionalities.

Locomo Evaluation:
- Adds zep_locomo_eval.py for grading responses using OpenAI's LLM with a custom grading prompt.
- Adds zep_locomo_ingestion.py for ingesting Locomo dataset into Zep's knowledge graph.
- Adds zep_locomo_responses.py for generating responses using OpenAI's LLM based on Zep's context.
- Adds zep_locomo_search.py for searching and composing context from Zep's knowledge graph.
LongMemEval Evaluation:
- Adds zep_eval.py and zep_responses.py for grading and generating responses using OpenAI's LLM.
- Adds analyze_dataset_sessions.py for analyzing session statistics from LongMemEval dataset.
- Adds calculate_scores.py for calculating scores by question type from JSONL results.
- Adds queue_depth.py for analyzing queue metrics from a JSON file.
- Adds zep_longmem_eval.py for a comprehensive evaluation pipeline including ingestion, response generation, and grading using Zep and OpenAI's LLM.

^{This description was created by}^{for c0c0196. You can customize this summary. It will automatically update as commits are pushed.}

- Introduced `analyze_dataset_sessions.py` for analyzing LongMemEval dataset sessions and generating statistics. - Added `.gitignore` files for both the main benchmarks and the LongMemEval directory to exclude unnecessary files. - Implemented logging in `zep_longmem_eval.py` for better traceability during data ingestion and evaluation. - Removed obsolete Jupyter notebook `zep_longmem_eval.ipynb` to streamline the project structure.

…ialization - Removed the dry run functionality from the LongMemEvalRunner class, simplifying the initialization process. - Adjusted the logic for setting up the Zep and OpenAI clients based on the development environment. - Cleaned up the dataset download method to remove dry run checks, ensuring direct execution of the download process. - Updated argument parsing in the main function to eliminate the dry run option.

…tency - Standardized string formatting to use double quotes throughout the `analyze_dataset_sessions.py` and `zep_longmem_eval.py` files. - Cleaned up whitespace and indentation for better code clarity. - Enhanced logging messages for clarity during dataset loading and processing. - Updated function signatures for better readability and consistency in parameter formatting.

…gestion - Introduced a `start_index` parameter in the `LongMemEvalRunner` class to allow users to specify the starting point for data ingestion. - Updated logging to reflect the actual number of sessions being ingested based on the provided start index. - Modified the argument parser in the main function to include the new `--start-index` option for command-line usage.

ellipsis-dev

Caution

Changes requested ❌

Reviewed everything up to a008fed in 2 minutes and 34 seconds. Click for details.

Reviewed 292 lines of code in 3 files
Skipped 0 files when reviewing.
Skipped posting 4 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. benchmarks/longmemeval/queue_depth.py:4

Draft comment:
Consider wrapping execution logic in a main() guard and adding error handling for file I/O.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% This is a new file in a benchmarks directory, likely meant for ad-hoc analysis. While error handling and main() guards are good practices, this seems like overkill for a simple benchmark script. The comment is technically correct but may not add value given the context and purpose of the file. The comment suggests good engineering practices that could make the code more robust. File I/O can fail and having proper error handling is generally good practice. Given this is a benchmark script in a benchmarks directory, it's likely meant to be simple and straightforward. Adding error handling would make it more verbose without adding much value in this specific context. The comment should be removed as it suggests overengineering a simple benchmark script. While technically correct, it violates the rule about not making comments unless there is clearly a code change required.

2. benchmarks/longmemeval/zep_longmem_eval.py:163

Draft comment:
The tar.extractall() call uses filter="data", but filter expects a callable. Verify the intended usage for safe extraction.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

3. benchmarks/longmemeval/zep_longmem_eval.py:392

Draft comment:
Ensure that DatasetManager, ZepIngester, and LLMEvaluator are imported or defined, as they are used in the init method.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50% This is a legitimate issue - the code is trying to use classes that don't exist in the current file. This would cause runtime errors. The comment is pointing out a real problem that needs to be fixed. However, the phrasing "Ensure that..." makes it a verification request rather than a direct statement of the issue. The comment uses "Ensure that..." phrasing which violates our rules about verification requests. It could be more direct in stating the issue. While the phrasing could be better, the underlying issue is real and important - the code will not work without these missing classes being defined or imported. The comment identifies a critical issue that needs to be fixed, but should be rephrased to directly state the problem rather than asking for verification.

4. benchmarks/longmemeval/zep_longmem_eval.py:499

Draft comment:
Method naming inconsistency: The grader method was renamed to 'grade_response' but is still called as 'lme_grader'. Update the call to match the new method name.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_exPxn4y2dvPa9J7C

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

ellipsis-dev · 2025-07-22T02:16:52Z

benchmarks/longmemeval/zep_longmem_eval.py

+        global question_max
+        question_max = max(question_max, len(question))
+
+        print(f"Question max: {question_max}")


Replace the print() call with a logger call (e.g., logger.debug/info) for consistent logging.

Suggested change

print(f"Question max: {question_max}")

self.logger.info(f"Question max: {question_max}")

benchmarks/longmemeval/zep_longmem_eval.py

…alysis - Introduced `calculate_scores.py` to compute question type scores from a JSONL file, providing detailed output of correct and total responses. - Added `queue_depth.py` to analyze queue metrics from a local JSON file, calculating the average task values and total tasks in the queue. - Updated documentation in `zep_longmem_eval.py` for clarity on fact representation and event timestamps.

ellipsis-dev

Caution

Changes requested ❌

Reviewed 89703c3 in 2 minutes and 35 seconds. Click for details.

Reviewed 117 lines of code in 3 files
Skipped 0 files when reviewing.
Skipped posting 6 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. benchmarks/longmemeval/calculate_scores.py:55

Draft comment:
Add a trailing newline at the end of the file to comply with POSIX standards.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50% None

2. benchmarks/longmemeval/queue_depth.py:6

Draft comment:
Guard against missing keys in the JSON by checking if the key exists before accessing it.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 30% vs. threshold = 50% This is a benchmark/evaluation script, not production code. The key seems to be a hardcoded UUID, suggesting this is for analyzing specific known data. Error handling might make the code more robust, but it's not clear if it's necessary for this use case. The comment doesn't provide strong evidence that missing keys are actually a problem here. The comment raises a valid defensive programming practice. Missing key handling could prevent crashes if the data format changes. While defensive programming is good practice, this appears to be a benchmark script for specific known data. Over-engineering error handling may not be valuable here. Delete the comment as it suggests adding complexity that may not be necessary for this benchmark script's specific use case.

3. benchmarks/longmemeval/queue_depth.py:15

Draft comment:
Check if 'gwf' is non-empty before dividing to avoid a ZeroDivisionError.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% The code assumes the key "GraphWorkflow-1fe79d02-2163-4b3b-8530-677551b358c6" exists and contains data. If the key doesn't exist, it would raise a KeyError before reaching the division. If the key exists but maps to an empty dict, a ZeroDivisionError could occur. However, this seems like a simple metrics script where the author knows their data structure. The comment identifies a real potential error case. Empty dictionaries could cause runtime errors that aren't immediately obvious. However, this appears to be a one-off metrics script working with known data. The hardcoded UUID suggests this is not meant to be robust production code. The comment, while technically correct, is overly defensive for a metrics script working with known data structure. It violates the rule about not making speculative comments.

4. benchmarks/longmemeval/zep_longmem_eval.py:33

Draft comment:
Consider using 'datetime' consistently instead of splitting it as 'date time' in the context template for clarity.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50% None

5. benchmarks/longmemeval/zep_longmem_eval.py:164

Draft comment:
The 'filter' argument in tar.extractall must be a callable, not a string. Update or remove it to correctly filter archive members.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

6. benchmarks/longmemeval/calculate_scores.py:55

Draft comment:
It appears there's no newline at the end of the file. Adding a trailing newline would adhere to common style guidelines.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% While having a newline at the end of file is a common convention and some tools expect it, this is a very minor style issue. It's something that could be automatically fixed by an autoformatter or linter. The comment doesn't point out a functional issue or suggest an important code quality improvement. Missing newlines at end of file can cause issues with some Unix tools and make diffs less clean. It's a real issue that some would consider worth fixing. While true, this is exactly the kind of minor style issue that should be handled by automation rather than code review comments. It doesn't affect functionality and isn't worth the reviewer's or author's time. Delete the comment as it points out a very minor style issue that would be better handled by automation.

Workflow ID: wflow_EFBWQlFDAHEwXhcy

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

benchmarks/longmemeval/zep_longmem_eval.py

ellipsis-dev

Important

Looks good to me! 👍

Reviewed c0c0196 in 1 minute and 11 seconds. Click for details.

Reviewed 13 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 1 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. benchmarks/longmemeval/zep_longmem_eval.py:34

Draft comment:
Grammar fix: Changed 'date time' to 'datetime' for clarity.
Reason this comment was not posted:
Comment looked like it was already resolved.

Workflow ID: wflow_sw93zeiQsSC16uJF

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

…luation, and summarization - Deleted `analyze_dataset_sessions.py` as part of the refactoring process. - Introduced `ingestion.py` for data ingestion into the Zep knowledge graph with detailed logging. - Added `evaluation.py` for evaluating conversations using Zep context retrieval. - Created `summarization.py` for summarization utilities, enhancing context composition. - Implemented `common.py` for shared constants and utility functions. - Established `zep_ontology.py` to define custom entity and edge types for Zep knowledge graphs. - Updated `zep_longmem_eval.py` to orchestrate ingestion and evaluation processes with new runners.

- Removed unnecessary comments and consolidated entity and edge type sections for clarity. - Eliminated unused fields from the OccurredAt and WorksFor edge types. - Simplified the setup_zep_ontology function by directly setting the ontology without exception handling or summary printing. - Improved overall readability and maintainability of the code.

- Introduced asynchronous processing for user sessions to improve concurrency and efficiency. - Added detailed logging for user and session creation, including error handling and message addition status. - Implemented a semaphore to limit concurrent user processing and a lock for thread-safe CSV writing. - Refactored the ingestion method to utilize the new `_process_user` method for better code organization and readability. - Ensured custom ontology setup is performed asynchronously if requested.

- Implemented chunking and contextualization for large messages to ensure they fit within memory constraints. - Introduced new methods for handling large messages, including `_chunk_large_message` and `_contextualize_chunk`. - Updated the ingestion logic to process large messages separately, maintaining detailed logging of chunking results. - Enhanced the Zep ontology setup function to support asynchronous operations. - Improved overall code organization and readability in the ingestion module.

- Introduced replay mode for re-ingesting failed users, improving error handling and robustness. - Updated `IngestionRunner` to support replay functionality and detailed logging of ingestion results. - Refactored `EvaluationRunner` to improve prompt clarity and response generation logic. - Enhanced chunking logic in `IngestionRunner` for better handling of large messages. - Improved overall code organization and readability across multiple modules.

- Updated `EvaluationRunner` to support configurable model parameters for response, grading, and summarization. - Introduced a new `grid_search.py` module for systematic evaluation of parameter combinations, improving experimentation efficiency. - Added `grid_search_config.yaml` for defining search parameters and evaluation settings. - Implemented a `RerankerFactory` to facilitate reranking options, including integration with MxbaiRerankV2. - Enhanced summarization logic to include message context in summaries, improving response relevance. - Removed outdated session analysis summary file to streamline project structure.

- Introduced `bayesian_search.py` for implementing Bayesian optimization using Optuna, enhancing the evaluation process. - Configured logging and output management for search results, ensuring reproducibility and traceability. - Updated `common.py` with new summarization prompt templates for improved context extraction. - Enhanced `EvaluationRunner` to support dynamic summarization strategies based on configuration. - Removed outdated `grid_search_config.yaml` to streamline configuration management and replaced it with `search_config.yaml` for unified search settings.

- Added resolution markers for Python version compatibility. - Introduced new packages: `accelerate`, `aiohappyeyeballs`, `aiohttp`, `aiolimiter`, `aiosignal`, `alembic`, `yarl`, `zep-cloud`, and `zstandard`, along with their respective versions and dependencies. - Updated existing package entries with detailed source and wheel information for better package management.

…ement - Integrated `tiktoken` for measuring context length and counting tokens in the `EvaluationRunner`. - Added methods to count tokens and characters in context, improving evaluation metrics. - Updated dependencies in `pyproject.toml` to include `tiktoken` for enhanced functionality. - Removed unnecessary whitespace in `ingestion.py` for cleaner code.

- Added concurrency control to the ingestion process, allowing up to 10 concurrent users during data ingestion. - Implemented checkpointing functionality to track completed and failed users, enabling resumption of ingestion from the last successful state. - Updated the benchmark configuration to include concurrency settings. - Enhanced logging and user feedback during ingestion and evaluation processes. - Introduced a Makefile for easier development tasks, including testing, formatting, and linting. - Improved documentation to reflect new features and usage instructions.

…xtualization - Implemented chunking and contextualization for large messages to improve handling and integration into the ingestion process. - Added new constants for message size limits and chunking parameters to enhance configurability. - Updated the ingestion logic to process messages in batches, ensuring efficient handling of both normal and large messages. - Improved error handling and logging for large message processing, providing better insights during ingestion. - Refactored code for clarity and maintainability, including the addition of helper methods for chunking and contextualization.

…nd ingestion logic - Eliminated the concurrency setting from the benchmark configuration and related classes, simplifying the ingestion process. - Updated the ingestion runner to no longer require concurrency parameters, streamlining user processing. - Adjusted documentation and tests to reflect the removal of concurrency features, ensuring clarity and consistency.

… user processing - Adjusted user processing to only count completed users when resuming from a checkpoint, allowing for retries of failed users. - Enhanced reporting to distinguish between new users and retries, providing clearer feedback during ingestion. - Updated checkpoint management to prevent duplicate entries for completed and failed users, ensuring accurate tracking.

danielchalef added 9 commits July 17, 2025 15:56

benchmarks

89427d6

wip

4a2369a

refactor

1ab0753

refactor and cleanup

9c20d08

Merge branch 'main' into benchmarks

fd23ea4

ellipsis-dev bot reviewed Jul 22, 2025

View reviewed changes

danielchalef force-pushed the benchmarks branch from a008fed to 89703c3 Compare July 22, 2025 03:09

getzep deleted a comment from ellipsis-dev bot Jul 22, 2025

ellipsis-dev bot reviewed Jul 22, 2025

View reviewed changes

benchmarks/longmemeval/zep_longmem_eval.py Outdated Show resolved Hide resolved

typo

c0c0196

ellipsis-dev bot reviewed Jul 22, 2025

View reviewed changes

danielchalef added 5 commits July 24, 2025 08:17

getzep deleted a comment from ellipsis-dev bot Jul 25, 2025

danielchalef added 8 commits July 26, 2025 08:07

checkpoint

41ed63a

refactor and clean up benchmark code

879549e

cleanup

f536a11

danielchalef added 3 commits October 1, 2025 22:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

benchmark eval changes #404

benchmark eval changes #404

Uh oh!

danielchalef commented Jul 22, 2025 •

edited by ellipsis-dev bot

Loading

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

ellipsis-dev bot Jul 22, 2025

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	print(f"Question max: {question_max}")
	self.logger.info(f"Question max: {question_max}")

benchmark eval changes #404

Are you sure you want to change the base?

benchmark eval changes #404

Uh oh!

Conversation

danielchalef commented Jul 22, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danielchalef commented Jul 22, 2025 •

edited by ellipsis-dev bot

Loading