-
Notifications
You must be signed in to change notification settings - Fork 570
benchmark eval changes #404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Introduced `analyze_dataset_sessions.py` for analyzing LongMemEval dataset sessions and generating statistics. - Added `.gitignore` files for both the main benchmarks and the LongMemEval directory to exclude unnecessary files. - Implemented logging in `zep_longmem_eval.py` for better traceability during data ingestion and evaluation. - Removed obsolete Jupyter notebook `zep_longmem_eval.ipynb` to streamline the project structure.
…ialization - Removed the dry run functionality from the LongMemEvalRunner class, simplifying the initialization process. - Adjusted the logic for setting up the Zep and OpenAI clients based on the development environment. - Cleaned up the dataset download method to remove dry run checks, ensuring direct execution of the download process. - Updated argument parsing in the main function to eliminate the dry run option.
…tency - Standardized string formatting to use double quotes throughout the `analyze_dataset_sessions.py` and `zep_longmem_eval.py` files. - Cleaned up whitespace and indentation for better code clarity. - Enhanced logging messages for clarity during dataset loading and processing. - Updated function signatures for better readability and consistency in parameter formatting.
…gestion - Introduced a `start_index` parameter in the `LongMemEvalRunner` class to allow users to specify the starting point for data ingestion. - Updated logging to reflect the actual number of sessions being ingested based on the provided start index. - Modified the argument parser in the main function to include the new `--start-index` option for command-line usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Changes requested ❌
Reviewed everything up to a008fed in 2 minutes and 34 seconds. Click for details.
- Reviewed
292lines of code in3files - Skipped
0files when reviewing. - Skipped posting
4draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. benchmarks/longmemeval/queue_depth.py:4
- Draft comment:
Consider wrapping execution logic in a main() guard and adding error handling for file I/O. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% This is a new file in a benchmarks directory, likely meant for ad-hoc analysis. While error handling and main() guards are good practices, this seems like overkill for a simple benchmark script. The comment is technically correct but may not add value given the context and purpose of the file. The comment suggests good engineering practices that could make the code more robust. File I/O can fail and having proper error handling is generally good practice. Given this is a benchmark script in a benchmarks directory, it's likely meant to be simple and straightforward. Adding error handling would make it more verbose without adding much value in this specific context. The comment should be removed as it suggests overengineering a simple benchmark script. While technically correct, it violates the rule about not making comments unless there is clearly a code change required.
2. benchmarks/longmemeval/zep_longmem_eval.py:163
- Draft comment:
The tar.extractall() call uses filter="data", but filter expects a callable. Verify the intended usage for safe extraction. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
3. benchmarks/longmemeval/zep_longmem_eval.py:392
- Draft comment:
Ensure that DatasetManager, ZepIngester, and LLMEvaluator are imported or defined, as they are used in the init method. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50% This is a legitimate issue - the code is trying to use classes that don't exist in the current file. This would cause runtime errors. The comment is pointing out a real problem that needs to be fixed. However, the phrasing "Ensure that..." makes it a verification request rather than a direct statement of the issue. The comment uses "Ensure that..." phrasing which violates our rules about verification requests. It could be more direct in stating the issue. While the phrasing could be better, the underlying issue is real and important - the code will not work without these missing classes being defined or imported. The comment identifies a critical issue that needs to be fixed, but should be rephrased to directly state the problem rather than asking for verification.
4. benchmarks/longmemeval/zep_longmem_eval.py:499
- Draft comment:
Method naming inconsistency: The grader method was renamed to 'grade_response' but is still called as 'lme_grader'. Update the call to match the new method name. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
Workflow ID: wflow_exPxn4y2dvPa9J7C
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
| global question_max | ||
| question_max = max(question_max, len(question)) | ||
|
|
||
| print(f"Question max: {question_max}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace the print() call with a logger call (e.g., logger.debug/info) for consistent logging.
| print(f"Question max: {question_max}") | |
| self.logger.info(f"Question max: {question_max}") |
…alysis - Introduced `calculate_scores.py` to compute question type scores from a JSONL file, providing detailed output of correct and total responses. - Added `queue_depth.py` to analyze queue metrics from a local JSON file, calculating the average task values and total tasks in the queue. - Updated documentation in `zep_longmem_eval.py` for clarity on fact representation and event timestamps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Changes requested ❌
Reviewed 89703c3 in 2 minutes and 35 seconds. Click for details.
- Reviewed
117lines of code in3files - Skipped
0files when reviewing. - Skipped posting
6draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. benchmarks/longmemeval/calculate_scores.py:55
- Draft comment:
Add a trailing newline at the end of the file to comply with POSIX standards. - Reason this comment was not posted:
Confidence changes required:33%<= threshold50%None
2. benchmarks/longmemeval/queue_depth.py:6
- Draft comment:
Guard against missing keys in the JSON by checking if the key exists before accessing it. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 30% vs. threshold = 50% This is a benchmark/evaluation script, not production code. The key seems to be a hardcoded UUID, suggesting this is for analyzing specific known data. Error handling might make the code more robust, but it's not clear if it's necessary for this use case. The comment doesn't provide strong evidence that missing keys are actually a problem here. The comment raises a valid defensive programming practice. Missing key handling could prevent crashes if the data format changes. While defensive programming is good practice, this appears to be a benchmark script for specific known data. Over-engineering error handling may not be valuable here. Delete the comment as it suggests adding complexity that may not be necessary for this benchmark script's specific use case.
3. benchmarks/longmemeval/queue_depth.py:15
- Draft comment:
Check if 'gwf' is non-empty before dividing to avoid a ZeroDivisionError. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% The code assumes the key "GraphWorkflow-1fe79d02-2163-4b3b-8530-677551b358c6" exists and contains data. If the key doesn't exist, it would raise a KeyError before reaching the division. If the key exists but maps to an empty dict, a ZeroDivisionError could occur. However, this seems like a simple metrics script where the author knows their data structure. The comment identifies a real potential error case. Empty dictionaries could cause runtime errors that aren't immediately obvious. However, this appears to be a one-off metrics script working with known data. The hardcoded UUID suggests this is not meant to be robust production code. The comment, while technically correct, is overly defensive for a metrics script working with known data structure. It violates the rule about not making speculative comments.
4. benchmarks/longmemeval/zep_longmem_eval.py:33
- Draft comment:
Consider using 'datetime' consistently instead of splitting it as 'date time' in the context template for clarity. - Reason this comment was not posted:
Confidence changes required:33%<= threshold50%None
5. benchmarks/longmemeval/zep_longmem_eval.py:164
- Draft comment:
The 'filter' argument in tar.extractall must be a callable, not a string. Update or remove it to correctly filter archive members. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
6. benchmarks/longmemeval/calculate_scores.py:55
- Draft comment:
It appears there's no newline at the end of the file. Adding a trailing newline would adhere to common style guidelines. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% While having a newline at the end of file is a common convention and some tools expect it, this is a very minor style issue. It's something that could be automatically fixed by an autoformatter or linter. The comment doesn't point out a functional issue or suggest an important code quality improvement. Missing newlines at end of file can cause issues with some Unix tools and make diffs less clean. It's a real issue that some would consider worth fixing. While true, this is exactly the kind of minor style issue that should be handled by automation rather than code review comments. It doesn't affect functionality and isn't worth the reviewer's or author's time. Delete the comment as it points out a very minor style issue that would be better handled by automation.
Workflow ID: wflow_EFBWQlFDAHEwXhcy
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed c0c0196 in 1 minute and 11 seconds. Click for details.
- Reviewed
13lines of code in1files - Skipped
0files when reviewing. - Skipped posting
1draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. benchmarks/longmemeval/zep_longmem_eval.py:34
- Draft comment:
Grammar fix: Changed 'date time' to 'datetime' for clarity. - Reason this comment was not posted:
Comment looked like it was already resolved.
Workflow ID: wflow_sw93zeiQsSC16uJF
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
…luation, and summarization - Deleted `analyze_dataset_sessions.py` as part of the refactoring process. - Introduced `ingestion.py` for data ingestion into the Zep knowledge graph with detailed logging. - Added `evaluation.py` for evaluating conversations using Zep context retrieval. - Created `summarization.py` for summarization utilities, enhancing context composition. - Implemented `common.py` for shared constants and utility functions. - Established `zep_ontology.py` to define custom entity and edge types for Zep knowledge graphs. - Updated `zep_longmem_eval.py` to orchestrate ingestion and evaluation processes with new runners.
- Removed unnecessary comments and consolidated entity and edge type sections for clarity. - Eliminated unused fields from the OccurredAt and WorksFor edge types. - Simplified the setup_zep_ontology function by directly setting the ontology without exception handling or summary printing. - Improved overall readability and maintainability of the code.
- Introduced asynchronous processing for user sessions to improve concurrency and efficiency. - Added detailed logging for user and session creation, including error handling and message addition status. - Implemented a semaphore to limit concurrent user processing and a lock for thread-safe CSV writing. - Refactored the ingestion method to utilize the new `_process_user` method for better code organization and readability. - Ensured custom ontology setup is performed asynchronously if requested.
- Implemented chunking and contextualization for large messages to ensure they fit within memory constraints. - Introduced new methods for handling large messages, including `_chunk_large_message` and `_contextualize_chunk`. - Updated the ingestion logic to process large messages separately, maintaining detailed logging of chunking results. - Enhanced the Zep ontology setup function to support asynchronous operations. - Improved overall code organization and readability in the ingestion module.
- Introduced replay mode for re-ingesting failed users, improving error handling and robustness. - Updated `IngestionRunner` to support replay functionality and detailed logging of ingestion results. - Refactored `EvaluationRunner` to improve prompt clarity and response generation logic. - Enhanced chunking logic in `IngestionRunner` for better handling of large messages. - Improved overall code organization and readability across multiple modules.
- Updated `EvaluationRunner` to support configurable model parameters for response, grading, and summarization. - Introduced a new `grid_search.py` module for systematic evaluation of parameter combinations, improving experimentation efficiency. - Added `grid_search_config.yaml` for defining search parameters and evaluation settings. - Implemented a `RerankerFactory` to facilitate reranking options, including integration with MxbaiRerankV2. - Enhanced summarization logic to include message context in summaries, improving response relevance. - Removed outdated session analysis summary file to streamline project structure.
- Introduced `bayesian_search.py` for implementing Bayesian optimization using Optuna, enhancing the evaluation process. - Configured logging and output management for search results, ensuring reproducibility and traceability. - Updated `common.py` with new summarization prompt templates for improved context extraction. - Enhanced `EvaluationRunner` to support dynamic summarization strategies based on configuration. - Removed outdated `grid_search_config.yaml` to streamline configuration management and replaced it with `search_config.yaml` for unified search settings.
- Added resolution markers for Python version compatibility. - Introduced new packages: `accelerate`, `aiohappyeyeballs`, `aiohttp`, `aiolimiter`, `aiosignal`, `alembic`, `yarl`, `zep-cloud`, and `zstandard`, along with their respective versions and dependencies. - Updated existing package entries with detailed source and wheel information for better package management.
…ement - Integrated `tiktoken` for measuring context length and counting tokens in the `EvaluationRunner`. - Added methods to count tokens and characters in context, improving evaluation metrics. - Updated dependencies in `pyproject.toml` to include `tiktoken` for enhanced functionality. - Removed unnecessary whitespace in `ingestion.py` for cleaner code.
- Added concurrency control to the ingestion process, allowing up to 10 concurrent users during data ingestion. - Implemented checkpointing functionality to track completed and failed users, enabling resumption of ingestion from the last successful state. - Updated the benchmark configuration to include concurrency settings. - Enhanced logging and user feedback during ingestion and evaluation processes. - Introduced a Makefile for easier development tasks, including testing, formatting, and linting. - Improved documentation to reflect new features and usage instructions.
…xtualization - Implemented chunking and contextualization for large messages to improve handling and integration into the ingestion process. - Added new constants for message size limits and chunking parameters to enhance configurability. - Updated the ingestion logic to process messages in batches, ensuring efficient handling of both normal and large messages. - Improved error handling and logging for large message processing, providing better insights during ingestion. - Refactored code for clarity and maintainability, including the addition of helper methods for chunking and contextualization.
…nd ingestion logic - Eliminated the concurrency setting from the benchmark configuration and related classes, simplifying the ingestion process. - Updated the ingestion runner to no longer require concurrency parameters, streamlining user processing. - Adjusted documentation and tests to reflect the removal of concurrency features, ensuring clarity and consistency.
… user processing - Adjusted user processing to only count completed users when resuming from a checkpoint, allowing for retries of failed users. - Enhanced reporting to distinguish between new users and retries, providing clearer feedback during ingestion. - Updated checkpoint management to prevent duplicate entries for completed and failed users, ensuring accurate tracking.
Important
This pull request adds scripts for evaluating Locomo and LongMemEval datasets using Zep's knowledge graph and OpenAI's LLM, including ingestion, response generation, grading, and analysis functionalities.
zep_locomo_eval.pyfor grading responses using OpenAI's LLM with a custom grading prompt.zep_locomo_ingestion.pyfor ingesting Locomo dataset into Zep's knowledge graph.zep_locomo_responses.pyfor generating responses using OpenAI's LLM based on Zep's context.zep_locomo_search.pyfor searching and composing context from Zep's knowledge graph.zep_eval.pyandzep_responses.pyfor grading and generating responses using OpenAI's LLM.analyze_dataset_sessions.pyfor analyzing session statistics from LongMemEval dataset.calculate_scores.pyfor calculating scores by question type from JSONL results.queue_depth.pyfor analyzing queue metrics from a JSON file.zep_longmem_eval.pyfor a comprehensive evaluation pipeline including ingestion, response generation, and grading using Zep and OpenAI's LLM.This description was created by
for c0c0196. You can customize this summary. It will automatically update as commits are pushed.