-
Notifications
You must be signed in to change notification settings - Fork 8
Course Exam Benchmark: Add Evaluation Infrastructure #61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| from .courseexam import courseexam | ||
| from .dataset import load_dataset, load_exam_metadata | ||
| from .metrics import ( | ||
| points_accuracy, | ||
| total_points_earned, | ||
| total_points_possible, | ||
| ) | ||
| from .scorer import exam_scorer | ||
|
|
||
| __all__ = [ | ||
| "courseexam", | ||
| "load_dataset", | ||
| "load_exam_metadata", | ||
| "exam_scorer", | ||
| "points_accuracy", | ||
| "total_points_earned", | ||
| "total_points_possible", | ||
| ] | ||
|
|
||
| __version__ = "0.1.0" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,73 @@ | ||
| from inspect_ai import Task, task | ||
| from inspect_ai.model import GenerateConfig | ||
| from inspect_ai.solver import ( | ||
| Generate, | ||
| Solver, | ||
| TaskState, | ||
| generate, | ||
| multiple_choice, | ||
| solver, | ||
| ) | ||
|
|
||
| from courseexam.dataset import load_dataset, load_exam_metadata | ||
| from courseexam.metrics import ( | ||
| points_accuracy, | ||
| total_points_earned, | ||
| total_points_possible, | ||
| ) | ||
| from courseexam.scorer import exam_scorer | ||
|
|
||
|
|
||
| @solver | ||
| def conditional_solver() -> Solver: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I assume that we can implement any our own solver, right?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah. I actually started by implementing a custom solver before I realized this is more convenient
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| mc_solver = multiple_choice() | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we know what's the concrete prompt this multiple_choice func sent to LLM?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes we can view the full LLM query and response in the UI |
||
| gen_solver = generate() | ||
|
|
||
| async def solve(state: TaskState, generate_fn: Generate) -> TaskState: | ||
| question_type = state.metadata.get("type", "Freeform") | ||
|
|
||
| if question_type == "ExactMatch": | ||
| return await mc_solver(state, generate_fn) | ||
| else: | ||
| return await gen_solver(state, generate_fn) | ||
|
|
||
| return solve | ||
|
|
||
|
|
||
| @task | ||
| def courseexam( | ||
| exam_ids: str | list[str] | None = None, | ||
| question_types: str | list[str] | None = None, | ||
| tags: str | list[str] | None = None, | ||
| shuffle: bool = False, | ||
| max_tokens: int = 2048, | ||
| judge_model: str = "openai/gpt-4o-mini", | ||
| ) -> Task: | ||
| dataset = load_dataset( | ||
| exam_ids=exam_ids, | ||
| question_types=question_types, | ||
| tags=tags, | ||
| shuffle=shuffle, | ||
| ) | ||
|
|
||
| metadata = load_exam_metadata() | ||
|
|
||
| exam_info = { | ||
| "num_exams": len(metadata["exams"]), | ||
| "exams": metadata["exams"], | ||
| } | ||
|
|
||
| return Task( | ||
| dataset=dataset, | ||
| solver=conditional_solver(), | ||
| scorer=exam_scorer(judge_model=judge_model), | ||
| metrics=[ | ||
| points_accuracy(), | ||
| total_points_earned(), | ||
| total_points_possible(), | ||
| ], | ||
| config=GenerateConfig(max_tokens=max_tokens, temperature=0.0), | ||
| metadata=exam_info, | ||
| name="courseexam", | ||
| version="0.1.0", | ||
| ) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think many models, such as ones from openai, anthropic and google, already support multimodal (picture, txt). I am wondering why we need to change picture to text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Images put language models at a disadvantage and most of the models we benchmark (as of now) are language-only so for now I think questions that rely on images shouldn’t be included.
Later, when we support agents, we can decide how to handle image-based questions, either with clear flags or in a separate standalone benchmark.
What do you think?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model, gpt-4o, cluade-connet-4.5. I think they support image. please double check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they do yes. Do you think we should put questions with images' references in a separate benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have some tag or label for these tasks? if the model support image, we test all. if not, we only run the ones without picture. They are all inside the system exam. But, with additional labels.
what do you think?