-
Notifications
You must be signed in to change notification settings - Fork 3
Add sheet scoring #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
morganizzzm
commented
Aug 11, 2025
- Add SheetScorer class for scoring biblical reference discussion levels (0-4 scale), title interest, and creativity assessment with multi-language Hebrew/English support. Add resilient LLM processing with OpenAI function calling, recursive fallback handling, intelligent content chunking, and token management with smart truncation and LLM-based summarization for large sheets
- Add Celery integration with SheetScoringInput/SheetScoringOutput dataclasses in sheet_scoring module and score_sheet_task
- Register sheet_scoring package in Celery autodiscovery and add test script for validation
…-based scoring - Add SheetScorer class for scoring biblical reference discussion levels (0-4 scale), title interest, and creativity assessment with multi-language Hebrew/English support. Add resilient LLM processing with OpenAI function calling, recursive fallback handling, intelligent content chunking, and token management with smart truncation and LLM-based summarization for large sheets - Add Celery integration with SheetScoringInput/SheetScoringOutput dataclasses in scoring_io module and score_sheet_task - Register sheet_scoring package in Celery autodiscovery and add test script for validation This enables automated analysis of Jewish study sheets to score reference discussion quality and title engagement, supporting content curation workflows.
…de review - Replace manual multiline f-strings with textwrap.dedent for prompt builders - Simplify _invoke_llm_with_function by using getattr on additional_kwargs - Rename SheetScoringOutput.processed_at → processed_datetime - Move datetime→ISO conversion into __post_init__ instead of custom __init__ - Expand and clarify comments on configuration constants - Switch from relative to absolute imports after adding LLM to PYTHONPATH
feat: Add SheetScorer tool for analyzing Jewish study sheets with LLM…
…nsistent with topic_prompt and other llm apps scoring_io_input.py -> sheet_scoring_input.py scoring_io_output.py -> sheet_scoring_output.py package name scoring_io -> sheet_scoring consequently renamed all the inputs
added this by mistake in one of the commits
…emoved import of env variables
app/sheet_scoring/text_utils.py
Outdated
| return "" | ||
| text = TAG_RE.sub("", raw) | ||
| text = html.unescape(text) | ||
| text = re.sub(r"\s+\n", "\n", text) # trim spaces before newlines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we don't need tabs, i think this is better:
return '\n'.join([' '.join(line.split()) for line in text.split('\n')])
i think it should be faster, but maybe i'm wrong, and maybe it's not so readable (for me it seems readable but i'm not sure for everyone)
app/sheet_scoring/text_utils.py
Outdated
| return len(TOKEN_RE.findall(text)) | ||
|
|
||
|
|
||
| def sheet_to_text_views( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not too long for being one line (we usually don't break def like this)
app/sheet_scoring/text_utils.py
Outdated
| def sheet_to_text_views( | ||
| sheet: Dict[str, Any], | ||
| default_lang: str = "en", | ||
| ) -> Tuple[str, str, str, bool, float]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returning a dict is better. it make it much clearer what you have in hand, so if there is no reason for that, please change.
it can also be a class if you wish.
| default_lang: str = "en", | ||
| ) -> Tuple[str, str, str, bool, float]: | ||
| """ | ||
| Build three plain‑text snapshots of a Sefaria sheet **and** compute a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why both? can you split it into 2 functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because all of the outputs are extracted during the same parsing process of sheet's attributes. if i divide it into 2 functions it will take x2 time
| token_margin: int = DEFAULT_MAX_OUTPUT_TOKENS, | ||
| max_ref_to_process: int = DEFAULT_MAX_REFS_TO_PROCESS, | ||
| chunk_size: int = DEFAULT_CHUNK_SIZE, | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove newline
| self.LANGUAGE_FIELD: LanguageCode.DEFAULT | ||
| } | ||
|
|
||
| def _normalize_scores_to_percentages( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add doc string to explain what this functions does
| • First try the whole list. | ||
| • If the model returns < len(refs) scores (or JSON error), | ||
| split the list in two and grade each half recursively. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i didn't understand what this function should do and specifically why it returns 2 dicts
| validated_scores = {} | ||
| for ref, score in chunk_scores.items(): | ||
| validated_scores[ref] = self._validate_score_level( | ||
| score,f"ref_score[{ref}]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space
| logger.error(f"Chunk GPT failed: {e}") | ||
| return None | ||
|
|
||
| def _last_regular_start(self, n: int, chunk: int, overlap: int) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change n to have a deescriptive name?
…nd unified output - Updated SheetScoringInput to have only the data used by SheetScorer: sheet_id, title, sources, expanded_refs - refactored SheetScorer to return SheetScoringOutput directly instead of a dict - Changed score_one_sheet to call process_sheet_by_content with explicit args (title, sources etc.) - Added request_status and request_status_message fields to SheetScoringOutput - Modified sheet_to_text_views to accept title/sources and return dict with quotes_only, no_quotes, with_quotes, has_original, creativity_score - Simplified strip_html to only collapse whitespace/newlines, removed HTML entity decoding - Added create_failure_output for standardized error handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a comprehensive sheet scoring system that uses LLMs to automatically analyze Jewish study sheets. The system scores biblical reference discussion levels (0-4 scale), evaluates title interest, and calculates creativity metrics based on user-generated content percentage. The implementation includes resilient LLM processing with OpenAI function calling, intelligent chunking for large sheets, and Celery integration for asynchronous task processing.
Key Changes:
- Implements
SheetScorerclass with multi-language support (Hebrew/English) for analyzing study sheets via OpenAI GPT models - Adds Celery task integration with
score_sheet_taskand structured input/output dataclasses - Reduces Celery worker concurrency from 50 to 4 to accommodate resource-intensive LLM operations
- Fixes typo in sentencizer prompt ("will" → "with")
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| build/entrypoint.sh | Reduces Celery concurrency from 50 to 4 for LLM workloads |
| app/util/sentencizer.py | Fixes typo in system prompt ("will" → "with") |
| app/sheet_scoring/text_utils.py | Adds text processing utilities for token counting and sheet content extraction |
| app/sheet_scoring/tasks.py | Implements Celery task wrapper for sheet scoring |
| app/sheet_scoring/sheet_scoring.py | Provides main API entry point for scoring sheets |
| app/sheet_scoring/openai_sheets_scorer.py | Core LLM scoring engine with resilient processing and chunking strategies |
| app/sheet_scoring/init.py | Package initialization file |
| app/sheet_scoring/README.md | Comprehensive documentation for the sheet scoring system |
| app/requirements.txt | Updates dependencies: pins langchain versions, adds langchain-openai, httpx, and unpins tiktoken |
| app/llm_interface/sefaria_llm_interface/sheet_scoring/sheet_scoring_output.py | Defines output dataclass for sheet scoring results |
| app/llm_interface/sefaria_llm_interface/sheet_scoring/sheet_scoring_input.py | Defines input dataclass for sheet scoring requests |
| app/llm_interface/sefaria_llm_interface/sheet_scoring/init.py | Exports sheet scoring dataclasses |
| app/celery_setup/app.py | Registers sheet_scoring package in Celery autodiscovery |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| langchain==0.2.1 | ||
| langchain-core==0.2.2 | ||
| langchain-openai==0.1.8 |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dependency constraint change: Changed from flexible version langchain[llms]~=0.2.1 to exact version langchain==0.2.1 and split into separate packages (langchain-core, langchain-openai). While this provides more control, ensure this exact version pinning doesn't cause conflicts with other dependencies. The split into specific packages is good practice for reducing unnecessary dependencies.
| langchain==0.2.1 | |
| langchain-core==0.2.2 | |
| langchain-openai==0.1.8 | |
| langchain~=0.2.1 | |
| langchain-core~=0.2.2 | |
| langchain-openai~=0.1.8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making versions work took me a day...
| def _create_json_llm(self, api_key: str, model: str) -> ChatOpenAI: | ||
| """Create LLM client for JSON responses.""" | ||
| http_client = httpx.Client() | ||
| return ChatOpenAI( | ||
| model=model, | ||
| temperature=0, | ||
| top_p=0, | ||
| frequency_penalty=0, | ||
| presence_penalty=0, | ||
| seed=42, | ||
| api_key=api_key, | ||
| http_client=http_client, | ||
| ) | ||
|
|
||
| def _create_text_llm(self, api_key: str, model: str) -> ChatOpenAI: | ||
| """Create LLM client for text responses.""" | ||
| http_client = httpx.Client() | ||
| return ChatOpenAI( | ||
| model=model, | ||
| temperature=0, | ||
| model_kwargs={"response_format": {"type": "text"}}, | ||
| api_key=api_key, | ||
| http_client=http_client, | ||
| ) |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resource leak: httpx.Client() instances are created but never closed. These HTTP clients should be properly closed to avoid resource leaks. Consider using context managers or storing the client as an instance variable with proper cleanup in a destructor, or initialize ChatOpenAI without passing an http_client parameter to let it manage its own client lifecycle.
app/llm_interface/sefaria_llm_interface/sheet_scoring/sheet_scoring_output.py
Outdated
Show resolved
Hide resolved
| from sefaria_llm_interface.sheet_scoring.sheet_scoring_input import * | ||
| from sefaria_llm_interface.sheet_scoring.sheet_scoring_output import * | ||
|
|
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import pollutes the enclosing namespace, as the imported module sefaria_llm_interface.sheet_scoring.sheet_scoring_input does not define 'all'.
| from sefaria_llm_interface.sheet_scoring.sheet_scoring_input import * | |
| from sefaria_llm_interface.sheet_scoring.sheet_scoring_output import * | |
| import sefaria_llm_interface.sheet_scoring.sheet_scoring_input as sheet_scoring_input | |
| import sefaria_llm_interface.sheet_scoring.sheet_scoring_output as sheet_scoring_output | |
| __all__ = [] | |
| # Re-export all public (non-underscore) names from the submodules | |
| for _mod in (sheet_scoring_input, sheet_scoring_output): | |
| for _name in dir(_mod): | |
| if not _name.startswith("_"): | |
| globals()[_name] = getattr(_mod, _name) | |
| __all__.append(_name) |
| from sefaria_llm_interface.sheet_scoring.sheet_scoring_output import * | ||
|
|
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import pollutes the enclosing namespace, as the imported module sefaria_llm_interface.sheet_scoring.sheet_scoring_output does not define 'all'.
| from sefaria_llm_interface.sheet_scoring.sheet_scoring_output import * | |
| import sefaria_llm_interface.sheet_scoring.sheet_scoring_output as _sheet_scoring_output | |
| # Re-export all public names from sheet_scoring_output without using a wildcard import. | |
| for _name, _value in _sheet_scoring_output.__dict__.items(): | |
| if not _name.startswith("_"): | |
| globals()[_name] = _value |
| diff-match-patch | ||
| dnspython~=2.5.0 | ||
| tiktoken~=0.4.0 | ||
| tiktoken |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tiktoken dependency is now specified without a version, which means builds and runtime will pull the latest mutable release and any compromised or malicious update could execute arbitrary code with the application's privileges (including access to secrets and data). This creates a concrete supply chain risk because an attacker who gains control of the PyPI project or its release process can silently introduce a backdoored version that will be automatically deployed. Pin tiktoken to a specific, vetted version (and ideally manage upgrades explicitly) to ensure only known-good code is installed.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@YishaiGlasner still some conflicts that need to be resolved. Also, can you give a bit of context when you have a chance + anything specific you want us to be looking for? Thank you so much! |