fix(deps): Modernize dependencies & upgrade to Transformers/Outlines v1.2+#24
fix(deps): Modernize dependencies & upgrade to Transformers/Outlines v1.2+#24NakulSingh156 wants to merge 3 commits intodbpedia:mainfrom
Conversation
…ormers/Outlines v1.2
📝 WalkthroughWalkthroughThe PR replaces a llama_cpp extraction flow with a Hugging Face + Outlines v1.2 pipeline, adds Pydantic schemas for triples, rewrites the sentence-to-triples function with prompt-based inference and error handling, introduces a CLI main() script, and updates requirements to modern minimum-version pins. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User/CLI
participant Main as main()
participant HF as HF Model\n(AutoModel + Tokenizer)
participant Outlines as Outlines v1.2
participant Extract as get_triples_from_sentence()
participant Schema as Pydantic Schema
participant Output as DataFrame / CSV
User->>Main: invoke CLI with source & options
Main->>HF: load model/tokenizer (device/dtype)
HF-->>Main: model ready
Main->>Main: fetch input (sentence/text/file/wiki)
Main->>Extract: pass input text
Extract->>Outlines: prompt + call Outlines-wrapped model
Outlines-->>Extract: raw LLM response
Extract->>Schema: parse into TriplesResponse
Schema-->>Extract: list of Triple dicts
Extract-->>Main: return list of {sub, rel, obj}
Main->>Output: convert to DataFrame (optional save CSV)
Output-->>User: display/write results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@GSoC24/end-2-end-use.py`:
- Around line 50-56: Enable real device detection and pass a valid device_map
when loading the HF model: restore the commented detection logic to set device =
"cuda" if torch.cuda.is_available() else "mps" if
torch.backends.mps.is_available() else "cpu", then when calling
AutoModelForCausalLM.from_pretrained (hf_model) use device_map="auto" for
GPU/MPS or device_map={"": "cpu"} (or omit device_map and call .to(device)
after) for CPU; also ensure torch_dtype is float16 only for GPU/MPS and use
float32 for CPU to avoid unsupported dtypes on CPU.
In `@GSoC24/RelationExtraction/methods.py`:
- Around line 30-45: The call to the Outlines model incorrectly passes the
schema positionally; update the invocation of model to use the named parameter
output_type (i.e., replace model(full_prompt, TriplesResponse) with
model(full_prompt, output_type=TriplesResponse)) so that the TriplesResponse
schema is passed correctly; keep the surrounding logic that processes
result.triples and returns triples_data unchanged.
🧹 Nitpick comments (7)
GSoC24/requirements.txt (1)
1-24: Consider adding upper bounds or a lockfile for reproducibility.Using only
>=specifiers allows automatic security patches but can cause unexpected breakages when transitive dependencies conflict. For a research/GSoC project, consider:
- Adding a
requirements.lockor usingpip-toolsto pin exact versions for reproducibility.- Or, add reasonable upper bounds (e.g.,
torch>=2.5.0,<3.0) to prevent major version jumps.This is especially important given the planned Dockerfile and CI/CD pipelines mentioned in the PR roadmap.
GSoC24/RelationExtraction/methods.py (2)
1-1: Unused import:pandas.The
pandasimport is not used anywhere in this file. Remove it to keep imports clean.🧹 Proposed fix
-import pandas as pd import outlines from pydantic import BaseModel, Field
47-49: Catch more specific exceptions and consider logging.The broad
except Exceptionmasks the actual error type and makes debugging harder. Consider catching more specific exceptions (e.g.,ValidationErrorfrom Pydantic, or model-specific errors) and using a proper logger instead of♻️ Suggested improvement
+import logging +from pydantic import ValidationError + +logger = logging.getLogger(__name__) + # ... in get_triples_from_sentence ... - except Exception as e: - print(f" Extraction failed: {e}") + except ValidationError as e: + logger.warning("Schema validation failed: %s", e) + return [] + except Exception as e: + logger.exception("Unexpected extraction failure") return []GSoC24/end-2-end-use.py (4)
39-39:args.vdefaults to string, but is compared as int.
args.vhasdefault=0which makes it an int, but argparse will parse CLI input as a string. This works due to theint()cast at line 100, but it's cleaner to declare the type explicitly.♻️ Proposed fix
- parser.add_argument("--v", default=0, help="If set to 1, print the triples dataframe") + parser.add_argument("--v", type=int, default=0, help="If set to 1, print the triples dataframe")Then at line 100:
- if int(args.v) == 1: + if args.v == 1:
91-91: Redundant list comprehension.
[trip for trip in sentence_triples]is equivalent tolist(sentence_triples)or just usingsentence_triplesdirectly since it's already a list.🧹 Proposed fix
- triples = [trip for trip in sentence_triples] - triples_dataframe = pd.DataFrame(data=triples) + triples_dataframe = pd.DataFrame(data=sentence_triples)
105-108: Movetracebackimport to file top and consider using logging.Importing
tracebackinside the exception handler works but is unconventional. Move it to the top-level imports. Additionally, consider usinglogging.exception()for consistent error reporting across the codebase.🧹 Proposed fix
At top of file:
import argparse import pandas as pd import torch +import traceback from transformers import AutoModelForCausalLM, AutoTokenizerIn exception handler:
except Exception as e: print(f"Error during extraction: {e}") - import traceback traceback.print_exc()
75-78: Missing error handling for file operations.Reading from
args.text_filepathcould fail if the file doesn't exist or isn't readable. Consider adding error handling or validation.♻️ Proposed fix
elif args.text_filepath: - with open(args.text_filepath, "r") as f: - print(f"Reading text from file: {args.text_filepath}") - sentences = f.read() + filepath = Path(args.text_filepath) + if not filepath.exists(): + print(f"Error: File not found: {args.text_filepath}") + return + try: + print(f"Reading text from file: {args.text_filepath}") + sentences = filepath.read_text(encoding="utf-8") + except IOError as e: + print(f"Error reading file: {e}") + return
GSoC24/end-2-end-use.py
Outdated
| device = "cpu" | ||
| # device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu" | ||
| print(f" Using device: {device.upper()}") | ||
|
|
||
| print("llama model loaded") | ||
| try: | ||
| # Load via Hugging Face Transformers first (Standard Industry Pattern) | ||
| hf_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map=device) |
There was a problem hiding this comment.
Device detection is disabled and device_map usage may be incorrect.
Two issues:
-
The actual device detection logic is commented out (line 51), so the PR's stated goal of "device-agnostic detection for CUDA, MPS, and CPU" is not functional.
-
device_map=devicewithdevice="cpu"may not work as expected. Thedevice_mapparameter expects values like"auto","balanced","sequential", or a dict mapping—not"cpu". For CPU-only loading, usedevice_map={"": "cpu"}or omitdevice_mapentirely and call.to(device)afterward.
🐛 Proposed fix
# Check for Mac (MPS) or NVidia (CUDA) support automatically
- device = "cpu"
- # device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
+ if torch.cuda.is_available():
+ device = "cuda"
+ device_map = "auto"
+ elif torch.backends.mps.is_available():
+ device = "mps"
+ device_map = {"": "mps"}
+ else:
+ device = "cpu"
+ device_map = {"": "cpu"}
print(f" Using device: {device.upper()}")
try:
# Load via Hugging Face Transformers first (Standard Industry Pattern)
- hf_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map=device)
+ hf_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map=device_map)
tokenizer = AutoTokenizer.from_pretrained(model_name)🤖 Prompt for AI Agents
In `@GSoC24/end-2-end-use.py` around lines 50 - 56, Enable real device detection
and pass a valid device_map when loading the HF model: restore the commented
detection logic to set device = "cuda" if torch.cuda.is_available() else "mps"
if torch.backends.mps.is_available() else "cpu", then when calling
AutoModelForCausalLM.from_pretrained (hf_model) use device_map="auto" for
GPU/MPS or device_map={"": "cpu"} (or omit device_map and call .to(device)
after) for CPU; also ensure torch_dtype is float16 only for GPU/MPS and use
float32 for CPU to avoid unsupported dtypes on CPU.
|
Hi, thanks for your PR. You said the model load is device-agnostic. But then why is it using CPU in your test on macOS M1/M2? |
Hi @mommi84 thanks a lot for the review. As you saw in the output screenshot, I temporarily forced device=cpu to bypass the known PyTorch [mutex.cc] deadlock warning that frequently occurs on macOS when downloading/initializing large models via MPS for the first time. I wanted to ensure the PR proof showed a clean execution flow. However, the actual implementation in "end-2-end-use.py" correctly implements the hierarchy: it checks cuda first, then mps, and falls back to cpu only if neither is available. Hence the code in the PR is fully device-agnostic. I'm attaching the code screenshot for your reference.
|
Wouldn't it be better to add a parameter that forces the CPU in case of issues? FYI, I've just merged the GSoC 2025 code for both the English and Hindi pipelines. |
…n-framework into fix/modernize-dependencies-pipeline


Description
This PR addresses critical dependency conflicts and modernizes the inference pipeline to support Python 3.11+ and Apple Silicon (MPS) environments. Currently, the codebase fails to initialize on modern setups due to conflicting versions of
transformers,pydantic, andoutlines.Key Changes
Dependency Modernization (
requirements.txt):transformers>=4.38.0,outlines>=0.0.34,torch>=2.2.0).accelerate,sentence-transformers) required by modern Hugging Face models.Codebase Refactor (
end-2-end-use.py&methods.py):llama-cpptotransformers: Replaced the deprecated GGUF loader with standardAutoModelForCausalLM. This ensures compatibility with the vast majority of Hugging Face models.outlinesAPI: Updated generation logic to match Outlines v1.2+ syntax (replacing the removedoutlines.models.transformersandgenerate.jsonpatterns).Testing
Verified locally on macOS (M1/M2).
The pipeline now successfully installs dependencies, initializes the
transformersstack, and downloads models from HF Hub withoutImportError.Future Roadmap (Next Steps)
To further stabilize this repository for GSoC '26, I plan to work on the following in subsequent PRs:
Dockerfileto ensure a 100% reproducible environment for all contributors.ruff/black) and basic regression testing.Summary by CodeRabbit
New Features
Improvements
Chores
✏️ Tip: You can customize this high-level summary in your review settings.