-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Context
The stub generator (dev/generate_stubs.R) produces clean R wrapper functions from OpenAPI schemas. Some functions need post-processing logic beyond what the generated stub provides — dispatching across multiple stubs, annotation joins, string coercion, DTXSID extraction. Currently these are hand-written files that get overwritten when stubs are regenerated after schema changes.
We need a system that:
- Stores post-processing logic separately from generated output
- Automatically applies it during stub generation
- Survives schema-driven regeneration (the whole point)
- Is easy to add new recipes and edit existing ones
- Works with the existing lifecycle badge protection in
05_file_scaffold.R
Current functions that need post-processing
ct_bioactivity— dispatches to 4 generated stubs bysearch_type, optional annotation join via secondary API callct_lists_all— projection selection logic, DTXSID comma-separated string coercionct_list— uppercase coercion, DTXSID extraction + string split + dedup
More will be added as untested endpoints are validated.
Design Options
Option A: R list registry in a single file
All recipes live in one file (dev/endpoint_eval/09_recipes.R) as a named R list. Each entry contains metadata (title, lifecycle, params) and the function body as a character string. The generator reads this list and produces complete R files.
Advantages:
- Single source of truth — one file to manage all recipes
- Follows existing pipeline conventions (all
dev/endpoint_eval/modules are single R files) - Easy to iterate the registry programmatically (validation, reporting, drift detection)
- No file discovery logic needed — just read the list
Disadvantages:
- Writing R code inside character strings — no syntax highlighting, autocomplete, or linting in IDE
- Harder to review diffs (string changes vs. real code changes)
- File grows linearly with number of recipes; complex recipes (like
chemi_safetywith ~100 lines) make the file unwieldy - Syntax errors in recipe bodies are only caught at generation time, not at edit time
Option B: Separate R files per recipe
Each recipe gets its own file in dev/recipes/ (e.g., dev/recipes/ct_bioactivity.R). Each file contains a standard R function definition that the generator reads, wraps with roxygen docs, and writes to R/. Metadata (title, lifecycle, params) could be in roxygen-style comments or a companion list at the top of the file.
Advantages:
- Full IDE support — syntax highlighting, autocomplete, linting, debugging all work
- Each recipe is independently readable and reviewable
- Complex recipes stay manageable (own file, own git history)
- Easy to test recipes in isolation (source the file, call the function)
- Git blame works per-recipe
Disadvantages:
- File discovery logic needed (glob
dev/recipes/*.R, parse metadata) - Metadata format needs design (roxygen comments? A header list? A companion YAML?)
- More files to manage
- Need convention for how the generator extracts the function body vs. metadata
Option C: Marker-protected regions in R/ files
Post-processing is written directly in the generated R/ files as normal R code. Special marker comments (e.g., # <<< RECIPE START >>> / # <<< RECIPE END >>>) delineate hand-written sections. The generator preserves everything between markers during regeneration and only rewrites the generated portions.
Advantages:
- Most natural workflow — edit the actual R file you're working with
- Full IDE support with complete file context
- No separate recipe files or registries to maintain
- What you see is what you get — the file in
R/IS the source of truth
Disadvantages:
- Fragile — marker comments can be accidentally deleted, moved, or malformed
- Merges and rebases can corrupt marker boundaries
- Mixes generated and hand-written code in the same file (unclear ownership)
- Generator needs complex parsing logic to extract and re-inject protected regions
- Harder to validate — is the marker region valid? Did the generated portion change in a way that breaks the protected region?
- No clean separation between "what the schema gives us" and "what we added"