Skip to content

Design post-processing recipe system for stub generator #120

@seanthimons

Description

@seanthimons

Context

The stub generator (dev/generate_stubs.R) produces clean R wrapper functions from OpenAPI schemas. Some functions need post-processing logic beyond what the generated stub provides — dispatching across multiple stubs, annotation joins, string coercion, DTXSID extraction. Currently these are hand-written files that get overwritten when stubs are regenerated after schema changes.

We need a system that:

  1. Stores post-processing logic separately from generated output
  2. Automatically applies it during stub generation
  3. Survives schema-driven regeneration (the whole point)
  4. Is easy to add new recipes and edit existing ones
  5. Works with the existing lifecycle badge protection in 05_file_scaffold.R

Current functions that need post-processing

  • ct_bioactivity — dispatches to 4 generated stubs by search_type, optional annotation join via secondary API call
  • ct_lists_all — projection selection logic, DTXSID comma-separated string coercion
  • ct_list — uppercase coercion, DTXSID extraction + string split + dedup

More will be added as untested endpoints are validated.

Design Options

Option A: R list registry in a single file

All recipes live in one file (dev/endpoint_eval/09_recipes.R) as a named R list. Each entry contains metadata (title, lifecycle, params) and the function body as a character string. The generator reads this list and produces complete R files.

Advantages:

  • Single source of truth — one file to manage all recipes
  • Follows existing pipeline conventions (all dev/endpoint_eval/ modules are single R files)
  • Easy to iterate the registry programmatically (validation, reporting, drift detection)
  • No file discovery logic needed — just read the list

Disadvantages:

  • Writing R code inside character strings — no syntax highlighting, autocomplete, or linting in IDE
  • Harder to review diffs (string changes vs. real code changes)
  • File grows linearly with number of recipes; complex recipes (like chemi_safety with ~100 lines) make the file unwieldy
  • Syntax errors in recipe bodies are only caught at generation time, not at edit time

Option B: Separate R files per recipe

Each recipe gets its own file in dev/recipes/ (e.g., dev/recipes/ct_bioactivity.R). Each file contains a standard R function definition that the generator reads, wraps with roxygen docs, and writes to R/. Metadata (title, lifecycle, params) could be in roxygen-style comments or a companion list at the top of the file.

Advantages:

  • Full IDE support — syntax highlighting, autocomplete, linting, debugging all work
  • Each recipe is independently readable and reviewable
  • Complex recipes stay manageable (own file, own git history)
  • Easy to test recipes in isolation (source the file, call the function)
  • Git blame works per-recipe

Disadvantages:

  • File discovery logic needed (glob dev/recipes/*.R, parse metadata)
  • Metadata format needs design (roxygen comments? A header list? A companion YAML?)
  • More files to manage
  • Need convention for how the generator extracts the function body vs. metadata

Option C: Marker-protected regions in R/ files

Post-processing is written directly in the generated R/ files as normal R code. Special marker comments (e.g., # <<< RECIPE START >>> / # <<< RECIPE END >>>) delineate hand-written sections. The generator preserves everything between markers during regeneration and only rewrites the generated portions.

Advantages:

  • Most natural workflow — edit the actual R file you're working with
  • Full IDE support with complete file context
  • No separate recipe files or registries to maintain
  • What you see is what you get — the file in R/ IS the source of truth

Disadvantages:

  • Fragile — marker comments can be accidentally deleted, moved, or malformed
  • Merges and rebases can corrupt marker boundaries
  • Mixes generated and hand-written code in the same file (unclear ownership)
  • Generator needs complex parsing logic to extract and re-inject protected regions
  • Harder to validate — is the marker region valid? Did the generated portion change in a way that breaks the protected region?
  • No clean separation between "what the schema gives us" and "what we added"

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions