Skip to content

feat: add pdfvision command for Vision API-based PDF conversion#75

Draft
isamu wants to merge 7 commits intomainfrom
feat/pdfvision
Draft

feat: add pdfvision command for Vision API-based PDF conversion#75
isamu wants to merge 7 commits intomainfrom
feat/pdfvision

Conversation

@isamu
Copy link
Copy Markdown
Contributor

@isamu isamu commented Feb 8, 2026

Summary

  • Add pdfvision CLI command that uses Vision API (Gemini/OpenAI) to intelligently convert PDFs into narrated presentations
  • Unlike pdf command (1:1 page-to-slide), pdfvision analyzes document structure, identifies figures, and creates logical slide compositions
  • Pipeline: page images + text extraction → Vision API analysis (1 call) → text LLM narration (1 call) → MulmoScript
  • Deduplicate extractJsonFromResponse into shared llm.ts utility

User Prompt

現状はpdfを1ページ1ページスライドにしている。しかし、もしコンテンツの中身が読めるなら、コンテンツごとに理解をし、画像があるならその画像などに基づき、きちんとした解説をしたプレゼン資料にしたい。

Test plan

  • 227 unit tests pass (31 new: vision-provider, document-analysis, narration-generator)
  • Build, lint, format all clean
  • Manual test: GEMINI_API_KEY=xxx yarn pdfvision paper.pdf -l ja
  • Manual test: OPENAI_API_KEY=xxx yarn pdfvision paper.pdf --provider openai

Closes #74

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added pdfvision CLI to convert PDFs into narrated slide decks; supports Gemini and OpenAI with auto-detection or explicit selection
  • Dependencies

    • Added support for Gemini Vision (new provider) and introduced GEMINI_API_KEY configuration
  • Tests

    • Added tests for provider resolution, document analysis, and narration generation (CI-friendly, no external LLM calls)
  • Documentation

    • Updated README, CLAUDE, and AGENTS docs with pdfvision workflow and usage examples

isamu and others added 4 commits February 8, 2026 15:58
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement intelligent PDF-to-MulmoScript conversion using Vision API:
- vision-provider: Gemini/OpenAI abstraction with auto-detection
- document-analysis: prompt builder and parser for document structure
- narration-generator: prompt builder and parser for slide narrations
- pdfvision pipeline: images + text → Vision analysis → narration → MulmoScript
- CLI command with --provider option

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests for vision-provider, document-analysis, and narration-generator
pure functions (no API calls).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Move extractJsonFromResponse to llm.ts as shared utility
- Remove duplicate implementations from document-analysis.ts and narration-generator.ts
- Add pdfvision to AGENTS.md, CLAUDE.md, README.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@socket-security
Copy link
Copy Markdown

socket-security bot commented Feb 8, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Added@​google/​generative-ai@​0.24.11001008085100

View full report

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 8, 2026

📝 Walkthrough
🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add pdfvision command for Vision API-based PDF conversion' directly and clearly summarizes the main change—adding a new pdfvision CLI command.
Linked Issues check ✅ Passed The PR successfully implements all core coding requirements from issue #74: Vision API integration with Gemini/OpenAI support [#74], document analysis and figure detection [#74], text-only LLM narration generation [#74], provider selection via CLI flags/env vars [#74], and modular architecture with comprehensive tests.
Out of Scope Changes check ✅ Passed All changes are scoped to the pdfvision feature: new utilities for vision provider abstraction, document analysis, narration generation, CLI integration, and supporting tests. Documentation updates reflect the new command. No unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/pdfvision

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@package.json`:
- Line 65: The dependency "@google/generative-ai" in package.json is deprecated;
remove or mark it for replacement by updating package.json to stop depending on
"@google/generative-ai" (or add a placeholder adapter package name) and create a
follow-up change to add a tested alternative library and adapter module (e.g., a
new file that exposes the same functions used in our codebase) so callers of the
existing symbols can be switched with minimal friction; update lockfile
(npm/yarn) and CI, add a short migration note in the repo README or a new issue
referencing "@google/generative-ai" deprecation and the planned replacement, and
run unit/integration tests to confirm no breaking changes.

In `@src/convert/pdfvision.ts`:
- Around line 37-41: buildPageImages currently filters out missing image files
which can make images.length differ from the original pageCount and
desynchronize with extractedTexts passed into buildDocumentAnalysisPrompt;
update buildPageImages to preserve the original pageCount (e.g., return both the
VisionImage[] and the original pageCount or a list of missing page indices), and
add a warning log when images.length !== pageCount that lists which page images
are missing so callers know there was a partial failure; then pass the original
pageCount (not images.length) into buildDocumentAnalysisPrompt or adjust callers
to use the preserved pageCount to keep extractedTexts and images aligned.
- Around line 149-156: The current assignment extractedTexts[page.pageNumber] =
page.text can create a sparse array that buildDocumentAnalysisPrompt may skip;
to fix, ensure extractedTexts is a dense array by pre-sizing or filling it and
assigning consistently (e.g., create extractedTexts = new
Array(pageTexts.length).fill("(no text)") and then set
extractedTexts[page.pageNumber - 1] = page.text), or sort pageTexts by
pageNumber and push page.text in order; update the logic around
extractTextFromPdf, page.pageNumber, extractedTexts and
buildDocumentAnalysisPrompt to guarantee no holes and provide a placeholder for
missing pages.
🧹 Nitpick comments (9)
tests/test_vision_provider.ts (1)

17-53: Consider extracting env save/restore into a helper to reduce boilerplate.

The env variable save/restore pattern is repeated across 4 tests with slight variations. A small helper like withEnv({ GEMINI_API_KEY: "test", OPENAI_API_KEY: undefined }, fn) could reduce duplication and make tests more readable. Not blocking — the current approach is correct.

tests/test_narration_generator.ts (1)

96-156: Solid parsing tests — consider adding a duplicate-index case.

The parsing tests cover valid JSON, markdown-wrapped JSON, missing indices, empty array, invalid JSON, and out-of-order indices nicely. One minor gap: there's no test for duplicate indices (e.g., two entries with index: 0). Per the Map constructor in parseNarrationResponse, the last entry wins — it may be worth codifying that behavior with a test.

src/utils/narration-generator.ts (1)

76-88: Duplicate index values silently drop earlier entries.

If the LLM returns multiple narration entries with the same index, the Map constructor keeps only the last one. This is a reasonable behavior for LLM output, but consider logging a warning when duplicates are detected so issues are visible during debugging.

src/utils/vision-provider.ts (4)

46-52: Gemini client is re-instantiated on every call, unlike the cached OpenAI client.

getOpenAIClient() in llm.ts caches the client in a module-level variable, but getGeminiClient() creates a new GoogleGenerativeAI instance each time. For consistency and to avoid unnecessary object creation (especially when both vision and text calls happen in the same pipeline run), consider caching the Gemini client similarly.

♻️ Cache the Gemini client
+let geminiClient: GoogleGenerativeAI | null = null;
+
 const getGeminiClient = (): GoogleGenerativeAI => {
+  if (geminiClient) {
+    return geminiClient;
+  }
   const apiKey = process.env.GEMINI_API_KEY;
   if (!apiKey) {
     throw new Error("GEMINI_API_KEY environment variable is not set");
   }
-  return new GoogleGenerativeAI(apiKey);
+  geminiClient = new GoogleGenerativeAI(apiKey);
+  return geminiClient;
 };

76-104: flatMap returning single-element arrays — map is simpler here.

Each image produces exactly one ChatCompletionContentPart, so flatMap with a single-element array return is equivalent to map. Using map is more straightforward.

♻️ Simplify to map
-  const imageContents: OpenAI.Chat.ChatCompletionContentPart[] = request.images.flatMap(
-    (img): OpenAI.Chat.ChatCompletionContentPart[] => {
+  const imageContents: OpenAI.Chat.ChatCompletionContentPart[] = request.images.map(
+    (img): OpenAI.Chat.ChatCompletionContentPart => {
       const base64 = imageToBase64(img.path);
       const mediaType = getImageMediaType(img.path);
-      return [
-        {
-          type: "image_url",
-          image_url: { url: `data:${mediaType};base64,${base64}`, detail: "low" },
-        },
-      ];
+      return {
+        type: "image_url",
+        image_url: { url: `data:${mediaType};base64,${base64}`, detail: "low" },
+      };
     }
   );

54-74: All page images loaded into memory as base64 simultaneously — potential memory pressure for large PDFs.

For a 50-page PDF at 300 DPI, each page image could be several MB. Converting all to base64 (which is ~33% larger) and holding them in memory simultaneously could be significant. The plan doc acknowledges large PDFs as an open question. For the initial implementation this is acceptable, but worth noting as a scaling concern.


97-101: Add responseMimeType: "application/json" to Gemini's generationConfig for consistency with OpenAI's JSON enforcement.

OpenAI uses response_format: { type: "json_object" } to enforce JSON mode, but the Gemini calls don't set equivalent JSON mode constraints. While extractJsonFromResponse handles markdown-wrapped JSON, Gemini could return unparseable formats without the mode enforced. Set responseMimeType: "application/json" in the generationConfig when creating Gemini models to match OpenAI's guarantees.

This applies to both callGeminiVision (line 56) and callGeminiText (line 120).

src/utils/document-analysis.ts (1)

122-128: Unsafe as casts on type and importance fields accept any string from LLM output.

If the LLM returns an unexpected value like "photo" for type or "critical" for importance, the cast silently passes through without validation. Since this is LLM-generated data, unexpected values are plausible.

Consider adding a validation/normalization step, or at minimum a set of allowed values to fall back on:

♻️ Add validation for enum-like fields
+const VALID_FIGURE_TYPES = new Set(["figure", "table", "chart", "diagram"]);
+const VALID_IMPORTANCE = new Set(["high", "medium", "low"]);
+
 figures: (parsed.figures ?? []).map((f: Record<string, unknown>) => ({
   page: Number(f.page ?? 0),
-  type: String(f.type ?? "figure") as FigureInfo["type"],
+  type: (VALID_FIGURE_TYPES.has(String(f.type)) ? String(f.type) : "figure") as FigureInfo["type"],
   label: f.label ? String(f.label) : undefined,
   description: String(f.description ?? ""),
-  importance: String(f.importance ?? "medium") as FigureInfo["importance"],
+  importance: (VALID_IMPORTANCE.has(String(f.importance)) ? String(f.importance) : "medium") as FigureInfo["importance"],
 })),
src/convert/pdfvision.ts (1)

80-116: buildMulmoScript is well-structured with proper schema validation.

Good fallback chain for imagePage, schema validation via safeParse, and clear error reporting on validation failure.

One observation: the document title from the analysis is not included in the MulmoScript output. If the schema supports a title/metadata field, it could be worth including.

"test:marp:theme": "tsx src/cli.ts marp samples/custom_theme_demo.md --theme samples/custom-ocean.css"
},
"dependencies": {
"@google/generative-ai": "^0.24.1",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

@google/generative-ai npm package latest version vulnerabilities

💡 Result:

  • Latest version (npm “latest” dist-tag): @google/generative-ai@0.24.1 (shown as the current latest tag on the npm versions page). [1]
  • Known vulnerabilities (package itself): No direct vulnerabilities currently listed for @google/generative-ai in Snyk’s vulnerability database (note: this excludes issues that might exist only in transitive dependencies). [2]
  • Other security trackers: Aikido Intel also reports 0 vulnerabilities and notes the package has 0 direct dependencies. [3]
  • Maintenance status (important): The package is described as legacy/deprecated, with an end-of-support date of Aug 31, 2025 in its published metadata/readme mirrors—so you should plan migration even if there are no known CVEs today. [4]

Sources: [1] [2] [3] [4]


Plan migration away from @google/generative-ai; package is deprecated with end-of-support August 31, 2025.

While version ^0.24.1 is current and has no known direct vulnerabilities, the package is marked as legacy/deprecated. Update the dependency strategy and begin evaluating alternative libraries before the support deadline.

🤖 Prompt for AI Agents
In `@package.json` at line 65, The dependency "@google/generative-ai" in
package.json is deprecated; remove or mark it for replacement by updating
package.json to stop depending on "@google/generative-ai" (or add a placeholder
adapter package name) and create a follow-up change to add a tested alternative
library and adapter module (e.g., a new file that exposes the same functions
used in our codebase) so callers of the existing symbols can be switched with
minimal friction; update lockfile (npm/yarn) and CI, add a short migration note
in the repo README or a new issue referencing "@google/generative-ai"
deprecation and the planned replacement, and run unit/integration tests to
confirm no breaking changes.

Comment on lines +37 to +41
const buildPageImages = (imagesDir: string, basename: string, pageCount: number): VisionImage[] => {
return Array.from({ length: pageCount }, (_, i) => ({
path: path.join(imagesDir, `${basename}-${i}.png`),
})).filter((img) => fs.existsSync(img.path));
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

buildPageImages filtering by existence can cause a mismatch between image count and page count.

If some page images fail to generate (e.g., ImageMagick partially fails), the filtered images array is shorter than pageCount. This shorter length is then passed as pageCount to buildDocumentAnalysisPrompt (via images.length on Line 52), while extractedTexts still has entries for all pages. The Vision API sees fewer images than the text references, which could confuse the analysis.

Consider logging a warning when the counts diverge, or passing the original pageCount separately.

🤖 Prompt for AI Agents
In `@src/convert/pdfvision.ts` around lines 37 - 41, buildPageImages currently
filters out missing image files which can make images.length differ from the
original pageCount and desynchronize with extractedTexts passed into
buildDocumentAnalysisPrompt; update buildPageImages to preserve the original
pageCount (e.g., return both the VisionImage[] and the original pageCount or a
list of missing page indices), and add a warning log when images.length !==
pageCount that lists which page images are missing so callers know there was a
partial failure; then pass the original pageCount (not images.length) into
buildDocumentAnalysisPrompt or adjust callers to use the preserved pageCount to
keep extractedTexts and images aligned.

Comment on lines +149 to +156
// Step 2: Extract text
console.log("Extracting text from PDF...");
const pageTexts = await extractTextFromPdf(pdfFile);
const extractedTexts: string[] = [];
pageTexts.forEach((page) => {
extractedTexts[page.pageNumber] = page.text;
});
console.log(`Extracted text from ${pageTexts.length} pages`);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Sparse array risk: index-based assignment on extractedTexts can produce holes that Array.map silently skips.

If extractTextFromPdf ever skips a page number, extractedTexts[page.pageNumber] = page.text creates a sparse array. When buildDocumentAnalysisPrompt iterates with .map(), JS skips sparse holes entirely — those pages won't appear in the prompt (not even as "(no text)"). In practice pdf-parse likely returns all pages, but this pattern is fragile.

🛡️ Pre-fill the array to guarantee density
   const pageTexts = await extractTextFromPdf(pdfFile);
-  const extractedTexts: string[] = [];
+  const extractedTexts: string[] = Array.from({ length: pageCount }, () => "");
   pageTexts.forEach((page) => {
     extractedTexts[page.pageNumber] = page.text;
   });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Step 2: Extract text
console.log("Extracting text from PDF...");
const pageTexts = await extractTextFromPdf(pdfFile);
const extractedTexts: string[] = [];
pageTexts.forEach((page) => {
extractedTexts[page.pageNumber] = page.text;
});
console.log(`Extracted text from ${pageTexts.length} pages`);
// Step 2: Extract text
console.log("Extracting text from PDF...");
const pageTexts = await extractTextFromPdf(pdfFile);
const extractedTexts: string[] = Array.from({ length: pageTexts.length }, () => "");
pageTexts.forEach((page) => {
extractedTexts[page.pageNumber] = page.text;
});
console.log(`Extracted text from ${pageTexts.length} pages`);
🤖 Prompt for AI Agents
In `@src/convert/pdfvision.ts` around lines 149 - 156, The current assignment
extractedTexts[page.pageNumber] = page.text can create a sparse array that
buildDocumentAnalysisPrompt may skip; to fix, ensure extractedTexts is a dense
array by pre-sizing or filling it and assigning consistently (e.g., create
extractedTexts = new Array(pageTexts.length).fill("(no text)") and then set
extractedTexts[page.pageNumber - 1] = page.text), or sort pageTexts by
pageNumber and push page.text in order; update the logic around
extractTextFromPdf, page.pageNumber, extractedTexts and
buildDocumentAnalysisPrompt to guarantee no holes and provide a placeholder for
missing pages.

- Add bbox field to FigureInfo for figure position coordinates
- Vision API prompt now requests bounding box (% of page) for figures
- ImageMagick crops figures from page images based on bbox
- Slides with figureRef use cropped images instead of full page
- Update plan to include cropping as part of final design

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/convert/pdfvision.ts`:
- Around line 54-63: The cropFigure function currently hardcodes the "convert"
binary and builds a single shell string, which breaks on macOS/Windows and risks
shell-quoting bugs; change cropFigure to pick the ImageMagick command the same
way convertPdfToImages does (choose "magick" on Windows/macOS when appropriate,
otherwise "convert"), then replace the assembled shell string and execSync call
with execFileSync using the chosen binary and an args array (use unquoted
pageImagePath, "-crop", `${bbox.width}%x${bbox.height}%+${bbox.x}%+${bbox.y}%`,
"+repage", outputPath) and pass {stdio: "pipe"} so paths with spaces are handled
safely and the correct binary is invoked.
🧹 Nitpick comments (2)
plans/pdfvision.md (2)

21-47: Add language specifier to fenced code block.

Static analysis (markdownlint MD040) flags this block. Even for plain-text pipeline descriptions, use a specifier like ```text to satisfy the linter and improve rendering consistency.


53-56: Add language specifier to fenced code block.

Same markdownlint MD040 flag here. Use ```text or similar.

- Convert bbox percentage to pixels using actual image dimensions
- Use identify command to get page image width/height
- Fix ImageMagick -crop syntax (was using % for offset incorrectly)
- Improve bbox prompt with precise instructions and example

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/convert/pdfvision.ts`:
- Around line 49-57: Replace the unsafe execSync call in getImageDimensions with
execFileSync using an argument array (pass ["-format", "%w %h", imagePath]) to
avoid shell interpolation of imagePath; also detect ImageMagick7 by attempting
"identify" then falling back to "magick" (invoke as ["identify", ...] vs
["magick", "identify", ...]) so the binary exists on macOS/Windows, and surface
the caught error (rather than swallowing) or return null consistently; update
the getImageDimensions function to use execFileSync, the proper arg list, and
explicit error handling.
🧹 Nitpick comments (1)
src/utils/document-analysis.ts (1)

120-167: type and importance casts bypass runtime validation — LLM can return arbitrary strings.

Lines 145 and 148 use as casts which perform no runtime check. If the Vision API returns an unexpected value (e.g., "type": "photo" or "importance": "critical"), it silently passes through with a type that doesn't match the FigureInfo union. Downstream comparisons like figure.type === "figure" would silently fail.

Consider clamping to allowed values:

🛡️ Proposed validation
+const FIGURE_TYPES = new Set(["figure", "table", "chart", "diagram"]);
+const IMPORTANCE_LEVELS = new Set(["high", "medium", "low"]);
+
 // inside the figures mapping:
-        type: String(f.type ?? "figure") as FigureInfo["type"],
+        type: (FIGURE_TYPES.has(String(f.type)) ? String(f.type) : "figure") as FigureInfo["type"],
         // ...
-        importance: String(f.importance ?? "medium") as FigureInfo["importance"],
+        importance: (IMPORTANCE_LEVELS.has(String(f.importance)) ? String(f.importance) : "medium") as FigureInfo["importance"],

Comment on lines +49 to +57
const getImageDimensions = (imagePath: string): { width: number; height: number } | null => {
try {
const output = execSync(`identify -format "%w %h" "${imagePath}"`, { encoding: "utf-8" });
const [w, h] = output.trim().split(" ").map(Number);
return { width: w, height: h };
} catch {
return null;
}
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Shell injection risk in getImageDimensions — use execFileSync with an argument array.

imagePath is interpolated into a shell string. If the PDF filename contains shell metacharacters (e.g., $(cmd) or backticks), execSync will expand them inside the double quotes. The basename is derived from user-provided input via path.basename(pdfFile, ".pdf"), which does not sanitize shell-special characters.

Additionally, identify has the same platform portability issue as convert (flagged in a past review for cropFigure): on ImageMagick 7 on macOS/Windows, the standalone identify binary may not exist.

🔒 Proposed fix — use execFileSync
+import { execSync, execFileSync } from "child_process";
+
+const imageMagickCmd = (): string => {
+  return process.platform === "linux" ? "" : "magick";
+};
+
 const getImageDimensions = (imagePath: string): { width: number; height: number } | null => {
   try {
-    const output = execSync(`identify -format "%w %h" "${imagePath}"`, { encoding: "utf-8" });
+    const cmd = imageMagickCmd();
+    const args = cmd
+      ? [cmd, "identify", "-format", "%w %h", imagePath]
+      : ["identify", "-format", "%w %h", imagePath];
+    const binary = args.shift()!;
+    const output = execFileSync(binary, args, { encoding: "utf-8" });
     const [w, h] = output.trim().split(" ").map(Number);
     return { width: w, height: h };
   } catch {
     return null;
   }
 };
🤖 Prompt for AI Agents
In `@src/convert/pdfvision.ts` around lines 49 - 57, Replace the unsafe execSync
call in getImageDimensions with execFileSync using an argument array (pass
["-format", "%w %h", imagePath]) to avoid shell interpolation of imagePath; also
detect ImageMagick7 by attempting "identify" then falling back to "magick"
(invoke as ["identify", ...] vs ["magick", "identify", ...]) so the binary
exists on macOS/Windows, and surface the caught error (rather than swallowing)
or return null consistently; update the getImageDimensions function to use
execFileSync, the proper arg list, and explicit error handling.

- Convert figure pages at 600dpi (2x) for sharper cropped images
- Add 5% padding around bbox to compensate for LLM estimation errors
- Post-process with ImageMagick -trim to clean whitespace + add border
- Use magick command on macOS (ImageMagick 7 compatibility)
- Change OpenAI Vision detail from "low" to "auto" for better analysis
- Improve prompt to request generous bbox including all labels/captions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/convert/pdfvision.ts`:
- Around line 226-235: The mapping that builds beats uses narrations[i] || ""
which silently produces empty text when narrations.length <
analysis.slides.length; update the logic around the beats creation (where
analysis.slides is mapped and narrations is referenced) to detect mismatched
lengths and emit a warning (e.g., console.warn or the repo's logger) when
narrations.length !== analysis.slides.length, and additionally log which slide
index(s) are missing narration so missing text is easier to debug before keeping
the existing empty-string fallback.
🧹 Nitpick comments (5)
src/utils/vision-provider.ts (3)

46-52: Gemini client is re-instantiated on every API call.

getGeminiClient creates a new GoogleGenerativeAI instance each time, unlike the OpenAI singleton pattern in llm.ts. For a CLI tool making 1-2 calls this is fine, but if usage grows, consider caching the client.

♻️ Optional: cache the Gemini client
+let geminiClient: GoogleGenerativeAI | null = null;
+
 const getGeminiClient = (): GoogleGenerativeAI => {
+  if (geminiClient) return geminiClient;
   const apiKey = process.env.GEMINI_API_KEY;
   if (!apiKey) {
     throw new Error("GEMINI_API_KEY environment variable is not set");
   }
-  return new GoogleGenerativeAI(apiKey);
+  geminiClient = new GoogleGenerativeAI(apiKey);
+  return geminiClient;
 };

79-90: flatMap returning single-element arrays is equivalent to map.

The flatMap with a callback that always returns a single-element array adds unnecessary complexity. A plain map would be clearer.

♻️ Simplify to map
-  const imageContents: OpenAI.Chat.ChatCompletionContentPart[] = request.images.flatMap(
-    (img): OpenAI.Chat.ChatCompletionContentPart[] => {
-      const base64 = imageToBase64(img.path);
-      const mediaType = getImageMediaType(img.path);
-      return [
-        {
-          type: "image_url",
-          image_url: { url: `data:${mediaType};base64,${base64}`, detail: "auto" },
-        },
-      ];
-    }
-  );
+  const imageContents: OpenAI.Chat.ChatCompletionContentPart[] = request.images.map(
+    (img): OpenAI.Chat.ChatCompletionContentPart => {
+      const base64 = imageToBase64(img.path);
+      const mediaType = getImageMediaType(img.path);
+      return {
+        type: "image_url",
+        image_url: { url: `data:${mediaType};base64,${base64}`, detail: "auto" },
+      };
+    }
+  );

26-44: resolveVisionProvider silently prefers Gemini when both keys are set — worth documenting.

When both GEMINI_API_KEY and OPENAI_API_KEY are set and no preferred is provided, Gemini is silently chosen. This is fine as a default, but a brief log or comment would help users understand the precedence.

src/utils/document-analysis.ts (1)

146-153: Unsafe type assertions on LLM-provided type and importance values.

String(f.type ?? "figure") as FigureInfo["type"] and String(f.importance ?? "medium") as FigureInfo["importance"] will silently accept any string the LLM returns (e.g., "photo", "critical"), violating the union type at runtime. Since this is best-effort LLM parsing, it's not critical, but a fallback to the default on invalid values would be more robust.

♻️ Validate and default on unexpected values
+const VALID_FIGURE_TYPES = new Set(["figure", "table", "chart", "diagram"]);
+const VALID_IMPORTANCE = new Set(["high", "medium", "low"]);
+
 // inside the figures mapping:
-        type: String(f.type ?? "figure") as FigureInfo["type"],
+        type: VALID_FIGURE_TYPES.has(String(f.type)) ? (String(f.type) as FigureInfo["type"]) : "figure",
         ...
-        importance: String(f.importance ?? "medium") as FigureInfo["importance"],
+        importance: VALID_IMPORTANCE.has(String(f.importance)) ? (String(f.importance) as FigureInfo["importance"]) : "medium",
src/convert/pdfvision.ts (1)

173-177: fs.rmdirSync is deprecated; also, the forEach return value triggers a lint warning.

fs.rmdirSync has been deprecated since Node 16 in favor of fs.rmSync. Additionally, fs.unlinkSync returns void but the static analysis tool flags the implicit return from the arrow function inside forEach.

♻️ Proposed fix
   if (fs.existsSync(highResDir)) {
-    fs.readdirSync(highResDir).forEach((f) => fs.unlinkSync(path.join(highResDir, f)));
-    fs.rmdirSync(highResDir);
+    fs.rmSync(highResDir, { recursive: true });
   }

Comment on lines +226 to +235
const beats = analysis.slides.map((slide, i) => {
const imagePage = slide.imagePage ?? slide.sourcePages[0] ?? 0;
const pageImagePath = `./images/${basename}-${imagePage}.png`;
const imagePath =
slide.figureRef && figureImageMap.has(slide.figureRef)
? figureImageMap.get(slide.figureRef)!
: pageImagePath;

return {
text: narrations[i] || "",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silent fallback to empty string when narration is missing for a slide.

Line 235: narrations[i] || "" silently produces empty text if the narration array is shorter than the slides array (e.g., if the LLM returned fewer narrations). Consider logging a warning when narrations.length !== analysis.slides.length to help debug unexpected empty slides.

🔍 Proposed diagnostic
 const buildMulmoScript = (
   analysis: DocumentAnalysis,
   narrations: string[],
   basename: string,
   lang: SupportedLang,
   figureImageMap: Map<string, string>
 ): z.output<typeof mulmoScriptSchema> => {
+  if (narrations.length !== analysis.slides.length) {
+    console.warn(`Warning: narration count (${narrations.length}) differs from slide count (${analysis.slides.length})`);
+  }
   const beats = analysis.slides.map((slide, i) => {
🤖 Prompt for AI Agents
In `@src/convert/pdfvision.ts` around lines 226 - 235, The mapping that builds
beats uses narrations[i] || "" which silently produces empty text when
narrations.length < analysis.slides.length; update the logic around the beats
creation (where analysis.slides is mapped and narrations is referenced) to
detect mismatched lengths and emit a warning (e.g., console.warn or the repo's
logger) when narrations.length !== analysis.slides.length, and additionally log
which slide index(s) are missing narration so missing text is easier to debug
before keeping the existing empty-string fallback.

@isamu isamu marked this pull request as draft February 10, 2026 03:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: pdfvision - Vision API-based intelligent PDF to presentation conversion

1 participant