feat: add pdfvision command for Vision API-based PDF conversion by isamu · Pull Request #75 · receptron/MulmoCast-Slides

isamu · 2026-02-08T07:33:28Z

Summary

Add pdfvision CLI command that uses Vision API (Gemini/OpenAI) to intelligently convert PDFs into narrated presentations
Unlike pdf command (1:1 page-to-slide), pdfvision analyzes document structure, identifies figures, and creates logical slide compositions
Pipeline: page images + text extraction → Vision API analysis (1 call) → text LLM narration (1 call) → MulmoScript
Deduplicate extractJsonFromResponse into shared llm.ts utility

User Prompt

現状はpdfを１ページ１ページスライドにしている。しかし、もしコンテンツの中身が読めるなら、コンテンツごとに理解をし、画像があるならその画像などに基づき、きちんとした解説をしたプレゼン資料にしたい。

Test plan

227 unit tests pass (31 new: vision-provider, document-analysis, narration-generator)
Build, lint, format all clean
Manual test: GEMINI_API_KEY=xxx yarn pdfvision paper.pdf -l ja
Manual test: OPENAI_API_KEY=xxx yarn pdfvision paper.pdf --provider openai

Closes #74

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added pdfvision CLI to convert PDFs into narrated slide decks; supports Gemini and OpenAI with auto-detection or explicit selection
Dependencies
- Added support for Gemini Vision (new provider) and introduced GEMINI_API_KEY configuration
Tests
- Added tests for provider resolution, document analysis, and narration generation (CI-friendly, no external LLM calls)
Documentation
- Updated README, CLAUDE, and AGENTS docs with pdfvision workflow and usage examples

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Implement intelligent PDF-to-MulmoScript conversion using Vision API: - vision-provider: Gemini/OpenAI abstraction with auto-detection - document-analysis: prompt builder and parser for document structure - narration-generator: prompt builder and parser for slide narrations - pdfvision pipeline: images + text → Vision analysis → narration → MulmoScript - CLI command with --provider option Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Tests for vision-provider, document-analysis, and narration-generator pure functions (no API calls). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Move extractJsonFromResponse to llm.ts as shared utility - Remove duplicate implementations from document-analysis.ts and narration-generator.ts - Add pdfvision to AGENTS.md, CLAUDE.md, README.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

socket-security · 2026-02-08T07:33:56Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	@google/generative-ai@0.24.1

View full report

coderabbitai · 2026-02-08T07:35:21Z

📝 Walkthrough

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: add pdfvision command for Vision API-based PDF conversion' directly and clearly summarizes the main change—adding a new pdfvision CLI command.
Linked Issues check	✅ Passed	The PR successfully implements all core coding requirements from issue `#74`: Vision API integration with Gemini/OpenAI support [`#74`], document analysis and figure detection [`#74`], text-only LLM narration generation [`#74`], provider selection via CLI flags/env vars [`#74`], and modular architecture with comprehensive tests.
Out of Scope Changes check	✅ Passed	All changes are scoped to the pdfvision feature: new utilities for vision provider abstraction, document analysis, narration generation, CLI integration, and supporting tests. Documentation updates reflect the new command. No unrelated changes detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/pdfvision

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@package.json`:
- Line 65: The dependency "@google/generative-ai" in package.json is deprecated;
remove or mark it for replacement by updating package.json to stop depending on
"@google/generative-ai" (or add a placeholder adapter package name) and create a
follow-up change to add a tested alternative library and adapter module (e.g., a
new file that exposes the same functions used in our codebase) so callers of the
existing symbols can be switched with minimal friction; update lockfile
(npm/yarn) and CI, add a short migration note in the repo README or a new issue
referencing "@google/generative-ai" deprecation and the planned replacement, and
run unit/integration tests to confirm no breaking changes.

In `@src/convert/pdfvision.ts`:
- Around line 37-41: buildPageImages currently filters out missing image files
which can make images.length differ from the original pageCount and
desynchronize with extractedTexts passed into buildDocumentAnalysisPrompt;
update buildPageImages to preserve the original pageCount (e.g., return both the
VisionImage[] and the original pageCount or a list of missing page indices), and
add a warning log when images.length !== pageCount that lists which page images
are missing so callers know there was a partial failure; then pass the original
pageCount (not images.length) into buildDocumentAnalysisPrompt or adjust callers
to use the preserved pageCount to keep extractedTexts and images aligned.
- Around line 149-156: The current assignment extractedTexts[page.pageNumber] =
page.text can create a sparse array that buildDocumentAnalysisPrompt may skip;
to fix, ensure extractedTexts is a dense array by pre-sizing or filling it and
assigning consistently (e.g., create extractedTexts = new
Array(pageTexts.length).fill("(no text)") and then set
extractedTexts[page.pageNumber - 1] = page.text), or sort pageTexts by
pageNumber and push page.text in order; update the logic around
extractTextFromPdf, page.pageNumber, extractedTexts and
buildDocumentAnalysisPrompt to guarantee no holes and provide a placeholder for
missing pages.

🧹 Nitpick comments (9)

tests/test_vision_provider.ts (1)

17-53: Consider extracting env save/restore into a helper to reduce boilerplate.

The env variable save/restore pattern is repeated across 4 tests with slight variations. A small helper like withEnv({ GEMINI_API_KEY: "test", OPENAI_API_KEY: undefined }, fn) could reduce duplication and make tests more readable. Not blocking — the current approach is correct.

tests/test_narration_generator.ts (1)

96-156: Solid parsing tests — consider adding a duplicate-index case.

The parsing tests cover valid JSON, markdown-wrapped JSON, missing indices, empty array, invalid JSON, and out-of-order indices nicely. One minor gap: there's no test for duplicate indices (e.g., two entries with index: 0). Per the Map constructor in parseNarrationResponse, the last entry wins — it may be worth codifying that behavior with a test.

src/utils/narration-generator.ts (1)

76-88: Duplicate index values silently drop earlier entries.

If the LLM returns multiple narration entries with the same index, the Map constructor keeps only the last one. This is a reasonable behavior for LLM output, but consider logging a warning when duplicates are detected so issues are visible during debugging.
src/utils/vision-provider.ts (4)
46-52: Gemini client is re-instantiated on every call, unlike the cached OpenAI client.

getOpenAIClient() in llm.ts caches the client in a module-level variable, but getGeminiClient() creates a new GoogleGenerativeAI instance each time. For consistency and to avoid unnecessary object creation (especially when both vision and text calls happen in the same pipeline run), consider caching the Gemini client similarly.
♻️ Cache the Gemini client
+let geminiClient: GoogleGenerativeAI | null = null;
+
 const getGeminiClient = (): GoogleGenerativeAI => {
+  if (geminiClient) {
+    return geminiClient;
+  }
   const apiKey = process.env.GEMINI_API_KEY;
   if (!apiKey) {
     throw new Error("GEMINI_API_KEY environment variable is not set");
   }
-  return new GoogleGenerativeAI(apiKey);
+  geminiClient = new GoogleGenerativeAI(apiKey);
+  return geminiClient;
 };
76-104: flatMap returning single-element arrays — map is simpler here.

Each image produces exactly one ChatCompletionContentPart, so flatMap with a single-element array return is equivalent to map. Using map is more straightforward.
♻️ Simplify to map
-  const imageContents: OpenAI.Chat.ChatCompletionContentPart[] = request.images.flatMap(
-    (img): OpenAI.Chat.ChatCompletionContentPart[] => {
+  const imageContents: OpenAI.Chat.ChatCompletionContentPart[] = request.images.map(
+    (img): OpenAI.Chat.ChatCompletionContentPart => {
       const base64 = imageToBase64(img.path);
       const mediaType = getImageMediaType(img.path);
-      return [
-        {
-          type: "image_url",
-          image_url: { url: `data:${mediaType};base64,${base64}`, detail: "low" },
-        },
-      ];
+      return {
+        type: "image_url",
+        image_url: { url: `data:${mediaType};base64,${base64}`, detail: "low" },
+      };
     }
   );
54-74: All page images loaded into memory as base64 simultaneously — potential memory pressure for large PDFs.

For a 50-page PDF at 300 DPI, each page image could be several MB. Converting all to base64 (which is ~33% larger) and holding them in memory simultaneously could be significant. The plan doc acknowledges large PDFs as an open question. For the initial implementation this is acceptable, but worth noting as a scaling concern.

97-101: Add responseMimeType: "application/json" to Gemini's generationConfig for consistency with OpenAI's JSON enforcement.

OpenAI uses response_format: { type: "json_object" } to enforce JSON mode, but the Gemini calls don't set equivalent JSON mode constraints. While extractJsonFromResponse handles markdown-wrapped JSON, Gemini could return unparseable formats without the mode enforced. Set responseMimeType: "application/json" in the generationConfig when creating Gemini models to match OpenAI's guarantees.

This applies to both callGeminiVision (line 56) and callGeminiText (line 120).
src/utils/document-analysis.ts (1)
122-128: Unsafe as casts on type and importance fields accept any string from LLM output.

If the LLM returns an unexpected value like "photo" for type or "critical" for importance, the cast silently passes through without validation. Since this is LLM-generated data, unexpected values are plausible.

Consider adding a validation/normalization step, or at minimum a set of allowed values to fall back on:
♻️ Add validation for enum-like fields
+const VALID_FIGURE_TYPES = new Set(["figure", "table", "chart", "diagram"]);
+const VALID_IMPORTANCE = new Set(["high", "medium", "low"]);
+
 figures: (parsed.figures ?? []).map((f: Record<string, unknown>) => ({
   page: Number(f.page ?? 0),
-  type: String(f.type ?? "figure") as FigureInfo["type"],
+  type: (VALID_FIGURE_TYPES.has(String(f.type)) ? String(f.type) : "figure") as FigureInfo["type"],
   label: f.label ? String(f.label) : undefined,
   description: String(f.description ?? ""),
-  importance: String(f.importance ?? "medium") as FigureInfo["importance"],
+  importance: (VALID_IMPORTANCE.has(String(f.importance)) ? String(f.importance) : "medium") as FigureInfo["importance"],
 })),
src/convert/pdfvision.ts (1)

80-116: buildMulmoScript is well-structured with proper schema validation.

Good fallback chain for imagePage, schema validation via safeParse, and clear error reporting on validation failure.

One observation: the document title from the analysis is not included in the MulmoScript output. If the schema supports a title/metadata field, it could be worth including.

coderabbitai · 2026-02-08T07:40:05Z

package.json

    "test:marp:theme": "tsx src/cli.ts marp samples/custom_theme_demo.md --theme samples/custom-ocean.css"
  },
  "dependencies": {
+    "@google/generative-ai": "^0.24.1",


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

@google/generative-ai npm package latest version vulnerabilities

💡 Result:

Latest version (npm “latest” dist-tag): @google/generative-ai@0.24.1 (shown as the current latest tag on the npm versions page). [1]

Known vulnerabilities (package itself): No direct vulnerabilities currently listed for @google/generative-ai in Snyk’s vulnerability database (note: this excludes issues that might exist only in transitive dependencies). [2]

Other security trackers: Aikido Intel also reports 0 vulnerabilities and notes the package has 0 direct dependencies. [3]

Maintenance status (important): The package is described as legacy/deprecated, with an end-of-support date of Aug 31, 2025 in its published metadata/readme mirrors—so you should plan migration even if there are no known CVEs today. [4]

Sources: [1] [2] [3] [4]

Plan migration away from @google/generative-ai; package is deprecated with end-of-support August 31, 2025.

While version ^0.24.1 is current and has no known direct vulnerabilities, the package is marked as legacy/deprecated. Update the dependency strategy and begin evaluating alternative libraries before the support deadline.

🤖 Prompt for AI Agents

In `@package.json` at line 65, The dependency "@google/generative-ai" in package.json is deprecated; remove or mark it for replacement by updating package.json to stop depending on "@google/generative-ai" (or add a placeholder adapter package name) and create a follow-up change to add a tested alternative library and adapter module (e.g., a new file that exposes the same functions used in our codebase) so callers of the existing symbols can be switched with minimal friction; update lockfile (npm/yarn) and CI, add a short migration note in the repo README or a new issue referencing "@google/generative-ai" deprecation and the planned replacement, and run unit/integration tests to confirm no breaking changes.

coderabbitai · 2026-02-08T07:40:05Z

src/convert/pdfvision.ts

+const buildPageImages = (imagesDir: string, basename: string, pageCount: number): VisionImage[] => {
+  return Array.from({ length: pageCount }, (_, i) => ({
+    path: path.join(imagesDir, `${basename}-${i}.png`),
+  })).filter((img) => fs.existsSync(img.path));
+};


⚠️ Potential issue | 🟡 Minor

buildPageImages filtering by existence can cause a mismatch between image count and page count.

If some page images fail to generate (e.g., ImageMagick partially fails), the filtered images array is shorter than pageCount. This shorter length is then passed as pageCount to buildDocumentAnalysisPrompt (via images.length on Line 52), while extractedTexts still has entries for all pages. The Vision API sees fewer images than the text references, which could confuse the analysis.

Consider logging a warning when the counts diverge, or passing the original pageCount separately.

🤖 Prompt for AI Agents

In `@src/convert/pdfvision.ts` around lines 37 - 41, buildPageImages currently filters out missing image files which can make images.length differ from the original pageCount and desynchronize with extractedTexts passed into buildDocumentAnalysisPrompt; update buildPageImages to preserve the original pageCount (e.g., return both the VisionImage[] and the original pageCount or a list of missing page indices), and add a warning log when images.length !== pageCount that lists which page images are missing so callers know there was a partial failure; then pass the original pageCount (not images.length) into buildDocumentAnalysisPrompt or adjust callers to use the preserved pageCount to keep extractedTexts and images aligned.

coderabbitai · 2026-02-08T07:40:05Z

src/convert/pdfvision.ts

+  // Step 2: Extract text
+  console.log("Extracting text from PDF...");
+  const pageTexts = await extractTextFromPdf(pdfFile);
+  const extractedTexts: string[] = [];
+  pageTexts.forEach((page) => {
+    extractedTexts[page.pageNumber] = page.text;
+  });
+  console.log(`Extracted text from ${pageTexts.length} pages`);


⚠️ Potential issue | 🟡 Minor

Sparse array risk: index-based assignment on extractedTexts can produce holes that Array.map silently skips.

If extractTextFromPdf ever skips a page number, extractedTexts[page.pageNumber] = page.text creates a sparse array. When buildDocumentAnalysisPrompt iterates with .map(), JS skips sparse holes entirely — those pages won't appear in the prompt (not even as "(no text)"). In practice pdf-parse likely returns all pages, but this pattern is fragile.

🛡️ Pre-fill the array to guarantee density

const pageTexts = await extractTextFromPdf(pdfFile); - const extractedTexts: string[] = []; + const extractedTexts: string[] = Array.from({ length: pageCount }, () => ""); pageTexts.forEach((page) => { extractedTexts[page.pageNumber] = page.text; });

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Step 2: Extract text

console.log("Extracting text from PDF...");

const pageTexts = await extractTextFromPdf(pdfFile);

const extractedTexts: string[] = [];

pageTexts.forEach((page) => {

extractedTexts[page.pageNumber] = page.text;

});

console.log(`Extracted text from ${pageTexts.length} pages`);

// Step 2: Extract text

console.log("Extracting text from PDF...");

const pageTexts = await extractTextFromPdf(pdfFile);

const extractedTexts: string[] = Array.from({ length: pageTexts.length }, () => "");

pageTexts.forEach((page) => {

extractedTexts[page.pageNumber] = page.text;

});

console.log(`Extracted text from ${pageTexts.length} pages`);

🤖 Prompt for AI Agents

In `@src/convert/pdfvision.ts` around lines 149 - 156, The current assignment extractedTexts[page.pageNumber] = page.text can create a sparse array that buildDocumentAnalysisPrompt may skip; to fix, ensure extractedTexts is a dense array by pre-sizing or filling it and assigning consistently (e.g., create extractedTexts = new Array(pageTexts.length).fill("(no text)") and then set extractedTexts[page.pageNumber - 1] = page.text), or sort pageTexts by pageNumber and push page.text in order; update the logic around extractTextFromPdf, page.pageNumber, extractedTexts and buildDocumentAnalysisPrompt to guarantee no holes and provide a placeholder for missing pages.

- Add bbox field to FigureInfo for figure position coordinates - Vision API prompt now requests bounding box (% of page) for figures - ImageMagick crops figures from page images based on bbox - Slides with figureRef use cropped images instead of full page - Update plan to include cropping as part of final design Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/convert/pdfvision.ts`:
- Around line 54-63: The cropFigure function currently hardcodes the "convert"
binary and builds a single shell string, which breaks on macOS/Windows and risks
shell-quoting bugs; change cropFigure to pick the ImageMagick command the same
way convertPdfToImages does (choose "magick" on Windows/macOS when appropriate,
otherwise "convert"), then replace the assembled shell string and execSync call
with execFileSync using the chosen binary and an args array (use unquoted
pageImagePath, "-crop", `${bbox.width}%x${bbox.height}%+${bbox.x}%+${bbox.y}%`,
"+repage", outputPath) and pass {stdio: "pipe"} so paths with spaces are handled
safely and the correct binary is invoked.

🧹 Nitpick comments (2)

plans/pdfvision.md (2)

21-47: Add language specifier to fenced code block.

Static analysis (markdownlint MD040) flags this block. Even for plain-text pipeline descriptions, use a specifier like ```text to satisfy the linter and improve rendering consistency.

53-56: Add language specifier to fenced code block.

Same markdownlint MD040 flag here. Use ```text or similar.

src/convert/pdfvision.ts

- Convert bbox percentage to pixels using actual image dimensions - Use identify command to get page image width/height - Fix ImageMagick -crop syntax (was using % for offset incorrectly) - Improve bbox prompt with precise instructions and example Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/convert/pdfvision.ts`:
- Around line 49-57: Replace the unsafe execSync call in getImageDimensions with
execFileSync using an argument array (pass ["-format", "%w %h", imagePath]) to
avoid shell interpolation of imagePath; also detect ImageMagick7 by attempting
"identify" then falling back to "magick" (invoke as ["identify", ...] vs
["magick", "identify", ...]) so the binary exists on macOS/Windows, and surface
the caught error (rather than swallowing) or return null consistently; update
the getImageDimensions function to use execFileSync, the proper arg list, and
explicit error handling.

🧹 Nitpick comments (1)

src/utils/document-analysis.ts (1)
120-167: type and importance casts bypass runtime validation — LLM can return arbitrary strings.

Lines 145 and 148 use as casts which perform no runtime check. If the Vision API returns an unexpected value (e.g., "type": "photo" or "importance": "critical"), it silently passes through with a type that doesn't match the FigureInfo union. Downstream comparisons like figure.type === "figure" would silently fail.

Consider clamping to allowed values:
🛡️ Proposed validation
+const FIGURE_TYPES = new Set(["figure", "table", "chart", "diagram"]);
+const IMPORTANCE_LEVELS = new Set(["high", "medium", "low"]);
+
 // inside the figures mapping:
-        type: String(f.type ?? "figure") as FigureInfo["type"],
+        type: (FIGURE_TYPES.has(String(f.type)) ? String(f.type) : "figure") as FigureInfo["type"],
         // ...
-        importance: String(f.importance ?? "medium") as FigureInfo["importance"],
+        importance: (IMPORTANCE_LEVELS.has(String(f.importance)) ? String(f.importance) : "medium") as FigureInfo["importance"],

coderabbitai · 2026-02-08T08:16:36Z

src/convert/pdfvision.ts

+const getImageDimensions = (imagePath: string): { width: number; height: number } | null => {
+  try {
+    const output = execSync(`identify -format "%w %h" "${imagePath}"`, { encoding: "utf-8" });
+    const [w, h] = output.trim().split(" ").map(Number);
+    return { width: w, height: h };
+  } catch {
+    return null;
+  }
+};


⚠️ Potential issue | 🟠 Major

Shell injection risk in getImageDimensions — use execFileSync with an argument array.

imagePath is interpolated into a shell string. If the PDF filename contains shell metacharacters (e.g., $(cmd) or backticks), execSync will expand them inside the double quotes. The basename is derived from user-provided input via path.basename(pdfFile, ".pdf"), which does not sanitize shell-special characters.

Additionally, identify has the same platform portability issue as convert (flagged in a past review for cropFigure): on ImageMagick 7 on macOS/Windows, the standalone identify binary may not exist.

🔒 Proposed fix — use execFileSync

+import { execSync, execFileSync } from "child_process"; + +const imageMagickCmd = (): string => { + return process.platform === "linux" ? "" : "magick"; +}; + const getImageDimensions = (imagePath: string): { width: number; height: number } | null => { try { - const output = execSync(`identify -format "%w %h" "${imagePath}"`, { encoding: "utf-8" }); + const cmd = imageMagickCmd(); + const args = cmd + ? [cmd, "identify", "-format", "%w %h", imagePath] + : ["identify", "-format", "%w %h", imagePath]; + const binary = args.shift()!; + const output = execFileSync(binary, args, { encoding: "utf-8" }); const [w, h] = output.trim().split(" ").map(Number); return { width: w, height: h }; } catch { return null; } };

🤖 Prompt for AI Agents

In `@src/convert/pdfvision.ts` around lines 49 - 57, Replace the unsafe execSync call in getImageDimensions with execFileSync using an argument array (pass ["-format", "%w %h", imagePath]) to avoid shell interpolation of imagePath; also detect ImageMagick7 by attempting "identify" then falling back to "magick" (invoke as ["identify", ...] vs ["magick", "identify", ...]) so the binary exists on macOS/Windows, and surface the caught error (rather than swallowing) or return null consistently; update the getImageDimensions function to use execFileSync, the proper arg list, and explicit error handling.

- Convert figure pages at 600dpi (2x) for sharper cropped images - Add 5% padding around bbox to compensate for LLM estimation errors - Post-process with ImageMagick -trim to clean whitespace + add border - Use magick command on macOS (ImageMagick 7 compatibility) - Change OpenAI Vision detail from "low" to "auto" for better analysis - Improve prompt to request generous bbox including all labels/captions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/convert/pdfvision.ts`:
- Around line 226-235: The mapping that builds beats uses narrations[i] || ""
which silently produces empty text when narrations.length <
analysis.slides.length; update the logic around the beats creation (where
analysis.slides is mapped and narrations is referenced) to detect mismatched
lengths and emit a warning (e.g., console.warn or the repo's logger) when
narrations.length !== analysis.slides.length, and additionally log which slide
index(s) are missing narration so missing text is easier to debug before keeping
the existing empty-string fallback.

🧹 Nitpick comments (5)

src/utils/vision-provider.ts (3)
46-52: Gemini client is re-instantiated on every API call.

getGeminiClient creates a new GoogleGenerativeAI instance each time, unlike the OpenAI singleton pattern in llm.ts. For a CLI tool making 1-2 calls this is fine, but if usage grows, consider caching the client.
♻️ Optional: cache the Gemini client
+let geminiClient: GoogleGenerativeAI | null = null;
+
 const getGeminiClient = (): GoogleGenerativeAI => {
+  if (geminiClient) return geminiClient;
   const apiKey = process.env.GEMINI_API_KEY;
   if (!apiKey) {
     throw new Error("GEMINI_API_KEY environment variable is not set");
   }
-  return new GoogleGenerativeAI(apiKey);
+  geminiClient = new GoogleGenerativeAI(apiKey);
+  return geminiClient;
 };
79-90: flatMap returning single-element arrays is equivalent to map.

The flatMap with a callback that always returns a single-element array adds unnecessary complexity. A plain map would be clearer.
♻️ Simplify to map
-  const imageContents: OpenAI.Chat.ChatCompletionContentPart[] = request.images.flatMap(
-    (img): OpenAI.Chat.ChatCompletionContentPart[] => {
-      const base64 = imageToBase64(img.path);
-      const mediaType = getImageMediaType(img.path);
-      return [
-        {
-          type: "image_url",
-          image_url: { url: `data:${mediaType};base64,${base64}`, detail: "auto" },
-        },
-      ];
-    }
-  );
+  const imageContents: OpenAI.Chat.ChatCompletionContentPart[] = request.images.map(
+    (img): OpenAI.Chat.ChatCompletionContentPart => {
+      const base64 = imageToBase64(img.path);
+      const mediaType = getImageMediaType(img.path);
+      return {
+        type: "image_url",
+        image_url: { url: `data:${mediaType};base64,${base64}`, detail: "auto" },
+      };
+    }
+  );
26-44: resolveVisionProvider silently prefers Gemini when both keys are set — worth documenting.

When both GEMINI_API_KEY and OPENAI_API_KEY are set and no preferred is provided, Gemini is silently chosen. This is fine as a default, but a brief log or comment would help users understand the precedence.
src/utils/document-analysis.ts (1)
146-153: Unsafe type assertions on LLM-provided type and importance values.

String(f.type ?? "figure") as FigureInfo["type"] and String(f.importance ?? "medium") as FigureInfo["importance"] will silently accept any string the LLM returns (e.g., "photo", "critical"), violating the union type at runtime. Since this is best-effort LLM parsing, it's not critical, but a fallback to the default on invalid values would be more robust.
♻️ Validate and default on unexpected values
+const VALID_FIGURE_TYPES = new Set(["figure", "table", "chart", "diagram"]);
+const VALID_IMPORTANCE = new Set(["high", "medium", "low"]);
+
 // inside the figures mapping:
-        type: String(f.type ?? "figure") as FigureInfo["type"],
+        type: VALID_FIGURE_TYPES.has(String(f.type)) ? (String(f.type) as FigureInfo["type"]) : "figure",
         ...
-        importance: String(f.importance ?? "medium") as FigureInfo["importance"],
+        importance: VALID_IMPORTANCE.has(String(f.importance)) ? (String(f.importance) as FigureInfo["importance"]) : "medium",
src/convert/pdfvision.ts (1)
173-177: fs.rmdirSync is deprecated; also, the forEach return value triggers a lint warning.

fs.rmdirSync has been deprecated since Node 16 in favor of fs.rmSync. Additionally, fs.unlinkSync returns void but the static analysis tool flags the implicit return from the arrow function inside forEach.
♻️ Proposed fix
   if (fs.existsSync(highResDir)) {
-    fs.readdirSync(highResDir).forEach((f) => fs.unlinkSync(path.join(highResDir, f)));
-    fs.rmdirSync(highResDir);
+    fs.rmSync(highResDir, { recursive: true });
   }

coderabbitai · 2026-02-08T20:26:09Z

src/convert/pdfvision.ts

+  const beats = analysis.slides.map((slide, i) => {
+    const imagePage = slide.imagePage ?? slide.sourcePages[0] ?? 0;
+    const pageImagePath = `./images/${basename}-${imagePage}.png`;
+    const imagePath =
+      slide.figureRef && figureImageMap.has(slide.figureRef)
+        ? figureImageMap.get(slide.figureRef)!
+        : pageImagePath;
+
+    return {
+      text: narrations[i] || "",


⚠️ Potential issue | 🟡 Minor

Silent fallback to empty string when narration is missing for a slide.

Line 235: narrations[i] || "" silently produces empty text if the narration array is shorter than the slides array (e.g., if the LLM returned fewer narrations). Consider logging a warning when narrations.length !== analysis.slides.length to help debug unexpected empty slides.

🔍 Proposed diagnostic

const buildMulmoScript = ( analysis: DocumentAnalysis, narrations: string[], basename: string, lang: SupportedLang, figureImageMap: Map<string, string> ): z.output<typeof mulmoScriptSchema> => { + if (narrations.length !== analysis.slides.length) { + console.warn(`Warning: narration count (${narrations.length}) differs from slide count (${analysis.slides.length})`); + } const beats = analysis.slides.map((slide, i) => {

🤖 Prompt for AI Agents

In `@src/convert/pdfvision.ts` around lines 226 - 235, The mapping that builds beats uses narrations[i] || "" which silently produces empty text when narrations.length < analysis.slides.length; update the logic around the beats creation (where analysis.slides is mapped and narrations is referenced) to detect mismatched lengths and emit a warning (e.g., console.warn or the repo's logger) when narrations.length !== analysis.slides.length, and additionally log which slide index(s) are missing narration so missing text is easier to debug before keeping the existing empty-string fallback.

isamu and others added 4 commits February 8, 2026 15:58

docs: add pdfvision implementation plan

a498d0f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

test: add pdfvision unit tests

f51b60f

Tests for vision-provider, document-analysis, and narration-generator pure functions (no API calls). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai bot reviewed Feb 8, 2026

View reviewed changes

src/convert/pdfvision.ts Outdated Show resolved Hide resolved

coderabbitai bot reviewed Feb 8, 2026

View reviewed changes

isamu marked this pull request as draft February 10, 2026 03:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add pdfvision command for Vision API-based PDF conversion#75

feat: add pdfvision command for Vision API-based PDF conversion#75
isamu wants to merge 7 commits intomainfrom
feat/pdfvision

isamu commented Feb 8, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

socket-security bot commented Feb 8, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 8, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 8, 2026

Uh oh!

coderabbitai bot Feb 8, 2026

Uh oh!

coderabbitai bot Feb 8, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 8, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

isamu commented Feb 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

User Prompt

Test plan

Summary by CodeRabbit

Uh oh!

socket-security bot commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

isamu commented Feb 8, 2026 •

edited by coderabbitai bot

Loading

socket-security bot commented Feb 8, 2026 •

edited

Loading

coderabbitai bot commented Feb 8, 2026 •

edited

Loading