feat: add pdfvision command for Vision API-based PDF conversion#75
feat: add pdfvision command for Vision API-based PDF conversion#75
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement intelligent PDF-to-MulmoScript conversion using Vision API: - vision-provider: Gemini/OpenAI abstraction with auto-detection - document-analysis: prompt builder and parser for document structure - narration-generator: prompt builder and parser for slide narrations - pdfvision pipeline: images + text → Vision analysis → narration → MulmoScript - CLI command with --provider option Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests for vision-provider, document-analysis, and narration-generator pure functions (no API calls). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Move extractJsonFromResponse to llm.ts as shared utility - Remove duplicate implementations from document-analysis.ts and narration-generator.ts - Add pdfvision to AGENTS.md, CLAUDE.md, README.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
📝 Walkthrough🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@package.json`:
- Line 65: The dependency "@google/generative-ai" in package.json is deprecated;
remove or mark it for replacement by updating package.json to stop depending on
"@google/generative-ai" (or add a placeholder adapter package name) and create a
follow-up change to add a tested alternative library and adapter module (e.g., a
new file that exposes the same functions used in our codebase) so callers of the
existing symbols can be switched with minimal friction; update lockfile
(npm/yarn) and CI, add a short migration note in the repo README or a new issue
referencing "@google/generative-ai" deprecation and the planned replacement, and
run unit/integration tests to confirm no breaking changes.
In `@src/convert/pdfvision.ts`:
- Around line 37-41: buildPageImages currently filters out missing image files
which can make images.length differ from the original pageCount and
desynchronize with extractedTexts passed into buildDocumentAnalysisPrompt;
update buildPageImages to preserve the original pageCount (e.g., return both the
VisionImage[] and the original pageCount or a list of missing page indices), and
add a warning log when images.length !== pageCount that lists which page images
are missing so callers know there was a partial failure; then pass the original
pageCount (not images.length) into buildDocumentAnalysisPrompt or adjust callers
to use the preserved pageCount to keep extractedTexts and images aligned.
- Around line 149-156: The current assignment extractedTexts[page.pageNumber] =
page.text can create a sparse array that buildDocumentAnalysisPrompt may skip;
to fix, ensure extractedTexts is a dense array by pre-sizing or filling it and
assigning consistently (e.g., create extractedTexts = new
Array(pageTexts.length).fill("(no text)") and then set
extractedTexts[page.pageNumber - 1] = page.text), or sort pageTexts by
pageNumber and push page.text in order; update the logic around
extractTextFromPdf, page.pageNumber, extractedTexts and
buildDocumentAnalysisPrompt to guarantee no holes and provide a placeholder for
missing pages.
🧹 Nitpick comments (9)
tests/test_vision_provider.ts (1)
17-53: Consider extracting env save/restore into a helper to reduce boilerplate.The env variable save/restore pattern is repeated across 4 tests with slight variations. A small helper like
withEnv({ GEMINI_API_KEY: "test", OPENAI_API_KEY: undefined }, fn)could reduce duplication and make tests more readable. Not blocking — the current approach is correct.tests/test_narration_generator.ts (1)
96-156: Solid parsing tests — consider adding a duplicate-index case.The parsing tests cover valid JSON, markdown-wrapped JSON, missing indices, empty array, invalid JSON, and out-of-order indices nicely. One minor gap: there's no test for duplicate indices (e.g., two entries with
index: 0). Per theMapconstructor inparseNarrationResponse, the last entry wins — it may be worth codifying that behavior with a test.src/utils/narration-generator.ts (1)
76-88: Duplicateindexvalues silently drop earlier entries.If the LLM returns multiple narration entries with the same
index, theMapconstructor keeps only the last one. This is a reasonable behavior for LLM output, but consider logging a warning when duplicates are detected so issues are visible during debugging.src/utils/vision-provider.ts (4)
46-52: Gemini client is re-instantiated on every call, unlike the cached OpenAI client.
getOpenAIClient()inllm.tscaches the client in a module-level variable, butgetGeminiClient()creates a newGoogleGenerativeAIinstance each time. For consistency and to avoid unnecessary object creation (especially when both vision and text calls happen in the same pipeline run), consider caching the Gemini client similarly.♻️ Cache the Gemini client
+let geminiClient: GoogleGenerativeAI | null = null; + const getGeminiClient = (): GoogleGenerativeAI => { + if (geminiClient) { + return geminiClient; + } const apiKey = process.env.GEMINI_API_KEY; if (!apiKey) { throw new Error("GEMINI_API_KEY environment variable is not set"); } - return new GoogleGenerativeAI(apiKey); + geminiClient = new GoogleGenerativeAI(apiKey); + return geminiClient; };
76-104:flatMapreturning single-element arrays —mapis simpler here.Each image produces exactly one
ChatCompletionContentPart, soflatMapwith a single-element array return is equivalent tomap. Usingmapis more straightforward.♻️ Simplify to map
- const imageContents: OpenAI.Chat.ChatCompletionContentPart[] = request.images.flatMap( - (img): OpenAI.Chat.ChatCompletionContentPart[] => { + const imageContents: OpenAI.Chat.ChatCompletionContentPart[] = request.images.map( + (img): OpenAI.Chat.ChatCompletionContentPart => { const base64 = imageToBase64(img.path); const mediaType = getImageMediaType(img.path); - return [ - { - type: "image_url", - image_url: { url: `data:${mediaType};base64,${base64}`, detail: "low" }, - }, - ]; + return { + type: "image_url", + image_url: { url: `data:${mediaType};base64,${base64}`, detail: "low" }, + }; } );
54-74: All page images loaded into memory as base64 simultaneously — potential memory pressure for large PDFs.For a 50-page PDF at 300 DPI, each page image could be several MB. Converting all to base64 (which is ~33% larger) and holding them in memory simultaneously could be significant. The plan doc acknowledges large PDFs as an open question. For the initial implementation this is acceptable, but worth noting as a scaling concern.
97-101: AddresponseMimeType: "application/json"to Gemini's generationConfig for consistency with OpenAI's JSON enforcement.OpenAI uses
response_format: { type: "json_object" }to enforce JSON mode, but the Gemini calls don't set equivalent JSON mode constraints. WhileextractJsonFromResponsehandles markdown-wrapped JSON, Gemini could return unparseable formats without the mode enforced. SetresponseMimeType: "application/json"in thegenerationConfigwhen creating Gemini models to match OpenAI's guarantees.This applies to both
callGeminiVision(line 56) andcallGeminiText(line 120).src/utils/document-analysis.ts (1)
122-128: Unsafeascasts ontypeandimportancefields accept any string from LLM output.If the LLM returns an unexpected value like
"photo"fortypeor"critical"forimportance, the cast silently passes through without validation. Since this is LLM-generated data, unexpected values are plausible.Consider adding a validation/normalization step, or at minimum a set of allowed values to fall back on:
♻️ Add validation for enum-like fields
+const VALID_FIGURE_TYPES = new Set(["figure", "table", "chart", "diagram"]); +const VALID_IMPORTANCE = new Set(["high", "medium", "low"]); + figures: (parsed.figures ?? []).map((f: Record<string, unknown>) => ({ page: Number(f.page ?? 0), - type: String(f.type ?? "figure") as FigureInfo["type"], + type: (VALID_FIGURE_TYPES.has(String(f.type)) ? String(f.type) : "figure") as FigureInfo["type"], label: f.label ? String(f.label) : undefined, description: String(f.description ?? ""), - importance: String(f.importance ?? "medium") as FigureInfo["importance"], + importance: (VALID_IMPORTANCE.has(String(f.importance)) ? String(f.importance) : "medium") as FigureInfo["importance"], })),src/convert/pdfvision.ts (1)
80-116:buildMulmoScriptis well-structured with proper schema validation.Good fallback chain for
imagePage, schema validation viasafeParse, and clear error reporting on validation failure.One observation: the document
titlefrom the analysis is not included in the MulmoScript output. If the schema supports a title/metadata field, it could be worth including.
| "test:marp:theme": "tsx src/cli.ts marp samples/custom_theme_demo.md --theme samples/custom-ocean.css" | ||
| }, | ||
| "dependencies": { | ||
| "@google/generative-ai": "^0.24.1", |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
@google/generative-ai npm package latest version vulnerabilities
💡 Result:
- Latest version (npm “latest” dist-tag):
@google/generative-ai@0.24.1(shown as the currentlatesttag on the npm versions page). [1] - Known vulnerabilities (package itself): No direct vulnerabilities currently listed for
@google/generative-aiin Snyk’s vulnerability database (note: this excludes issues that might exist only in transitive dependencies). [2] - Other security trackers: Aikido Intel also reports 0 vulnerabilities and notes the package has 0 direct dependencies. [3]
- Maintenance status (important): The package is described as legacy/deprecated, with an end-of-support date of Aug 31, 2025 in its published metadata/readme mirrors—so you should plan migration even if there are no known CVEs today. [4]
Sources: [1] [2] [3] [4]
Plan migration away from @google/generative-ai; package is deprecated with end-of-support August 31, 2025.
While version ^0.24.1 is current and has no known direct vulnerabilities, the package is marked as legacy/deprecated. Update the dependency strategy and begin evaluating alternative libraries before the support deadline.
🤖 Prompt for AI Agents
In `@package.json` at line 65, The dependency "@google/generative-ai" in
package.json is deprecated; remove or mark it for replacement by updating
package.json to stop depending on "@google/generative-ai" (or add a placeholder
adapter package name) and create a follow-up change to add a tested alternative
library and adapter module (e.g., a new file that exposes the same functions
used in our codebase) so callers of the existing symbols can be switched with
minimal friction; update lockfile (npm/yarn) and CI, add a short migration note
in the repo README or a new issue referencing "@google/generative-ai"
deprecation and the planned replacement, and run unit/integration tests to
confirm no breaking changes.
| const buildPageImages = (imagesDir: string, basename: string, pageCount: number): VisionImage[] => { | ||
| return Array.from({ length: pageCount }, (_, i) => ({ | ||
| path: path.join(imagesDir, `${basename}-${i}.png`), | ||
| })).filter((img) => fs.existsSync(img.path)); | ||
| }; |
There was a problem hiding this comment.
buildPageImages filtering by existence can cause a mismatch between image count and page count.
If some page images fail to generate (e.g., ImageMagick partially fails), the filtered images array is shorter than pageCount. This shorter length is then passed as pageCount to buildDocumentAnalysisPrompt (via images.length on Line 52), while extractedTexts still has entries for all pages. The Vision API sees fewer images than the text references, which could confuse the analysis.
Consider logging a warning when the counts diverge, or passing the original pageCount separately.
🤖 Prompt for AI Agents
In `@src/convert/pdfvision.ts` around lines 37 - 41, buildPageImages currently
filters out missing image files which can make images.length differ from the
original pageCount and desynchronize with extractedTexts passed into
buildDocumentAnalysisPrompt; update buildPageImages to preserve the original
pageCount (e.g., return both the VisionImage[] and the original pageCount or a
list of missing page indices), and add a warning log when images.length !==
pageCount that lists which page images are missing so callers know there was a
partial failure; then pass the original pageCount (not images.length) into
buildDocumentAnalysisPrompt or adjust callers to use the preserved pageCount to
keep extractedTexts and images aligned.
| // Step 2: Extract text | ||
| console.log("Extracting text from PDF..."); | ||
| const pageTexts = await extractTextFromPdf(pdfFile); | ||
| const extractedTexts: string[] = []; | ||
| pageTexts.forEach((page) => { | ||
| extractedTexts[page.pageNumber] = page.text; | ||
| }); | ||
| console.log(`Extracted text from ${pageTexts.length} pages`); |
There was a problem hiding this comment.
Sparse array risk: index-based assignment on extractedTexts can produce holes that Array.map silently skips.
If extractTextFromPdf ever skips a page number, extractedTexts[page.pageNumber] = page.text creates a sparse array. When buildDocumentAnalysisPrompt iterates with .map(), JS skips sparse holes entirely — those pages won't appear in the prompt (not even as "(no text)"). In practice pdf-parse likely returns all pages, but this pattern is fragile.
🛡️ Pre-fill the array to guarantee density
const pageTexts = await extractTextFromPdf(pdfFile);
- const extractedTexts: string[] = [];
+ const extractedTexts: string[] = Array.from({ length: pageCount }, () => "");
pageTexts.forEach((page) => {
extractedTexts[page.pageNumber] = page.text;
});📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Step 2: Extract text | |
| console.log("Extracting text from PDF..."); | |
| const pageTexts = await extractTextFromPdf(pdfFile); | |
| const extractedTexts: string[] = []; | |
| pageTexts.forEach((page) => { | |
| extractedTexts[page.pageNumber] = page.text; | |
| }); | |
| console.log(`Extracted text from ${pageTexts.length} pages`); | |
| // Step 2: Extract text | |
| console.log("Extracting text from PDF..."); | |
| const pageTexts = await extractTextFromPdf(pdfFile); | |
| const extractedTexts: string[] = Array.from({ length: pageTexts.length }, () => ""); | |
| pageTexts.forEach((page) => { | |
| extractedTexts[page.pageNumber] = page.text; | |
| }); | |
| console.log(`Extracted text from ${pageTexts.length} pages`); |
🤖 Prompt for AI Agents
In `@src/convert/pdfvision.ts` around lines 149 - 156, The current assignment
extractedTexts[page.pageNumber] = page.text can create a sparse array that
buildDocumentAnalysisPrompt may skip; to fix, ensure extractedTexts is a dense
array by pre-sizing or filling it and assigning consistently (e.g., create
extractedTexts = new Array(pageTexts.length).fill("(no text)") and then set
extractedTexts[page.pageNumber - 1] = page.text), or sort pageTexts by
pageNumber and push page.text in order; update the logic around
extractTextFromPdf, page.pageNumber, extractedTexts and
buildDocumentAnalysisPrompt to guarantee no holes and provide a placeholder for
missing pages.
- Add bbox field to FigureInfo for figure position coordinates - Vision API prompt now requests bounding box (% of page) for figures - ImageMagick crops figures from page images based on bbox - Slides with figureRef use cropped images instead of full page - Update plan to include cropping as part of final design Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/convert/pdfvision.ts`:
- Around line 54-63: The cropFigure function currently hardcodes the "convert"
binary and builds a single shell string, which breaks on macOS/Windows and risks
shell-quoting bugs; change cropFigure to pick the ImageMagick command the same
way convertPdfToImages does (choose "magick" on Windows/macOS when appropriate,
otherwise "convert"), then replace the assembled shell string and execSync call
with execFileSync using the chosen binary and an args array (use unquoted
pageImagePath, "-crop", `${bbox.width}%x${bbox.height}%+${bbox.x}%+${bbox.y}%`,
"+repage", outputPath) and pass {stdio: "pipe"} so paths with spaces are handled
safely and the correct binary is invoked.
🧹 Nitpick comments (2)
plans/pdfvision.md (2)
21-47: Add language specifier to fenced code block.Static analysis (markdownlint MD040) flags this block. Even for plain-text pipeline descriptions, use a specifier like
```textto satisfy the linter and improve rendering consistency.
53-56: Add language specifier to fenced code block.Same markdownlint MD040 flag here. Use
```textor similar.
- Convert bbox percentage to pixels using actual image dimensions - Use identify command to get page image width/height - Fix ImageMagick -crop syntax (was using % for offset incorrectly) - Improve bbox prompt with precise instructions and example Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/convert/pdfvision.ts`:
- Around line 49-57: Replace the unsafe execSync call in getImageDimensions with
execFileSync using an argument array (pass ["-format", "%w %h", imagePath]) to
avoid shell interpolation of imagePath; also detect ImageMagick7 by attempting
"identify" then falling back to "magick" (invoke as ["identify", ...] vs
["magick", "identify", ...]) so the binary exists on macOS/Windows, and surface
the caught error (rather than swallowing) or return null consistently; update
the getImageDimensions function to use execFileSync, the proper arg list, and
explicit error handling.
🧹 Nitpick comments (1)
src/utils/document-analysis.ts (1)
120-167:typeandimportancecasts bypass runtime validation — LLM can return arbitrary strings.Lines 145 and 148 use
ascasts which perform no runtime check. If the Vision API returns an unexpected value (e.g.,"type": "photo"or"importance": "critical"), it silently passes through with a type that doesn't match theFigureInfounion. Downstream comparisons likefigure.type === "figure"would silently fail.Consider clamping to allowed values:
🛡️ Proposed validation
+const FIGURE_TYPES = new Set(["figure", "table", "chart", "diagram"]); +const IMPORTANCE_LEVELS = new Set(["high", "medium", "low"]); + // inside the figures mapping: - type: String(f.type ?? "figure") as FigureInfo["type"], + type: (FIGURE_TYPES.has(String(f.type)) ? String(f.type) : "figure") as FigureInfo["type"], // ... - importance: String(f.importance ?? "medium") as FigureInfo["importance"], + importance: (IMPORTANCE_LEVELS.has(String(f.importance)) ? String(f.importance) : "medium") as FigureInfo["importance"],
| const getImageDimensions = (imagePath: string): { width: number; height: number } | null => { | ||
| try { | ||
| const output = execSync(`identify -format "%w %h" "${imagePath}"`, { encoding: "utf-8" }); | ||
| const [w, h] = output.trim().split(" ").map(Number); | ||
| return { width: w, height: h }; | ||
| } catch { | ||
| return null; | ||
| } | ||
| }; |
There was a problem hiding this comment.
Shell injection risk in getImageDimensions — use execFileSync with an argument array.
imagePath is interpolated into a shell string. If the PDF filename contains shell metacharacters (e.g., $(cmd) or backticks), execSync will expand them inside the double quotes. The basename is derived from user-provided input via path.basename(pdfFile, ".pdf"), which does not sanitize shell-special characters.
Additionally, identify has the same platform portability issue as convert (flagged in a past review for cropFigure): on ImageMagick 7 on macOS/Windows, the standalone identify binary may not exist.
🔒 Proposed fix — use execFileSync
+import { execSync, execFileSync } from "child_process";
+
+const imageMagickCmd = (): string => {
+ return process.platform === "linux" ? "" : "magick";
+};
+
const getImageDimensions = (imagePath: string): { width: number; height: number } | null => {
try {
- const output = execSync(`identify -format "%w %h" "${imagePath}"`, { encoding: "utf-8" });
+ const cmd = imageMagickCmd();
+ const args = cmd
+ ? [cmd, "identify", "-format", "%w %h", imagePath]
+ : ["identify", "-format", "%w %h", imagePath];
+ const binary = args.shift()!;
+ const output = execFileSync(binary, args, { encoding: "utf-8" });
const [w, h] = output.trim().split(" ").map(Number);
return { width: w, height: h };
} catch {
return null;
}
};🤖 Prompt for AI Agents
In `@src/convert/pdfvision.ts` around lines 49 - 57, Replace the unsafe execSync
call in getImageDimensions with execFileSync using an argument array (pass
["-format", "%w %h", imagePath]) to avoid shell interpolation of imagePath; also
detect ImageMagick7 by attempting "identify" then falling back to "magick"
(invoke as ["identify", ...] vs ["magick", "identify", ...]) so the binary
exists on macOS/Windows, and surface the caught error (rather than swallowing)
or return null consistently; update the getImageDimensions function to use
execFileSync, the proper arg list, and explicit error handling.
- Convert figure pages at 600dpi (2x) for sharper cropped images - Add 5% padding around bbox to compensate for LLM estimation errors - Post-process with ImageMagick -trim to clean whitespace + add border - Use magick command on macOS (ImageMagick 7 compatibility) - Change OpenAI Vision detail from "low" to "auto" for better analysis - Improve prompt to request generous bbox including all labels/captions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/convert/pdfvision.ts`:
- Around line 226-235: The mapping that builds beats uses narrations[i] || ""
which silently produces empty text when narrations.length <
analysis.slides.length; update the logic around the beats creation (where
analysis.slides is mapped and narrations is referenced) to detect mismatched
lengths and emit a warning (e.g., console.warn or the repo's logger) when
narrations.length !== analysis.slides.length, and additionally log which slide
index(s) are missing narration so missing text is easier to debug before keeping
the existing empty-string fallback.
🧹 Nitpick comments (5)
src/utils/vision-provider.ts (3)
46-52: Gemini client is re-instantiated on every API call.
getGeminiClientcreates a newGoogleGenerativeAIinstance each time, unlike the OpenAI singleton pattern inllm.ts. For a CLI tool making 1-2 calls this is fine, but if usage grows, consider caching the client.♻️ Optional: cache the Gemini client
+let geminiClient: GoogleGenerativeAI | null = null; + const getGeminiClient = (): GoogleGenerativeAI => { + if (geminiClient) return geminiClient; const apiKey = process.env.GEMINI_API_KEY; if (!apiKey) { throw new Error("GEMINI_API_KEY environment variable is not set"); } - return new GoogleGenerativeAI(apiKey); + geminiClient = new GoogleGenerativeAI(apiKey); + return geminiClient; };
79-90:flatMapreturning single-element arrays is equivalent tomap.The
flatMapwith a callback that always returns a single-element array adds unnecessary complexity. A plainmapwould be clearer.♻️ Simplify to map
- const imageContents: OpenAI.Chat.ChatCompletionContentPart[] = request.images.flatMap( - (img): OpenAI.Chat.ChatCompletionContentPart[] => { - const base64 = imageToBase64(img.path); - const mediaType = getImageMediaType(img.path); - return [ - { - type: "image_url", - image_url: { url: `data:${mediaType};base64,${base64}`, detail: "auto" }, - }, - ]; - } - ); + const imageContents: OpenAI.Chat.ChatCompletionContentPart[] = request.images.map( + (img): OpenAI.Chat.ChatCompletionContentPart => { + const base64 = imageToBase64(img.path); + const mediaType = getImageMediaType(img.path); + return { + type: "image_url", + image_url: { url: `data:${mediaType};base64,${base64}`, detail: "auto" }, + }; + } + );
26-44:resolveVisionProvidersilently prefers Gemini when both keys are set — worth documenting.When both
GEMINI_API_KEYandOPENAI_API_KEYare set and nopreferredis provided, Gemini is silently chosen. This is fine as a default, but a brief log or comment would help users understand the precedence.src/utils/document-analysis.ts (1)
146-153: Unsafe type assertions on LLM-providedtypeandimportancevalues.
String(f.type ?? "figure") as FigureInfo["type"]andString(f.importance ?? "medium") as FigureInfo["importance"]will silently accept any string the LLM returns (e.g.,"photo","critical"), violating the union type at runtime. Since this is best-effort LLM parsing, it's not critical, but a fallback to the default on invalid values would be more robust.♻️ Validate and default on unexpected values
+const VALID_FIGURE_TYPES = new Set(["figure", "table", "chart", "diagram"]); +const VALID_IMPORTANCE = new Set(["high", "medium", "low"]); + // inside the figures mapping: - type: String(f.type ?? "figure") as FigureInfo["type"], + type: VALID_FIGURE_TYPES.has(String(f.type)) ? (String(f.type) as FigureInfo["type"]) : "figure", ... - importance: String(f.importance ?? "medium") as FigureInfo["importance"], + importance: VALID_IMPORTANCE.has(String(f.importance)) ? (String(f.importance) as FigureInfo["importance"]) : "medium",src/convert/pdfvision.ts (1)
173-177:fs.rmdirSyncis deprecated; also, theforEachreturn value triggers a lint warning.
fs.rmdirSynchas been deprecated since Node 16 in favor offs.rmSync. Additionally,fs.unlinkSyncreturnsvoidbut the static analysis tool flags the implicit return from the arrow function insideforEach.♻️ Proposed fix
if (fs.existsSync(highResDir)) { - fs.readdirSync(highResDir).forEach((f) => fs.unlinkSync(path.join(highResDir, f))); - fs.rmdirSync(highResDir); + fs.rmSync(highResDir, { recursive: true }); }
| const beats = analysis.slides.map((slide, i) => { | ||
| const imagePage = slide.imagePage ?? slide.sourcePages[0] ?? 0; | ||
| const pageImagePath = `./images/${basename}-${imagePage}.png`; | ||
| const imagePath = | ||
| slide.figureRef && figureImageMap.has(slide.figureRef) | ||
| ? figureImageMap.get(slide.figureRef)! | ||
| : pageImagePath; | ||
|
|
||
| return { | ||
| text: narrations[i] || "", |
There was a problem hiding this comment.
Silent fallback to empty string when narration is missing for a slide.
Line 235: narrations[i] || "" silently produces empty text if the narration array is shorter than the slides array (e.g., if the LLM returned fewer narrations). Consider logging a warning when narrations.length !== analysis.slides.length to help debug unexpected empty slides.
🔍 Proposed diagnostic
const buildMulmoScript = (
analysis: DocumentAnalysis,
narrations: string[],
basename: string,
lang: SupportedLang,
figureImageMap: Map<string, string>
): z.output<typeof mulmoScriptSchema> => {
+ if (narrations.length !== analysis.slides.length) {
+ console.warn(`Warning: narration count (${narrations.length}) differs from slide count (${analysis.slides.length})`);
+ }
const beats = analysis.slides.map((slide, i) => {🤖 Prompt for AI Agents
In `@src/convert/pdfvision.ts` around lines 226 - 235, The mapping that builds
beats uses narrations[i] || "" which silently produces empty text when
narrations.length < analysis.slides.length; update the logic around the beats
creation (where analysis.slides is mapped and narrations is referenced) to
detect mismatched lengths and emit a warning (e.g., console.warn or the repo's
logger) when narrations.length !== analysis.slides.length, and additionally log
which slide index(s) are missing narration so missing text is easier to debug
before keeping the existing empty-string fallback.
Summary
pdfvisionCLI command that uses Vision API (Gemini/OpenAI) to intelligently convert PDFs into narrated presentationspdfcommand (1:1 page-to-slide),pdfvisionanalyzes document structure, identifies figures, and creates logical slide compositionsextractJsonFromResponseinto sharedllm.tsutilityUser Prompt
現状はpdfを1ページ1ページスライドにしている。しかし、もしコンテンツの中身が読めるなら、コンテンツごとに理解をし、画像があるならその画像などに基づき、きちんとした解説をしたプレゼン資料にしたい。
Test plan
GEMINI_API_KEY=xxx yarn pdfvision paper.pdf -l jaOPENAI_API_KEY=xxx yarn pdfvision paper.pdf --provider openaiCloses #74
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Dependencies
Tests
Documentation