Skip to content

[AGENT] Harden transcription final pass and replay tooling#160

Open
BASIC-BIT wants to merge 15 commits intomasterfrom
feature-transcription-final-pass
Open

[AGENT] Harden transcription final pass and replay tooling#160
BASIC-BIT wants to merge 15 commits intomasterfrom
feature-transcription-final-pass

Conversation

@BASIC-BIT
Copy link
Collaborator

Summary

  • harden transcription voting and final-pass reconciliation against punctuation-only outputs, repeated low-information hallucinations, and segment-boundary drift
  • add replay-oriented eval tooling for prompt vs no-prompt comparison, plus targeted tests and docs updates for the final-pass pipeline
  • keep transcription observability high with extra trivial-text telemetry and config/docs coverage for the new safeguards

Testing

  • yarn test --runTestsByPath src/utils/__tests__/transcriptionText.test.ts src/utils/__tests__/transcriptionVote.test.ts src/services/__tests__/transcriptionFinalPassService.test.ts
  • yarn lint:check src/evals/transcriptionEval.ts src/services/transcriptionFinalPassService.ts src/services/transcriptionService.ts src/utils/transcriptionVote.ts src/utils/transcriptionText.ts src/utils/__tests__/transcriptionText.test.ts src/utils/__tests__/transcriptionVote.test.ts src/services/__tests__/transcriptionFinalPassService.test.ts
  • yarn build
  • npx prettier --check src/evals/transcriptionEval.ts src/services/transcriptionFinalPassService.ts src/services/transcriptionService.ts src/utils/transcriptionVote.ts src/utils/transcriptionText.ts src/utils/__tests__/transcriptionText.test.ts src/utils/__tests__/transcriptionVote.test.ts src/services/__tests__/transcriptionFinalPassService.test.ts docs/audio-transcription.md docs/feature-toggles.md docs/audio-transcription-plan.md

Notes

  • yarn markdownlint:check is not runnable in this environment because markdownlint-cli2 is not installed.

Copilot AI review requested due to automatic review settings March 21, 2026 19:28
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 23d8397d30

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the transcription pipeline by adding a finalized-audio “final pass” verification step, strengthening prompt vs no-prompt voting against low-information outputs, and adding replay-oriented eval tooling to compare prompt/no-prompt behavior with extra telemetry and docs coverage.

Changes:

  • Add runTranscriptionFinalPass to verify transcripts against finalized mixed audio, apply high-confidence edits, reject trivial replacements, and filter repeated low-information segments.
  • Improve transcription vote selection by penalizing trivial/punctuation-only candidates and recording new “trivial text” telemetry.
  • Extend transcription eval tooling with --compare-no-prompt, and update tests/config/docs/prompts to support the new pipeline.

Reviewed changes

Copilot reviewed 37 out of 38 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/services/transcriptionPromptService.test.ts Adds coverage for final-pass prompt formatting and Langfuse prompt name.
test/services/transcriptionFinalPassService.test.ts Adds integration-style tests for chunking, boundary handling, and guardrail fallback.
test/services/imageCaptionService.test.ts Updates test runtime config shape with finalPassEnabled.
test/services/contextService.test.ts Updates default transcription config in tests to include finalPassEnabled.
test/services/autoRecordCancellationService.test.ts Updates test runtime config shape with finalPassEnabled.
test/commands/endMeeting.test.ts Mocks and asserts end-meeting flow invokes the final pass when transcription is enabled.
test/audio/startProcessingSnippet.test.ts Updates test runtime config shape with finalPassEnabled.
src/utils/transcriptionVote.ts Hardens vote scoring/selection using trivial-text detection and a no-prompt quality floor.
src/utils/transcriptionText.ts Introduces transcript text-quality helpers (trivial, punctuation-only, low-information, near-duplicate).
src/utils/tests/transcriptionVote.test.ts Adds tests for punctuation-only/trivial vote behavior.
src/utils/tests/transcriptionText.test.ts Adds tests for punctuation-only detection and normal speech classification.
src/types/audio.ts Adds finalPassTranscript field to AudioFileData.
src/services/uploadService.ts Prefers finalPassTranscript when building transcript JSON output.
src/services/transcriptionService.ts Adds trivial-text telemetry and logs candidate quality details.
src/services/transcriptionPromptService.ts Adds Langfuse prompt builder for final-pass reconciliation and baseline segment formatting.
src/services/transcriptionFinalPassService.ts Implements finalized-audio chunk transcription, reconciliation, edit application, and repetition filtering.
src/services/meetingTimelineService.ts Prefers finalPassTranscript for live timeline segments.
src/services/meetingConfigService.ts Plumbs transcription.finalPass.enabled into runtime config resolution.
src/services/configService.ts Adds Langfuse prompt name config for final-pass prompt (LANGFUSE_PROMPT_TRANSCRIPTION_FINAL_PASS).
src/services/tests/transcriptionFinalPassService.test.ts Adds focused unit tests for trivial replacement rejection and repetition filtering.
src/evals/transcriptionEval.ts Adds trivial-text counts and a prompt vs no-prompt comparison mode in eval output.
src/constants.ts Adds final-pass chunking, confidence, and guardrail constants.
src/config/types.ts Extends runtime config typing with finalPassEnabled.
src/config/registry.ts Registers transcription.finalPass.enabled config entry (global toggle).
src/config/keys.ts Adds config key constant for transcription.finalPass.enabled.
src/commands/endMeeting.ts Runs final transcription pass after mixed audio is built and before compiling transcript/notes.
src/audio.ts Prefers finalPassTranscript when compiling full meeting transcription.
prompts/chronote-transcription-final-pass-chat.md Adds a Langfuse-managed prompt for reconciliation edits JSON output.
docs/feature-toggles.md Documents the new transcription.finalPass.enabled toggle and updates vote description.
docs/audio-transcription.md Documents punctuation-only handling and final-pass behaviors.
docs/audio-transcription-plan.md Updates plan with replay tooling and repeated-phrase cleanup.
apps/docs-site/docs/core-concepts/meeting-lifecycle.md Updates meeting lifecycle docs to include finalized-audio verification stage.
README.md Mentions the finalized-audio verification pass at meeting end.
AGENTS.md Updates agent context for final-pass toggle and prompt override env var.
.opencode/skills/pr-review-recycle/SKILL.md Updates PR review recycle checklist to include a 5-minute post-push wait.
.opencode/skills/pr-post-push-sop/SKILL.md Adds a new post-push SOP skill document.
.github/copilot-instructions.md Mirrors AGENTS guidance updates, including post-push SOP and final-pass notes.

@greptile-apps
Copy link

greptile-apps bot commented Mar 21, 2026

Greptile Summary

This PR hardens the transcription pipeline against two families of low-quality output: punctuation-only results (e.g. a bare .) that could win a prompt-vs-no-prompt vote or overwrite an existing transcript via the final pass, and repeated short "greeting-style" hallucinations from the same speaker that the LLM reconcile step may not catch. It also adds a --compare-no-prompt replay mode to the eval tooling and extends observability with trivial-text quality signals throughout the transcription logs.

Key changes:

  • transcriptionText.ts (new): getTranscriptionTextQuality / isTrivialTranscriptionText / isLowInformationTranscriptionText / areLowInformationTranscriptionTextsNearDuplicates form a clean utility layer consumed by both the vote arbiter and the final-pass service.
  • transcriptionVote.ts: Trivial-text short-circuits guard before score comparison; a NO_PROMPT_MIN_SELECTION_SCORE = 120 quality floor prevents very low-scoring no-prompt candidates from ever winning; symmetric reason codes (prompt_trivial_text, no_prompt_trivial_text, both_trivial_text, no_prompt_below_quality_floor) improve log replay.
  • transcriptionFinalPassService.ts: rejectedTrivialEdits counter tracks blocked punctuation-only replacements; applyRepeatedLowInformationFilter runs after edits are applied (or in place of them when none were accepted), keyed on raw userId so same-named speakers from different users are never conflated.
  • transcriptionEval.ts: --compare-no-prompt flag runs two sequential batch passes and emits a side-by-side JSON comparison for offline vote investigations.
  • All prior review-thread concerns (missing mock fields, contiguous-subarray naming, trivial-penalty telemetry comment, symmetric reason codes, segment usability ordering, redundant guard, blank-line formatting) are addressed in this iteration.

Confidence Score: 5/5

  • Safe to merge — all prior concerns resolved, new logic is well-guarded and thoroughly tested.
  • Every concern from the previous review round has been addressed with targeted fixes. The new applyRepeatedLowInformationFilter logic is correct (window filtering keyed on userId, dropped segments skipped cleanly), the trivial-text short-circuits fire before score comparison as intended, and the score constants confirm NO_PROMPT_MIN_SELECTION_SCORE = 120 is a sensible floor. The only remaining item is a cosmetic comment in the Bedrock branch of buildTranscribeOnce, which has no runtime impact.
  • No files require special attention.

Important Files Changed

Filename Overview
src/utils/transcriptionText.ts New utility module providing text quality analysis. Logic for trivial/low-information detection, contiguous subarray matching, and near-duplicate detection is sound and well-tested.
src/utils/transcriptionVote.ts Adds trivial-text short-circuits, a NO_PROMPT_MIN_SELECTION_SCORE=120 quality floor, and symmetric reason codes for vote telemetry. Score constants confirm 120 is a sensible floor for genuinely low-quality outputs.
src/services/transcriptionFinalPassService.ts Adds rejectedTrivialEdits counter, speakerKey field to BaselineSegment, and applyRepeatedLowInformationFilter. Logic correctly keys filtering on userId (not display name), uses a 180-second window, and skips already-dropped segments.
src/services/transcriptionService.ts Adds rawTextQuality and selectedTextQuality telemetry to transcription logs and vote metadata; replaces hardcoded failure string with the extracted constant.
src/evals/transcriptionEval.ts Adds --compare-no-prompt flag and trivialTextCount tracking. The promptOverride parameter is accepted but silently unused in the Bedrock branch of buildTranscribeOnce; this is harmless because --compare-no-prompt is already guarded to OpenAI only.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[runTranscriptionFinalPass] --> B[buildBaselineSegments]
    B --> C[processAllChunks]
    C --> D{LLM reconcile batch}
    D -->|edit| E[applyBatchEdits]
    E --> F{replace edit?}
    F -->|yes| G{isTrivialTranscriptionText?}
    G -->|yes| H[rejectedTrivialEdits++\nskip edit]
    G -->|no| I[updateAcceptedEdit]
    F -->|no - drop| I
    C --> J{acceptedEdits.size == 0?}
    J -->|yes| K[applyRepeatedLowInformationFilter]
    J -->|no| L{guardrail thresholds exceeded?}
    L -->|yes| M[return fallback result\nskip repetition filter]
    L -->|no| N[applyAcceptedEdits]
    N --> K
    K --> O{isLowInformationTranscriptionText?}
    O -->|no| P[skip segment]
    O -->|yes| Q{areLowInformationTextsNearDuplicates\nwithin 180s window?}
    Q -->|yes| R[finalPassTranscript = ''\nrepetitionFilteredSegments++]
    Q -->|no| S[add to speaker window\nlastKeptBySpeaker]
    K --> T[return result with applied=true]
Loading

Comments Outside Diff (1)

  1. src/evals/transcriptionEval.ts, line 356-370 (link)

    P2 promptOverride silently unused in Bedrock branch

    buildTranscribeOnce accepts promptOverride as a parameter, but the Bedrock branch ignores it — the same bedrockConfig is used regardless of what is passed. This is harmless today because --compare-no-prompt is already guarded to OpenAI only, but the mismatch between the function signature and actual Bedrock behavior can mislead future maintainers into thinking the override is honoured for Bedrock too.

    A comment makes the intent explicit:

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/evals/transcriptionEval.ts
    Line: 356-370
    
    Comment:
    **`promptOverride` silently unused in Bedrock branch**
    
    `buildTranscribeOnce` accepts `promptOverride` as a parameter, but the Bedrock branch ignores it — the same `bedrockConfig` is used regardless of what is passed. This is harmless today because `--compare-no-prompt` is already guarded to OpenAI only, but the mismatch between the function signature and actual Bedrock behavior can mislead future maintainers into thinking the override is honoured for Bedrock too.
    
    A comment makes the intent explicit:
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/evals/transcriptionEval.ts
Line: 356-370

Comment:
**`promptOverride` silently unused in Bedrock branch**

`buildTranscribeOnce` accepts `promptOverride` as a parameter, but the Bedrock branch ignores it — the same `bedrockConfig` is used regardless of what is passed. This is harmless today because `--compare-no-prompt` is already guarded to OpenAI only, but the mismatch between the function signature and actual Bedrock behavior can mislead future maintainers into thinking the override is honoured for Bedrock too.

A comment makes the intent explicit:

```suggestion
        : (() => {
            const bedrockConfig = resolveBedrockConfig();
            const region = config.storage.awsRegion;
            const bedrockClient = new BedrockDataAutomationRuntimeClient({
              region,
            });
            const s3Client = new S3Client({ region });
            // Bedrock does not support a dynamic prompt override; promptOverride is unused here.
            return () =>
              transcribeWithBedrock(
                bedrockClient,
                s3Client,
                filePath,
                bedrockConfig,
              );
          })();
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (6): Last reviewed commit: "[AGENT] Add spacing before final-pass de..." | Re-trigger Greptile

@BASIC-BIT
Copy link
Collaborator Author

[AGENT] Addressed the concrete bot review items locally before this push: added the missing final-pass mock fields in test/commands/endMeeting.test.ts, switched repeated-phrase tracking to stable user IDs, removed transcript text from the new warning log payloads, restricted --compare-no-prompt to OpenAI, and renamed the contiguous-token helper with an intent comment. I am intentionally keeping the focused src/services/__tests__ coverage alongside the broader test/services suite.

Copilot AI review requested due to automatic review settings March 23, 2026 05:18
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

@github-actions
Copy link

Visual regression report

No visual diffs detected.

Run: https://github.com/Chronote-gg/Chronote/actions/runs/23422584614

@github-actions
Copy link

Visual regression report

No visual diffs detected.

Run: https://github.com/Chronote-gg/Chronote/actions/runs/23422620451

@github-actions
Copy link

Visual regression report

No visual diffs detected.

Run: https://github.com/Chronote-gg/Chronote/actions/runs/23422804157

Copilot AI review requested due to automatic review settings March 23, 2026 05:35
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/utils/transcriptionVote.ts:111

  • calculateCandidateQuality uses getTranscriptionTextQuality (Unicode-aware) but still normalizes words via normalizeWords, which strips non-ASCII letters/numbers. This means repetition/unique-word heuristics won’t work for non-Latin transcripts and can skew scoring/telemetry when wordCount comes from non-ASCII text. Consider reusing the Unicode-aware token normalization from transcriptionText (or switching normalizeWords to \p{L}/\p{N}) so quality scoring is consistent across languages.
  const textQuality = getTranscriptionTextQuality(candidate.text);
  const trimmed = textQuality.trimmed;
  const wordCount = textQuality.wordCount;
  const normalizedWords = normalizeWords(trimmed);
  const uniqueWordRatio =
    normalizedWords.length === 0
      ? 1
      : new Set(normalizedWords).size / normalizedWords.length;
  const maxConsecutiveRepeats = calculateMaxConsecutiveRepeats(normalizedWords);

@github-actions
Copy link

Visual regression report

No visual diffs detected.

Run: https://github.com/Chronote-gg/Chronote/actions/runs/23423069771

@github-actions
Copy link

Visual regression report

No visual diffs detected.

Run: https://github.com/Chronote-gg/Chronote/actions/runs/23423323835

Copilot AI review requested due to automatic review settings March 23, 2026 05:53
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

@github-actions
Copy link

Visual regression report

No visual diffs detected.

Run: https://github.com/Chronote-gg/Chronote/actions/runs/23423501705

@BASIC-BIT
Copy link
Collaborator Author

[AGENT] Addressed the last minor cleanup items from the latest Greptile pass: the no-space text test now has an accurate description, and the guardrail-threshold fallback in transcriptionFinalPassService now documents why it intentionally skips the deterministic repetition filter when the reconcile model looks suspicious.

Copilot AI review requested due to automatic review settings March 23, 2026 06:01
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

src/evals/transcriptionEval.ts:467

  • --compare-no-prompt can be combined with dataset mode (--dataset / LANGFUSE_EVAL_DATASET), but runLangfuseEval currently ignores compareNoPrompt (it only runs a single batch per item). This makes the flag silently ineffective in dataset runs. Consider rejecting --compare-no-prompt when useDataset is true, or extending the dataset task/output schema to run and record both prompt modes.
  const compareNoPrompt = hasFlag("--compare-no-prompt");
  if (compareNoPrompt && provider !== "openai") {
    throw new Error(
      "--compare-no-prompt currently supports only --provider openai.",
    );
  }

  const datasetFlag =
    readFlagValue("--langfuse-dataset") ?? readFlagValue("--dataset");
  const experimentFlag =
    readFlagValue("--langfuse-experiment") ?? readFlagValue("--experiment");
  const envDataset = process.env.LANGFUSE_EVAL_DATASET;
  const useDataset = Boolean(datasetFlag ?? (!file && envDataset));
  const langfuseDataset = datasetFlag ?? (useDataset ? envDataset : undefined);

@github-actions
Copy link

Visual regression report

No visual diffs detected.

Run: https://github.com/Chronote-gg/Chronote/actions/runs/23423671051

@github-actions
Copy link

Visual regression report

No visual diffs detected.

Run: https://github.com/Chronote-gg/Chronote/actions/runs/23423718230

@github-actions
Copy link

Visual regression report

No visual diffs detected.

Run: https://github.com/Chronote-gg/Chronote/actions/runs/23423938912

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants