Skip to content

Conversation

@BuddhiLW
Copy link
Contributor

@BuddhiLW BuddhiLW commented Dec 29, 2025

Summary

  • Fix for Emacs Org-mode files (.org) being incorrectly rejected as binary files
  • Apache Tika misdetects .org files as application/vnd.lotus-organizer (Lotus Organizer format)
  • Discovered while working on emacs-mcp project

Changes

  • Added text-file-extensions set for known text file extensions that Tika may misdetect
  • Added text-file-names set for dotfiles/special files without traditional extensions (Makefile, .gitignore, etc.)
  • Modified text-file? to check extension first before falling back to MIME type detection

Test plan

  • Added unit tests for get-filename, get-file-extension, text-extension?
  • Added integration test for .org file detection
  • All existing tests pass

Co-Authored-By: Pedro Gomes Branquinho pedrogbranquinho@gmail.com

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Improved text file recognition with explicit support for additional file types, including better handling of .org files and other text-based formats.
    • Enhanced file type detection that prioritizes known text patterns before falling back to MIME-based detection for more reliable identification.

✏️ Tip: You can customize this high-level summary in your review settings.

Discovered while working on emacs-mcp project: Apache Tika incorrectly
detects Emacs Org-mode files (.org) as application/vnd.lotus-organizer
(Lotus Organizer format), causing read_file to reject them as
unsupported binary files.

This fix adds extension-based text file detection that runs before
MIME type checking:
- text-file-extensions: known text extensions (.org, .md, .rst, etc.)
- text-file-names: dotfiles/special files (Makefile, .gitignore, etc.)
- text-extension?: checks both extension and filename
- text-file?: now checks extension first, then falls back to MIME

Fixes the error: "File read not supported for `/path/file.org`
with mime-type `application/vnd.lotus-organizer`"

Co-Authored-By: Pedro Gomes Branquinho <pedrogbranquinho@gmail.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Dec 29, 2025

📝 Walkthrough

Walkthrough

This PR enhances text file detection by introducing extension and filename-based recognition that precedes MIME-type checking. New utility functions extract file metadata from paths, and curated collections of known text extensions and filenames enable robust file classification independent of system MIME heuristics.

Changes

Cohort / File(s) Summary
Text file detection enhancements
src/clojure_mcp/file_content.clj
Added text-file-extensions and text-file-names collections for known text patterns. Introduced utility functions: get-filename, get-file-extension extract metadata from paths. New text-extension? predicate validates files against known patterns. Modified text-file? to check extension-based patterns first, then fall back to MIME-type detection.
Test coverage for file utilities
test/clojure_mcp/file_content_test.clj
Added comprehensive unit tests covering: get-filename extraction from various path formats and edge cases; get-file-extension including multi-extension scenarios (e.g., .tar.gz) and case-insensitivity; text-extension? validation against known patterns and unknown extensions; org-mode file special handling.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Possibly related PRs

  • bhauman/clojure-mcp#91: Also modifies text-detection logic in src/clojure_mcp/file_content.clj, expanding MIME-based recognition alongside this PR's extension-based prechecks.

Poem

A rabbit hops through filenames fast,
No more MIME confusion casting doubt—
Extensions and names hold steadfast,
Text files found, beyond a doubt! 🐰📄

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main issue: .org files being misdetected as Lotus Organizer binary format. It accurately reflects the core problem solved by the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f53436e and 7a93e02.

📒 Files selected for processing (2)
  • src/clojure_mcp/file_content.clj
  • test/clojure_mcp/file_content_test.clj
🧰 Additional context used
📓 Path-based instructions (2)
**/*.clj

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.clj: Use :require with ns aliases in import statements (e.g., [clojure.string :as string])
Use kebab-case for variable and function names
End predicate functions with ? (e.g., is-top-level-form?)
Use try/catch with specific exception handling; use atom for tracking errors
Use 2-space indentation and maintain whitespace in edited forms
Align namespace names with directory structure (e.g., clojure-mcp.repl-tools for clojure_mcp/repl_tools.clj)
Include clear tool :description for LLM guidance in MCP tool definitions
Validate inputs and provide helpful error messages in MCP tools
Return structured data with both result and error status from MCP tools
Maintain atom-based state for consistent service access in MCP tools

Files:

  • test/clojure_mcp/file_content_test.clj
  • src/clojure_mcp/file_content.clj
**/*_test.clj

📄 CodeRabbit inference engine (CLAUDE.md)

Use deftest with descriptive test names; use testing for subsections; use is for assertions

Files:

  • test/clojure_mcp/file_content_test.clj
🧠 Learnings (3)
📚 Learning: 2025-12-07T23:16:26.445Z
Learnt from: CR
Repo: bhauman/clojure-mcp PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-07T23:16:26.445Z
Learning: Applies to **/*_test.clj : Use `deftest` with descriptive test names; use `testing` for subsections; use `is` for assertions

Applied to files:

  • test/clojure_mcp/file_content_test.clj
📚 Learning: 2025-12-27T06:54:07.157Z
Learnt from: nandoolle
Repo: bhauman/clojure-mcp PR: 138
File: src/clojure_mcp/agent/langchain/model.clj:91-97
Timestamp: 2025-12-27T06:54:07.157Z
Learning: When reviewing Clojure code (e.g., in src/clojure_mcp/... ), verify calls to presets/get-preset follow the two arities: [preset-key] and [preset-key overrides]. The 2-arg version should merge the provided overrides map with the preset defaults and return the resulting configuration. Check for correct map merging, preservation of immutability, and validation of overrides keys to ensure predictable configuration outcomes.

Applied to files:

  • test/clojure_mcp/file_content_test.clj
  • src/clojure_mcp/file_content.clj
📚 Learning: 2025-12-07T23:16:26.445Z
Learnt from: CR
Repo: bhauman/clojure-mcp PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-07T23:16:26.445Z
Learning: Applies to **/*.clj : Include clear tool `:description` for LLM guidance in MCP tool definitions

Applied to files:

  • src/clojure_mcp/file_content.clj
🔇 Additional comments (10)
test/clojure_mcp/file_content_test.clj (4)

131-139: LGTM! Comprehensive test coverage for filename extraction.

The tests appropriately cover path extraction, dotfiles, case conversion, and edge cases. The implementation correctly validates that filenames are normalized to lowercase.


141-162: LGTM! Thorough test coverage for extension extraction.

The tests comprehensively validate:

  • Basic extraction with case normalization
  • Multi-part extensions (intentionally returns last segment)
  • Dotfiles (correctly distinguishes .gitignore with no extension from .gitignore.local with .local extension)
  • Edge cases (nil, empty, no extension)

164-192: LGTM! Complete test coverage for text extension detection.

The tests validate all key behaviors:

  • Known text extensions (.org, .md, .txt, etc.)
  • Known text filenames (dotfiles and build files)
  • Case-insensitive matching for both categories
  • Rejection of unknown extensions

194-219: LGTM! Excellent integration test for the core fix.

This test directly validates the PR's main objective: ensuring .org files are recognized as text despite Apache Tika's MIME type misdetection. The use of realistic org-mode content (metadata, headings, lists, code blocks) provides strong confidence in the fix.

src/clojure_mcp/file_content.clj (6)

24-43: LGTM! Comprehensive text extension coverage.

The extension set is well-curated and addresses the core issue (.org files). All entries correctly include the dot prefix and are lowercase, which aligns with the get-file-extension implementation.


45-59: LGTM! Good coverage of extension-less text files.

The filename set appropriately handles common dotfiles and build configuration files. All entries are lowercase, consistent with the case-normalization in get-filename.


89-103: LGTM! Clean and defensive utility implementations.

Both functions correctly:

  • Handle nil and empty inputs
  • Support cross-platform path separators ([/\\])
  • Normalize case for consistent matching
  • Handle edge cases (dotfiles, multi-part extensions)

The pos? dot-idx check in get-file-extension properly ensures dotfiles return nil.


105-110: LGTM! Clean predicate implementation.

The function correctly combines extension-based and filename-based checks using the curated sets. The implementation is straightforward and efficient.


151-157: LGTM! Core fix correctly implemented.

The extension-first check elegantly solves the MIME misdetection issue. The short-circuit or ensures efficient evaluation, and the updated docstring clearly documents the behavior.


141-149: The flagged inconsistency is not a practical issue. text-file? and file-response->file-content operate in completely separate control flows: the former determines how to read files locally in the unified_read_file tool, while the latter converts tool responses into MCP format in core.clj. They are not called in the same flow, so no inconsistent handling of .org files occurs. Additionally, should-be-file-response? is defined but never called anywhere in the codebase, suggesting it is unused.

Likely an incorrect or invalid review comment.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@BuddhiLW
Copy link
Contributor Author

Before/After Examples

Before (error)

● clojure-mcp - read_file (MCP)(path: "/home/lages/dotfiles/gitthings/emacs-mcp/kanban.org")
  ⎿  Error: File read not supported for `/home/lages/dotfiles/gitthings/emacs-mcp/kanban.org` with mime-type `application/vnd.lotus-organizer`

After (success)

● clojure-mcp - read_file (MCP)(path: "/home/lages/dotfiles/gitthings/emacs-mcp/kanban.org")
  ⎿  ### /home/lages/dotfiles/gitthings/emacs-mcp/kanban.org
     ```org
     #+TITLE: emacs-mcp Kanban
     #+STARTUP: overview
     #+TODO: TODO IN-PROGRESS IN-REVIEW | DONE CANCELLED

     * emacs-mcp
     :PROPERTIES:
     :ID: b70720e8-3a52-41ba-a115-964b81369bb5
     :END:
     ...
     ```

Root cause

Apache Tika's MIME detection incorrectly identifies Emacs Org-mode files (.org) as application/vnd.lotus-organizer (Lotus Organizer, a discontinued PIM application from the 1990s). Both formats historically used the .org extension.

Solution

Added extension-based text file detection that runs before MIME type checking, so known text extensions like .org, .md, .rst, etc. are correctly treated as text regardless of what Tika detects.

@bhauman bhauman merged commit eeaf80b into bhauman:main Jan 5, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants