Skip to content

Conversation

@dadoonet
Copy link
Owner

@dadoonet dadoonet commented Nov 25, 2025

This also change the way the Tika parser was instantiated. It's no more a static class.

This is a WIP as I'd like to add the support for multiple passwords so we can try many options "a la brute force" in case the directory contains many files with different passwords.

I think we should implement a PasswordProvider interface which could get the Password from many possible providers.

The idea is to define PasswordProvider#getPassword(String path) which is responsible to provide the password for a given file.

The simplest one would be MemoryPasswordProvider.
The easiest one would be DiskPasswordProvider.
And may be an ElasticsearchPasswordProvider.

Related to #1916.


Note

Adds password support for protected docs (form/header/query) across REST and 3rd‑party uploads, and replaces static Tika parsing with an instance-based parser.

  • Core/Parsing:
    • Replace static TikaDocParser.generate(...) with an instance-based TikaDocParser held by FsParserAbstract and DocumentApi.
    • Introduce per-instance TikaInstance (no global/static state); expose extractText(...) and langDetector() on instances.
    • Handle encrypted files via Tika PasswordProvider; gracefully log when password missing/incorrect.
  • REST API:
    • Accept password for uploads via form, header, or query in POST /_document and PUT /_document/{id}.
    • Support password for 3rd‑party JSON uploads; pass through to parsing.
    • DocumentApi.enrichDoc(...) updated to use instance parser and optional password.
  • Docs:
    • Add "Document password" section with curl examples; minor formatting tweaks.
  • Integration/Tests:
    • Extend REST ITs to upload protected pdf/docx with passwords and assert non-empty content.
    • Update unit tests (TikaDocParserTest) to cover protected docs and new parser instantiation.
    • OCR IT refactor: detect tesseract path once; split scenarios; minor ignores and logs.
    • Minor test infra tweaks (resource copying overload, startup/shutdown logs, wait/count log messages).

Written by Cursor Bugbot for commit 324e87c. This will update automatically on new commits. Configure here.

This also change the way the Tika parser was instantiated. It's no more a static class.

This is a WIP as I'd like to add the support for multiple passwords so we can try many options "a la brute force" in case the directory contains many files with different passwords.

I think we should implement a PasswordProvider interface which could get the Password from many possible providers.

The idea is to define `PasswordProvider#getPassword(String path)` which is responsible to provide the password for a given file.

The simplest one would be `MemoryPasswordProvider`.
The easiest one would be `DiskPasswordProvider`.
And may be an `ElasticsearchPasswordProvider`.

Related to #1916.
@dadoonet dadoonet self-assigned this Nov 25, 2025
@dadoonet dadoonet added new For new features or options component:extractor For Tika, XML and JSON parsers labels Nov 25, 2025
WriteOutContentHandler handler = new WriteOutContentHandler(indexedChars);
try (stream) {
// Set the password if any
context.set(PasswordProvider.class, new StandardPasswordProvider(password));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: PasswordProvider persists across multiple document parsing operations

The ParseContext is created once in TikaInstance constructor and reused across all parsing operations. When a password is provided, the PasswordProvider is set in this shared context but never cleaned up after parsing. This means subsequent parsing operations with the same TikaDocParser instance will retain the previous password, potentially allowing encrypted documents to be decrypted with incorrect passwords or bypassing password protection unintentionally.

Additional Locations (1)

Fix in Cursor Fix in Web

logger.info(" --> Launching test [{}]", currentTestName);
currentTestResourceDir = testResourceTarget.resolve(currentTestName);
String url = getUrl("samples", currentTestName);
String url = getUrl("samples", sampleDirName);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Test resources directory mismatch with sample directory

In the new copyTestResources(String sampleDirName) method, the sampleDirName parameter is used to locate the source files via getUrl("samples", sampleDirName), but the destination directory is still set using currentTestName instead of sampleDirName. This causes a mismatch where test files from the sampleDirName directory are copied to a different directory named after the test method. When a subclass overrides this method to use a different sample directory (e.g., "ocr"), the resources will be copied to the wrong location.

Fix in Cursor Fix in Web

@dadoonet dadoonet linked an issue Nov 25, 2025 that may be closed by this pull request

public void copyTestResources() throws IOException {
copyTestResources("ocr");
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing @Before annotation on overridden setup method

The copyTestResources() method overrides the parent's @Before annotated method from AbstractITCase but is missing the @Before annotation. In JUnit 4, the @Before annotation is not inherited when overriding, so this method will never be called before tests run. This means the "ocr" test resources won't be copied to the test directory, causing all OCR tests to fail because they won't find the expected sample files.

Fix in Cursor Fix in Web

@sonarqubecloud
Copy link

sonarqubecloud bot commented Dec 5, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component:extractor For Tika, XML and JSON parsers new For new features or options

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for password protected documents

2 participants