Skip to content

3237: docs: add Iceberg public API documentation to contributor guide#40

Open
martin-augment wants to merge 5 commits intomainfrom
pr-3237-2026-01-22-20-41-46
Open

3237: docs: add Iceberg public API documentation to contributor guide#40
martin-augment wants to merge 5 commits intomainfrom
pr-3237-2026-01-22-20-41-46

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

3237: To review by AI

andygrove and others added 5 commits January 21, 2026 17:24
Add documentation detailing all Comet classes and methods that form
the public API used by Apache Iceberg. This helps contributors
understand which APIs may affect Iceberg integration and need
backward compatibility considerations.

The documentation covers:
- org.apache.comet.parquet: FileReader, RowGroupReader, ReadOptions,
  ParquetColumnSpec, column readers, BatchReader, Native JNI methods
- org.apache.comet: CometSchemaImporter
- org.apache.comet.vector: CometVector
- org.apache.comet.shaded.arrow: RootAllocator, ValueVector

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add a custom Java annotation @IcebergApi to mark all classes, methods,
constructors, and fields that form the public API used by Apache Iceberg.
This makes it easy to identify which APIs need backward compatibility
considerations when making changes.

The annotation is applied to:
- org.apache.comet.parquet: FileReader, RowGroupReader, ReadOptions,
  WrappedInputFile, ParquetColumnSpec, AbstractColumnReader, ColumnReader,
  BatchReader, MetadataColumnReader, ConstantColumnReader, Native, TypeUtil,
  Utils
- org.apache.comet: CometSchemaImporter
- org.apache.comet.vector: CometVector

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add annotations to:
- AbstractColumnReader.nativeHandle (protected field accessed by Iceberg
  subclasses)
- AbstractCometSchemaImporter.close() (called by Iceberg)

Also update documentation to include these APIs.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add a new Maven module containing dedicated unit tests for all @IcebergApi
annotated classes, ensuring the public API contract with Apache Iceberg
remains stable and tested.

Key changes:
- Add iceberg-public-api module with 169 tests covering all @IcebergApi classes
- Fix CometVector constructor visibility (protected -> public) to match API annotation
- Add IcebergApiVerificationTest for reflection-based API verification
- Add tests for FileReader, BatchReader, ColumnReader, Native, TypeUtil, Utils
- Add tests for CometVector, CometSchemaImporter, WrappedInputFile

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 22, 2026

Walkthrough

This pull request introduces a public API surface for the Comet-Iceberg integration by creating the @IcebergApi annotation and applying it across multiple classes and methods. Key components include the new annotation definition, its application to approximately 20 classes and 40+ methods throughout the parquet, schema importing, and vector packages, comprehensive documentation of the public API surface in iceberg_public_api.md, and a new Maven test module iceberg-public-api containing reflection-based test suites to verify annotation presence, public accessibility, and API compliance. No behavioral changes are introduced; all modifications are annotation-driven to establish clear API boundaries for Iceberg integration.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the maintainability and clarity of the Comet-Iceberg integration by formally defining and documenting the public API contract. By introducing a dedicated annotation, comprehensive documentation, and an automated test suite, it ensures that critical API elements are explicitly identified, understood, and protected against unintentional breaking changes, thereby improving the stability and future evolution of the integration.

Highlights

  • Iceberg Public API Annotation: Introduced a new @IcebergApi annotation (org.apache.comet.IcebergApi) to explicitly mark classes, methods, constructors, and fields that form the public API contract between Comet and Apache Iceberg. This annotation serves as a clear indicator for developers regarding compatibility considerations.
  • API Contract Formalization: Applied the newly created @IcebergApi annotation across numerous existing Comet classes, methods, and fields within the org.apache.comet, org.apache.comet.parquet, org.apache.comet.vector, and org.apache.arrow.c packages. This formalizes the API surface that Iceberg relies upon for its integration with Comet's native Parquet reader.
  • Comprehensive Documentation: Added a new, detailed documentation page (docs/source/contributor-guide/iceberg_public_api.md) to the contributor guide. This document outlines the specific Comet APIs used by Iceberg, explains the integration approaches (Hybrid and Native Readers), and provides crucial compatibility considerations for future development.
  • Dedicated API Verification Tests: Created a new Maven module (iceberg-public-api) with a suite of reflection-based tests. These tests (IcebergApiVerificationTest.java and various *ApiTest.java files) automatically verify that all elements marked with @IcebergApi are publicly accessible and maintain their expected signatures, ensuring the stability of the Iceberg integration.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@claude
Copy link
Copy Markdown

claude bot commented Jan 22, 2026

Pull Request Review: Iceberg Public API Documentation

Summary

This PR adds comprehensive documentation and test coverage for the Comet public API used by Apache Iceberg. The changes include a new @IcebergApi annotation, detailed documentation, and a dedicated test module to verify the API contract.


Positive Aspects ✅

1. Excellent Documentation

  • The new docs/source/contributor-guide/iceberg_public_api.md is comprehensive and well-structured
  • Clear explanation of the hybrid vs native reader approaches
  • Detailed API surface documentation with code examples
  • Good coverage of compatibility considerations

2. Well-Designed Annotation

  • The @IcebergApi annotation (common/src/main/java/org/apache/comet/IcebergApi.java) is properly designed:
    • @Retention(RetentionPolicy.RUNTIME) allows reflection-based verification
    • ✅ Targets classes, methods, constructors, and fields
    • ✅ Comprehensive Javadoc explaining the purpose
    • ✅ Proper ASF license header

3. Comprehensive Test Coverage

  • 18 test files covering all annotated API classes
  • IcebergApiVerificationTest uses reflection to verify:
    • All expected classes have annotations
    • Annotated elements are properly accessible (public/protected)
    • Expected methods and constructors exist
  • Individual API test classes (e.g., FileReaderApiTest, BatchReaderApiTest) verify functional correctness
  • ParquetTestHelper provides good testing utilities

4. Proper Module Structure

  • New iceberg-public-api module is test-only:
    • maven.install.skip=true and maven.deploy.skip=true
    • ✅ Only test scope dependencies
    • ✅ Properly integrated into parent POM
  • Good separation of concerns

5. Minimal Code Changes

  • Only 61 @IcebergApi annotations added to existing code
  • No behavioral changes to production code
  • Changes are purely additive for documentation/verification

Issues and Concerns 🔍

1. Missing Import Ordering Fix ⚠️

In AbstractCometSchemaImporter.java:

import org.apache.arrow.vector.types.pojo.Field;

import org.apache.comet.IcebergApi;  // ⚠️ Extra blank line before this

This violates Java import grouping conventions. The org.apache.comet import should be in its own group, but there's an extra blank line that shouldn't be there. It should be:

import org.apache.arrow.vector.types.pojo.Field;
import org.apache.comet.IcebergApi;

Location: common/src/main/java/org/apache/arrow/c/AbstractCometSchemaImporter.java:23-25

2. Documentation Clarity Issues

2.1 RootAllocator Reference

The documentation mentions org.apache.comet.shaded.arrow.memory.RootAllocator but the actual import used in tests is:

org.apache.arrow.memory.BufferAllocator

This could be confusing. The documentation should clarify:

  • When to use the shaded version vs. the non-shaded version
  • Whether Iceberg actually uses the shaded classes or if this is internal to Comet

2.2 Missing Error Handling Context

The documentation shows API signatures but doesn't document:

  • What exceptions can be thrown and when
  • Error handling patterns Iceberg should follow
  • What happens when invalid inputs are provided

3. Test Coverage Gaps

3.1 Missing Negative Tests

The test suite focuses on happy paths but lacks:

  • Tests for error conditions (invalid inputs, IOExceptions, etc.)
  • Tests for boundary conditions (null values, empty arrays, etc.)
  • Tests for concurrent access patterns if applicable

3.2 MockInputFile Implementation

In FileReaderApiTest.java, the MockInputFile class uses reflection to mimic Iceberg's InputFile:

private static class MockInputFile {
  public long getLength() throws IOException { ... }
  public org.apache.parquet.io.SeekableInputStream newStream() { ... }
}

Issue: This doesn't actually implement Iceberg's InputFile interface. The test passes Object.class to WrappedInputFile constructor, which may not catch type safety issues. Consider adding a comment explaining why this approach was chosen, or use an actual Iceberg dependency for more realistic testing.

Location: iceberg-public-api/src/test/java/org/apache/comet/iceberg/api/parquet/FileReaderApiTest.java:157-182

4. Inconsistent Annotation Coverage ⚠️

Looking at the documentation vs. code:

  • Documentation mentions protected long nativeHandle field in AbstractColumnReader and MetadataColumnReader
  • These protected fields should have @IcebergApi annotations since they're explicitly documented as "accessed by Iceberg subclasses"

Recommendation: Add @IcebergApi annotations to protected fields that are part of the public API contract.

5. Missing Deprecation Strategy

The documentation notes:

"Changes to these APIs may break Iceberg's Comet integration"

But there's no guidance on:

  • Deprecation policy if APIs need to change
  • Versioning strategy for the API
  • Communication plan for breaking changes
  • Timeline for maintaining backward compatibility

6. Potential Security Concern 🔒

In FileReader constructor parameters:

public FileReader(
    ...
    byte[] fileEncryptionKey,    // ⚠️
    byte[] fileAADPrefix         // ⚠️
)

Concern: The documentation doesn't mention:

  • How encryption keys should be handled securely
  • Whether these byte arrays are copied or referenced (potential mutation issues)
  • Memory cleanup requirements for sensitive data

Recommendation: Add security considerations section to the documentation.


Suggestions for Improvement 💡

1. Add Version Information

Consider adding version metadata to the annotation or documentation:

@IcebergApi
@Since("0.13.0")  // Track when APIs were added
public class FileReader { ... }

2. Document Thread Safety

Add thread safety guarantees/requirements to the documentation. For example:

  • Are FileReader instances thread-safe?
  • Can BatchReader be used concurrently?
  • What synchronization is required?

3. Add API Stability Contract

Create a document section explaining:

## API Stability Guarantee
- **Stable APIs**: Marked with @IcebergApi, maintained for N major versions
- **Breaking Changes**: Require deprecation period of X months
- **Notification**: Changes announced via mailing list and GitHub discussions

4. Consider Adding API Usage Examples

The documentation shows signatures but could benefit from complete working examples:

// Example: Reading a Parquet file with Comet
try (FileReader reader = new FileReader(...)) {
    reader.setRequestedSchemaFromSpecs(specs);
    while ((RowGroupReader rg = reader.readNextRowGroup()) \!= null) {
        // Process row group
    }
}

5. Add Integration Testing

Consider adding a test that actually uses Iceberg (if feasible) to verify the integration works end-to-end, rather than just API surface testing.


Code Quality Assessment 📊

Aspect Rating Notes
Documentation ⭐⭐⭐⭐⭐ Excellent, comprehensive coverage
Test Coverage ⭐⭐⭐⭐ Good API verification, could use more edge cases
Code Quality ⭐⭐⭐⭐⭐ Clean, minimal changes
Architecture ⭐⭐⭐⭐⭐ Well-designed annotation system
Security ⭐⭐⭐ Needs documentation on encryption handling

Performance Considerations ✓

No performance concerns - this is purely documentation and testing infrastructure. The annotation overhead at runtime is negligible.


Recommendation

APPROVE with minor fixes

This is an excellent PR that addresses an important need for documenting the Iceberg integration API. The issues identified are minor and mostly related to documentation clarity and test coverage edge cases.

Required Changes:

  1. Fix the import ordering in AbstractCometSchemaImporter.java

Recommended Changes (can be addressed in follow-up PRs):

  1. Add @IcebergApi to protected fields mentioned in documentation
  2. Clarify shaded vs non-shaded Arrow class usage
  3. Add error handling and security sections to documentation
  4. Consider adding thread safety documentation
  5. Add deprecation/versioning policy

Great work on this comprehensive documentation effort! This will significantly help maintain compatibility with Iceberg going forward. 🎉

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Jan 22, 2026

🤖 Augment PR Summary

Summary: This PR formalizes and documents the subset of Comet’s JVM APIs that Apache Iceberg relies on for its hybrid Parquet reader integration.

Changes:

  • Introduces a new @IcebergApi annotation (runtime-retained) to explicitly mark Iceberg-facing public API elements.
  • Annotates key Parquet reader classes/methods/fields (e.g., FileReader, BatchReader, AbstractColumnReader, Native) plus schema/vector entry points used by Iceberg.
  • Makes CometVector’s constructor public to support Iceberg’s usage/extension patterns.
  • Adds a contributor-guide page documenting the Iceberg public API surface and expected call flows.
  • Adds a new iceberg-public-api Maven module with JUnit/AssertJ tests that verify API presence and accessibility via reflection.
  • Wires the new module into the root build and links the new guide page from the contributor guide index.

Technical Notes: The API is designed to be shading-safe (e.g., using ParquetColumnSpec and Object wrappers to avoid direct dependency on Iceberg-shaded Parquet types).

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 5 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

// Methods used by Iceberg
public void setRequestedSchemaFromSpecs(List<ParquetColumnSpec> specs)
public RowGroupReader readNextRowGroup() throws IOException
public void skipNextRowGroup()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guide documents skipNextRowGroup() as returning void, but the actual @IcebergApi method returns boolean (and BatchReader.nextBatch(int) also returns boolean). Keeping these signatures consistent in the guide will prevent confusion when validating the Iceberg-facing contract.

Other Locations
  • docs/source/contributor-guide/iceberg_public_api.md:162

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎


```java
// Constructor
public WrappedInputFile(org.apache.iceberg.io.InputFile inputFile)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WrappedInputFile constructor is documented as taking org.apache.iceberg.io.InputFile, but the implemented API is WrappedInputFile(Object) to avoid shading/type-coupling. Consider aligning the rest of the signatures in this guide with the shading-safe API (e.g., Utils.getColumnReader using ParquetColumnSpec).

Other Locations
  • docs/source/contributor-guide/iceberg_public_api.md:228

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

@After
public void tearDown() throws IOException {
if (tempDir != null && Files.exists(tempDir)) {
Files.walk(tempDir).sorted(Comparator.reverseOrder()).map(Path::toFile).forEach(File::delete);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Files.walk(tempDir) returns a stream that should be closed (try-with-resources), otherwise it can leak file descriptors and cause intermittent cleanup failures in tests.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

return Modifier.isPublic(modifiers) || Modifier.isProtected(modifiers);
}

/** Checks if native library is available. */
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isNativeLibraryAvailable() currently only verifies NativeBase is on the classpath, not that the JNI library can actually be loaded; the name/comment could be misleading if it’s used to gate native-dependent tests.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

* @deprecated since 0.10.0, will be removed in 0.11.0.
* @see <a href="https://github.com/apache/datafusion-comet/issues/2079">Comet Issue #2079</a>
*/
@IcebergApi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor is marked as deprecated “will be removed in 0.11.0” but is now also marked @IcebergApi; that combination is easy to misinterpret and risks accidental removal without coordinating the Iceberg integration contract.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
common/src/main/java/org/apache/comet/parquet/RowGroupReader.java (1)

41-45: Add @IcebergApi annotation to the public constructor.

The public constructor RowGroupReader(long rowCount, long rowIndexOffset) should be annotated with @IcebergApi to align with similar public constructors in the codebase (e.g., CometSchemaImporter, WrappedInputFile). Since this class is part of the Iceberg API surface and FileReader.readNextRowGroup() (an @IcebergApi method) directly instantiates this constructor, the constructor must be explicitly marked as part of the public API.

common/src/main/java/org/apache/comet/parquet/ParquetColumnSpec.java (1)

89-91: Annotate isRepeated() if it’s part of the Iceberg-facing API.

It’s a public getter but is the only one without @IcebergApi, which could unintentionally exclude it from the documented API surface.

✅ Suggested fix
-  public boolean isRepeated() {
+  `@IcebergApi`
+  public boolean isRepeated() {
     return isRepeated;
   }
🤖 Fix all issues with AI agents
In `@docs/source/contributor-guide/iceberg_public_api.md`:
- Around line 308-321: Update the docs to use the actual public API names:
replace the nonexistent CometTypeUtils with TypeUtil when describing conversion
of ColumnDescriptor to ParquetColumnSpec, and change the incorrect
delegate().currentBatch() reference to call currentBatch() directly on the
appropriate reader type (e.g., ColumnReader, BatchReader or
AbstractColumnReader) so the flow reads that you retrieve results via
currentBatch() on the column/batch reader instead of via delegate().
- Around line 250-259: The doc currently lists AbstractCometSchemaImporter under
package org.apache.arrow.c but at runtime the shaded artifact relocates it to
org.apache.comet.shaded.arrow.c; update the text around
AbstractCometSchemaImporter to clarify both names (source vs runtime) or replace
the package with the runtime package org.apache.comet.shaded.arrow.c so API
consumers can import AbstractCometSchemaImporter correctly, and mention that JNI
classes remain under org.apache.arrow.* as an exception.

In
`@iceberg-public-api/src/test/java/org/apache/comet/iceberg/api/parquet/BatchReaderApiTest.java`:
- Around line 85-89: The test BatchReaderApiTest.testCloseMethodExists only
asserts the existence of BatchReader.close() but does not verify the `@IcebergApi`
annotation; update this test to mirror
AbstractColumnReaderApiTest.testCloseMethodExists by retrieving the close method
via BatchReader.class.getMethod("close") and asserting that
method.isAnnotationPresent(IcebergApi.class) (or equivalent assertion using the
test framework) so the test both checks the method exists and has the
`@IcebergApi` annotation.
🧹 Nitpick comments (11)
common/src/main/java/org/apache/comet/parquet/ReadOptions.java (1)

151-175: Consider annotating the Builder constructor for API consistency.

The Builder(Configuration conf) constructor is public but lacks the @IcebergApi annotation. While users are expected to use ReadOptions.builder(conf), direct constructor access is possible. For consistency with other annotated constructors in this PR (e.g., CometSchemaImporter, WrappedInputFile), consider either:

  1. Adding @IcebergApi to the constructor, or
  2. Reducing the constructor visibility to package-private if it's only meant to be accessed via the factory method
common/src/main/java/org/apache/comet/parquet/TypeUtil.java (1)

39-46: Deprecated method annotated as Iceberg API - consider adding @Deprecated annotation.

The method is documented as deprecated via Javadoc (@deprecated since 0.10.0, will be removed in 0.11.0), but the @Deprecated Java annotation is missing. Adding both ensures:

  1. Compiler warnings for callers
  2. IDE visibility of deprecation status
♻️ Suggested change
   * `@deprecated` since 0.10.0, will be removed in 0.11.0.
   * `@see` <a href="https://github.com/apache/datafusion-comet/issues/2079">Comet Issue `#2079`</a>
   */
+ `@Deprecated`
  `@IcebergApi`
  public static ColumnDescriptor convertToParquet(StructField field) {
common/src/main/java/org/apache/comet/parquet/WrappedInputFile.java (1)

44-65: Minor: Trailing comment artifacts and reflection accessibility note.

  1. Lines 47 and 58 have trailing // comments that appear to be artifacts and should be removed.

  2. The setAccessible(true) calls may trigger warnings or failures under Java's module system (Java 9+) when the target class is in a non-open module. If Iceberg's InputFile implementations are in packages not opened to Comet, this could fail at runtime. This is likely acceptable given the specific integration context, but worth documenting.

♻️ Remove trailing comment artifacts
-      Method targetMethod = wrapped.getClass().getDeclaredMethod("getLength"); //
+      Method targetMethod = wrapped.getClass().getDeclaredMethod("getLength");
-      Method targetMethod = wrapped.getClass().getDeclaredMethod("newStream"); //
+      Method targetMethod = wrapped.getClass().getDeclaredMethod("newStream");
iceberg-public-api/src/test/java/org/apache/comet/iceberg/api/parquet/UtilsApiTest.java (1)

242-274: Consider adding assertion for isRepeated field in round-trip test.

The round-trip conversion test validates most fields but doesn't explicitly assert that isRepeated is preserved. While the original spec sets isRepeated to false (Line 255), the conversion logic in Utils.descriptorToParquetColumnSpec derives this from the repetition type, which should match.

♻️ Suggested enhancement
     assertThat(roundTrippedSpec.getLogicalTypeName()).isEqualTo(originalSpec.getLogicalTypeName());
+    assertThat(roundTrippedSpec.isRepeated()).isEqualTo(originalSpec.isRepeated());
   }
iceberg-public-api/src/test/java/org/apache/comet/iceberg/api/AbstractApiTest.java (1)

37-54: Consider handling deletion failures in tearDown.

The current cleanup uses File::delete which returns false on failure but doesn't throw, so failures are silent. For test cleanup this is usually acceptable, but you may want to log or handle failures for debugging purposes.

♻️ Optional: Add failure handling
   `@After`
   public void tearDown() throws IOException {
     if (tempDir != null && Files.exists(tempDir)) {
-      Files.walk(tempDir).sorted(Comparator.reverseOrder()).map(Path::toFile).forEach(File::delete);
+      Files.walk(tempDir)
+          .sorted(Comparator.reverseOrder())
+          .forEach(
+              path -> {
+                try {
+                  Files.deleteIfExists(path);
+                } catch (IOException e) {
+                  // Log or ignore cleanup failures in tests
+                }
+              });
     }
   }
iceberg-public-api/src/test/java/org/apache/comet/iceberg/api/IcebergApiVerificationTest.java (1)

266-338: Consider adding class-level annotation checks for consistency.

The tests for Utils, TypeUtil, and Native classes only verify method existence but don't check for class-level @IcebergApi annotation presence, unlike other individual class tests (e.g., FileReader, BatchReader). This inconsistency may be intentional if these are utility classes without class-level annotations, but adding a comment would clarify the intent.

♻️ Suggested: Add consistency or clarifying comments
   `@Test`
   public void testUtilsHasExpectedPublicApi() {
     Class<?> clazz = Utils.class;
+    // Utils is a utility class with only static methods annotated, not the class itself
 
     // Check for expected static methods
     assertStaticMethodExists(

Or, if the class should be annotated:

   `@Test`
   public void testUtilsHasExpectedPublicApi() {
     Class<?> clazz = Utils.class;
+    assertThat(isPublic(clazz)).isTrue();
 
     // Check for expected static methods
iceberg-public-api/src/test/java/org/apache/comet/iceberg/api/vector/CometVectorApiTest.java (1)

111-178: Inconsistent IcebergApi annotation verification for inherited methods.

Several method existence tests (e.g., testCloseMethodExists, testIsNullAtMethodExists, and the getter methods) don't verify the @IcebergApi annotation, unlike the tests for setNumNulls, setNumValues, numValues, getValueVector, and slice. If these methods are intentionally excluded from the Iceberg public API surface (e.g., inherited from ColumnVector), this is fine. Otherwise, consider adding annotation verification for consistency.

iceberg-public-api/src/test/java/org/apache/comet/iceberg/api/parquet/WrappedInputFileApiTest.java (1)

127-131: Potential integer overflow in test mock (minor).

Casting length from long to int at line 129 could overflow for values exceeding Integer.MAX_VALUE. Since test inputs are controlled small values, this is not an issue in practice, but consider using Math.toIntExact(length) to fail fast if someone adds a test with a large length value.

♻️ Suggested improvement
     public InputStream newStream() {
       // Return a simple ByteArrayInputStream for testing
-      byte[] data = new byte[(int) length];
+      byte[] data = new byte[Math.toIntExact(length)];
       return new ByteArrayInputStream(data);
     }
iceberg-public-api/src/test/java/org/apache/comet/iceberg/api/parquet/RowGroupReaderApiTest.java (1)

128-154: Deduplicate MockInputFile across API tests.

This helper duplicates the one in FileReaderApiTest; consider extracting a shared test utility to reduce maintenance drift.

iceberg-public-api/src/test/java/org/apache/comet/iceberg/api/parquet/ColumnReaderApiTest.java (1)

75-92: Consider asserting @IcebergApi on readBatch/currentBatch/close for consistency.

Other API tests verify annotations on public methods; adding the checks here tightens the contract.

♻️ Proposed update
   `@Test`
   public void testReadBatchMethodExists() throws NoSuchMethodException {
     Method method = ColumnReader.class.getMethod("readBatch", int.class);
     assertThat(method).isNotNull();
+    assertThat(hasIcebergApiAnnotation(method)).isTrue();
   }

   `@Test`
   public void testCurrentBatchMethodExists() throws NoSuchMethodException {
     Method method = ColumnReader.class.getMethod("currentBatch");
     assertThat(method).isNotNull();
+    assertThat(hasIcebergApiAnnotation(method)).isTrue();
     assertThat(method.getReturnType().getSimpleName()).isEqualTo("CometVector");
   }

   `@Test`
   public void testCloseMethodExists() throws NoSuchMethodException {
     Method method = ColumnReader.class.getMethod("close");
     assertThat(method).isNotNull();
+    assertThat(hasIcebergApiAnnotation(method)).isTrue();
   }
iceberg-public-api/src/test/java/org/apache/comet/iceberg/api/parquet/AbstractColumnReaderApiTest.java (1)

57-59: Optional: redundant null assertions.

The getMethod() and getDeclaredField() reflection methods throw NoSuchMethodException/NoSuchFieldException if the element doesn't exist—they never return null. The assertThat(...).isNotNull() calls are technically redundant since the test would already fail with an exception.

This is a minor nit and can be kept for readability if preferred.

Also applies to: 64-66, 71-72, 78-79, 86-88, 94-95

Comment on lines +250 to +259
## Package: `org.apache.arrow.c`

### AbstractCometSchemaImporter

Base class for `CometSchemaImporter`.

```java
// Methods used by Iceberg
public void close()
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

rg -n "class AbstractCometSchemaImporter" -g '*.java'

Repository: martin-augment/datafusion-comet

Length of output: 206


🏁 Script executed:

rg -n "package.*arrow" -g '*.java' -A 2 | head -50

Repository: martin-augment/datafusion-comet

Length of output: 645


🏁 Script executed:

cat pom.xml | grep -A 20 "shade\|arrow" | head -60

Repository: martin-augment/datafusion-comet

Length of output: 3252


🏁 Script executed:

rg -n "relocation|shadedPattern|includes|excludes" pom.xml -B 2 -A 2 | grep -A 5 "arrow"

Repository: martin-augment/datafusion-comet

Length of output: 57


🏁 Script executed:

fd "pom.xml" -x cat {} \; | grep -B 5 -A 15 "<relocations>"

Repository: martin-augment/datafusion-comet

Length of output: 2282


Clarify the package name for AbstractCometSchemaImporter based on context of use.

The source code resides in org.apache.arrow.c, but the shade plugin relocates all org.apache.arrow.* classes to org.apache.comet.shaded.arrow.* (except JNI classes). Since AbstractCometSchemaImporter is not a JNI class, users importing from the shaded JAR will need to reference org.apache.comet.shaded.arrow.c. Update the documentation to specify which package name applies (source vs. runtime) or use the runtime package name if this guide is for API consumers.

🤖 Prompt for AI Agents
In `@docs/source/contributor-guide/iceberg_public_api.md` around lines 250 - 259,
The doc currently lists AbstractCometSchemaImporter under package
org.apache.arrow.c but at runtime the shaded artifact relocates it to
org.apache.comet.shaded.arrow.c; update the text around
AbstractCometSchemaImporter to clarify both names (source vs runtime) or replace
the package with the runtime package org.apache.comet.shaded.arrow.c so API
consumers can import AbstractCometSchemaImporter correctly, and mention that JNI
classes remain under org.apache.arrow.* as an exception.

Comment on lines +308 to +321
1. Iceberg creates a `WrappedInputFile` from its `InputFile`
2. Creates `ReadOptions` via builder pattern
3. Instantiates `FileReader` with the wrapped input file
4. Converts Parquet `ColumnDescriptor`s to `ParquetColumnSpec`s using `CometTypeUtils`
5. Calls `setRequestedSchemaFromSpecs()` to specify which columns to read
6. Iterates through row groups via `readNextRowGroup()` and `skipNextRowGroup()`

### Column Reading Flow

1. Creates `CometSchemaImporter` with a `RootAllocator`
2. Uses `Utils.getColumnReader()` to create appropriate column readers
3. Calls `reset()` and `setPageReader()` for each row group
4. Uses `BatchReader` to coordinate reading batches across all columns
5. Retrieves results via `delegate().currentBatch()`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check whether these identifiers exist in this repo.
rg -n "CometTypeUtils|delegate\(\)\.currentBatch" -g '*.java'

Repository: martin-augment/datafusion-comet

Length of output: 57


🏁 Script executed:

#!/bin/bash
# Search for similar utility class names and column reader methods
rg -n "class.*TypeUtil|class.*Utils" -g '*.java' | head -20
rg -n "currentBatch\(\)" -g '*.java' | head -20
rg -n "ColumnReader" -g '*.java' | head -10

Repository: martin-augment/datafusion-comet

Length of output: 4195


🏁 Script executed:

#!/bin/bash
# Read the documentation file to see the full context
cat -n docs/source/contributor-guide/iceberg_public_api.md | sed -n '300,330p'

Repository: martin-augment/datafusion-comet

Length of output: 1483


🏁 Script executed:

#!/bin/bash
# Check TypeUtil methods
cat -n common/src/main/java/org/apache/comet/parquet/TypeUtil.java | head -60

echo "---"

# Check TypeUtilApiTest to see correct usage
cat -n iceberg-public-api/src/test/java/org/apache/comet/iceberg/api/parquet/TypeUtilApiTest.java

Repository: martin-augment/datafusion-comet

Length of output: 12938


Correct API symbol names in documentation flow.

Line 311 references CometTypeUtils, which doesn't exist in the codebase. The correct class is TypeUtil. Additionally, line 321 references delegate().currentBatch(), which also doesn't exist. The correct method is currentBatch() called directly on the appropriate column reader class (e.g., ColumnReader, BatchReader, or AbstractColumnReader). Update these references so contributors can accurately map the documented flows to the actual public API.

🤖 Prompt for AI Agents
In `@docs/source/contributor-guide/iceberg_public_api.md` around lines 308 - 321,
Update the docs to use the actual public API names: replace the nonexistent
CometTypeUtils with TypeUtil when describing conversion of ColumnDescriptor to
ParquetColumnSpec, and change the incorrect delegate().currentBatch() reference
to call currentBatch() directly on the appropriate reader type (e.g.,
ColumnReader, BatchReader or AbstractColumnReader) so the flow reads that you
retrieve results via currentBatch() on the column/batch reader instead of via
delegate().

Comment on lines +85 to +89
@Test
public void testCloseMethodExists() throws NoSuchMethodException {
Method method = BatchReader.class.getMethod("close");
assertThat(method).isNotNull();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing @IcebergApi annotation verification for close() method.

In AbstractColumnReaderApiTest.testCloseMethodExists(), the test verifies that close() has the @IcebergApi annotation. For consistency, this test should also verify the annotation on BatchReader.close().

Proposed fix
 `@Test`
 public void testCloseMethodExists() throws NoSuchMethodException {
   Method method = BatchReader.class.getMethod("close");
   assertThat(method).isNotNull();
+  assertThat(hasIcebergApiAnnotation(method)).isTrue();
 }
🤖 Prompt for AI Agents
In
`@iceberg-public-api/src/test/java/org/apache/comet/iceberg/api/parquet/BatchReaderApiTest.java`
around lines 85 - 89, The test BatchReaderApiTest.testCloseMethodExists only
asserts the existence of BatchReader.close() but does not verify the `@IcebergApi`
annotation; update this test to mirror
AbstractColumnReaderApiTest.testCloseMethodExists by retrieving the close method
via BatchReader.class.getMethod("close") and asserting that
method.isAnnotationPresent(IcebergApi.class) (or equivalent assertion using the
test framework) so the test both checks the method exists and has the
`@IcebergApi` annotation.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new @IcebergApi annotation to formally define the public API used for Apache Iceberg integration. The changes are systematically applied across the codebase, and a new documentation file (iceberg_public_api.md) is added to describe this API. Crucially, a new test module (iceberg-public-api) is included, which uses reflection to verify the API contract, preventing accidental breaking changes in the future. The changes are well-structured and significantly improve the maintainability and clarity of the Iceberg integration. My review identified a few minor inaccuracies in the new documentation, for which I've provided suggestions.

@@ -0,0 +1,342 @@
<!--
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This is a great addition to the documentation! I have a few suggestions to improve its accuracy and clarity:

  1. Utils.getColumnReader signature: The signature for Utils.getColumnReader on line 226 seems incorrect. The actual annotated method in Utils.java is getColumnReader(DataType, ParquetColumnSpec, ...). The documented one uses ColumnDescriptor and has a different set of boolean flags. Could you please update the documentation to reflect the correct signature of the public API method?

  2. MetadataColumnReader.nativeHandle: On line 183, protected long nativeHandle is listed under MetadataColumnReader. This field is inherited from AbstractColumnReader and not declared in MetadataColumnReader itself. To avoid confusion, it might be better to remove it from the MetadataColumnReader section.

  3. CometTypeUtils typo: On line 311, CometTypeUtils is mentioned. This seems to be a typo and should probably be Utils, as Utils.descriptorToParquetColumnSpec is the relevant method in Comet's API.

  4. Ambiguous reset() call: On line 319, the documentation mentions Calls reset() ... for each row group. The method reset() is not on any of the reader classes. It seems to refer to Native.resetBatch(), which is called internally. It would be clearer if this step was rephrased, for example: "Reading a new batch will reset the internal state."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants