Skip to content

Conversation

@aastha25
Copy link
Contributor

@aastha25 aastha25 commented Oct 24, 2025

What changes are proposed in this pull request, and why are they necessary?

Summary
This PR introduces native Apache Iceberg table support to Coral, enabling direct schema conversion from Iceberg to Calcite's RelDataType without lossy intermediate conversions through Hive's type system. The implementation preserves Iceberg-specific type semantics including timestamp precision and explicit nullability.
Key architectural decision: HiveMetastoreClient remains unchanged and
does NOT extend CoralCatalog. Integration classes use composition
(storing both instances) with runtime dispatch.

New Components

  • Catalog Abstraction (coral-common/src/main/java/com/linkedin/coral/common/catalog/)
  • CoralCatalog: Format-agnostic catalog interface with getTable(), getAllTables(), namespaceExists()
  • CoralTable: Unified table metadata interface (name(), properties(), tableType())
  • HiveCoralTable / IcebergCoralTable: Implementations wrapping native Hive/Iceberg table objects
  • TableType`: Simple enum (TABLE or VIEW)

Iceberg Integration

  • IcebergTable: Calcite ScannableTable implementation for Iceberg tables
  • IcebergTypeConverter: Converts Iceberg Schema → Calcite RelDataType with precision preservation
  • IcebergHiveTableConverter: Backward compatibility bridge for UDF resolution (converts Iceberg → Hive table object)

Integration Pattern

  • HiveSchema, HiveDbSchema, ToRelConverter: Store both CoralCatalog and HiveMetastoreClient instances
  • Runtime dispatch: if coralCatalog != null use unified path;
    else if msc != null use Hive-only path
  • HiveMetastoreClient and HiveMscAdapter marked @deprecated (still functional, prefer CoralCatalog)

How Reviewers Should Read This
Start here:

  1. CoralCatalog.java - New abstraction layer interface
  2. CoralTable.java - Unified table metadata interface
  3. IcebergCoralTable.java - How Iceberg tables are wrapped
  4. IcebergTypeConverter.java - Core schema conversion logic

Then review integration:

  1. HiveDbSchema.java - Dispatch logic based on CoralTable type (Iceberg vs Hive)
  2. IcebergTable.java - Calcite integration
  3. ToRelConverter.java - Dual-path support (CoralCatalog vs HiveMetastoreClient)
  4. HiveMetastoreClient.java - Backward compatibility

Test:

  1. IcebergTableConverterTest.java - End-to-end Iceberg conversion test

How was this patch tested?

New and existing tests pass
integration tests - WIP

Copy link

@sumedhsakdeo sumedhsakdeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aastha25 , code looks great. Added some questions / comments, ptal.

Comment on lines 122 to 123
// Iceberg timestamp type - microsecond precision (6 digits)
convertedType = typeFactory.createSqlType(SqlTypeName.TIMESTAMP, 6);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we handle timestamp with time zone for completeness?

Suggested change
// Iceberg timestamp type - microsecond precision (6 digits)
convertedType = typeFactory.createSqlType(SqlTypeName.TIMESTAMP, 6);
Types.TimestampType timestampType = (Types.TimestampType) icebergType;
if (timestampType.shouldAdjustToUTC()) {
// TIMESTAMP WITH TIME ZONE - stores instant in time
convertedType = typeFactory.createSqlType(SqlTypeName.TIMESTAMP_WITH_LOCAL_TIME_ZONE, 6);
} else {
// TIMESTAMP - stores local datetime
convertedType = typeFactory.createSqlType(SqlTypeName.TIMESTAMP, 6);
}

Ref: https://github.com/apache/calcite/blob/95350ed1a449bbb2f008fcf2b704544e7d95c410/core/src/main/java/org/apache/calcite/sql/type/SqlTypeName.java#L73

convertedType = typeFactory.createSqlType(SqlTypeName.DATE);
break;
case TIME:
convertedType = typeFactory.createSqlType(SqlTypeName.TIME);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
convertedType = typeFactory.createSqlType(SqlTypeName.TIME);
convertedType = typeFactory.createSqlType(SqlTypeName.TIME, 6);

convertedType = typeFactory.createSqlType(SqlTypeName.BINARY, fixedType.length());
break;
case BINARY:
convertedType = typeFactory.createSqlType(SqlTypeName.VARBINARY, Integer.MAX_VALUE);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason why we use VARBINARY over BINARY here?
Unlike HiveTypeConverter

convertedType = dtFactory.createSqlType(SqlTypeName.BINARY);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, im changing this to be same as hive.


@Override
public Schema.TableType getJdbcTableType() {
return dataset.tableType() == TableType.VIEW ? Schema.TableType.VIEW : Schema.TableType.TABLE;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return dataset.tableType() == TableType.VIEW ? Schema.TableType.VIEW : Schema.TableType.TABLE;
return Schema.TableType.TABLE;

🤔, with an assert that dataset.tableType() should be TableType.MANAGED_TABLE?

/**
* Returns the underlying Iceberg Table for advanced operations.
*
* @return Iceberg Table object

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @return Iceberg Table object
* @return org.apache.iceberg.Table

Comment on lines 21 to 24
* Utility class to convert Iceberg datasets to Hive Table objects for backward compatibility.
*
* <p>This converter creates complete Hive Table objects from Iceberg tables, including schema conversion
* using {@code HiveSchemaUtil}. While the table object acts as "glue code" for backward compatibility,
* it populates all standard Hive table metadata to ensure broad compatibility with downstream code paths.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect to exercise this glue code in practice? If so, under what scenarios?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code path is used to read table properties when parsing the view SQL in @parsetreeBuilder.

Yes, this glue code gets exercised primarily to retrieve eligible table properties on the base tables during parsing
stage in ParseTreeBuilder/HiveFunctionResolver (no schema dependency). Without this glue, we would need larger scale refactoring in those classes to interpret IcebergTable natively.


// Convert Iceberg schema to Hive columns
try {
storageDescriptor.setCols(HiveSchemaUtil.convert(icebergTable.schema()));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason why we choose to set storageDescriptor columns from HiveSchemaUtil.convert(icebergTable.schema()) and not from AvroSchemaUtil.convert(hiveParameters.get("avro.schema.literal"))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(a) iceberg schema is the SOT (b) avro literal may not always exist or could be stale & (c) this logic is ported from existing production code paths so as to have consistency in how we convert iceberg table -> hive table object across the stack.
Practically, setting this one way or the other in this specific class has no bearing on view schema resolution.

Comment on lines +104 to +105
0, // createTime
0, // lastModifiedTime
0, // retention

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the side-effects of empty metadata here for owner .. retention?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none, we practically only need tbl properties for backward compatibility with SQL parser logic in ParseTreeBuilder

* @param dbName Database or namespace name
* @return true if the namespace exists, false otherwise
*/
boolean namespaceExists(String dbName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us fix the inconsistencies between namespace, db, schema.

@wmoustafa
Copy link
Contributor

Let us not use Dataset as it is not a standard term. Table maybe more appropriate, but I understand you want to use it elsewhere, but we can find an alternative. I think Schema is also closer to calcite terminolorgy than Namespace. As much as possible we should use standard terms, or when in ambiguity, we should be closer to Calcite terminology.

* across different table formats (Hive, Iceberg, etc.).
*
* CoralCatalog abstracts away the differences between various table formats
* and provides a consistent way to access dataset information through
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto on "Dataset" terminology.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the feedback, I have refactored the PR to move away from Dataset.
we now have
(1) coralCatalog (new catalog interface) & HiveMetastoreClient (old catalog interface) are independent and both work with Coral for translations. HiveMetastoreClient has been marked as deprecated in favor of coralCatalog.
(2) getTable() is the API in coralCatalog. It returns an interface of CoralTable. Currently, we have 2 impls of coralTable - hiveCoralTable & icebergCoralTable

*/
@Override
public TableType tableType() {
return TableType.fromHiveTableType(table.getTableType());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write the spec of conversion between Hive, Iceberg, and Coral representations? How does this expand for more table formats? Ideally we should have a Coral representation that is universal enough and everything can be converted to it. So I would expect methods like toCoralType as opposed to fromHiveType. Underlying table formats should not be hard coded in the universal catalog as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TableType.fromHiveTableType this method has been deleted.
Also, as discussed, the spec of table formats -> coral IR is just schema conversion which is captured in classes TypeConverter for hive tables & IcebergTypeConverter for iceberg tables.

* @param icebergCoralTable Iceberg coral table to convert
* @return Hive Table object with complete metadata and schema
*/
public static Table toHiveTable(IcebergCoralTable icebergCoralTable) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on offline discussion, I understood we got rid of those, but seems we still leverage methods that hardcode the table formats. This is not good for extensibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

captured the reasoning in detail in the class doc and linked this ticket #575, specifying that this class will be deleted in a follow up patch.

@wmoustafa
Copy link
Contributor

Can we eliminate

if (coralCatalog != null) {
  ...
} else {
  ...
}

that we are currently using everywhere and use some adapter class instead?

@wmoustafa
Copy link
Contributor

There are quite a fiew overlapping wrappers:

CoralTable --> HiveCoralTable / IcebergCoralTable
HiveTable / HiveViewTable
IcebergTable

The layering is conceptually unclear. Can we simplify this and merge a few classes?

* @param namespaceName Namespace (database) name
* @return true if the namespace exists, false otherwise
*/
boolean namespaceExists(String namespaceName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistency between namespace, and schema elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved it is multiple places. lmk if i missed it someplace.

@wmoustafa
Copy link
Contributor

wmoustafa commented Oct 31, 2025

I have considered a few design options and this seems to make the most sense:

interface CoralTable extends ScannableTable
class HiveTable implements CoralTable
class IcebergTable implements CoralTable

@wmoustafa
Copy link
Contributor

I have considered a few design options and this seems to make the most sense:

interface CoralTable extends ScannableTable
class HiveTable implements CoralTable
class IcebergTable implements CoralTable

Discussed offline. The motivation of the above was to avoid duplicating implementation layers (i.e., having both HiveTable and HiveCoralTable, IcebergTable and IcebergCoralTable as in the current PR). The idea was to consolidate each table foramt's implementation into a single class that directly implements ScannableTable. However, this approach exposes RelDataType through the CoralTable API, which could make it harder to replace RelDataType in the future, especially CoralTable will be exposed to the engine connectors.

To properly decouple the two, we would need a standalone Coral type system that models schema and type metadata independently. That type system has now been introduced in #558, which can serve as the foundation for adopting an approach that makes CoralTable fully standalone and decoupled from ScannableTable.

Comment on lines +25 to +29
*
* @return Fully qualified table name
*/
String name();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the method name need to reflect it is a fully qualified table name? Also, is this method used anywhere? Let us add things only when necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API called name() returning the fully qualified table name (db.tbl) seems a standard convention (also adopted in iceberg). It is used in code, so there is relevance to this method.
Also, I have clarified in the javadoc what does this method expected to return.

*
* Used by Calcite integration to dispatch to HiveTable.
*/
public class HiveCoralTable implements CoralTable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are table implementations in common and table implementations in catalog. What is the basis for defining in each?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline and also tried to clarify in the class code + java docs.
TL;DR: catalog has user <-> coral interaction. common has coral <-> calcite interaction

*
* Used by Calcite integration to dispatch to HiveTable.
*/
public class HiveCoralTable implements CoralTable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be implemented as a plugin to avoid class path issues? Also note that both Hive and Iceberg co-exist in the same module which is not quite normal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hive & iceberg dependencies co-existing in dynamic data lakes is an inevitable reality. On the feedback of injecting this dependency at runtime, that can be implemented in multiple way. We can do it in Coral or the downstream consumer of Coral can exclude iceberg brought in by Coral (upstream) & inject their own iceberg.
I'm going to evaluate the feasibility of implementing as this a plugin in Coral outside the scope of this PR.

Comment on lines +119 to +120
final CoralDataType coralType = HiveToCoralTypeConverter.convert(typeInfo);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example of package inconsistency. Utility method for this implementation is in the common package.

Comment on lines +81 to +83
public org.apache.iceberg.Table getIcebergTable() {
return table;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally you should not need this, and it might indicate a leak in the API. Can we avoid it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have marked this as an internal API which will be removed in future release. This should be clean up as part of #57.


compile deps.'hadoop'.'hadoop-common'

// LinkedIn Iceberg dependencies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason why this needs to be a plugin. This should integrate with OSS Iceberg too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specifically I want to bring in the custom shaded distribution of li-iceberg-hive-metastore here. But in general, I feel ok but Coral depending on li-iceberg.

'iceberg-hive-metastore': "com.linkedin.iceberg:iceberg-hive-metastore:${versions['linkedin-iceberg']}:shaded"

Comment on lines +221 to +223
// Convert Iceberg coral table to minimal Hive Table for backward compatibility
// This is needed because downstream code (ParseTreeBuilder, HiveFunctionResolver)
// expects a Hive Table object for Dali UDF resolution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate more? Why not use CoralTable there?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, also is the conversion lossy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the integration here requires only "view SQL definition" & view's "tbl properties". so these details are lossless in converting icebergTable (coralTable) -> hiveTable.
why we need this? that explain in detail in #575

* <li>Storage descriptor with SerDe info (for compatibility)</li>
* </ul>
*/
public class IcebergHiveTableConverter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this change is not to do this anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

captured the reasoning in detail in the class doc and linked this ticket #575, specifying that this class will be deleted in a follow up patch.

*
* Copied structure from TypeConverter for consistency.
*/
public class IcebergTypeConverter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see why this is required. Could you explain?

* @param msc Hive metastore client for Hive-specific access (can be null if coralCatalog is provided)
* @param dbName Database name (must not be null)
*/
HiveDbSchema(CoralCatalog coralCatalog, HiveMetastoreClient msc, @Nonnull String dbName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we introduce the coral classes here, depending only on CoralCatalog, and mark the Hive ones as deprecated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, please take a look at the new code. I have added new parallel classes here which work with CoralCatalog

Copy link

@sumedhsakdeo sumedhsakdeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. In retrospect this should have been broken down into smaller PRs that makes reviews easier.



/**
* Utility class to convert Iceberg datasets to Hive Table objects for backward compatibility.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it also convert schema, if yes, is the converter is lossy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

captured the reasoning in detail in the class doc and linked this ticket #575, specifying that this class will be deleted in a follow up patch.
the converted schema is not used.

*
* Length indicates:
* - -1: unbounded/variable-length (LENGTH_UNBOUNDED, default for BINARY/VARBINARY)
* - 0: fixed-length (for FIXED types, e.g., Iceberg FIXED(16))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* - 0: fixed-length (for FIXED types, e.g., Iceberg FIXED(16))
* - >0: fixed-length (for FIXED types, e.g., Iceberg FIXED(16))

?

Comment on lines +102 to +104
case TIMESTAMP:
// Iceberg timestamp has microsecond precision (6 digits)
return TimestampType.of(6, nullable);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case BINARY:
// Variable-length binary
return BinaryType.of(BinaryType.LENGTH_UNBOUNDED, nullable);
case DECIMAL:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String stringViewExpandedText = null;
if (table.getTableType().equals("VIRTUAL_VIEW")) {
stringViewExpandedText = table.getViewExpandedText();
if (coralCatalog != null) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javadoc is missing updates for processView

Comment on lines +221 to +223
// Convert Iceberg coral table to minimal Hive Table for backward compatibility
// This is needed because downstream code (ParseTreeBuilder, HiveFunctionResolver)
// expects a Hive Table object for Dali UDF resolution

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, also is the conversion lossy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants