-
Notifications
You must be signed in to change notification settings - Fork 139
Description
Apache Iceberg version
main (development)
Please describe the bug 🐞
Environment
- iceberg-go: v0.3.1
- Catalog: REST catalog (standard Iceberg REST API)
- Go: 1.21+
- OS: Linux/Windows (reproducible across)
Summary
In v0.3.0, table.NewAddSchemaUpdate(*Schema, lastColumnID, initial) allowed callers to ensure the table’s last-assigned-field-id stayed monotonic even when deleting the column that previously had the highest field ID.
In v0.3.1, the API changed to table.NewAddSchemaUpdate(*Schema) (no lastColumnID). When we only delete the highest-ID column(s) and add no new columns, the library appears to derive last-assigned-field-id from the new schema’s max field id, which decreases. The REST catalog then rejects the commit with:
invalid_metadata: The specified metadata is not valid
Deleting a first/middle column works; deleting the tail (max-ID) column fails.
Note: This repro excludes partition/sort references (i.e., we are not deleting a column referenced by the default spec or sort order).
Steps to Reproduce
-
Start with a table whose current schema has fields, e.g.:
a(id=1),b(id=2),c(id=3).
No partition/sort references toc.
-
Build a new schema that removes
cand keepsa/bwith the same field IDs (we don’t touch IDs). -
Submit two updates in one commit:
AddSchemausingtable.NewAddSchemaUpdate(newSchema)SetCurrentSchemausingtable.NewSetCurrentSchemaUpdate(newSchemaID)
(We obtainnewSchemaIDby runningb := table.MetadataBuilderFromBase(meta); id, _ := b.AddSchema(newSchema).)
-
Include concurrency requirements (optional but recommended):
AssertTableUUID(meta.UUID())AssertLastAssignedFieldID(oldLastID)whereoldLastIDis 3 in this example.
-
CommitTable(...)→ fails withinvalid_metadata.
Minimal code sketch (v0.3.1 style):
meta := tbl.Metadata()
oldLast := highestID(tbl.Schema()) // returns 3 in the example
// Build new schema that keeps a(id=1), b(id=2) only (delete c(id=3))
newSchema := buildSchemaKeepAB(tbl.Schema()) // preserves existing IDs
// Precompute new schema-id
b := table.MetadataBuilderFromBase(meta)
newSchemaID, err := b.AddSchema(newSchema)
if err != nil { panic(err) }
// Prepare updates (v0.3.1 API)
add := table.NewAddSchemaUpdate(newSchema) // no lastColumnID parameter anymore
set := table.NewSetCurrentSchemaUpdate(newSchemaID)
reqs := []table.Requirement{
table.AssertTableUUID(meta.UUID()),
table.AssertLastAssignedFieldID(int(oldLast)), // oldLast == 3
}
_, _, err = cat.CommitTable(ctx, tbl, reqs, []table.Update{add, set})
// => invalid_metadata when only deleting tail/highest-ID columnsExpected Behavior
-
Deleting columns (including the highest-ID column) should be allowed as long as:
- We do not change existing field IDs of the kept columns.
last-assigned-field-iddoes not decrease (i.e., remains the previous value).
-
In v0.3.0, passing
lastColumnID=oldLastensured monotonicity and commits succeeded.
Actual Behavior
- With v0.3.1,
NewAddSchemaUpdatecannot acceptlastColumnID. - When we only delete the max-ID column and add no new columns, the commit is rejected with
invalid_metadata—apparently because the derivedlast-assigned-field-idregresses to the new schema’s max ID.
Analysis
- Iceberg requires
last-assigned-field-idto be monotonic (never decreases). - In the “delete-tail-columns only” scenario, the current
last-assigned-field-idis the old max (e.g., 3). The new schema’s max becomes smaller (e.g., 2).
If the client or server infers the counter from the new schema’s max, it violates monotonicity →invalid_metadata.
Workarounds
- Add a sentinel (dummy) column in the same update with ID =
oldLast + 1(e.g.,__compat_padding_...), nullable, never used. This keeps the new schema’s max ≥ old max.
Or, more practically, add a real new column in the same change so max ID increases. - (Less ideal) Maintain a fork that restores the older API (
NewAddSchemaUpdate(schema, lastColumnID, initial)) or custom-craft the REST payload to setlast-column-id = oldLast. - Of course still ensure you’re not deleting a field referenced by partition spec or sort order (not the case in this repro).
Proposal
-
API / behavior options:
- Re-introduce a way to set
lastColumnID(or an equivalent parameter) onAddSchemain the Go client; or - Have the client compute
last-assigned-field-idasmax(oldLastID, max(newSchema.FieldIDs))so it never regresses; or - Provide a dedicated update or requirement to explicitly set/preserve
last-assigned-field-idwithout requiring a dummy column.
- Re-introduce a way to set
-
Docs: Clarify in v0.3.1 migration notes that callers must ensure the counter doesn’t regress when deleting the highest-ID column, and suggest recommended patterns.
Additional Context
- The same flow succeeds if we delete a middle/first column (the new schema’s max ID stays the same).
- The same flow succeeds if we add at least one new column (the new schema’s max ID increases).
Happy to provide a tiny repro program if needed. Thanks!