-
Notifications
You must be signed in to change notification settings - Fork 3
#649 Add ability to select how key is serialized in Kafka Avro source. #652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… use binary or string serializer.
…a Avro ingestion iobs.
WalkthroughThis pull request extends Pramen's Kafka integration by adding topic-based query support and implementing configurable key column serialization strategies. QueryBuilder now recognizes "topic" as a valid query configuration option alongside existing "table" and "db.table" options. KafkaAvroSource gains a new Changes
Sequence DiagramsequenceDiagram
participant Config as Configuration
participant KafkaAvroSource
participant Kafka as Kafka Data
participant DataFrame
Config->>KafkaAvroSource: key.column.serializer: "binary"<br/>(or "string", "avro", "none")
Config->>KafkaAvroSource: key.naming.strategy (optional)
alt key.naming.strategy defined
KafkaAvroSource->>Kafka: Read Avro key
Kafka-->>KafkaAvroSource: Deserialized key
KafkaAvroSource->>DataFrame: Select & rename to kafka_key
else key.column.serializer: "binary"
KafkaAvroSource->>Kafka: Read binary key
Kafka-->>KafkaAvroSource: Raw key bytes
KafkaAvroSource->>DataFrame: Materialize as kafka_key
else key.column.serializer: "string"
KafkaAvroSource->>Kafka: Read key
Kafka-->>KafkaAvroSource: Key bytes
KafkaAvroSource->>DataFrame: Cast to StringType, rename to kafka_key
else key.column.serializer: "none"
KafkaAvroSource->>DataFrame: Omit key handling
end
KafkaAvroSource->>DataFrame: Finalize with kafka & kafka_key columns
DataFrame-->>KafkaAvroSource: Output DataFrame
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes
Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pramen/core/src/main/scala/za/co/absa/pramen/core/model/QueryBuilder.scala (1)
48-59: Update error messages to include TOPIC_KEY.The error messages on lines 56 and 58 list the allowed query options but don't mention
TOPIC_KEY. Users encountering these errors won't know they can usetopicas an alternative.Apply this diff:
val parent = if (parentPath.isEmpty) "" else s" at $parentPath" if (prefix.isEmpty) - throw new IllegalArgumentException(s"No options are specified for the query. Usually, it is one of: '$SQL_KEY', '$PATH_KEY', '$TABLE_KEY', '$DB_TABLE_KEY'$parent.") + throw new IllegalArgumentException(s"No options are specified for the query. Usually, it is one of: '$SQL_KEY', '$PATH_KEY', '$TABLE_KEY', '$DB_TABLE_KEY', '$TOPIC_KEY'$parent.") else - throw new IllegalArgumentException(s"No options are specified for the '$prefix' query. Usually, it is one of: '$p$SQL_KEY', '$p$PATH_KEY', '$p$TABLE_KEY', '$p$DB_TABLE_KEY'$parent.") + throw new IllegalArgumentException(s"No options are specified for the '$prefix' query. Usually, it is one of: '$p$SQL_KEY', '$p$PATH_KEY', '$p$TABLE_KEY', '$p$DB_TABLE_KEY', '$p$TOPIC_KEY'$parent.")
🧹 Nitpick comments (1)
pramen/extras/src/main/scala/za/co/absa/pramen/extras/source/KafkaAvroSource.scala (1)
56-59: Clarify documentation about automatic serializer selection.The comment states that
"avro"is selected automatically whenkey.naming.strategyis defined, but this is misleading. The code doesn't changekeyColumnSerializerto"avro"; instead, whenkeyNamingStrategyis present (lines 222-227), it bypasses thekeyColumnSerializerswitch entirely and always uses Avro deserialization.Consider clarifying the documentation:
- * # The Kafka key serializer. Can be "none", "binary", "string", "avro". - * # When "avro", "key.naming.strategy" should be defined at the "schema.registry" section. - * # Default is "binary", but if "key.naming.strategy" is defined, "avro" is selected automatically. + * # The Kafka key serializer when key.naming.strategy is NOT defined. Can be "none", "binary", "string". + * # When key.naming.strategy IS defined in schema.registry, Avro deserialization is used automatically. + * # Default is "binary".
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
README.md(3 hunks)pramen/core/src/main/scala/za/co/absa/pramen/core/model/QueryBuilder.scala(1 hunks)pramen/core/src/test/scala/za/co/absa/pramen/core/model/QueryBuilderSuite.scala(1 hunks)pramen/extras/src/main/scala/za/co/absa/pramen/extras/source/KafkaAvroSource.scala(7 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
pramen/extras/src/main/scala/za/co/absa/pramen/extras/source/KafkaAvroSource.scala (2)
pramen/extras/src/main/scala/za/co/absa/pramen/extras/utils/ConfigUtils.scala (2)
ConfigUtils(24-205)getOptionString(43-49)pramen/extras/src/main/scala/za/co/absa/pramen/extras/writer/model/NamingStrategy.scala (1)
applyNamingStrategyToAbrisConfig(81-103)
pramen/core/src/test/scala/za/co/absa/pramen/core/model/QueryBuilderSuite.scala (1)
pramen/api/src/main/scala/za/co/absa/pramen/api/Query.scala (2)
Query(28-56)Table(37-41)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Test Spark 3.3.4 on Scala 2.13.16
- GitHub Check: Test Spark 3.4.4 on Scala 2.13.16
- GitHub Check: Test Spark 3.5.5 on Scala 2.13.16
- GitHub Check: Test Spark 3.3.4 on Scala 2.12.20
- GitHub Check: Test Spark 3.4.4 on Scala 2.12.20
- GitHub Check: Test Spark 2.4.8 on Scala 2.11.12
- GitHub Check: Test Spark 3.5.5 on Scala 2.12.20
- GitHub Check: Test Coverage on Scala 2.12.18
🔇 Additional comments (6)
pramen/core/src/main/scala/za/co/absa/pramen/core/model/QueryBuilder.scala (1)
28-46: LGTM! Clean implementation of TOPIC_KEY support.The addition of
TOPIC_KEYfollows the existing pattern forTABLE_KEYandDB_TABLE_KEY. The implementation correctly extends both thehasDbTablecheck and thetableDefresolution logic.pramen/extras/src/main/scala/za/co/absa/pramen/extras/source/KafkaAvroSource.scala (3)
22-22: LGTM! Good string normalization.The addition of the
keyColumnSerializerconfig with a sensible default of "binary" is appropriate. ApplyingtoLowerCase.trimensures consistent comparison regardless of user input formatting.Also applies to: 125-128
238-244: LGTM! Good conflict detection.The warnings for payload field name conflicts with metadata column names provide helpful feedback to users and clearly indicate when their data fields will be replaced.
248-259: LGTM! Correct conditional DataFrame assembly.The conditional logic properly handles both scenarios:
- When a key is present: includes the key column in the output
- When no key: excludes the key column entirely
The consistent pattern of dropping conflicting columns before renaming temp columns ensures the final schema is correct.
pramen/core/src/test/scala/za/co/absa/pramen/core/model/QueryBuilderSuite.scala (1)
33-34: LGTM! Good test coverage for TOPIC_KEY.The test cases appropriately verify that:
input.db.tableresolves to aQuery.Tableinput.topicresolves to aQuery.TableThis confirms that both keys are treated equivalently, which aligns with the implementation in
QueryBuilder.scala.README.md (1)
992-992: LGTM! Consistent terminology for Kafka topics.Changing
input.tabletoinput.topicfor the Kafka ingestion example improves clarity and aligns with the domain-specific terminology introduced byTOPIC_KEY.
pramen/extras/src/main/scala/za/co/absa/pramen/extras/source/KafkaAvroSource.scala
Show resolved
Hide resolved
pramen/extras/src/main/scala/za/co/absa/pramen/extras/source/KafkaAvroSource.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
README.md (1)
944-952: Documentation correctly describes key serialization behavior.The documentation clearly explains:
key.column.serializeris used whenkey.naming.strategyis NOT defined- Automatic Avro deserialization when
key.naming.strategyIS defined- Valid serializer options and defaults
Note: The typo "deined" mentioned in past reviews has been fixed.
pramen/extras/src/main/scala/za/co/absa/pramen/extras/source/KafkaAvroSource.scala (2)
52-60: Documentation comments are clear and correct.The key serialization behavior is well documented. The typo "deined" mentioned in past reviews has been corrected to "defined".
222-236: Key deserialization logic is well structured.The branching logic correctly handles:
- Avro deserialization when
keyNamingStrategyis defined- Binary, string, and none serializers when it's not
- Clear error for "avro" serializer without naming strategy
- Helpful error message for unknown serializers
The issue mentioned in past reviews about "long" in the error message has been fixed—it's no longer listed as a valid option.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
README.md(3 hunks)pramen/core/src/main/scala/za/co/absa/pramen/core/model/QueryBuilder.scala(2 hunks)pramen/extras/src/main/scala/za/co/absa/pramen/extras/source/KafkaAvroSource.scala(7 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
pramen/extras/src/main/scala/za/co/absa/pramen/extras/source/KafkaAvroSource.scala (2)
pramen/extras/src/main/scala/za/co/absa/pramen/extras/utils/ConfigUtils.scala (2)
ConfigUtils(24-205)getOptionString(43-49)pramen/extras/src/main/scala/za/co/absa/pramen/extras/writer/model/NamingStrategy.scala (1)
applyNamingStrategyToAbrisConfig(81-103)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Test Spark 3.5.5 on Scala 2.13.16
- GitHub Check: Test Spark 3.5.5 on Scala 2.12.20
- GitHub Check: Test Spark 3.3.4 on Scala 2.13.16
- GitHub Check: Test Spark 3.4.4 on Scala 2.12.20
- GitHub Check: Test Spark 2.4.8 on Scala 2.11.12
- GitHub Check: Test Spark 3.3.4 on Scala 2.12.20
- GitHub Check: Test Spark 3.4.4 on Scala 2.13.16
- GitHub Check: Test Coverage on Scala 2.12.18
🔇 Additional comments (4)
pramen/core/src/main/scala/za/co/absa/pramen/core/model/QueryBuilder.scala (1)
28-58: LGTM! Clean extension to support topic-based queries.The addition of
TOPIC_KEYand its consistent integration throughout the query resolution logic is well done. This provides better semantic clarity for Kafka and other message-queue sources while maintaining backward compatibility with existing table-based queries.README.md (1)
993-993: Good update to use topic-based terminology.Changing from
input.tabletoinput.topicin the Kafka example improves clarity and aligns with the newTOPIC_KEYsupport in QueryBuilder.pramen/extras/src/main/scala/za/co/absa/pramen/extras/source/KafkaAvroSource.scala (2)
248-259: Final DataFrame assembly logic is correct.The conditional assembly properly:
- Selects appropriate columns based on
hasKey- Drops conflicting field names from the payload
- Renames temporary columns to final names
- Maintains sensible column ordering
However, this logic depends on
hasKeybeing calculated correctly (see comment on line 220).
271-271: LGTM! Public constant follows existing patterns.Adding
KEY_COLUMN_SERIALIZER_KEYmakes the configuration key accessible externally and is consistent with other constants in the companion object.
pramen/extras/src/main/scala/za/co/absa/pramen/extras/source/KafkaAvroSource.scala
Outdated
Show resolved
Hide resolved
1d9fbf0 to
51a7acf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pramen/extras/src/main/scala/za/co/absa/pramen/extras/source/KafkaAvroSource.scala (1)
238-259: Update warning message to accurately reflect behavior when hasKey is false.Lines 242-244 warn that a payload field with the same name as
keyColumnName"will be replaced", but whenhasKeyis false (lines 255-259), the field is actually dropped, not replaced with a Kafka key column.Consider updating the warning to reflect the actual behavior:
if (payloadFields.contains(keyColumnName)) { - log.warn(s"Payload field '$keyColumnName' conflicts with Kafka key column name and will be replaced.") + log.warn(s"Payload field '$keyColumnName' conflicts with reserved Kafka key column name and will be removed.") }Alternatively, only drop
keyColumnNamewhenhasKeyis true to preserve payload fields when no key is being added.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
README.md(3 hunks)pramen-py/pyproject.toml(1 hunks)pramen/core/src/main/scala/za/co/absa/pramen/core/model/QueryBuilder.scala(2 hunks)pramen/core/src/test/scala/za/co/absa/pramen/core/model/QueryBuilderSuite.scala(4 hunks)pramen/core/src/test/scala/za/co/absa/pramen/core/pipeline/SourceTableParserSuite.scala(1 hunks)pramen/extras/src/main/scala/za/co/absa/pramen/extras/source/KafkaAvroSource.scala(7 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
pramen/core/src/test/scala/za/co/absa/pramen/core/model/QueryBuilderSuite.scala (1)
pramen/api/src/main/scala/za/co/absa/pramen/api/Query.scala (2)
Query(28-56)Table(37-41)
pramen/extras/src/main/scala/za/co/absa/pramen/extras/source/KafkaAvroSource.scala (2)
pramen/extras/src/main/scala/za/co/absa/pramen/extras/utils/ConfigUtils.scala (2)
ConfigUtils(24-205)getOptionString(43-49)pramen/extras/src/main/scala/za/co/absa/pramen/extras/writer/model/NamingStrategy.scala (1)
applyNamingStrategyToAbrisConfig(81-103)
🪛 GitHub Actions: python_ci
pramen-py/pyproject.toml
[warning] 1-1: Warning: poetry.lock is not consistent with pyproject.toml. You may be getting improper dependencies. Run poetry lock [--no-update] to fix it.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Test Spark 3.5.5 on Scala 2.12.20
- GitHub Check: Test Spark 3.5.5 on Scala 2.13.16
- GitHub Check: Test Spark 2.4.8 on Scala 2.11.12
- GitHub Check: Test Spark 3.4.4 on Scala 2.13.16
- GitHub Check: Test Spark 3.3.4 on Scala 2.12.20
- GitHub Check: Test Spark 3.3.4 on Scala 2.13.16
- GitHub Check: Test Spark 3.4.4 on Scala 2.12.20
- GitHub Check: Test Coverage on Scala 2.12.18
🔇 Additional comments (3)
pramen/core/src/test/scala/za/co/absa/pramen/core/pipeline/SourceTableParserSuite.scala (1)
122-122: LGTM!The error message correctly reflects the new
input.topicoption introduced in the QueryBuilder changes.pramen/core/src/main/scala/za/co/absa/pramen/core/model/QueryBuilder.scala (1)
28-58: LGTM!The TOPIC_KEY addition is implemented consistently throughout:
- Properly defined as a public constant
- Integrated into
hasDbTablechecks- Included in
tableDefresolution- Reflected in all error messages
The implementation correctly treats
topicas an alternative name fortable, which aligns well with Kafka source terminology.pramen/core/src/test/scala/za/co/absa/pramen/core/model/QueryBuilderSuite.scala (1)
33-78: LGTM!Excellent test coverage for the new TOPIC_KEY functionality:
- Validates that
input.db.tablecontinues to work (backward compatibility)- Validates that
input.topicproduces the expectedQuery.Tableresult- Error message expectations correctly updated to include 'topic' option
pramen-py/pyproject.toml
Outdated
| {version="^22.1.0", python = ">=3.7,<4.0"}, | ||
| ] | ||
| pyspark = "3.1.3" | ||
| pyspark = "3.3.4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update poetry.lock to match pyproject.toml.
The pipeline is reporting that poetry.lock is inconsistent with this change. Run poetry lock --no-update to regenerate the lock file.
🤖 Prompt for AI Agents
In pramen-py/pyproject.toml around line 133, the pyproject declares pyspark =
"3.3.4" but poetry.lock is now out of sync; run `poetry lock --no-update` in the
project root to regenerate the lockfile without changing other dependency
versions, commit the updated poetry.lock, and push the change so the pipeline
and repository are consistent.
Unit Test Coverage
Files
|
51a7acf to
c984e75
Compare
c984e75 to
549e082
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (7)
pramen/core/src/main/scala/za/co/absa/pramen/core/model/QueryBuilder.scala (2)
35-35: Rename to reflect topic support.Variable now also covers
topic. Consider renaminghasDbTable→hasTableOrTopicfor clarity.- val hasDbTable = conf.hasPath(s"$p$TABLE_KEY") || conf.hasPath(s"$p$DB_TABLE_KEY") || conf.hasPath(s"$p$TOPIC_KEY") + val hasTableOrTopic = conf.hasPath(s"$p$TABLE_KEY") || conf.hasPath(s"$p$DB_TABLE_KEY") || conf.hasPath(s"$p$TOPIC_KEY") @@ - (hasSql, hasPath, hasDbTable, hesSomething) match { + (hasSql, hasPath, hasTableOrTopic, hesSomething) match {
56-59: Improve error when multiple query options are set.Current message targets “no options specified”. Add a branch to detect multiple specified options and throw a targeted error (e.g., “Only one of: sql, path, table, db.table, topic is allowed.”).
@@ - (hasSql, hasPath, hasDbTable, hesSomething) match { + val specified = Seq( + if (hasSql) Some(SQL_KEY) else None, + if (hasPath) Some(PATH_KEY) else None, + if (hasTableOrTopic) Some(s"$TABLE_KEY/$DB_TABLE_KEY/$TOPIC_KEY") else None + ).flatten + (hasSql, hasPath, hasTableOrTopic, hesSomething, specified.count(_ => true)) match { + case (_, _, _, _, n) if n > 1 => + val parent = if (parentPath.isEmpty) "" else s" at $parentPath" + if (prefix.isEmpty) + throw new IllegalArgumentException(s"Multiple options specified for the query$parent. Only one of: '$SQL_KEY', '$PATH_KEY', '$TABLE_KEY', '$DB_TABLE_KEY', '$TOPIC_KEY' is allowed.") + else + throw new IllegalArgumentException(s"Multiple options specified for the '$prefix' query$parent. Only one of: '$p$SQL_KEY', '$p$PATH_KEY', '$p$TABLE_KEY', '$p$DB_TABLE_KEY', '$p$TOPIC_KEY' is allowed.") case (true, false, false, _) => Query.Sql(conf.getString(s"$p$SQL_KEY")) case (false, true, false, _) => Query.Path(conf.getString(s"$p$PATH_KEY")) case (false, false, true, _) => Query.Table(tableDef.get) case (false, false, false, true) => Query.Custom(ConfigUtils.getExtraOptions(conf, prefix)) case _ =>pramen/core/src/test/scala/za/co/absa/pramen/core/model/QueryBuilderSuite.scala (1)
78-79: Consider adding multi-option conflict tests.Add cases where two options are set (e.g., input.table + input.topic) and assert a “only one allowed” error if you implement that branch in QueryBuilder.
pramen/extras/src/main/scala/za/co/absa/pramen/extras/source/KafkaAvroSource.scala (3)
229-235: Clarify string decoding or support charset.Casting binary → string assumes UTF‑8. Either:
- Document that “string” uses UTF‑8; or
- Allow
string:<charset>(e.g.,string:utf-16) and usefrom_csv-style decode or a small UDF.Example (documentation-only change preferred in this PR):
- * # The Kafka key serializer when 'key.naming.strategy' is NOT defined. Can be "none", "binary", "string". + * # The Kafka key serializer when 'key.naming.strategy' is NOT defined. Can be "none", "binary", "string" (UTF‑8).
125-128: Default and temp column names.
- Good default for
key.column.serializer= "binary".- Optionally make temp names even less collision‑prone (e.g., prefix with
__pramen_tmp_).- private val tempKafkaColumnName = "tmp_pramen_kafka" - private val tempKafkaKeyColumnName = "tmp_pramen_kafka_key" + private val tempKafkaColumnName = "__pramen_tmp_kafka" + private val tempKafkaKeyColumnName = "__pramen_tmp_kafka_key"
56-60: Doc tweak: make UTF‑8 explicit for “string”.Minor clarity: note that “string” decodes key bytes as UTF‑8 (given the cast).
README.md (1)
948-951: State encoding for “string” serializer.Add “(UTF‑8)” to align with Spark binary→string cast behavior.
- # The Kafka key serializer when key.naming.strategy is NOT defined. Can be "none", "binary", "string". + # The Kafka key serializer when key.naming.strategy is NOT defined. Can be "none", "binary", "string" (UTF‑8).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
README.md(3 hunks)pramen/core/src/main/scala/za/co/absa/pramen/core/model/QueryBuilder.scala(2 hunks)pramen/core/src/test/scala/za/co/absa/pramen/core/model/QueryBuilderSuite.scala(4 hunks)pramen/core/src/test/scala/za/co/absa/pramen/core/pipeline/SourceTableParserSuite.scala(1 hunks)pramen/extras/src/main/scala/za/co/absa/pramen/extras/source/KafkaAvroSource.scala(6 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- pramen/core/src/test/scala/za/co/absa/pramen/core/pipeline/SourceTableParserSuite.scala
🧰 Additional context used
🧬 Code graph analysis (2)
pramen/extras/src/main/scala/za/co/absa/pramen/extras/source/KafkaAvroSource.scala (2)
pramen/extras/src/main/scala/za/co/absa/pramen/extras/utils/ConfigUtils.scala (2)
ConfigUtils(24-205)getOptionString(43-49)pramen/extras/src/main/scala/za/co/absa/pramen/extras/writer/model/NamingStrategy.scala (1)
applyNamingStrategyToAbrisConfig(81-103)
pramen/core/src/test/scala/za/co/absa/pramen/core/model/QueryBuilderSuite.scala (1)
pramen/api/src/main/scala/za/co/absa/pramen/api/Query.scala (2)
Query(28-56)Table(37-41)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Test Spark 3.3.4 on Scala 2.12.20
- GitHub Check: Test Spark 3.5.5 on Scala 2.12.20
- GitHub Check: Test Spark 3.4.4 on Scala 2.13.16
- GitHub Check: Test Spark 3.5.5 on Scala 2.13.16
- GitHub Check: Test Spark 3.3.4 on Scala 2.13.16
- GitHub Check: Test Spark 3.4.4 on Scala 2.12.20
- GitHub Check: Test Spark 2.4.8 on Scala 2.11.12
- GitHub Check: Test Coverage on Scala 2.12.18
🔇 Additional comments (7)
pramen/core/src/main/scala/za/co/absa/pramen/core/model/QueryBuilder.scala (2)
28-28: Good addition: explicit TOPIC_KEY constant.Clear, consistent with existing keys; enables topic-based configs cleanly.
42-43: Topic resolution path looks right.Topic falls back after table/db.table; consistent precedence.
pramen/core/src/test/scala/za/co/absa/pramen/core/model/QueryBuilderSuite.scala (3)
33-35: Tests for db.table and topic look good.Covers both aliases; maintains expected Query.Table resolution.
58-58: Error expectation updated correctly.Message now includes 'topic'.
68-68: Prefixed error expectation updated correctly.Includes 'input.topic' as intended.
pramen/extras/src/main/scala/za/co/absa/pramen/extras/source/KafkaAvroSource.scala (1)
220-220: hasKey fix is correct.Including
keyNamingStrategy.isDefinedprevents losing the deserialized key when serializer isnone. Nice.README.md (1)
992-992: Docs match code: input.topic exampleGood update to
input.topic; consistent with QueryBuilder changes.
| val TOPIC_NAME_KEY = "topic.name" | ||
| val CUSTOM_KAFKA_COLUMN_KEY = "custom.kafka.column" | ||
| val KEY_COLUMN_KEY = "key.column.name" | ||
| val KEY_COLUMN_SERIALIZER_KEY = "key.column.serializer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Public constant looks good; ensure config default is discoverable.
Confirm key.column.serializer default is present in reference.conf for visibility.
🏁 Script executed:
#!/bin/bash
# Verify default presence and examples for key.column.serializer
fd -a reference.conf
rg -nC2 'key\.column\.serializer' -g 'pramen/**/reference.conf'Length of output: 196
I need to inspect the reference.conf files directly to verify if the configuration default is present.
---
🏁 Script executed:
```shell
#!/bin/bash
# Search for key.column.serializer in all found reference.conf files
echo "=== Searching in extras reference.conf ==="
rg -nC2 'key\.column\.serializer' /home/jailuser/git/pramen/extras/src/main/resources/reference.conf
echo -e "\n=== Searching in core reference.conf ==="
rg -nC2 'key\.column\.serializer' /home/jailuser/git/pramen/core/src/main/resources/reference.conf
echo -e "\n=== Checking if extras reference.conf exists and contains kafka config ==="
cat /home/jailuser/git/pramen/extras/src/main/resources/reference.conf 2>/dev/null | head -50
Length of output: 1132
Let me verify more thoroughly by checking the full contents and looking for any Kafka configuration:
---
🏁 Script executed:
```shell
#!/bin/bash
# Check full extras reference.conf for Kafka sections
echo "=== Full extras reference.conf ==="
cat /home/jailuser/git/pramen/extras/src/main/resources/reference.conf
echo -e "\n=== Search for any 'kafka' in reference.conf files ==="
rg -in 'kafka' /home/jailuser/git/pramen/extras/src/main/resources/reference.conf /home/jailuser/git/pramen/core/src/main/resources/reference.conf
echo -e "\n=== Check KafkaAvroSource for default value handling ==="
rg -nC3 'key\.column\.serializer\|KEY_COLUMN_SERIALIZER' /home/jailuser/git/pramen/extras/src/main/scala/za/co/absa/pramen/extras/source/KafkaAvroSource.scala
Length of output: 1120
Let me verify the file path and view the KafkaAvroSource code directly:
#!/bin/bash
# Find the correct KafkaAvroSource file
echo "=== Finding KafkaAvroSource.scala ==="
find /home/jailuser/git/pramen -name "KafkaAvroSource.scala" -type f
echo -e "\n=== View context around line 271 ==="
fd -e scala "KafkaAvroSource" /home/jailuser/git/pramen --exec cat {} \; | sed -n '265,280p'
---
🏁 Script executed:
#!/bin/bash
# Find KafkaAvroSource.scala and view context around line 271
find /home/jailuser/git/pramen -name "KafkaAvroSource.scala" -type f
echo -e "\n=== View lines 265-280 of KafkaAvroSource.scala ==="
find /home/jailuser/git/pramen -name "KafkaAvroSource.scala" -type f -exec sed -n '265,280p' {} +
echo -e "\n=== View surrounding context for KEY_COLUMN_SERIALIZER_KEY ==="
find /home/jailuser/git/pramen -name "KafkaAvroSource.scala" -type f -exec grep -nC5 "KEY_COLUMN_SERIALIZER" {} +Length of output: 2062
Let me check if there are any Kafka configuration examples elsewhere in the codebase:
---
🏁 Script executed:
```shell
#!/bin/bash
# Check if Kafka config is documented elsewhere
echo "=== Search for Kafka config examples in test resources or documentation ==="
find /home/jailuser/git/pramen -type f \( -name "*.conf" -o -name "*.md" -o -name "*.txt" \) -exec grep -l "key\.column" {} +
echo -e "\n=== Check test resources for Kafka configuration examples ==="
find /home/jailuser/git/pramen -path "*/test/resources/*" -name "*.conf" -type f -exec cat {} \;
Length of output: 56619
Add Kafka configuration defaults to reference.conf for discoverability.
Verification confirms key.column.serializer is absent from reference.conf. While the code provides a hardcoded default of "binary", this configuration is not documented or discoverable. Add a Kafka source configuration section to pramen/extras/src/main/resources/reference.conf listing available options (including key.column.serializer, key.column.name, custom.kafka.column, and topic.name) with their defaults for user visibility.
🤖 Prompt for AI Agents
In
pramen/extras/src/main/scala/za/co/absa/pramen/extras/source/KafkaAvroSource.scala
around line 271, the constant KEY_COLUMN_SERIALIZER_KEY
("key.column.serializer") is used but not documented in reference.conf; update
pramen/extras/src/main/resources/reference.conf to add a kafka source
configuration section (e.g., pramen.extras.source.kafka or similar existing
namespace) that lists the relevant keys and their defaults for discoverability:
key.column.serializer = "binary", key.column.name = "key", custom.kafka.column =
"" (or appropriate default), and topic.name = "" (or appropriate default), plus
brief comments describing accepted options where useful; ensure keys match the
exact names used in code and place the section in alphabetical/consistent order
with existing configs.
Summary by CodeRabbit
Release Notes
New Features
key.column.serializerconfiguration option for Kafka key handling (supports binary, string, avro, or none)Documentation