Conversation
There was a problem hiding this comment.
Pull request overview
This PR implements high-availability (HA) clustering capabilities for NiFi Registry with database-backed coordination, along with a maintenance mode feature for operational management.
Changes:
- Adds maintenance mode functionality via Spring Boot Actuator endpoints to temporarily block write operations for safe backups
- Implements active-active clustering with database-backed leader election, cache synchronization, and durable event delivery
- Refactors EventService to support both standalone and cluster-aware modes with exactly-once event delivery guarantees
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| MaintenanceModeManager.java | Simple state manager for maintenance mode using AtomicBoolean |
| MaintenanceModeFilter.java | Servlet filter that blocks write HTTP methods during maintenance mode |
| MaintenanceModeHealthIndicator.java | Exposes maintenance mode state via /actuator/health endpoint |
| MaintenanceModeEndpoint.java | Actuator endpoint for toggling maintenance mode via POST requests |
| NiFiRegistrySecurityConfig.java | Integrates maintenance mode filter into security filter chain |
| LeaderElectionManager.java | Interface for cluster leader election |
| DatabaseLeaderElectionManager.java | TTL-based leader election using CLUSTER_LEADER table |
| CacheRefreshPoller.java | Polls CACHE_VERSION table to synchronize authorization caches across nodes |
| DatabaseUserGroupProvider.java | Enhanced to bump cache version on mutations and register with poller |
| DatabaseAccessPolicyProvider.java | Enhanced to bump cache version on mutations and register with poller |
| EventService.java | Refactored from concrete class to interface |
| StandardEventService.java | Standalone in-memory event delivery implementation (renamed from EventService) |
| ClusterAwareEventService.java | Cluster-aware event delivery with database persistence and leader-only delivery |
| EventServiceConfiguration.java | Factory that selects EventService implementation based on cluster mode |
| ExtensionManager.java | Added optional LeaderElectionManager injection for logging |
| V9__AddClusterTables.sql | Creates CACHE_VERSION and CLUSTER_NODE tables |
| V10__AddLeaderAndEventTables.sql | Creates CLUSTER_LEADER and REGISTRY_EVENT tables |
| NiFiRegistryProperties.java | Added cluster configuration property getters |
| nifi-registry.properties | Added cluster configuration property template |
| administration-guide.adoc | Comprehensive documentation for maintenance mode, clustering, and leader election |
| CLAUDE.md | Development guidance for AI coding assistants |
| TestEventService.java | Updated to use StandardEventService |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import org.springframework.boot.health.contributor.Health; | ||
| import org.springframework.boot.health.contributor.HealthIndicator; |
There was a problem hiding this comment.
The Spring Boot Actuator import paths appear to be incorrect. The correct package for Spring Boot 2.x and 3.x is org.springframework.boot.actuate.health, not org.springframework.boot.health.contributor. The imports should be:
import org.springframework.boot.actuate.health.Health;import org.springframework.boot.actuate.health.HealthIndicator;
This will cause a compilation failure.
| import org.springframework.boot.health.contributor.Health; | |
| import org.springframework.boot.health.contributor.HealthIndicator; | |
| import org.springframework.boot.actuate.health.Health; | |
| import org.springframework.boot.actuate.health.HealthIndicator; |
| @Override | ||
| public void destroy() { | ||
| LOGGER.info("Shutting down CacheRefreshPoller."); | ||
| scheduler.shutdownNow(); | ||
| } |
There was a problem hiding this comment.
The destroy method always calls scheduler.shutdownNow() even when cluster mode is disabled and the scheduler was never started. While this is likely harmless (the scheduler is initialized in the constructor), it would be cleaner to track whether the scheduler was actually started and only shut it down if it's running. Consider adding a flag like private volatile boolean started = false; and checking it in the destroy method.
...egistry-framework/src/main/java/org/apache/nifi/registry/event/ClusterAwareEventService.java
Outdated
Show resolved
Hide resolved
| private final Map<String, ExtensionClassLoader> classLoaderMap = new HashMap<>(); | ||
| private final AtomicBoolean loaded = new AtomicBoolean(false); | ||
|
|
||
| // Optional: injected only when cluster mode is enabled; used for logging. |
There was a problem hiding this comment.
The comment states "Optional: injected only when cluster mode is enabled" but DatabaseLeaderElectionManager is always created as a Spring bean (marked @component). While the @Autowired(required = false) makes the injection optional at the Spring level, the leaderElectionManager will always be injected in practice since the bean always exists. Update the comment to reflect reality: "Optional injection; used for logging cluster leadership status when available."
| // Optional: injected only when cluster mode is enabled; used for logging. | |
| // Optional injection; used for logging cluster leadership status when available. |
| .Re-enabling writes after backup | ||
| ---- | ||
| curl -X POST https://localhost:18443/nifi-registry/actuator/maintenance \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{"enabled": false}' | ||
| ---- |
There was a problem hiding this comment.
The documentation describes how to disable maintenance mode via a POST request (line 1247), but this will fail due to the deadlock issue identified earlier - the MaintenanceModeFilter blocks all POST requests when maintenance mode is enabled, including the POST to /actuator/maintenance itself. The documentation should include a workaround, such as: "Note: If maintenance mode was accidentally enabled and must be disabled, restart the NiFi Registry process as maintenance mode is not persisted across restarts."
| -- Cluster node registration table: tracks live nodes for operational visibility. | ||
| -- Updated by a heartbeat thread on each node; no quorum or eviction is performed in this phase. | ||
|
|
||
| CREATE TABLE CLUSTER_NODE ( | ||
| NODE_ID VARCHAR(100) NOT NULL, | ||
| HOSTNAME VARCHAR(255) NOT NULL, | ||
| LAST_HEARTBEAT TIMESTAMP NOT NULL, | ||
| CONSTRAINT PK__CLUSTER_NODE PRIMARY KEY (NODE_ID) | ||
| ); |
There was a problem hiding this comment.
The CLUSTER_NODE table is created in this migration but there's no code in the PR that actually writes to it. According to the comment, it should be "Updated by a heartbeat thread on each node" but no such heartbeat logic exists. This appears to be dead schema that was planned but not implemented, or code is missing from this PR. Consider either removing this table definition or implementing the heartbeat logic that populates it.
| -- Cluster node registration table: tracks live nodes for operational visibility. | |
| -- Updated by a heartbeat thread on each node; no quorum or eviction is performed in this phase. | |
| CREATE TABLE CLUSTER_NODE ( | |
| NODE_ID VARCHAR(100) NOT NULL, | |
| HOSTNAME VARCHAR(255) NOT NULL, | |
| LAST_HEARTBEAT TIMESTAMP NOT NULL, | |
| CONSTRAINT PK__CLUSTER_NODE PRIMARY KEY (NODE_ID) | |
| ); |
| if (maintenanceModeManager.isEnabled() && isWriteMethod(httpRequest.getMethod())) { | ||
| sendMaintenanceModeResponse(httpResponse); | ||
| return; | ||
| } |
There was a problem hiding this comment.
The MaintenanceModeFilter blocks ALL write operations (POST, PUT, PATCH, DELETE) when maintenance mode is enabled, including the POST request to /actuator/maintenance endpoint that would disable maintenance mode. This creates a deadlock situation where maintenance mode cannot be disabled via the API once enabled. Consider either:
- Exempting the /actuator/maintenance endpoint from the filter
- Allowing GET with a query parameter to toggle state
- Adding an exclude path pattern to the filter for critical administrative endpoints
| // Provider instances are not Spring beans; they self-register here on initialisation. | ||
| private static volatile DatabaseAccessPolicyProvider accessPolicyProvider; | ||
| private static volatile DatabaseUserGroupProvider userGroupProvider; |
There was a problem hiding this comment.
The static volatile fields for provider instances (accessPolicyProvider, userGroupProvider) could cause issues in testing scenarios or if multiple Spring contexts are created. While volatile ensures visibility, static state is shared across all instances of the class. Consider making these instance fields or using a different mechanism to allow the providers to register themselves with the poller, such as a callback passed during provider initialization.
| @Bean | ||
| @Autowired | ||
| public EventService eventService( | ||
| final List<EventHookProvider> eventHookProviders, | ||
| final NiFiRegistryProperties properties, | ||
| final DataSource dataSource, | ||
| final LeaderElectionManager leaderElectionManager) { | ||
|
|
||
| if (properties.isClusterEnabled()) { | ||
| LOGGER.info("Cluster mode is enabled; creating ClusterAwareEventService."); | ||
| return new ClusterAwareEventService(eventHookProviders, dataSource, leaderElectionManager); |
There was a problem hiding this comment.
The EventServiceConfiguration.eventService() method requires a LeaderElectionManager parameter, but in standalone mode (cluster.enabled=false) the DatabaseLeaderElectionManager bean still exists and will be injected. While this works (the bean self-disables when cluster mode is off), it creates an unnecessary dependency. Consider making the LeaderElectionManager parameter optional with @Autowired(required = false) or using @ConditionalOnProperty to only create the ClusterAwareEventService when clustering is enabled.
1b628c0 to
d651688
Compare
Implements leader election and distributed event hooks for NiFi Registry cluster mode (nifi.registry.cluster.enabled=true). - V10__AddLeaderAndEventTables.sql: adds CLUSTER_LEADER (TTL-based distributed lock) and REGISTRY_EVENT (durable event log) tables. - LeaderElectionManager interface + DatabaseLeaderElectionManager: TTL lease (30 s) with 10 s heartbeat. Algorithm: renew own lease → claim expired lease → INSERT first-ever row → remain follower. - EventService converted to interface; existing in-memory logic moved to StandardEventService (standalone mode). ClusterAwareEventService (cluster mode) persists events to REGISTRY_EVENT and delivers them exactly once from the leader node via a 5 s polling loop with 7-day retention cleanup. - EventServiceConfiguration (@configuration) selects the right EventService implementation at startup based on cluster mode. - ExtensionManager: injects optional LeaderElectionManager and logs which node performed extension discovery.
…ain/java/org/apache/nifi/registry/event/ClusterAwareEventService.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
d651688 to
0782e1a
Compare
3523e71 to
f7c3872
Compare
HA Configuration — Three NiFi Registry Nodes
There are four files to change on each node (the database connection and cluster flags are the only per-node differences).
Overview
[Load Balancer]
/ |
Node-1 Node-2 Node-3
\ | /
[PostgreSQL]
All three nodes point at the same DB. The DB holds:
File 1 — conf/nifi-registry.properties
This is the main config file. The DB section and the cluster section both live here. Everything below is the same on all three nodes except
nifi.registry.cluster.node.identifier.
── Database (same on all nodes) ──────────────────────────────────────────────
nifi.registry.db.url=jdbc:postgresql://pg-host:5432/nifi_registry
nifi.registry.db.driver.class=org.postgresql.Driver
nifi.registry.db.driver.directory=./lib/postgresql # dir containing the JDBC jar
nifi.registry.db.username=nifireg
nifi.registry.db.password=secret
nifi.registry.db.maxConnections=5
── Cluster (change node.identifier per node) ─────────────────────────────────
nifi.registry.cluster.enabled=true
nifi.registry.cluster.node.identifier=node-1 # → node-2 / node-3 on the others
nifi.registry.cluster.cache.refresh.interval.ms=15000
The DB URL is the only place you configure the database. The providers (authorizers.xml, providers.xml) simply say "use the database" — they don't repeat the connection
details, because the framework injects the same DataSource bean into them automatically.
File 2 — conf/providers.xml
Switch flow storage from the default filesystem provider to the database provider. Identical on all three nodes.
org.apache.nifi.registry.provider.flow.DatabaseFlowPersistenceProviderNote on extension bundles: If you use the filesystem provider for bundles, mount the same NFS/EFS directory on every node. Or switch to S3 (S3BundlePersistenceProvider) for a
fully shared, filesystem-free setup.
File 3 — conf/authorizers.xml
Switch from the default file-backed providers to the DB-backed ones. Identical on all three nodes. Fill in your actual admin identity.
Also confirm nifi-registry.properties points to this authorizer:
nifi.registry.security.authorizer=managed-authorizer
File 4 — lib/postgresql/ (JDBC driver)
Drop the PostgreSQL JDBC jar into a directory on every node and point nifi.registry.db.driver.directory at it (done in File 1 above). No XML needed — just the jar file.
What happens on first start
CLUSTER_LEADER and becomes leader.
and Node-1 holds the lease), so they become followers.
s.
Summary of what goes where