Skip to content

Nifi 8843 internal#47

Open
prabhjyotsingh wants to merge 6 commits intorel/ODP-2.7.2.3.3.6.4-1from
NIFI-8843-internal
Open

Nifi 8843 internal#47
prabhjyotsingh wants to merge 6 commits intorel/ODP-2.7.2.3.3.6.4-1from
NIFI-8843-internal

Conversation

@prabhjyotsingh
Copy link

@prabhjyotsingh prabhjyotsingh commented Feb 19, 2026

HA Configuration — Three NiFi Registry Nodes

There are four files to change on each node (the database connection and cluster flags are the only per-node differences).

Overview

[Load Balancer]
/ |
Node-1 Node-2 Node-3
\ | /
[PostgreSQL]

All three nodes point at the same DB. The DB holds:

  • Metadata (buckets, flows, versions)
  • Users & groups / access policies (DB-backed providers)
  • Flow content (DB-backed persistence provider)
  • Cache version counters (CACHE_VERSION)
  • Leader election lock (CLUSTER_LEADER)
  • Durable event log (REGISTRY_EVENT)

File 1 — conf/nifi-registry.properties

This is the main config file. The DB section and the cluster section both live here. Everything below is the same on all three nodes except
nifi.registry.cluster.node.identifier.

── Database (same on all nodes) ──────────────────────────────────────────────

nifi.registry.db.url=jdbc:postgresql://pg-host:5432/nifi_registry
nifi.registry.db.driver.class=org.postgresql.Driver
nifi.registry.db.driver.directory=./lib/postgresql # dir containing the JDBC jar
nifi.registry.db.username=nifireg
nifi.registry.db.password=secret
nifi.registry.db.maxConnections=5

── Cluster (change node.identifier per node) ─────────────────────────────────

nifi.registry.cluster.enabled=true
nifi.registry.cluster.node.identifier=node-1 # → node-2 / node-3 on the others
nifi.registry.cluster.cache.refresh.interval.ms=15000

The DB URL is the only place you configure the database. The providers (authorizers.xml, providers.xml) simply say "use the database" — they don't repeat the connection
details, because the framework injects the same DataSource bean into them automatically.


File 2 — conf/providers.xml

Switch flow storage from the default filesystem provider to the database provider. Identical on all three nodes.

org.apache.nifi.registry.provider.flow.DatabaseFlowPersistenceProvider
  <!-- Extension bundle storage: also switch to S3 or keep filesystem+shared-NFS.
       Filesystem only works if all nodes mount the same directory. -->
  <extensionBundlePersistenceProvider>
      <class>org.apache.nifi.registry.provider.extension.FileSystemBundlePersistenceProvider</class>
      <property name="Extension Bundle Storage Directory">./extension_bundles</property>
  </extensionBundlePersistenceProvider>

Note on extension bundles: If you use the filesystem provider for bundles, mount the same NFS/EFS directory on every node. Or switch to S3 (S3BundlePersistenceProvider) for a
fully shared, filesystem-free setup.


File 3 — conf/authorizers.xml

Switch from the default file-backed providers to the DB-backed ones. Identical on all three nodes. Fill in your actual admin identity.

  <!-- Users & groups live in the DB (UGP_USER / UGP_GROUP tables) -->
  <userGroupProvider>
      <identifier>database-user-group-provider</identifier>
      <class>org.apache.nifi.registry.security.authorization.database.DatabaseUserGroupProvider</class>
      <property name="Initial User Identity 1">CN=admin, OU=nifi</property>
  </userGroupProvider>

  <!-- Access policies live in the DB (APP_POLICY tables) -->
  <accessPolicyProvider>
      <identifier>database-access-policy-provider</identifier>
      <class>org.apache.nifi.registry.security.authorization.database.DatabaseAccessPolicyProvider</class>
      <property name="User Group Provider">database-user-group-provider</property>
      <property name="Initial Admin Identity">CN=admin, OU=nifi</property>
      <!-- Add NiFi node identities if NiFi itself connects to this Registry -->
      <property name="NiFi Identity 1">CN=nifi-node-1, OU=nifi</property>
  </accessPolicyProvider>

  <authorizer>
      <identifier>managed-authorizer</identifier>
      <class>org.apache.nifi.registry.security.authorization.StandardManagedAuthorizer</class>
      <property name="Access Policy Provider">database-access-policy-provider</property>
  </authorizer>

Also confirm nifi-registry.properties points to this authorizer:
nifi.registry.security.authorizer=managed-authorizer


File 4 — lib/postgresql/ (JDBC driver)

Drop the PostgreSQL JDBC jar into a directory on every node and point nifi.registry.db.driver.directory at it (done in File 1 above). No XML needed — just the jar file.


What happens on first start

  1. Node-1 starts first. Flyway runs all migrations (V2–V10) and creates all tables, including CACHE_VERSION, CLUSTER_LEADER, and REGISTRY_EVENT. Node-1 wins the INSERT into
    CLUSTER_LEADER and becomes leader.
  2. Node-2 and Node-3 start. Flyway detects all migrations are already applied and skips them. Their heartbeat threads try to INSERT into CLUSTER_LEADER but fail (row exists
    and Node-1 holds the lease), so they become followers.
  3. All three nodes poll CACHE_VERSION every 15 s. When any node mutates users/groups/policies, it bumps the version and the other two reload their in-memory caches within 15
    s.
  4. If Node-1 dies, Node-2 or Node-3 claims the expired lease within 30 s and takes over event delivery.

Summary of what goes where

Setting Where
Database URL, driver, username, password nifi-registry.properties
cluster.enabled, node.identifier, cache interval nifi-registry.properties
Flow persistence provider (DB vs filesystem) providers.xml
User/group + policy providers (DB vs file) authorizers.xml
PostgreSQL JDBC jar lib/postgresql/ dir on each node
Nothing else — no extra XML, no ZooKeeper, no NiFi cluster.xml

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements high-availability (HA) clustering capabilities for NiFi Registry with database-backed coordination, along with a maintenance mode feature for operational management.

Changes:

  • Adds maintenance mode functionality via Spring Boot Actuator endpoints to temporarily block write operations for safe backups
  • Implements active-active clustering with database-backed leader election, cache synchronization, and durable event delivery
  • Refactors EventService to support both standalone and cluster-aware modes with exactly-once event delivery guarantees

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
MaintenanceModeManager.java Simple state manager for maintenance mode using AtomicBoolean
MaintenanceModeFilter.java Servlet filter that blocks write HTTP methods during maintenance mode
MaintenanceModeHealthIndicator.java Exposes maintenance mode state via /actuator/health endpoint
MaintenanceModeEndpoint.java Actuator endpoint for toggling maintenance mode via POST requests
NiFiRegistrySecurityConfig.java Integrates maintenance mode filter into security filter chain
LeaderElectionManager.java Interface for cluster leader election
DatabaseLeaderElectionManager.java TTL-based leader election using CLUSTER_LEADER table
CacheRefreshPoller.java Polls CACHE_VERSION table to synchronize authorization caches across nodes
DatabaseUserGroupProvider.java Enhanced to bump cache version on mutations and register with poller
DatabaseAccessPolicyProvider.java Enhanced to bump cache version on mutations and register with poller
EventService.java Refactored from concrete class to interface
StandardEventService.java Standalone in-memory event delivery implementation (renamed from EventService)
ClusterAwareEventService.java Cluster-aware event delivery with database persistence and leader-only delivery
EventServiceConfiguration.java Factory that selects EventService implementation based on cluster mode
ExtensionManager.java Added optional LeaderElectionManager injection for logging
V9__AddClusterTables.sql Creates CACHE_VERSION and CLUSTER_NODE tables
V10__AddLeaderAndEventTables.sql Creates CLUSTER_LEADER and REGISTRY_EVENT tables
NiFiRegistryProperties.java Added cluster configuration property getters
nifi-registry.properties Added cluster configuration property template
administration-guide.adoc Comprehensive documentation for maintenance mode, clustering, and leader election
CLAUDE.md Development guidance for AI coding assistants
TestEventService.java Updated to use StandardEventService

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +20 to +21
import org.springframework.boot.health.contributor.Health;
import org.springframework.boot.health.contributor.HealthIndicator;
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Spring Boot Actuator import paths appear to be incorrect. The correct package for Spring Boot 2.x and 3.x is org.springframework.boot.actuate.health, not org.springframework.boot.health.contributor. The imports should be:

  • import org.springframework.boot.actuate.health.Health;
  • import org.springframework.boot.actuate.health.HealthIndicator;

This will cause a compilation failure.

Suggested change
import org.springframework.boot.health.contributor.Health;
import org.springframework.boot.health.contributor.HealthIndicator;
import org.springframework.boot.actuate.health.Health;
import org.springframework.boot.actuate.health.HealthIndicator;

Copilot uses AI. Check for mistakes.
Comment on lines +92 to +96
@Override
public void destroy() {
LOGGER.info("Shutting down CacheRefreshPoller.");
scheduler.shutdownNow();
}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The destroy method always calls scheduler.shutdownNow() even when cluster mode is disabled and the scheduler was never started. While this is likely harmless (the scheduler is initialized in the constructor), it would be cleaner to track whether the scheduler was actually started and only shut it down if it's running. Consider adding a flag like private volatile boolean started = false; and checking it in the destroy method.

Copilot uses AI. Check for mistakes.
private final Map<String, ExtensionClassLoader> classLoaderMap = new HashMap<>();
private final AtomicBoolean loaded = new AtomicBoolean(false);

// Optional: injected only when cluster mode is enabled; used for logging.
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "Optional: injected only when cluster mode is enabled" but DatabaseLeaderElectionManager is always created as a Spring bean (marked @component). While the @Autowired(required = false) makes the injection optional at the Spring level, the leaderElectionManager will always be injected in practice since the bean always exists. Update the comment to reflect reality: "Optional injection; used for logging cluster leadership status when available."

Suggested change
// Optional: injected only when cluster mode is enabled; used for logging.
// Optional injection; used for logging cluster leadership status when available.

Copilot uses AI. Check for mistakes.
Comment on lines +1245 to +1250
.Re-enabling writes after backup
----
curl -X POST https://localhost:18443/nifi-registry/actuator/maintenance \
-H "Content-Type: application/json" \
-d '{"enabled": false}'
----
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation describes how to disable maintenance mode via a POST request (line 1247), but this will fail due to the deadlock issue identified earlier - the MaintenanceModeFilter blocks all POST requests when maintenance mode is enabled, including the POST to /actuator/maintenance itself. The documentation should include a workaround, such as: "Note: If maintenance mode was accidentally enabled and must be disabled, restart the NiFi Registry process as maintenance mode is not persisted across restarts."

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +36
-- Cluster node registration table: tracks live nodes for operational visibility.
-- Updated by a heartbeat thread on each node; no quorum or eviction is performed in this phase.

CREATE TABLE CLUSTER_NODE (
NODE_ID VARCHAR(100) NOT NULL,
HOSTNAME VARCHAR(255) NOT NULL,
LAST_HEARTBEAT TIMESTAMP NOT NULL,
CONSTRAINT PK__CLUSTER_NODE PRIMARY KEY (NODE_ID)
);
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLUSTER_NODE table is created in this migration but there's no code in the PR that actually writes to it. According to the comment, it should be "Updated by a heartbeat thread on each node" but no such heartbeat logic exists. This appears to be dead schema that was planned but not implemented, or code is missing from this PR. Consider either removing this table definition or implementing the heartbeat logic that populates it.

Suggested change
-- Cluster node registration table: tracks live nodes for operational visibility.
-- Updated by a heartbeat thread on each node; no quorum or eviction is performed in this phase.
CREATE TABLE CLUSTER_NODE (
NODE_ID VARCHAR(100) NOT NULL,
HOSTNAME VARCHAR(255) NOT NULL,
LAST_HEARTBEAT TIMESTAMP NOT NULL,
CONSTRAINT PK__CLUSTER_NODE PRIMARY KEY (NODE_ID)
);

Copilot uses AI. Check for mistakes.
Comment on lines +61 to +64
if (maintenanceModeManager.isEnabled() && isWriteMethod(httpRequest.getMethod())) {
sendMaintenanceModeResponse(httpResponse);
return;
}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MaintenanceModeFilter blocks ALL write operations (POST, PUT, PATCH, DELETE) when maintenance mode is enabled, including the POST request to /actuator/maintenance endpoint that would disable maintenance mode. This creates a deadlock situation where maintenance mode cannot be disabled via the API once enabled. Consider either:

  1. Exempting the /actuator/maintenance endpoint from the filter
  2. Allowing GET with a query parameter to toggle state
  3. Adding an exclude path pattern to the filter for critical administrative endpoints

Copilot uses AI. Check for mistakes.
Comment on lines +57 to +59
// Provider instances are not Spring beans; they self-register here on initialisation.
private static volatile DatabaseAccessPolicyProvider accessPolicyProvider;
private static volatile DatabaseUserGroupProvider userGroupProvider;
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The static volatile fields for provider instances (accessPolicyProvider, userGroupProvider) could cause issues in testing scenarios or if multiple Spring contexts are created. While volatile ensures visibility, static state is shared across all instances of the class. Consider making these instance fields or using a different mechanism to allow the providers to register themselves with the poller, such as a callback passed during provider initialization.

Copilot uses AI. Check for mistakes.
Comment on lines +47 to +57
@Bean
@Autowired
public EventService eventService(
final List<EventHookProvider> eventHookProviders,
final NiFiRegistryProperties properties,
final DataSource dataSource,
final LeaderElectionManager leaderElectionManager) {

if (properties.isClusterEnabled()) {
LOGGER.info("Cluster mode is enabled; creating ClusterAwareEventService.");
return new ClusterAwareEventService(eventHookProviders, dataSource, leaderElectionManager);
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EventServiceConfiguration.eventService() method requires a LeaderElectionManager parameter, but in standalone mode (cluster.enabled=false) the DatabaseLeaderElectionManager bean still exists and will be injected. While this works (the bean self-disables when cluster mode is off), it creates an unnecessary dependency. Consider making the LeaderElectionManager parameter optional with @Autowired(required = false) or using @ConditionalOnProperty to only create the ClusterAwareEventService when clustering is enabled.

Copilot uses AI. Check for mistakes.
prabhjyotsingh and others added 2 commits February 19, 2026 16:55
Implements leader election and distributed event hooks for NiFi Registry
cluster mode (nifi.registry.cluster.enabled=true).

- V10__AddLeaderAndEventTables.sql: adds CLUSTER_LEADER (TTL-based
  distributed lock) and REGISTRY_EVENT (durable event log) tables.

- LeaderElectionManager interface + DatabaseLeaderElectionManager: TTL
  lease (30 s) with 10 s heartbeat. Algorithm: renew own lease →
  claim expired lease → INSERT first-ever row → remain follower.

- EventService converted to interface; existing in-memory logic moved to
  StandardEventService (standalone mode). ClusterAwareEventService
  (cluster mode) persists events to REGISTRY_EVENT and delivers them
  exactly once from the leader node via a 5 s polling loop with 7-day
  retention cleanup.

- EventServiceConfiguration (@configuration) selects the right
  EventService implementation at startup based on cluster mode.

- ExtensionManager: injects optional LeaderElectionManager and logs
  which node performed extension discovery.
…ain/java/org/apache/nifi/registry/event/ClusterAwareEventService.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants