Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ public class DatanodeConfiguration extends ReconfigurableConfig {
public static final String FAILED_DB_VOLUMES_TOLERATED_KEY = "hdds.datanode.failed.db.volumes.tolerated";
public static final String DISK_CHECK_MIN_GAP_KEY = "hdds.datanode.disk.check.min.gap";
public static final String DISK_CHECK_TIMEOUT_KEY = "hdds.datanode.disk.check.timeout";
public static final String DISK_CHECK_RETRY_GAP_KEY = "hdds.datanode.disk.check.retry.gap";

// Minimum space should be left on volume.
// Ex: If volume has 1000GB and minFreeSpace is configured as 10GB,
Expand Down Expand Up @@ -99,6 +100,8 @@ public class DatanodeConfiguration extends ReconfigurableConfig {

static final Duration DISK_CHECK_TIMEOUT_DEFAULT = Duration.ofMinutes(10);

static final Duration DISK_CHECK_RETRY_GAP_DEFAULT = Duration.ofMinutes(1);

static final boolean CONTAINER_SCHEMA_V3_ENABLED_DEFAULT = true;
static final long ROCKSDB_LOG_MAX_FILE_SIZE_BYTES_DEFAULT = 32 * 1024 * 1024;
static final int ROCKSDB_LOG_MAX_FILE_NUM_DEFAULT = 64;
Expand Down Expand Up @@ -404,6 +407,17 @@ public class DatanodeConfiguration extends ReconfigurableConfig {
)
private Duration diskCheckTimeout = DISK_CHECK_TIMEOUT_DEFAULT;

@Config(key = DISK_CHECK_RETRY_GAP_KEY,
defaultValue = "1m",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be at least 15m?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't want a long time between two checks as the goal here is to ignore transient errors while opening RocksDb.

If the gap between two consecutive attempts at opening RocksDb is too long, it is no longer about transient errors.

Secondly, we can't have the time period between opening the RocksDb twice be 15 minutes as it will fail the disk check timeout of 10 minutes.

In case of errors when opening the RocksDb, we ideally want the thread to sleep for a very short time, try to open the RocksDb a second time and if we still fail, mark it as an actual failure to open RocksDb.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also increase 10 mins timeout then? in case of 3rd retry we'll spend 2 out of 10 mins in sleep + time for actual processing...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So 2 checks should be more than sufficient to definitively declare if we are failing to open RocksDb. After two checks we should let the sliding window handle if further checks should be required on that volume.

If we allow more than 2 checks then definitely the timeouts for each disk check will also have to become dynamic and the 10 minute threshold maybe too small. On the other hand we would like to know within 10 minutes if a disk is unhealthy or not as elongating this check pushes the future checks further out.

type = ConfigType.TIME,
tags = {DATANODE},
description = "Time to wait between retries of disk checks."
+ " To ignore transient issues, the RocksDb instance on a disk is validated multiple times before"
+ " declaring failure. This configuration defines the time to wait between the retry attempts."
+ " Unit could be defined with postfix (ns,ms,s,m,h,d)."
)
private Duration diskCheckRetryGap = DISK_CHECK_RETRY_GAP_DEFAULT;

@Config(key = "hdds.datanode.chunk.data.validation.check",
defaultValue = "false",
type = ConfigType.BOOLEAN,
Expand Down Expand Up @@ -688,6 +702,19 @@ public void validate() {
diskCheckTimeout = DISK_CHECK_TIMEOUT_DEFAULT;
}

if (diskCheckRetryGap.isNegative()) {
LOG.warn("{} must be greater than zero and was set to {}. Defaulting to {}",
DISK_CHECK_RETRY_GAP_KEY, diskCheckRetryGap, DISK_CHECK_RETRY_GAP_DEFAULT);
diskCheckRetryGap = DISK_CHECK_RETRY_GAP_DEFAULT;
}

if (diskCheckRetryGap.compareTo(diskCheckTimeout) > 0) {
LOG.warn("{} was set to {}. It must be less than {} which is {}. Defaulting to {}",
DISK_CHECK_RETRY_GAP_KEY, diskCheckRetryGap, DISK_CHECK_TIMEOUT_KEY, diskCheckTimeout,
DISK_CHECK_RETRY_GAP_DEFAULT);
diskCheckRetryGap = DISK_CHECK_RETRY_GAP_DEFAULT;
}

if (blockDeleteCommandWorkerInterval.isNegative()) {
LOG.warn(BLOCK_DELETE_COMMAND_WORKER_INTERVAL +
" must be greater than zero and was set to {}. Defaulting to {}",
Expand Down Expand Up @@ -903,6 +930,10 @@ public Duration getDiskCheckTimeout() {
return diskCheckTimeout;
}

public Duration getDiskCheckRetryGap() {
return diskCheckRetryGap;
}

public void setDiskCheckTimeout(Duration duration) {
diskCheckTimeout = duration;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
import jakarta.annotation.Nullable;
import java.io.File;
import java.io.IOException;
import java.time.Duration;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
Expand Down Expand Up @@ -326,17 +327,31 @@ public VolumeCheckResult checkDbHealth(File dbFile) throws InterruptedException
return VolumeCheckResult.HEALTHY;
}

// We attempt to open RocksDb twice to ignore any transient errors
// and to confirm that we actually cannot open RocksDb in readonly mode.
final boolean isVolumeTestResultHealthy = true;
try (ManagedOptions managedOptions = new ManagedOptions();
ManagedRocksDB ignored = ManagedRocksDB.openReadOnly(managedOptions, dbFile.toString())) {
volumeTestResultQueue.add(isVolumeTestResultHealthy);
} catch (Exception e) {
if (Thread.currentThread().isInterrupted()) {
throw new InterruptedException("Check of database for volume " + this + " interrupted.");
final int maxAttempts = 2;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought process for not making this configurable was to avoid problematic configurations where the total attempts could exceed the disk check timeout.

For example if maxAttempts is 10 and you sleep for 1 minute in each attempt, firstly this is no longer about transient errors and the total time of 10 minutes will make the method flaky and fail the disk check timeout of 10 minutes.

I can change it if we think we may still want that control in production.

final Duration maxRetryGap = getDatanodeConfig().getDiskCheckRetryGap();
for (int attempt = 0; attempt < maxAttempts; attempt++) {
try (ManagedOptions managedOptions = new ManagedOptions();
ManagedRocksDB ignored =
ManagedRocksDB.openAsSecondary(managedOptions, dbFile.toString(), getTmpDir().getPath())) {
volumeTestResultQueue.add(isVolumeTestResultHealthy);
break;
} catch (Exception e) {
if (Thread.currentThread().isInterrupted()) {
throw new InterruptedException("Check of database for volume " + this + " interrupted.");
}

if (attempt == maxAttempts - 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically here it does retry for any failure, not just transient errors.

Do we know which errora are transient that could be recovered by retry, and which errors cannot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the idea is that if you can't open the RocksDb twice, the problem should definitely be counted as an error.

There is no clear list of the possible transient errors, as such this check is purely defensive.

From the RocksDb docs and comments, it appears that we may not be able to open the database in a Read-Only mode when any of these happen:

  1. CURRENT file is being renamed
  2. SST files are deleted or some states of compaction
  3. WAL physical size on disk exceeds declared size in MANIFEST

LOG.error("Could not open Volume DB located at {}", dbFile, e);
volumeTestResultQueue.add(!isVolumeTestResultHealthy);
volumeTestFailureCount.incrementAndGet();
} else {
LOG.warn("Could not open Volume DB located at {}", dbFile, e);
Thread.sleep(maxRetryGap.toMillis());
}
}
LOG.warn("Could not open Volume DB located at {}", dbFile, e);
volumeTestResultQueue.add(!isVolumeTestResultHealthy);
volumeTestFailureCount.incrementAndGet();
}

if (volumeTestResultQueue.size() > volumeTestCount
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,14 @@ public static ManagedRocksDB openReadOnly(
);
}

public static ManagedRocksDB openAsSecondary(
final ManagedOptions options,
final String dbPath,
final String secondaryDbLogFilePath)
throws RocksDBException {
return new ManagedRocksDB(RocksDB.openAsSecondary(options, dbPath, secondaryDbLogFilePath));
}

public static ManagedRocksDB open(
final DBOptions options, final String path,
final List<ColumnFamilyDescriptor> columnFamilyDescriptors,
Expand Down