Skip to content

Conversation

@elliVM
Copy link
Contributor

@elliVM elliVM commented Dec 11, 2025

Description

Adds epoch migration mode, which can be used to modify archive reads to return only the first event in each S3 object.

The goal is to return only the first event in the S3 object. The epoch value of this first event can then be used by the user to migrate missing epoch metadata in the S3 objects.

The event message — and consequently the resulting _raw column — is left empty, as it is not required in this mode.

General

  • I have checked that my test files and functions have meaningful names.
  • I have checked that each test tests only a single behavior.
  • I have done happy tests.
  • I have tested only my own code.
  • I have tested at least all public methods.

Assertions

  • I have checked that my tests use assertions and not runtime overhead.
  • I have checked that my tests end in assertions.
  • I have checked that there is no comparison statements in assertions.
  • I have checked that assertions are in tests and not in helper functions.
  • I have checked that assertions for iterables are outside of for loops and both sides of the iteration blocks.
  • I have checked that assertions are not tested inside consumers.

Testing Data

  • I have tested algorithms and anything else with the possibility of unbound growth.
  • I have checked that all testing data is local and fully replaceable or reproducible or both.
  • I have checked that all test files are standalone.
  • I have checked that all test-specific fake objects and classes are in the test directory.
  • I have checked that my tests do not contain anything related to customers, infrastructure or users.
  • I have checked that my tests do not contain non-generic information.
  • I have checked that my tests do not do external requests and are not privately or publicly routable.

Statements

  • I have checked that my tests do not use throws for exceptions.
  • I have checked that my tests do not use try-catch statements.
  • I have checked that my tests do not use if-else statements.

Java

  • I have checked that my tests for Java uses JUnit library.
  • I have checked that my tests for Java uses JUnit utilities for parameters.

Other

  • I have only tested public behavior and not private implementation details.
  • I have checked that my tests are not (partially) commented out.
  • I have checked that hand-crafted variables in assertions are used accordingly.
  • I have tested Object Equality.
  • I have checked that I do not have any manual tests or I have a valid reason for them and I have explained it in the PR description.

Code Quality

  • I have checked that my code follows metrics set in Procedure: Class Metrics.
  • I have checked that my code follows metrics set in Procedure: Method Metrics.
  • I have checked that my code follows metrics set in Procedure: Object Quality.
  • I have checked that my code does not have any NULL values.
  • I have checked my code does not contain FIXME or TODO comments.

@elliVM elliVM self-assigned this Dec 11, 2025
@elliVM elliVM linked an issue Dec 11, 2025 that may be closed by this pull request
@elliVM elliVM requested a review from eemhu December 11, 2025 08:47
@elliVM elliVM requested a review from eemhu December 15, 2025 06:14
@elliVM elliVM force-pushed the epoch-migration-mode branch from 26276ff to cae62ff Compare December 15, 2025 06:39
@elliVM
Copy link
Contributor Author

elliVM commented Dec 15, 2025

rebased, improved test to include multiple messages per s3 object to ensure only first message is selected for the result

eemhu
eemhu previously approved these changes Dec 17, 2025
@elliVM elliVM added the review Issues or pull requests waiting for a review label Dec 18, 2025
@elliVM elliVM requested a review from kortemik December 18, 2025 07:21
@elliVM elliVM removed the review Issues or pull requests waiting for a review label Dec 29, 2025
@elliVM elliVM force-pushed the epoch-migration-mode branch from f26de61 to 28273f1 Compare January 14, 2026 13:13
@elliVM elliVM requested a review from eemhu January 14, 2026 13:34
@elliVM
Copy link
Contributor Author

elliVM commented Jan 14, 2026

rebased

@elliVM elliVM requested a review from Tiihott January 19, 2026 05:58
@elliVM elliVM force-pushed the epoch-migration-mode branch from 50bb345 to 22b8e93 Compare January 23, 2026 09:10
@elliVM
Copy link
Contributor Author

elliVM commented Jan 23, 2026

rebased

rowWriter.write(8, new EventToOrigin().asUTF8StringFrom(rfc5424Frame));
}
else {
rowWriter.write(0, 0L);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of 0, use the timestamp from the object path. path has always present elements which are parseable because of the object structure

example path is

2007/10-08/sc-99-99-14-110/f17/f17.logGLOB-2007100800.log.gz

segments 2007/10-08/... are extractable with

^(?<year>\d{4})/(?<month>\d{2})-(?<day>\d{2})/(?>.*)$

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time should be 00:00:00 of that day. we can then attempt extracting the hour from the path itself and assume europe/helsinki as the time. logGLOB-2007100800.log.gz includes the specific hour which could be extracted with

^(?>.*-)(?<year>\d{4})(?<month>\d{2})(?<day>\d{2})(?<hour>\d{2})(?>\..*)$

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

further thinking, please add this information to the metadata as well if it is available as it is much easier to do this extraction on pth_06 side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extraction now prioritizes hourly data when available at the end of the path; otherwise, it falls back to the date from the initial segments.

The path-derived timestamp is returned in the _time column for non-syslog events, with missing hour information defaulting to 00:00:00 in the Europe/Helsinki timezone.

Copy link
Member

@kortemik kortemik Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add information which one of these was used, path or hourformat

@elliVM elliVM force-pushed the epoch-migration-mode branch from 48c9631 to 47409b1 Compare January 27, 2026 07:13
@elliVM
Copy link
Contributor Author

elliVM commented Jan 27, 2026

rabased

@elliVM elliVM force-pushed the epoch-migration-mode branch from c5d8feb to c85ddcf Compare January 28, 2026 06:53
@elliVM
Copy link
Contributor Author

elliVM commented Jan 28, 2026

rebased

@elliVM
Copy link
Contributor Author

elliVM commented Jan 28, 2026

path extracted value now uses ZonedDateTime
c85ddcf

if (LOGGER.isDebugEnabled()) {
LOGGER.debug("Parser syslog event <[{}]>", rfc5424Frame.toString());
}
final RFC5424Timestamp rfc5424Timestamp = new RFC5424Timestamp(rfc5424Frame.timestamp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this uses rfc5424Frame even though the content might not be in syslog format, is there a test case for non syslog format object case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added non syslog format testing. This instantiates the timestamp here but only calls it later inside the isSyslog block, would it be better to move it inside the block?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be more context coherent to move it inside it.

Copy link
Contributor

@Tiihott Tiihott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were some missing object equality tests, methods that could be set to private and few smaller things to fix. The RowConverterImpl and EpochMigrationRowConverter classes could also use some refactoring but I think that should be done in a separate PR.

@elliVM elliVM requested a review from Tiihott January 30, 2026 13:11
);
readAttempted = true;
isSyslogFormat = false;
returnValue = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return value must be false, if this was not syslog format, right? there is no next frame to load then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a test case for this

if (!readAttempted) {
try {
boolean nextResult = rfc5424Frame.next();
readAttempted = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please move readAttempted out of the try block, if rfc5424Frame.next(); throws and this will not be set to true

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a test case for this

boolean nextResult = rfc5424Frame.next();
readAttempted = true;
isSyslogFormat = nextResult;
returnValue = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returnValue must be set by the rfc5424Frame.next(); instead of manually setting to true.

isSyslogFormat can be set to true if rfc5424Frame.next(); does not throw.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a test case for this

"ParseException at object: <[{}]>/<[{}]>\n message: <{}>", bucket, path,
exception.getMessage()
);
readAttempted = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting readAttempted is not necessary in catch block when it is set before the try block

if (LOGGER.isDebugEnabled()) {
LOGGER.debug("Parser syslog event <[{}]>", rfc5424Frame.toString());
}
final RFC5424Timestamp rfc5424Timestamp = new RFC5424Timestamp(rfc5424Frame.timestamp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be more context coherent to move it inside it.

public final class MockDBNonSyslogRowImpl implements MockDBRow {

private final MockDBRow origin;
private final static Comparator<MockDBRow> COMPARATOR = Comparator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not comprea isSyslog. that must be a private static final boolean field here to make it part of the comparison and that must be part of equals and hashCode


@Override
public boolean isSyslog() {
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this private static final boolean field and add to compareTo, equals and hashCode methods.

if (isSyslogFormat) {
final EpochMicros epochMicros = new EpochMicros(rfc5424Timestamp);
timestampBuilder
.add("original", String.valueOf(rfc5424Frame.timestamp))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename original to rfc5242timestamp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

implement epoch migration mode option

4 participants