-
Notifications
You must be signed in to change notification settings - Fork 9
Epoch migration mode #274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Epoch migration mode #274
Conversation
26276ff to
cae62ff
Compare
|
rebased, improved test to include multiple messages per s3 object to ensure only first message is selected for the result |
f26de61 to
28273f1
Compare
|
rebased |
50bb345 to
22b8e93
Compare
|
rebased |
src/main/java/com/teragrep/pth_06/task/s3/EpochMigrationRowConverter.java
Outdated
Show resolved
Hide resolved
| rowWriter.write(8, new EventToOrigin().asUTF8StringFrom(rfc5424Frame)); | ||
| } | ||
| else { | ||
| rowWriter.write(0, 0L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of 0, use the timestamp from the object path. path has always present elements which are parseable because of the object structure
example path is
2007/10-08/sc-99-99-14-110/f17/f17.logGLOB-2007100800.log.gz
segments 2007/10-08/... are extractable with
^(?<year>\d{4})/(?<month>\d{2})-(?<day>\d{2})/(?>.*)$There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time should be 00:00:00 of that day. we can then attempt extracting the hour from the path itself and assume europe/helsinki as the time. logGLOB-2007100800.log.gz includes the specific hour which could be extracted with
^(?>.*-)(?<year>\d{4})(?<month>\d{2})(?<day>\d{2})(?<hour>\d{2})(?>\..*)$There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
further thinking, please add this information to the metadata as well if it is available as it is much easier to do this extraction on pth_06 side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extraction now prioritizes hourly data when available at the end of the path; otherwise, it falls back to the date from the initial segments.
The path-derived timestamp is returned in the _time column for non-syslog events, with missing hour information defaulting to 00:00:00 in the Europe/Helsinki timezone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add information which one of these was used, path or hourformat
48c9631 to
47409b1
Compare
|
rabased |
…leanup unused objects
… JSON metadata payload for _raw column
…ct is non-syslog, use new mock objects
c5d8feb to
c85ddcf
Compare
|
rebased |
|
path extracted value now uses |
| if (LOGGER.isDebugEnabled()) { | ||
| LOGGER.debug("Parser syslog event <[{}]>", rfc5424Frame.toString()); | ||
| } | ||
| final RFC5424Timestamp rfc5424Timestamp = new RFC5424Timestamp(rfc5424Frame.timestamp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this uses rfc5424Frame even though the content might not be in syslog format, is there a test case for non syslog format object case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added non syslog format testing. This instantiates the timestamp here but only calls it later inside the isSyslog block, would it be better to move it inside the block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be more context coherent to move it inside it.
Tiihott
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were some missing object equality tests, methods that could be set to private and few smaller things to fix. The RowConverterImpl and EpochMigrationRowConverter classes could also use some refactoring but I think that should be done in a separate PR.
src/main/java/com/teragrep/pth_06/task/ArchiveMicroBatchInputPartitionReader.java
Outdated
Show resolved
Hide resolved
src/test/java/com/teragrep/pth_06/task/s3/EventMetadataTest.java
Outdated
Show resolved
Hide resolved
src/test/java/com/teragrep/pth_06/task/s3/PathExtractedTimestampTest.java
Show resolved
Hide resolved
src/main/java/com/teragrep/pth_06/task/s3/EpochMigrationRowConverter.java
Show resolved
Hide resolved
…format test for epoch migration testing
src/main/java/com/teragrep/pth_06/task/s3/EpochMigrationRowConverter.java
Show resolved
Hide resolved
| ); | ||
| readAttempted = true; | ||
| isSyslogFormat = false; | ||
| returnValue = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return value must be false, if this was not syslog format, right? there is no next frame to load then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make a test case for this
| if (!readAttempted) { | ||
| try { | ||
| boolean nextResult = rfc5424Frame.next(); | ||
| readAttempted = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please move readAttempted out of the try block, if rfc5424Frame.next(); throws and this will not be set to true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make a test case for this
| boolean nextResult = rfc5424Frame.next(); | ||
| readAttempted = true; | ||
| isSyslogFormat = nextResult; | ||
| returnValue = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returnValue must be set by the rfc5424Frame.next(); instead of manually setting to true.
isSyslogFormat can be set to true if rfc5424Frame.next(); does not throw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make a test case for this
| "ParseException at object: <[{}]>/<[{}]>\n message: <{}>", bucket, path, | ||
| exception.getMessage() | ||
| ); | ||
| readAttempted = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setting readAttempted is not necessary in catch block when it is set before the try block
| if (LOGGER.isDebugEnabled()) { | ||
| LOGGER.debug("Parser syslog event <[{}]>", rfc5424Frame.toString()); | ||
| } | ||
| final RFC5424Timestamp rfc5424Timestamp = new RFC5424Timestamp(rfc5424Frame.timestamp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be more context coherent to move it inside it.
src/main/java/com/teragrep/pth_06/task/s3/EpochMigrationRowConverter.java
Show resolved
Hide resolved
| public final class MockDBNonSyslogRowImpl implements MockDBRow { | ||
|
|
||
| private final MockDBRow origin; | ||
| private final static Comparator<MockDBRow> COMPARATOR = Comparator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does not comprea isSyslog. that must be a private static final boolean field here to make it part of the comparison and that must be part of equals and hashCode
|
|
||
| @Override | ||
| public boolean isSyslog() { | ||
| return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this private static final boolean field and add to compareTo, equals and hashCode methods.
| if (isSyslogFormat) { | ||
| final EpochMicros epochMicros = new EpochMicros(rfc5424Timestamp); | ||
| timestampBuilder | ||
| .add("original", String.valueOf(rfc5424Frame.timestamp)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename original to rfc5242timestamp
Description
Adds epoch migration mode, which can be used to modify archive reads to return only the first event in each S3 object.
The goal is to return only the first event in the S3 object. The epoch value of this first event can then be used by the user to migrate missing epoch metadata in the S3 objects.
The event message — and consequently the resulting _raw column — is left empty, as it is not required in this mode.
General
Assertions
Testing Data
Statements
Java
Other
Code Quality