Skip to content

Asdf Intermediate table entry flushing#48

Open
Miauwkeru wants to merge 13 commits intomainfrom
asdf-intermediate-data-flushing
Open

Asdf Intermediate table entry flushing#48
Miauwkeru wants to merge 13 commits intomainfrom
asdf-intermediate-data-flushing

Conversation

@Miauwkeru
Copy link
Contributor

@Miauwkeru Miauwkeru commented Jan 21, 2026

No description provided.

Add functionality to search for the specific index tables
This is for AsdfSnapshot that includes the offset inside the block
So that the table is required for return the table data and cleanup of
itself
@Miauwkeru Miauwkeru force-pushed the asdf-intermediate-data-flushing branch from 2ba0059 to 08f4761 Compare January 28, 2026 10:41
@codecov
Copy link

codecov bot commented Jan 28, 2026

Codecov Report

❌ Patch coverage is 94.52055% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.89%. Comparing base (72b05e7) to head (e7f8a2f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
dissect/evidence/asdf/asdf.py 94.48% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #48      +/-   ##
==========================================
+ Coverage   71.44%   72.89%   +1.44%     
==========================================
  Files          23       23              
  Lines        1387     1487     +100     
==========================================
+ Hits          991     1084      +93     
- Misses        396      403       +7     
Flag Coverage Δ
unittests 72.89% <94.52%> (+1.44%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Miauwkeru
Copy link
Contributor Author

Intermediate flushing is added, and added logic to find the table entries and combine it with _table_fit.
The table_index struct contains information about the previous table, and the indexes that where flushed to the current table. This is to have a faster lookup if we are only interested in one specific index.

I can think of one limitation tho. Once the table flushed all its contents to disk, there is the issue that duplicate data can be written. Maybe it is a good idea to have an additional asdf tool to remove this kind of duplication from the file.

@Miauwkeru Miauwkeru marked this pull request as ready for review January 28, 2026 10:46
@Miauwkeru Miauwkeru requested a review from Schamper January 28, 2026 10:46


@dataclass
class ReadEntry:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a difference between the types getting used to insert into the table. So, to make sure the correct intent gets communicated, i thought of creating the ReadEntry. As it looked more confusing when I used table_entry.file_size for data_offset.

@@ -285,24 +418,13 @@ def _write_meta(self) -> None:

def _write_table(self) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to flush.

return self._table.values()

def write(self, table_offset: int = -1) -> bytes:
"""Creates a table to be writen to the fileheader"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring doesn't seem to be accurate.

def values(self) -> ValuesView[list[T]]:
return self._table.values()

def write(self, table_offset: int = -1) -> bytes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not give this function a file-like object directly to write to?

footer = c_asdf.footer(
magic=FOOTER_MAGIC,
table_offset=self._table_offset,
table_offset=self._table.prev_table_offset,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe last_table_offset.

FOOTER_MAGIC = b"FT\xa5\xdf"
SPARSE_BYTES = b"\xa5\xdf"

DEFAULT_NR_OF_ENTRIES = 4 * 1024 * 1024 // len(c_asdf.table_entry)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nr of entries of what? Maybe DEFAULT_TABLE_SIZE.

def streams(self) -> Iterator[AsdfStream]:
"""Iterate over all streams in the file."""
for i in sorted(self.table.keys()):
for i in sorted(self.table._table.keys()):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely this can be nicer.

T = TypeVar("T", ReadEntry, c_asdf.table_entry)


class Table(Generic[T]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand how any of this code works, what it does and what the purpose is of the added structures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To try and explain my reasoning a bit for Table.

What I wanted to do was to create a single point where the table entries live for both ASDFWriter and ASDFSnapshot as both did similar things when either reading or writing data.

As for the added structures, c_asdf.table_index was created for the purpose of locating the previous flushed tables.

struct table_index {
    uint64      prev_table;     // Offset of the previous table FFFFFFFFF denotes last table
    uint64      size;           // Amount of bytes of the table
    uint64      indexes[4];     // Which table entries are inside
};

While testing for the worst case, and unrealistic, maximum table entries of 1; there were some issues with the tests due to the lookup offsets not matching the test data anymore. This gave birth to Table.lookup where it looks inside the previous tables and retrieves the offsets from there.

To speed up that process I added the indexes to table_index where it stores the stream indexes contained inside the table. This is stored in a 256 bits bitmap. Where we can reuse these indexes to search only for specific stream indexes required by ASDFStream. Although, this might be thinking too far ahead.

Of course I could have made table_index.indexes dynamic by using a buffer, but I thought a consistent size of the structure would be more beneficial.

When looking up data, we need to know what other tables are available. Which is why I added the _table_offsets. To keep track of any flushed tables indicated by c_asdf.table_index.

And I should have probably added such an explanation to Table in the first place.

@Miauwkeru
Copy link
Contributor Author

I added documentation to the classes and such

@Miauwkeru Miauwkeru requested a review from Schamper February 19, 2026 15:05

// A structure to keep track of previously flushed tables
struct table_index {
uint64 prev_table; // Offset of the previous table 0xFFFFFFFF_FFFFFFF denotes last table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint64 prev_table; // Offset of the previous table 0xFFFFFFFF_FFFFFFF denotes last table
uint64 prev_table; // Offset of the previous table, 0xFFFFFFFF_FFFFFFF denotes the last table

struct table_index {
uint64 prev_table; // Offset of the previous table 0xFFFFFFFF_FFFFFFF denotes last table
uint64 size; // Amount of bytes of the table
uint64 indexes[4]; // Which stream indexes are available inside the table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expand the docstring on how this is stored. Based on this I assume there's only 4 stream indexes available, each stored as a uint64.

indexes = sum(1 << key for key in self._table)
return [(indexes >> (x * 64)) & OFFSET_MASK for x in range(256 // 64)]

def lookup(self, idx: int, fh: BinaryIO) -> list[int]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unused?

table_offset = self.fh.tell()
table_index = c_asdf.table_index(self.fh)
table_offsets.append((table_offset, table_index))
if table_index.prev_table == OFFSET_MASK:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constant is confusingly named.



class Table(Generic[T]):
"""A single point for the table entries to get collected for reading and writing."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having one class be responsible for both reading and writing of the table feels a little awkward. Awkward new dataclasses are introduced and both APIs end up just slightly awkward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants