Skip to content

Investigate keeping processes around #5

@gtoonstra

Description

@gtoonstra

The map/reduce examples have clear boundaries between startup, reading data, processing data and writing it out to disk. The process lifetime doesn't extend beyond those boundaries, which always perpetuates the cost of disk usage.

Similar to apache spark, avoiding disk access saves disk access, which can augment performance. It is important to realize that the boundaries of the processing isn't different from disk/memory access. The only difference is that at the moment where the mapper (for example) writes a partition to disk and exits, it would simply stay around to wait for queries to be executed against the data in the partitions.

what's left is figure out how to express the functions to be executed against the data (which may be in any format) in a consistent way. Most of them are aggregation functions:

  • sum
  • group?
  • etc

Joins are a lot harder to achieve. Maybe the mapper/reducer process itself can implement specific functions that dictate how this is done, so that the framework doesn't become overly generic and hard to read.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions