Skip to content

Redundancy lookup : keep only maximal patterns #93

@Siorki

Description

@Siorki

Issue spun off from first point of #7

selection of the best substrings only, not all redundant strings. "Best" means that any string contained within another one, with the same redundancy count is discarded. To achieve this, always try to extend the string on each side until the copies count goes down

For instance, when performing forward lookup on getData(0,0) :

  • ge : 20 matches
  • get : 20 matches
  • getD : 6 matches
  • getDa : 6 matches
  • getDat : 6 matches
  • getData : 6 matches
  • getData( : 6 matches
  • getData(0 : 2 matches
  • getData(0, : 2 matches
  • getData(0,0 : 1 match (i.e. no other occurrence)

All occurrences of getD (6 of them) are also occurrences of getData(. Said otherwise, all occurrences of getD are followed by ata(. Thus it makes sense to pack, for the same cost of one token, getData( instead of getD. Going further, it makes sense not to keep getD in the candidate patterns. Same goes for ge and a few others and we shall only retain :

  • get : 20 matches
  • getData( : 6 matches
  • getData(0, : 2 matches

This should improve packing speed by testing fewer candidates, and also take care of #46 by lowering the memory footprint. Further testing shall make sure it does not worsen the compression.

Metadata

Metadata

Assignees

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions