Skip to content

Add maximum feature width threshold #76

@hamishmorgan

Description

@hamishmorgan

The width of a feature --- as opposed to frequency --- is the number of unique entries it occurs with. E.g the feature mod:red occurs with the entry bus 3 times and apple 7 times, it's frequency is 10 but it's width is 2.

Due to the inverted-index implementation of the all-pairs algorithms, filtering features with a width greater than some threshold, bounds the quadratic element of the algorithm to the threshold. This should have a dramatic impact on time complexity. It may also work as a statistical stop-word filter.

This technique is discussed in:

Rychly and Kilgarriff (2007) An efficient algorithm for building a distributional thesaurus. Proc ACL. Prague, Czech Republic. http://kilgarriff.co.uk/Publications/2007-RychlyKilg-ACL-thesauruses.pdf

Implementation will require a number of substantial changes:

For consistency the width filtering should also be applicable to entries. In addition min width, and max frequency filters should be added. This will result in the following addition parameters:

  • filter.entry.width.min
  • filter.entry.width.max
  • filter.feature.width.min
  • filter.feature.width.max
  • filter.entry.freq.max
  • filter.feature.freq.max

Resulting in a total (count based) filtering set:

  • filter.entry.width.min (Default: 2)
  • filter.entry.width.max (Default: +Infinity)
  • filter.feature.width.min (Default: 2)
  • filter.feature.width.max (Default: +Infinity)
  • filter.entry.freq.min (Default: 2)
  • filter.entry.freq.max (Default: +Infinity)
  • filter.feature.freq.min (Default: 2)
  • filter.feature.freq.max (Default: +Infinity)
  • filter.event.freq.min (Default: 2)
  • filter.event.freq.max (Default: +Infinity)

The width counts must be re-introduced during entries and features counting stage. A non-trivial task since the counts can not simply be summed like frequencies can.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions