-
Notifications
You must be signed in to change notification settings - Fork 8
Description
The width of a feature --- as opposed to frequency --- is the number of unique entries it occurs with. E.g the feature mod:red occurs with the entry bus 3 times and apple 7 times, it's frequency is 10 but it's width is 2.
Due to the inverted-index implementation of the all-pairs algorithms, filtering features with a width greater than some threshold, bounds the quadratic element of the algorithm to the threshold. This should have a dramatic impact on time complexity. It may also work as a statistical stop-word filter.
This technique is discussed in:
Rychly and Kilgarriff (2007) An efficient algorithm for building a distributional thesaurus. Proc ACL. Prague, Czech Republic. http://kilgarriff.co.uk/Publications/2007-RychlyKilg-ACL-thesauruses.pdf
Implementation will require a number of substantial changes:
For consistency the width filtering should also be applicable to entries. In addition min width, and max frequency filters should be added. This will result in the following addition parameters:
filter.entry.width.minfilter.entry.width.maxfilter.feature.width.minfilter.feature.width.maxfilter.entry.freq.maxfilter.feature.freq.max
Resulting in a total (count based) filtering set:
filter.entry.width.min(Default: 2)filter.entry.width.max(Default: +Infinity)filter.feature.width.min(Default: 2)filter.feature.width.max(Default: +Infinity)filter.entry.freq.min(Default: 2)filter.entry.freq.max(Default: +Infinity)filter.feature.freq.min(Default: 2)filter.feature.freq.max(Default: +Infinity)filter.event.freq.min(Default: 2)filter.event.freq.max(Default: +Infinity)
The width counts must be re-introduced during entries and features counting stage. A non-trivial task since the counts can not simply be summed like frequencies can.