Optimise token frequency performance #3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I've implemented some optimisations for token frequency performance, addressing the bottlenecks with:
This assumes that most users only care about the top 500-1000 most frequent tokens anyway. The original approach was doing massive computational work on extremely rare tokens that provide little analytical value.
Performance improvements scale exponentially with dataset size. Performance scaling improved from O(n x v²) to O(n x k) where k=1000, meaning consistent performance regardless of total vocabulary size. This optimisation provides 97% vocabulary reduction for statistical analysis.
Expected performance improvements:
Implementation Details:
This addresses the performance issues from commit 4f8f2d0 where switching to "full Stats results" significantly impacted performance, and makes DTM implementation (Issue #9) feasible.
Ready for testing and merge.