NGram counter for large corpuses
You can install the package using the following steps:
pip install using an admin prompt.
pip uninstall VLNGramCounter -y
python -OO -m pip install -v git+https://github.com/TextCorpusLabs/VLNGramCounter.git
or if you have the code local
pip uninstall VLNGramCounter -y
python -OO -m pip install -v c:/repos/TextCorpusLabs/VLNGramCounter
Counts the n-grams contained in a folder of TXT files.
VLNGramCounter -source d:/data/corpus -dest d:/data/corpus.ngrams.csv
The following are required parameters:
sourceis the folder containing the TXT files.destis the CSV file used to store the ngram results.
The following are optional parameters:
lengthis the length of the n-gram. The default is 1.chunk_sizeis the amount of items in used by the control structure before chunking. Higher values use more ram, but compute the overall value faster. The default is 1M.includecount only values in this CSV list. The default is count everything.excludeignore values in this CSV list. The default is exclude nothing. Note: due to the order of operations, it only makes seance toexcludesingle tokens.cutoffis the minimum value count to keep. The default is 2.topis the number of n-grams to save. The default is to keep 10K.keep_case(flag) keeps the casing as-is before converting to tokens for counting. The default is to upper case everything.keep_punct(flag) keeps all punctuation as-is before converting to tokens for counting. The default is to remove all tokens that are only punctuation.
NOTE: The order of operations for complex counting is as follows:
- Transformation (
keep_case) - Exclusion (
keep_punct>exclude) - Inclusion (
include) - Filter (
cutoff>top)
The code in this repo is setup as a module. Debugging and testing are based on the assumption that the module is already installed. In order to debug (F5) or run the tests (Ctrl + ; Crtl + A), make sure to install the module as editable (see below).
pip uninstall VLNGramCounter -y
python -m pip install -e c:/repos/TextCorpusLabs/VLNGramCounter
When debugging in VSCode for the first time, consider adding the below config to the launch.json file.
"args" : [
"-source", "d:/data/corpus",
"-dest", "d:/data/corpus.ngrams.csv",
"-length", "1"]