Running the Software

Running using byblo.sh

This script is designed to be the primary point of usage for the thesaurus building software. It runs a complete build process from frequency counting to K-Nearest-Neighbours in a single pass, providing all the most commonly used functionality of the underlying software.

$ ./byblo.sh [<options>] [@<config>] -i <file>

Where the arguments are:

<file> Input instances file containing head/contexts pairs.
@<config> Options and input files can be read from a file specified directly after an '$\mathtt{@}$' character. Options in this file should be specified exactly as they would be at the command line, and may contain additional @ references to other config files.
<options> Any number of the option switches specified [below][General Options].

Except where noted, the following option apply to all steps within the build process:

-c <name>, --charset <name> Set the character encoding to use when reading and writing all files. Default is UTF-8.
-h, --help Display usage information and exit.
-o <path> , --output<path> Directory that will contain all (non-temporary) output files. The default is the current working directory.

The device should have a very significant amount of free space. The specific amount is dependent on the task, but a typical thesaurus can require hundreds of gigabytes, even terabytes.
-t <num> , --threads <num> Number of concurrent threads to use during processes that can be parallelised (i.e all-pairs, sorting, and KNN). Default is set to a number that will make optimal use of available cores; usually number of cores plus 1.
-T <path>, --temp-dir <path> Path to the directory that will store temporary files used during certain processes: filter, sorting, and KNN. This device holding this directory should have an amount of free space similar to the output directory (see above.) In addition the I/O performance will have a significant impact on the processing time, so local storage is preferable. The default is to create a temporary sub-directory inside the output directory.

Filtering Options

The following options configure the input filtering stage of the build process. They can be used to narrow down the area being investigated. Filtering also improves overall performance, since data is removed before the more computationally demanding later stages of the process.

-fef <num>, --filter-entry-freq <num> Accept only those entries with a frequency greater than or equal to <num>. This filter is generally used to improve the quality of the thesaurus, since low frequency entries are unlikely to be described accurately.
-fep <exp> , --filter-entry-pattern <exp> Accept only those entries that match the given Perl style regular expression <exp>. This filter is generally used to find only particular syntactical ranges of entries, such as those composed entirely of alphabetical characters.
-few <file>, --filter-entry-wordlist <file> Accept only those entries that match exactly a line within the given <file>. This filter is used the investigate the distributional similarity of some small subset of words.
-fff <num> , --filter-feature-freq <num> Accept only those features with a frequency greater than or equal to <num>. This filter is generally used to improve the performance of the thesaurus build process, since low frequency features are unlikely to have a significant effect on the distributional similarity of the heads they described, they only serve to slow down the process.
-ffp <exp>, --filter-feature-pattern <exp> Accept only those context strings that match the given Perl style regular expression <exp>. It can be used as either a performance aid or an investigatory tool.
-ffw <file>, --filter-feature-wordlist <file> Accept only those features that are an exact match to a line within the given <file>.
-fff <num>, --filter-event-freq <num> Accept only those events (entry/feature pair observations) with a frequency greater than or equal to <num>. This filter is generally used to improve the quality of the thesaurus, since low frequency events are unlikely to be very discriminative.

Counting Options

The counting stage no longer has any additional options.

All-pairs Options

The following options effect the all-pair similarity search part of the thesaurus build process.

-m <name> , --measure <name> Proximity measure to compare heads with. Commonly used examples include: Lin, Jaccard, CRMI, RecallMI. The default is Lin.
--measure-reversed For asymmetric measures calculate the inverse similarities, i.e reverse the operands of similarity function: sim(q,r) becomes sim(r,q). Has no effect with symmetric measures.
-Smn <num>, --similarity-min <num> Minimum similarity threshold over which resultant pairs will be output. This option is generally used for performance purposes. Without this option the output pairs file size can be quadratic w.r.t the number of entries. Default is negative-infinity.
-Smx <num>, --similarity-max <num> Maximum similarity threshold under which resultant pairs will be output. Typically used instead of option --similarity-min, for distance based measures, since smaller similarity values indicate greater similarity. Default is positive-infinity.

Additional options for specific measures:

--crmi-beta <num> For CRMI measure. Value should be in the range 0 to 1 inclusive. Default is 1.
--crmi-gamma <num> For CRMI measure. Value should be in the range 0 to 1 inclusive. Default is 0.
--lee-alpha <num> For Lee measure. Value should be in the range 0 to 1 exclusive. Default is 0.99
--mink-p <num> For Lp measure. Value should be a integer in the range 1 to infinity. Default is 2.

Sorting and K-Nearest-Neighbours Options

-k <num> For each resultant head string, produce it's nearest neighbours. Default is 100.

Running using tools.sh

For most user the build.sh method (described above) is sufficient to produce a thesaurus. However, it is also possible to run specific stages of the pipeline with greater control using the tools.sh script.

See the (rather long) usage:

$ ./tools.sh --help
Usage: Tools [options] [command] [command options]
  Options:
    -h, --help   Display this help message.
                 Default: false
  Commands:
    allpairs      Usage: allpairs [options]      
        Options:
          -c, --charset                The character set encoding to use for
                                       both reading input and writing output files.
                                       Default: UTF-8
              --crmi-beta              Beta paramter to Weed's CRMI measure.
                                       Default: 0.5
              --crmi-gamma             Gamma parameter to Weed's CRMI measure.
                                       Default: 0.5
          -Xe, --entries-index-file    Index file for enumerating entries.
          -Ee, --enumerated-entries    Whether tokens in the first column of the
                                       input file are indexed.
                                       Default: false
          -Ef, --enumerated-features   Whether entries in the second column of
                                       the input file are indexed.
                                       Default: false
          -et, --enumerator-type       Options: [Memory, JDBC]
                                       Default: Memory
          -Xf, --features-index-file   Index file for enumerating features.
          -h, --help                   Display this help message.
                                       Default: false
          -ip, --identity-pairs        Produce similarity between pair of
                                       identical entries.
                                       Default: false
        * -i, --input                  Event frequency vectors files.
          -ie, --input-entries         Entry frequencies file
          -if, --input-features        Feature frequencies file
              --lee-alpha              Alpha parameter to Lee's alpha-skew
                                       divergence measure.
                                       Default: 0.99
          -m, --measure                Similarity measure to use.
                                       Default: Lin
              --measure-reversed       Swap similarity measure inputs.
                                       Default: false
              --mink-p                 P parameter to Minkowski/Lp space
                                       measure.
                                       Default: 2.0
        * -o, --output                 Output similarity matrix file.
          -Smx, --similarity-max       Maximum similarity threshold.
                                       Default: Infinity
          -Smn, --similarity-min       Minimum similarity threshold.
                                       Default: -Infinity
          -t, --threads                Number of concurrent processing threads.
                                       Default: 9

    count      Usage: count [options]      
        Options:
          -c, --charset                   The character set encoding to use for
                                          both reading input and writing output
                                          files.
                                          Default: UTF-8
          -Xe, --entries-index-file       Index file for enumerating entries.
          -Ee, --enumerated-entries       Whether tokens in the first column of
                                          the input file are indexed.
                                          Default: false
          -Ef, --enumerated-features      Whether entries in the second column
                                          of the input file are indexed.
                                          Default: false
          -et, --enumerator-type          Options: [Memory, JDBC]
                                          Default: Memory
          -Xf, --features-index-file      Index file for enumerating features.
          -h, --help                      Display this help message.
                                          Default: false
        * -i, --input                     Input instances file
        * -oe, --output-entries           Output entry frequencies file
        * -oef, --output-entry-features   Output entry-feature frequencies file
        * -of, --output-features          Output feature frequencies file
          -T, --temporary-directory       Directory used for holding temporary
                                          files.
                                          Default: /var/folders/_4/lnfd6yv910q4gdqbrpvs1vzr0000gp/T/tmp-<uniqueid>.tmp
          -t, --threads                   Number of threads to use.
                                          Default: 8

    sort-events      Usage: sort-events [options]      
        Options:
          -c, --charset                The character set encoding to use for
                                       both reading input and writing output files.
                                       Default: UTF-8
          -Xe, --entries-index-file    Index file for enumerating entries.
          -Ee, --enumerated-entries    Whether tokens in the first column of the
                                       input file are indexed.
                                       Default: false
          -Ef, --enumerated-features   Whether entries in the second column of
                                       the input file are indexed.
                                       Default: false
          -et, --enumerator-type       Options: [Memory, JDBC]
                                       Default: Memory
          -Xf, --features-index-file   Index file for enumerating features.
          -h, --help                   Display this help message.
                                       Default: false
        * -i, --input                  Source file that will be read
        * -o, --output                 Destination file that will be writen to.
          -r, --reverse                Reverse the result of comparisons.
                                       Default: false
          -T, --temporary-directory    Directory which will be used for storing
                                       temporary files.
                                       Default: /var/folders/_4/lnfd6yv910q4gdqbrpvs1vzr0000gp/T/tmp-<uniqueid>.tmp
          -t, --threads                Number of threads to use.
                                       Default: 8

    unindex-instances      Usage: unindex-instances [options]      
        Options:
          -c, --charset                The character set encoding to use for
                                       both reading input and writing output files.
                                       Default: UTF-8
          -Xe, --entries-index-file    Index file for enumerating entries.
          -Ee, --enumerated-entries    Whether tokens in the first column of the
                                       input file are indexed.
                                       Default: false
          -Ef, --enumerated-features   Whether entries in the second column of
                                       the input file are indexed.
                                       Default: false
          -et, --enumerator-type       Options: [Memory, JDBC]
                                       Default: Memory
          -Xf, --features-index-file   Index file for enumerating features.
          -h, --help                   Display this help message.
                                       Default: false
        * -i, --input                  Source file that will be read
        * -o, --output                 Destination file that will be writen to.

    merge-sims      Usage: merge-sims [options]      
        Options:
          -c, --charset                The character set encoding to use for
                                       both reading input and writing output files.
                                       Default: UTF-8
          -Xe, --entries-index-file    Index file for enumerating entries.
          -Ee, --enumerated-entries    Whether tokens in the first column of the
                                       input file are indexed.
                                       Default: false
          -Ef, --enumerated-features   Whether entries in the second column of
                                       the input file are indexed.
                                       Default: false
          -et, --enumerator-type       Options: [Memory, JDBC]
                                       Default: Memory
          -Xf, --features-index-file   Index file for enumerating features.
          -h, --help                   Display this help message.
                                       Default: false
        * -ifa, --input-file-a         The first file to merge.
        * -ifb, --input-file-b         The second file to merge.
        * -of, --output-file           The output file to which both input will
                                       be merged.
          -r, --reverse                Reverse the result of comparisons.
                                       Default: false

    unindex-features      Usage: unindex-features [options]      
        Options:
          -c, --charset                The character set encoding to use for
                                       both reading input and writing output files.
                                       Default: UTF-8
          -Xe, --entries-index-file    Index file for enumerating entries.
          -Ee, --enumerated-entries    Whether tokens in the first column of the
                                       input file are indexed.
                                       Default: false
          -Ef, --enumerated-features   Whether entries in the second column of
                                       the input file are indexed.
                                       Default: false
          -et, --enumerator-type       Options: [Memory, JDBC]
                                       Default: Memory
          -Xf, --features-index-file   Index file for enumerating features.
          -h, --help                   Display this help message.
                                       Default: false
        * -i, --input                  Source file that will be read
        * -o, --output                 Destination file that will be writen to.

    index-events      Usage: index-events [options]      
        Options:
          -c, --charset                The character set encoding to use for
                                       both reading input and writing output files.
                                       Default: UTF-8
          -Xe, --entries-index-file    Index file for enumerating entries.
          -Ee, --enumerated-entries    Whether tokens in the first column of the
                                       input file are indexed.
                                       Default: false
          -Ef, --enumerated-features   Whether entries in the second column of
                                       the input file are indexed.
                                       Default: false
          -et, --enumerator-type       Options: [Memory, JDBC]
                                       Default: Memory
          -Xf, --features-index-file   Index file for enumerating features.
          -h, --help                   Display this help message.
                                       Default: false
        * -i, --input                  Source file that will be read
        * -o, --output                 Destination file that will be writen to.

    sort-sims      Usage: sort-sims [options]      
        Options:
          -c, --charset                The character set encoding to use for
                                       both reading input and writing output files.
                                       Default: UTF-8
          -Xe, --entries-index-file    Index file for enumerating entries.
          -Ee, --enumerated-entries    Whether tokens in the first column of the
                                       input file are indexed.
                                       Default: false
          -Ef, --enumerated-features   Whether entries in the second column of
                                       the input file are indexed.
                                       Default: false
          -et, --enumerator-type       Options: [Memory, JDBC]
                                       Default: Memory
          -Xf, --features-index-file   Index file for enumerating features.
          -h, --help                   Display this help message.
                                       Default: false
        * -i, --input                  Source file that will be read
        * -o, --output                 Destination file that will be writen to.
          -r, --reverse                Reverse the result of comparisons.
                                       Default: false
          -T, --temporary-directory    Directory which will be used for storing
                                       temporary files.
                                       Default: /var/folders/_4/lnfd6yv910q4gdqbrpvs1vzr0000gp/T/tmp-<uniqueid>.tmp
          -t, --threads                Number of threads to use.
                                       Default: 8

    unindex-neighbours      Usage: unindex-neighbours [options]      
        Options:
          -c, --charset                The character set encoding to use for
                                       both reading input and writing output files.
                                       Default: UTF-8
          -Xe, --entries-index-file    Index file for enumerating entries.
          -Ee, --enumerated-entries    Whether tokens in the first column of the
                                       input file are indexed.
                                       Default: false
          -Ef, --enumerated-features   Whether entries in the second column of
                                       the input file are indexed.
                                       Default: false
          -et, --enumerator-type       Options: [Memory, JDBC]
                                       Default: Memory
          -Xf, --features-index-file   Index file for enumerating features.
          -h, --help                   Display this help message.
                                       Default: false
        * -i, --input                  Source file that will be read
        * -o, --output                 Destination file that will be writen to.

    index-sims      Usage: index-sims [options]      
        Options:
          -c, --charset                The character set encoding to use for
                                       both reading input and writing output files.
                                       Default: UTF-8
          -Xe, --entries-index-file    Index file for enumerating entries.
          -Ee, --enumerated-entries    Whether tokens in the first column of the
                                       input file are indexed.
                                       Default: false
          -Ef, --enumerated-features   Whether entries in the second column of
                                       the input file are indexed.
                                       Default: false
          -et, --enumerator-type       Options: [Memory, JDBC]
                                       Default: Memory
          -Xf, --features-index-file   Index file for enumerating features.
          -h, --help                   Display this help message.
                                       Default: false
        * -i, --input                  Source file that will be read
        * -o, --output                 Destination file that will be writen to.

    index-neighbours      Usage: index-neighbours [options]      
        Options:
          -c, --charset                The character set encoding to use for
                                       both reading input and writing output files.
                                       Default: UTF-8
          -Xe, --entries-index-file    Index file for enumerating entries.
          -Ee, --enumerated-entries    Whether tokens in the first column of the
                                       input file are indexed.
                                       Default: false
          -Ef, --enumerated-features   Whether entries in the second column of
                                       the input file are indexed.
                                       Default: false
          -et, --enumerator-type       Options: [Memory, JDBC]
                                       Default: Memory
          -Xf, --features-index-file   Index file for enumerating features.
          -h, --help                   Display this help message.
                                       Default: false
        * -i, --input                  Source file that will be read
        * -o, --output                 Destination file that will be writen to.

    merge-ents      Usage: merge-ents [options]      
        Options:
          -c, --charset                The character set encoding to use for
                                       both reading input and writing output files.
                                       Default: UTF-8
          -Xe, --entries-index-file    Index file for enumerating entries.
          -Ee, --enumerated-entries    Whether tokens in the first column of the
                                       input file are indexed.
                                       Default: false
          -Ef, --enumerated-features   Whether entries in the second column of
                                       the input file are indexed.
                                       Default: false
          -et, --enumerator-type       Options: [Memory, JDBC]
                                       Default: Memory
          -Xf, --features-index-file   Index file for enumerating features.
          -h, --help                   Display this help message.
                                       Default: false
        * -ifa, --input-file-a         The first file to merge.
        * -ifb, --input-file-b         The second file to merge.
        * -of, --output-file           The output file to which both input will
                                       be merged.
          -r, --reverse                Reverse the result of comparisons.
                                       Default: false

    index-entries      Usage: index-entries [options]      
        Options:
          -c, --charset                The character set encoding to use for
                                       both reading input and writing output files.
                                       Default: UTF-8
          -Xe, --entries-index-file    Index file for enumerating entries.
          -Ee, --enumerated-entries    Whether tokens in the first column of the
                                       input file are indexed.
                                       Default: false
          -Ef, --enumerated-features   Whether entries in the second column of
                                       the input file are indexed.
                                       Default: false
          -et, --enumerator-type       Options: [Memory, JDBC]
                                       Default: Memory
          -Xf, --features-index-file   Index file for enumerating features.
          -h, --help                   Display this help message.
                                       Default: false
        * -i, --input                  Source file that will be read
        * -o, --output                 Destination file that will be writen to.

    merge-feats      Usage: merge-feats [options]      
        Options:
          -c, --charset            The character set encoding to use for both
                                   reading input and writing output files.
                                   Default: UTF-8
          -E, --enumerated         Whether tokens in the input file are
                                   enumerated.
                                   Default: false
          -et, --enumerator-type   Options: [Memory, JDBC]
                                   Default: Memory
          -h, --help               Display this help message.
                                   Default: false
          -X, --index-file         Index for the string tokens.
        * -ifa, --input-file-a     The first file to merge.
        * -ifb, --input-file-b     The second file to merge.
        * -of, --output-file       The output file to which both input will be
                                   merged.
          -r, --reverse            Reverse the result of comparisons.
                                   Default: false

    index      Usage: index [options]      
        Options:
          -c, --charset                The character set encoding to use for
                                       both reading input and writing output files.
                                       Default: UTF-8
          -Xe, --entries-index-file    Index file for enumerating entries.
          -Ee, --enumerated-entries    Whether tokens in the first column of the
                                       input file are indexed.
                                       Default: false
          -Ef, --enumerated-features   Whether entries in the second column of
                                       the input file are indexed.
                                       Default: false
          -et, --enumerator-type       Options: [Memory, JDBC]
                                       Default: Memory
          -Xf, --features-index-file   Index file for enumerating features.
          -h, --help                   Display this help message.
                                       Default: false
        * -i, --input                  Source file that will be read
        * -o, --output                 Destination file that will be writen to.

    merge-events      Usage: merge-events [options]      
        Options:
          -c, --charset            The character set encoding to use for both
                                   reading input and writing output files.
                                   Default: UTF-8
          -E, --enumerated         Whether tokens in the input file are
                                   enumerated.
                                   Default: false
          -et, --enumerator-type   Options: [Memory, JDBC]
                                   Default: Memory
          -h, --help               Display this help message.
                                   Default: false
          -X, --index-file         Index for the string tokens.
        * -ifa, --input-file-a     The first file to merge.
        * -ifb, --input-file-b     The second file to merge.
        * -of, --output-file       The output file to which both input will be
                                   merged.
          -r, --reverse            Reverse the result of comparisons.
                                   Default: false

    sort-feats      Usage: sort-feats [options]      
        Options:
          -c, --charset               The character set encoding to use for both
                                      reading input and writing output files.
                                      Default: UTF-8
          -E, --enumerated            Whether tokens in the input file are
                                      enumerated.
                                      Default: false
          -et, --enumerator-type      Options: [Memory, JDBC]
                                      Default: Memory
          -h, --help                  Display this help message.
                                      Default: false
          -X, --index-file            Index for the string tokens.
        * -i, --input                 Source file that will be read
        * -o, --output                Destination file that will be writen to.
          -r, --reverse               Reverse the result of comparisons.
                                      Default: false
          -T, --temporary-directory   Directory which will be used for storing
                                      temporary files.
                                      Default: /var/folders/_4/lnfd6yv910q4gdqbrpvs1vzr0000gp/T/tmp-<uniqueid>.tmp
          -t, --threads               Number of threads to use.
                                      Default: 8

    unindex-events      Usage: unindex-events [options]      
        Options:
          -c, --charset                The character set encoding to use for
                                       both reading input and writing output files.
                                       Default: UTF-8
          -Xe, --entries-index-file    Index file for enumerating entries.
          -Ee, --enumerated-entries    Whether tokens in the first column of the
                                       input file are indexed.
                                       Default: false
          -Ef, --enumerated-features   Whether entries in the second column of
                                       the input file are indexed.
                                       Default: false
          -et, --enumerator-type       Options: [Memory, JDBC]
                                       Default: Memory
          -Xf, --features-index-file   Index file for enumerating features.
          -h, --help                   Display this help message.
                                       Default: false
        * -i, --input                  Source file that will be read
        * -o, --output                 Destination file that will be writen to.

    index-features      Usage: index-features [options]      
        Options:
          -c, --charset                The character set encoding to use for
                                       both reading input and writing output files.
                                       Default: UTF-8
          -Xe, --entries-index-file    Index file for enumerating entries.
          -Ee, --enumerated-entries    Whether tokens in the first column of the
                                       input file are indexed.
                                       Default: false
          -Ef, --enumerated-features   Whether entries in the second column of
                                       the input file are indexed.
                                       Default: false
          -et, --enumerator-type       Options: [Memory, JDBC]
                                       Default: Memory
          -Xf, --features-index-file   Index file for enumerating features.
          -h, --help                   Display this help message.
                                       Default: false
        * -i, --input                  Source file that will be read
        * -o, --output                 Destination file that will be writen to.

    index-instances      Usage: index-instances [options]      
        Options:
          -c, --charset                The character set encoding to use for
                                       both reading input and writing output files.
                                       Default: UTF-8
          -Xe, --entries-index-file    Index file for enumerating entries.
          -Ee, --enumerated-entries    Whether tokens in the first column of the
                                       input file are indexed.
                                       Default: false
          -Ef, --enumerated-features   Whether entries in the second column of
                                       the input file are indexed.
                                       Default: false
          -et, --enumerator-type       Options: [Memory, JDBC]
                                       Default: Memory
          -Xf, --features-index-file   Index file for enumerating features.
          -h, --help                   Display this help message.
                                       Default: false
        * -i, --input                  Source file that will be read
        * -o, --output                 Destination file that will be writen to.

    knn-sims      Usage: knn-sims [options]      
        Options:
          -c, --charset                The character set encoding to use for
                                       both reading input and writing output files.
                                       Default: UTF-8
          -Xe, --entries-index-file    Index file for enumerating entries.
          -Ee, --enumerated-entries    Whether tokens in the first column of the
                                       input file are indexed.
                                       Default: false
          -Ef, --enumerated-features   Whether entries in the second column of
                                       the input file are indexed.
                                       Default: false
          -et, --enumerator-type       Options: [Memory, JDBC]
                                       Default: Memory
          -Xf, --features-index-file   Index file for enumerating features.
          -h, --help                   Display this help message.
                                       Default: false
        * -i, --input                  Source file that will be read
        * -o, --output                 Destination file that will be writen to.
          -r, --reverse                Reverse the result of comparisons.
                                       Default: false
          -T, --temporary-directory    Directory which will be used for storing
                                       temporary files.
                                       Default: /var/folders/_4/lnfd6yv910q4gdqbrpvs1vzr0000gp/T/tmp-<uniqueid>.tmp
          -t, --threads                Number of threads to use.
                                       Default: 8
          -k                           The number of neighbours to produce for
                                       each base entry.
                                       Default: 100

    unindex-entries      Usage: unindex-entries [options]      
        Options:
          -c, --charset                The character set encoding to use for
                                       both reading input and writing output files.
                                       Default: UTF-8
          -Xe, --entries-index-file    Index file for enumerating entries.
          -Ee, --enumerated-entries    Whether tokens in the first column of the
                                       input file are indexed.
                                       Default: false
          -Ef, --enumerated-features   Whether entries in the second column of
                                       the input file are indexed.
                                       Default: false
          -et, --enumerator-type       Options: [Memory, JDBC]
                                       Default: Memory
          -Xf, --features-index-file   Index file for enumerating features.
          -h, --help                   Display this help message.
                                       Default: false
        * -i, --input                  Source file that will be read
        * -o, --output                 Destination file that will be writen to.

    sort-ents      Usage: sort-ents [options]      
        Options:
          -c, --charset               The character set encoding to use for both
                                      reading input and writing output files.
                                      Default: UTF-8
          -E, --enumerated            Whether tokens in the input file are
                                      enumerated.
                                      Default: false
          -et, --enumerator-type      Options: [Memory, JDBC]
                                      Default: Memory
          -h, --help                  Display this help message.
                                      Default: false
          -X, --index-file            Index for the string tokens.
        * -i, --input                 Source file that will be read
        * -o, --output                Destination file that will be writen to.
          -r, --reverse               Reverse the result of comparisons.
                                      Default: false
          -T, --temporary-directory   Directory which will be used for storing
                                      temporary files.
                                      Default: /var/folders/_4/lnfd6yv910q4gdqbrpvs1vzr0000gp/T/tmp-<uniqueid>.tmp
          -t, --threads               Number of threads to use.
                                      Default: 8

    filter      Usage: filter [options]      
        Options:
          -c, --charset                      The character set encoding to use
                                             for both reading input and writing
                                             output files.
                                             Default: UTF-8
          -Xe, --entries-index-file          Index file for enumerating entries.
          -Ee, --enumerated-entries          Whether tokens in the first column
                                             of the input file are indexed.
                                             Default: false
          -Ef, --enumerated-features         Whether entries in the second
                                             column of the input file are indexed.
                                             Default: false
          -et, --enumerator-type             Options: [Memory, JDBC]
                                             Default: Memory
          -Xf, --features-index-file         Index file for enumerating
                                             features.
          -fef, --filter-entry-freq          Minimum entry pair frequency
                                             threshold.
                                             Default: 0.0
          -fep, --filter-entry-pattern       Regular expresion that accepted
                                             entries must match.
          -few, --filter-entry-whitelist     Whitelist file containing entries
                                             of interest. (All others will be
                                             ignored)
          -fvf, --filter-event-freq          Minimum event frequency threshold.
                                             Default: 0.0
          -fff, --filter-feature-freq        Minimum feature pair frequency
                                             threshold.
                                             Default: 0.0
          -ffp, --filter-feature-pattern     Regular expresion that accepted
                                             features must match.
          -ffw, --filter-feature-whitelist   Whitelist file containing features
                                             of interest. (All others will be
                                             ignored)
          -h, --help                         Display this help message.
                                             Default: false
        * -ie, --input-entries               Input entry frequencies file.
        * -iv, --input-events                Input event frequencies file.
        * -if, --input-features              Input features frequencies file.
        * -oe, --output-entries              Output entry frequencies file
        * -ov, --output-events               Output event frequencies file.
        * -of, --output-features             Output features frequencies file.
          -T, --temp-dir                     Temorary directory which will be
                                             used during filtering.
                                             Default: /var/folders/_4/lnfd6yv910q4gdqbrpvs1vzr0000gp/T/tmp-<uniqueid>.tmp

    unindex-sims      Usage: unindex-sims [options]      
        Options:
          -c, --charset                The character set encoding to use for
                                       both reading input and writing output files.
                                       Default: UTF-8
          -Xe, --entries-index-file    Index file for enumerating entries.
          -Ee, --enumerated-entries    Whether tokens in the first column of the
                                       input file are indexed.
                                       Default: false
          -Ef, --enumerated-features   Whether entries in the second column of
                                       the input file are indexed.
                                       Default: false
          -et, --enumerator-type       Options: [Memory, JDBC]
                                       Default: Memory
          -Xf, --features-index-file   Index file for enumerating features.
          -h, --help                   Display this help message.
                                       Default: false
        * -i, --input                  Source file that will be read
        * -o, --output                 Destination file that will be writen to.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running the Software

Running the Software

Running using byblo.sh

Filtering Options

Counting Options

All-pairs Options

Sorting and K-Nearest-Neighbours Options

Running using tools.sh

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Contents

Contributors

Clone this wiki locally