-
Notifications
You must be signed in to change notification settings - Fork 8
Running the Software
This script is designed to be the primary point of usage for the thesaurus building software. It runs a complete build process from frequency counting to K-Nearest-Neighbours in a single pass, providing all the most commonly used functionality of the underlying software.
$ ./byblo.sh [<options>] [@<config>] -i <file>Where the arguments are:
-
<file>Input instances file containing head/contexts pairs. -
@<config>Options and input files can be read from a file specified directly after an '$\mathtt{@}$' character. Options in this file should be specified exactly as they would be at the command line, and may contain additional@references to other config files. -
<options>Any number of the option switches specified [below][General Options].
Except where noted, the following option apply to all steps within the build process:
-
-c <name>, --charset <name>Set the character encoding to use when reading and writing all files. Default is UTF-8. -
-h, --helpDisplay usage information and exit. -
-o <path> , --output<path>Directory that will contain all (non-temporary) output files. The default is the current working directory.The device should have a very significant amount of free space. The specific amount is dependent on the task, but a typical thesaurus can require hundreds of gigabytes, even terabytes.
-
-t <num> , --threads <num>Number of concurrent threads to use during processes that can be parallelised (i.e all-pairs, sorting, and KNN). Default is set to a number that will make optimal use of available cores; usually number of cores plus 1. -
-T <path>, --temp-dir <path>Path to the directory that will store temporary files used during certain processes: filter, sorting, and KNN. This device holding this directory should have an amount of free space similar to the output directory (see above.) In addition the I/O performance will have a significant impact on the processing time, so local storage is preferable. The default is to create a temporary sub-directory inside the output directory.
The following options configure the input filtering stage of the build process. They can be used to narrow down the area being investigated. Filtering also improves overall performance, since data is removed before the more computationally demanding later stages of the process.
-
-fef <num>, --filter-entry-freq <num>Accept only those entries with a frequency greater than or equal to<num>. This filter is generally used to improve the quality of the thesaurus, since low frequency entries are unlikely to be described accurately. -
-fep <exp> , --filter-entry-pattern <exp>Accept only those entries that match the given Perl style regular expression<exp>. This filter is generally used to find only particular syntactical ranges of entries, such as those composed entirely of alphabetical characters. -
-few <file>, --filter-entry-wordlist <file>Accept only those entries that match exactly a line within the given<file>. This filter is used the investigate the distributional similarity of some small subset of words. -
-fff <num> , --filter-feature-freq <num>Accept only those features with a frequency greater than or equal to<num>. This filter is generally used to improve the performance of the thesaurus build process, since low frequency features are unlikely to have a significant effect on the distributional similarity of the heads they described, they only serve to slow down the process. -
-ffp <exp>, --filter-feature-pattern <exp>Accept only those context strings that match the given Perl style regular expression<exp>. It can be used as either a performance aid or an investigatory tool. -
-ffw <file>, --filter-feature-wordlist <file>Accept only those features that are an exact match to a line within the given<file>. -
-fff <num>, --filter-event-freq <num>Accept only those events (entry/feature pair observations) with a frequency greater than or equal to<num>. This filter is generally used to improve the quality of the thesaurus, since low frequency events are unlikely to be very discriminative.
The counting stage no longer has any additional options.
The following options effect the all-pair similarity search part of the thesaurus build process.
-
-m <name> , --measure <name>Proximity measure to compare heads with. Commonly used examples include: Lin, Jaccard, CRMI, RecallMI. The default is Lin. -
--measure-reversedFor asymmetric measures calculate the inverse similarities, i.e reverse the operands of similarity function: sim(q,r) becomes sim(r,q). Has no effect with symmetric measures. -
-Smn <num>, --similarity-min <num>Minimum similarity threshold over which resultant pairs will be output. This option is generally used for performance purposes. Without this option the output pairs file size can be quadratic w.r.t the number of entries. Default is negative-infinity. -
-Smx <num>, --similarity-max <num>Maximum similarity threshold under which resultant pairs will be output. Typically used instead of option--similarity-min, for distance based measures, since smaller similarity values indicate greater similarity. Default is positive-infinity.
Additional options for specific measures:
-
--crmi-beta <num>For CRMI measure. Value should be in the range 0 to 1 inclusive. Default is 1. -
--crmi-gamma <num>For CRMI measure. Value should be in the range 0 to 1 inclusive. Default is 0. -
--lee-alpha <num>For Lee measure. Value should be in the range 0 to 1 exclusive. Default is 0.99 -
--mink-p <num>For Lp measure. Value should be a integer in the range 1 to infinity. Default is 2.
-
-k <num>For each resultant head string, produce it's nearest neighbours. Default is 100.
For most user the build.sh method (described above) is sufficient to produce a thesaurus. However, it is also possible to run specific stages of the pipeline with greater control using the tools.sh script.
See the (rather long) usage:
$ ./tools.sh --help
Usage: Tools [options] [command] [command options]
Options:
-h, --help Display this help message.
Default: false
Commands:
allpairs Usage: allpairs [options]
Options:
-c, --charset The character set encoding to use for
both reading input and writing output files.
Default: UTF-8
--crmi-beta Beta paramter to Weed's CRMI measure.
Default: 0.5
--crmi-gamma Gamma parameter to Weed's CRMI measure.
Default: 0.5
-Xe, --entries-index-file Index file for enumerating entries.
-Ee, --enumerated-entries Whether tokens in the first column of the
input file are indexed.
Default: false
-Ef, --enumerated-features Whether entries in the second column of
the input file are indexed.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-Xf, --features-index-file Index file for enumerating features.
-h, --help Display this help message.
Default: false
-ip, --identity-pairs Produce similarity between pair of
identical entries.
Default: false
* -i, --input Event frequency vectors files.
-ie, --input-entries Entry frequencies file
-if, --input-features Feature frequencies file
--lee-alpha Alpha parameter to Lee's alpha-skew
divergence measure.
Default: 0.99
-m, --measure Similarity measure to use.
Default: Lin
--measure-reversed Swap similarity measure inputs.
Default: false
--mink-p P parameter to Minkowski/Lp space
measure.
Default: 2.0
* -o, --output Output similarity matrix file.
-Smx, --similarity-max Maximum similarity threshold.
Default: Infinity
-Smn, --similarity-min Minimum similarity threshold.
Default: -Infinity
-t, --threads Number of concurrent processing threads.
Default: 9
count Usage: count [options]
Options:
-c, --charset The character set encoding to use for
both reading input and writing output
files.
Default: UTF-8
-Xe, --entries-index-file Index file for enumerating entries.
-Ee, --enumerated-entries Whether tokens in the first column of
the input file are indexed.
Default: false
-Ef, --enumerated-features Whether entries in the second column
of the input file are indexed.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-Xf, --features-index-file Index file for enumerating features.
-h, --help Display this help message.
Default: false
* -i, --input Input instances file
* -oe, --output-entries Output entry frequencies file
* -oef, --output-entry-features Output entry-feature frequencies file
* -of, --output-features Output feature frequencies file
-T, --temporary-directory Directory used for holding temporary
files.
Default: /var/folders/_4/lnfd6yv910q4gdqbrpvs1vzr0000gp/T/tmp-<uniqueid>.tmp
-t, --threads Number of threads to use.
Default: 8
sort-events Usage: sort-events [options]
Options:
-c, --charset The character set encoding to use for
both reading input and writing output files.
Default: UTF-8
-Xe, --entries-index-file Index file for enumerating entries.
-Ee, --enumerated-entries Whether tokens in the first column of the
input file are indexed.
Default: false
-Ef, --enumerated-features Whether entries in the second column of
the input file are indexed.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-Xf, --features-index-file Index file for enumerating features.
-h, --help Display this help message.
Default: false
* -i, --input Source file that will be read
* -o, --output Destination file that will be writen to.
-r, --reverse Reverse the result of comparisons.
Default: false
-T, --temporary-directory Directory which will be used for storing
temporary files.
Default: /var/folders/_4/lnfd6yv910q4gdqbrpvs1vzr0000gp/T/tmp-<uniqueid>.tmp
-t, --threads Number of threads to use.
Default: 8
unindex-instances Usage: unindex-instances [options]
Options:
-c, --charset The character set encoding to use for
both reading input and writing output files.
Default: UTF-8
-Xe, --entries-index-file Index file for enumerating entries.
-Ee, --enumerated-entries Whether tokens in the first column of the
input file are indexed.
Default: false
-Ef, --enumerated-features Whether entries in the second column of
the input file are indexed.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-Xf, --features-index-file Index file for enumerating features.
-h, --help Display this help message.
Default: false
* -i, --input Source file that will be read
* -o, --output Destination file that will be writen to.
merge-sims Usage: merge-sims [options]
Options:
-c, --charset The character set encoding to use for
both reading input and writing output files.
Default: UTF-8
-Xe, --entries-index-file Index file for enumerating entries.
-Ee, --enumerated-entries Whether tokens in the first column of the
input file are indexed.
Default: false
-Ef, --enumerated-features Whether entries in the second column of
the input file are indexed.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-Xf, --features-index-file Index file for enumerating features.
-h, --help Display this help message.
Default: false
* -ifa, --input-file-a The first file to merge.
* -ifb, --input-file-b The second file to merge.
* -of, --output-file The output file to which both input will
be merged.
-r, --reverse Reverse the result of comparisons.
Default: false
unindex-features Usage: unindex-features [options]
Options:
-c, --charset The character set encoding to use for
both reading input and writing output files.
Default: UTF-8
-Xe, --entries-index-file Index file for enumerating entries.
-Ee, --enumerated-entries Whether tokens in the first column of the
input file are indexed.
Default: false
-Ef, --enumerated-features Whether entries in the second column of
the input file are indexed.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-Xf, --features-index-file Index file for enumerating features.
-h, --help Display this help message.
Default: false
* -i, --input Source file that will be read
* -o, --output Destination file that will be writen to.
index-events Usage: index-events [options]
Options:
-c, --charset The character set encoding to use for
both reading input and writing output files.
Default: UTF-8
-Xe, --entries-index-file Index file for enumerating entries.
-Ee, --enumerated-entries Whether tokens in the first column of the
input file are indexed.
Default: false
-Ef, --enumerated-features Whether entries in the second column of
the input file are indexed.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-Xf, --features-index-file Index file for enumerating features.
-h, --help Display this help message.
Default: false
* -i, --input Source file that will be read
* -o, --output Destination file that will be writen to.
sort-sims Usage: sort-sims [options]
Options:
-c, --charset The character set encoding to use for
both reading input and writing output files.
Default: UTF-8
-Xe, --entries-index-file Index file for enumerating entries.
-Ee, --enumerated-entries Whether tokens in the first column of the
input file are indexed.
Default: false
-Ef, --enumerated-features Whether entries in the second column of
the input file are indexed.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-Xf, --features-index-file Index file for enumerating features.
-h, --help Display this help message.
Default: false
* -i, --input Source file that will be read
* -o, --output Destination file that will be writen to.
-r, --reverse Reverse the result of comparisons.
Default: false
-T, --temporary-directory Directory which will be used for storing
temporary files.
Default: /var/folders/_4/lnfd6yv910q4gdqbrpvs1vzr0000gp/T/tmp-<uniqueid>.tmp
-t, --threads Number of threads to use.
Default: 8
unindex-neighbours Usage: unindex-neighbours [options]
Options:
-c, --charset The character set encoding to use for
both reading input and writing output files.
Default: UTF-8
-Xe, --entries-index-file Index file for enumerating entries.
-Ee, --enumerated-entries Whether tokens in the first column of the
input file are indexed.
Default: false
-Ef, --enumerated-features Whether entries in the second column of
the input file are indexed.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-Xf, --features-index-file Index file for enumerating features.
-h, --help Display this help message.
Default: false
* -i, --input Source file that will be read
* -o, --output Destination file that will be writen to.
index-sims Usage: index-sims [options]
Options:
-c, --charset The character set encoding to use for
both reading input and writing output files.
Default: UTF-8
-Xe, --entries-index-file Index file for enumerating entries.
-Ee, --enumerated-entries Whether tokens in the first column of the
input file are indexed.
Default: false
-Ef, --enumerated-features Whether entries in the second column of
the input file are indexed.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-Xf, --features-index-file Index file for enumerating features.
-h, --help Display this help message.
Default: false
* -i, --input Source file that will be read
* -o, --output Destination file that will be writen to.
index-neighbours Usage: index-neighbours [options]
Options:
-c, --charset The character set encoding to use for
both reading input and writing output files.
Default: UTF-8
-Xe, --entries-index-file Index file for enumerating entries.
-Ee, --enumerated-entries Whether tokens in the first column of the
input file are indexed.
Default: false
-Ef, --enumerated-features Whether entries in the second column of
the input file are indexed.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-Xf, --features-index-file Index file for enumerating features.
-h, --help Display this help message.
Default: false
* -i, --input Source file that will be read
* -o, --output Destination file that will be writen to.
merge-ents Usage: merge-ents [options]
Options:
-c, --charset The character set encoding to use for
both reading input and writing output files.
Default: UTF-8
-Xe, --entries-index-file Index file for enumerating entries.
-Ee, --enumerated-entries Whether tokens in the first column of the
input file are indexed.
Default: false
-Ef, --enumerated-features Whether entries in the second column of
the input file are indexed.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-Xf, --features-index-file Index file for enumerating features.
-h, --help Display this help message.
Default: false
* -ifa, --input-file-a The first file to merge.
* -ifb, --input-file-b The second file to merge.
* -of, --output-file The output file to which both input will
be merged.
-r, --reverse Reverse the result of comparisons.
Default: false
index-entries Usage: index-entries [options]
Options:
-c, --charset The character set encoding to use for
both reading input and writing output files.
Default: UTF-8
-Xe, --entries-index-file Index file for enumerating entries.
-Ee, --enumerated-entries Whether tokens in the first column of the
input file are indexed.
Default: false
-Ef, --enumerated-features Whether entries in the second column of
the input file are indexed.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-Xf, --features-index-file Index file for enumerating features.
-h, --help Display this help message.
Default: false
* -i, --input Source file that will be read
* -o, --output Destination file that will be writen to.
merge-feats Usage: merge-feats [options]
Options:
-c, --charset The character set encoding to use for both
reading input and writing output files.
Default: UTF-8
-E, --enumerated Whether tokens in the input file are
enumerated.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-h, --help Display this help message.
Default: false
-X, --index-file Index for the string tokens.
* -ifa, --input-file-a The first file to merge.
* -ifb, --input-file-b The second file to merge.
* -of, --output-file The output file to which both input will be
merged.
-r, --reverse Reverse the result of comparisons.
Default: false
index Usage: index [options]
Options:
-c, --charset The character set encoding to use for
both reading input and writing output files.
Default: UTF-8
-Xe, --entries-index-file Index file for enumerating entries.
-Ee, --enumerated-entries Whether tokens in the first column of the
input file are indexed.
Default: false
-Ef, --enumerated-features Whether entries in the second column of
the input file are indexed.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-Xf, --features-index-file Index file for enumerating features.
-h, --help Display this help message.
Default: false
* -i, --input Source file that will be read
* -o, --output Destination file that will be writen to.
merge-events Usage: merge-events [options]
Options:
-c, --charset The character set encoding to use for both
reading input and writing output files.
Default: UTF-8
-E, --enumerated Whether tokens in the input file are
enumerated.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-h, --help Display this help message.
Default: false
-X, --index-file Index for the string tokens.
* -ifa, --input-file-a The first file to merge.
* -ifb, --input-file-b The second file to merge.
* -of, --output-file The output file to which both input will be
merged.
-r, --reverse Reverse the result of comparisons.
Default: false
sort-feats Usage: sort-feats [options]
Options:
-c, --charset The character set encoding to use for both
reading input and writing output files.
Default: UTF-8
-E, --enumerated Whether tokens in the input file are
enumerated.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-h, --help Display this help message.
Default: false
-X, --index-file Index for the string tokens.
* -i, --input Source file that will be read
* -o, --output Destination file that will be writen to.
-r, --reverse Reverse the result of comparisons.
Default: false
-T, --temporary-directory Directory which will be used for storing
temporary files.
Default: /var/folders/_4/lnfd6yv910q4gdqbrpvs1vzr0000gp/T/tmp-<uniqueid>.tmp
-t, --threads Number of threads to use.
Default: 8
unindex-events Usage: unindex-events [options]
Options:
-c, --charset The character set encoding to use for
both reading input and writing output files.
Default: UTF-8
-Xe, --entries-index-file Index file for enumerating entries.
-Ee, --enumerated-entries Whether tokens in the first column of the
input file are indexed.
Default: false
-Ef, --enumerated-features Whether entries in the second column of
the input file are indexed.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-Xf, --features-index-file Index file for enumerating features.
-h, --help Display this help message.
Default: false
* -i, --input Source file that will be read
* -o, --output Destination file that will be writen to.
index-features Usage: index-features [options]
Options:
-c, --charset The character set encoding to use for
both reading input and writing output files.
Default: UTF-8
-Xe, --entries-index-file Index file for enumerating entries.
-Ee, --enumerated-entries Whether tokens in the first column of the
input file are indexed.
Default: false
-Ef, --enumerated-features Whether entries in the second column of
the input file are indexed.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-Xf, --features-index-file Index file for enumerating features.
-h, --help Display this help message.
Default: false
* -i, --input Source file that will be read
* -o, --output Destination file that will be writen to.
index-instances Usage: index-instances [options]
Options:
-c, --charset The character set encoding to use for
both reading input and writing output files.
Default: UTF-8
-Xe, --entries-index-file Index file for enumerating entries.
-Ee, --enumerated-entries Whether tokens in the first column of the
input file are indexed.
Default: false
-Ef, --enumerated-features Whether entries in the second column of
the input file are indexed.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-Xf, --features-index-file Index file for enumerating features.
-h, --help Display this help message.
Default: false
* -i, --input Source file that will be read
* -o, --output Destination file that will be writen to.
knn-sims Usage: knn-sims [options]
Options:
-c, --charset The character set encoding to use for
both reading input and writing output files.
Default: UTF-8
-Xe, --entries-index-file Index file for enumerating entries.
-Ee, --enumerated-entries Whether tokens in the first column of the
input file are indexed.
Default: false
-Ef, --enumerated-features Whether entries in the second column of
the input file are indexed.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-Xf, --features-index-file Index file for enumerating features.
-h, --help Display this help message.
Default: false
* -i, --input Source file that will be read
* -o, --output Destination file that will be writen to.
-r, --reverse Reverse the result of comparisons.
Default: false
-T, --temporary-directory Directory which will be used for storing
temporary files.
Default: /var/folders/_4/lnfd6yv910q4gdqbrpvs1vzr0000gp/T/tmp-<uniqueid>.tmp
-t, --threads Number of threads to use.
Default: 8
-k The number of neighbours to produce for
each base entry.
Default: 100
unindex-entries Usage: unindex-entries [options]
Options:
-c, --charset The character set encoding to use for
both reading input and writing output files.
Default: UTF-8
-Xe, --entries-index-file Index file for enumerating entries.
-Ee, --enumerated-entries Whether tokens in the first column of the
input file are indexed.
Default: false
-Ef, --enumerated-features Whether entries in the second column of
the input file are indexed.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-Xf, --features-index-file Index file for enumerating features.
-h, --help Display this help message.
Default: false
* -i, --input Source file that will be read
* -o, --output Destination file that will be writen to.
sort-ents Usage: sort-ents [options]
Options:
-c, --charset The character set encoding to use for both
reading input and writing output files.
Default: UTF-8
-E, --enumerated Whether tokens in the input file are
enumerated.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-h, --help Display this help message.
Default: false
-X, --index-file Index for the string tokens.
* -i, --input Source file that will be read
* -o, --output Destination file that will be writen to.
-r, --reverse Reverse the result of comparisons.
Default: false
-T, --temporary-directory Directory which will be used for storing
temporary files.
Default: /var/folders/_4/lnfd6yv910q4gdqbrpvs1vzr0000gp/T/tmp-<uniqueid>.tmp
-t, --threads Number of threads to use.
Default: 8
filter Usage: filter [options]
Options:
-c, --charset The character set encoding to use
for both reading input and writing
output files.
Default: UTF-8
-Xe, --entries-index-file Index file for enumerating entries.
-Ee, --enumerated-entries Whether tokens in the first column
of the input file are indexed.
Default: false
-Ef, --enumerated-features Whether entries in the second
column of the input file are indexed.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-Xf, --features-index-file Index file for enumerating
features.
-fef, --filter-entry-freq Minimum entry pair frequency
threshold.
Default: 0.0
-fep, --filter-entry-pattern Regular expresion that accepted
entries must match.
-few, --filter-entry-whitelist Whitelist file containing entries
of interest. (All others will be
ignored)
-fvf, --filter-event-freq Minimum event frequency threshold.
Default: 0.0
-fff, --filter-feature-freq Minimum feature pair frequency
threshold.
Default: 0.0
-ffp, --filter-feature-pattern Regular expresion that accepted
features must match.
-ffw, --filter-feature-whitelist Whitelist file containing features
of interest. (All others will be
ignored)
-h, --help Display this help message.
Default: false
* -ie, --input-entries Input entry frequencies file.
* -iv, --input-events Input event frequencies file.
* -if, --input-features Input features frequencies file.
* -oe, --output-entries Output entry frequencies file
* -ov, --output-events Output event frequencies file.
* -of, --output-features Output features frequencies file.
-T, --temp-dir Temorary directory which will be
used during filtering.
Default: /var/folders/_4/lnfd6yv910q4gdqbrpvs1vzr0000gp/T/tmp-<uniqueid>.tmp
unindex-sims Usage: unindex-sims [options]
Options:
-c, --charset The character set encoding to use for
both reading input and writing output files.
Default: UTF-8
-Xe, --entries-index-file Index file for enumerating entries.
-Ee, --enumerated-entries Whether tokens in the first column of the
input file are indexed.
Default: false
-Ef, --enumerated-features Whether entries in the second column of
the input file are indexed.
Default: false
-et, --enumerator-type Options: [Memory, JDBC]
Default: Memory
-Xf, --features-index-file Index file for enumerating features.
-h, --help Display this help message.
Default: false
* -i, --input Source file that will be read
* -o, --output Destination file that will be writen to.