-
Notifications
You must be signed in to change notification settings - Fork 8
Data File Formats
This document is an overview of data file formats used during the generation of distributional thesauri.
There are a number of file format options for use with the Thesaurus Generation Software (TGS). Although we have been using one particular format in University of Sussex (UoS) team's previous work, the format is not necessarily optimal, thus is entirely open to change for this project. To discuss specific formats, it is necessary to give an abstract overview of the data types and structure stored at each stage of the TGS pipeline.
There should be a compressed archive file distributed with this document that contains examples of some of the files discussed. Any specific files, referenced by name in the text, should be available within the archive.
Each datum held by the structures of a distributional thesauri will be one of four atomic types: base entires, features, weights, and similarities. The following subsections give an overview of each type.
Entries are the things that are "looked up" in the resulting thesaurus, and naturally they are also the things that are retrieved. Taking the example of a traditional curated thesaurus, looking up the word "happy" might return "blissful", "glad", "perky", and "tickled" --- all of which are entries. Usually (though not always) entries are morphemes, tokens, phrases, or lemmas that occur in the input corpus. The looked up entry being used to search the thesaurus in commonly know as the base-entry, while the resulting pseudo-synonym is called a neighbour. In the thesaurus output, looking up a base entry returns an ordered list of neighbours with a high degree of similarity to the looked up entry.
Entries are discrete symbols represented as strings.
The context of entries in the corpus is described or represented by features. Typical examples of features are co-occurrent tokens, grammatical relations, Parts-Of-Speech (POS), or a combination. In some situations a system might have numeric features, but these must be discretised for the purpose of thesaurus generation. Discretisation can be achieved by quantising (binning) real numbers, or by bounding an integer range.
Features are discrete symbols represented as strings.
Weights denote how strongly features and entries are associated with each other. These associations can be between base entires and features, or as priors for independent features and entries.
Weights are (usually positive) integer or real numbers represented in double precision floating point.
Similarity is the degree to which a pair of entires can be considered equivalent with respect to their feature distribution. Similarity is calculated using a similarity measure function, which is determined by the notion of similarity required for the thesaurus. The details of the similarity measure and corresponding similarity varies between tasks. One commonly used notion of similarity is synonymy, although most distributional thesauri struggle to disambiguate synonyms from antonyms and some types of hypo/hypernymy. Another notion of similarity is that of substitutability; similar entires are those that can be interchanged with minimal loss of semantics or syntactical correctness.
Similarity is usually represented as a real number proximity value between 0 and 1, where 0 indicates minimal similarity and 1 indicates maximal similarity. Some measures produce a distance based similarity value, where 0 indicates maximal similarity and minimal similarity is represented by positive infinity. Similarity values may be asymmetric, e.g in thesauri where a notion of hierarchy is required.
Similarities are real numbers represented in double precision floating point.
The software currently makes use of two distinct file formats. Most data is read sequentially so a plain-text Tab-Separated-Values (TSV) format is sufficient. In some cases random accessing indexing it required, and for these purposes the JDBM3 library is used, which has a bespoke binary format.
The input, output, and most of the intermediate files are store in plain-text TSV. This consist of one record per line, where the records values are delimited by the tab character. Each value will be one of the atomic type described above.
Rather than dealing with complex strings during the whole processing pipeline, the strings from input instances file are given a fixed enumeration at the end of the pipeline. During this stage every entry and features type is given a unique integer id. The mapping from strings to integers is stored in a index, which is then used at the end of the pipeline to re-introduce the strings into the output files. The enumeration is done using the JDBM3 library from disk based data structures. This process produces two separated enumeration data structures; one for entries and one for features. Each index requires two files, a data store and an index.
For a particular input instances file named <inputname>, these enumeration files will typically be named as follows:
<inputname>.entry-index.d.0
<inputname>.entry-index.i.0
<inputname>.feature-index.d.0
<inputname>.feature-index.i.0
The files use an bespoke binary format, defined by the JDBM3 project authors, and so are not suitable for inspection.
All intermediated files are now stored using string enumeration, so regrettably they can not easily be inspected. The advantage is substantially reduced file sizes and associated I/O overhead. In addition the memory footprint is improved, since the enumeration mappings are no longer needed in memory, and the performance is increased because string not repeatedly lookup up in the index.
In addition to enumeration, the software makes use of skip indexing to further reduce data size. This involves storing the difference between the current id and the previous, rather than the raw integer value. For example the sequence (996, 1001, 1005, 1012) can be compressed to (996, 5, 4, 7). Skipping indexing is only effective when the numbers are stored sequentially in ascending order, and this ordering is not always possible, so the technique is applied only to selected columns in the data. This technique further impedes easy inspection of the intermediate files, but the benefit is further reduced file sizes and associated I/O overhead.
Most of the data is stored in a simple TSV format, with one row per record. Each record consists of a tuple of well defined types. It is often the case that the first element from a tuple is repeated in subsequent records, which leads to a substantial space overhead. For example consider the following data snippet:
... ...
x a
x b
y c
y d
y e
x f
x g
y h
y e
z j
... ...
The first column, containing the entry, is often repeated on subsequent lines. Compact format is way of reducing the overhead by only starting a new record when column values changes. Continuing the example, we can compress the snippet to the following:
... ...
x a b
y c d e
x f g
y h e
z j
... ...
Here each record now consists of a single column 1 type, followed by one or more column 2 types.
This compact format can reduce the disk space requirements by up to 50% for files which with the correct ordering.
One of the advantages of this format is that a direct super-type of the more structured expanded format. This allows a single input parser to read both file formats without correctly.
The following is an overview of the files, used during thesaurus creation, in the order they appear in the processing pipeline.
The input for the TGS is a sequence of base entry / feature pair instances. For each feature found by the feature extractor a single instance pair is produced. This is the preferred input form of the TGS, since it allows maximal flexibility when it comes to feature frequency counting and weighting. Precomputed feature vectors can be used, but they preclude more effective measures of similarity and optimisation strategies.
The instances file format is a Tab Separated Values (TSV) file - consisting of a plain text, tab delimited flat file with one record per line. Each record is composed of two values: a single entry, and a single feature:
base entry feature
base entry feature
base entry feature
... ...
The software makes no assumptions about the morphology of the entry and feature strings, other than that they do not contain the tab, newline, or null characters used for value, record, and file delimitation respectively. The empty string is also not permitted, since this can have not useful semantics for the task at hand. The file need not be sorted or organised in any particular way.
A simple example instances file for a proximity thesaurus could include the following snippet, when fruit entries are described by co-occurrent tokens:
... ...
apple peel
apple green
orange eat
orange juicy
orange orange
apple steve
apple the
orange phone
orange colour
fig eat
... ...
A more complex dependency thesaurus (from a parsed corpus, using parts of speech, lemmatisation and grammar relations) might include the following snippet:
... ...
apple_NN ccomp:peel+ing
apple_NN ncmod:green_ADJ
orange_NN obj:eat+s_VB
orange_NN juicy+st_ADJ
orange_NN ncmod:orange_ADJ
apple_NN conj:steve_NN
apple_NN ta:phone+_NN
orange_NN ta:phone+_NN
orange_NN xmod:colour_ADJ
fig_NN obj:eat+ed_VB
... ...
The sample file src/test/resources/uk/ac/susx/mlcl/byblo/bnc-gramrels-fruit is an example of such a raw instances file. Each base entry is a type of fruit, and features are grammatical relations found in a RASP parse of the British National Corpus (BNC).
The input instances file may be provided in the compact format (as specified above). This would reduce the example input snippet to the following:
... ...
apple_NN ccomp:peel+ing ncmod:green_ADJ
orange_NN obj:eat+s_VB juicy+st_ADJ ncmod:orange_ADJ
apple_NN conj:steve_NN ta:phone+_NN
orange_NN ta:phone+_NN xmod:colour_ADJ
fig_NN obj:eat+ed_VB
... ...
The following structures are used as intermediaries between the various steps of the processing pipeline. It is not critical for the user to understand their format, but they may be useful for experimentation and analysis. Since these files are primarily for internal use, the format, and even their existence, is particularly subject to change.
The raw instances input is converted by the TGS into three separate frequency count files: The entries file consists of information about the base entries independent of features. Likewise the features file consists of information about feature independent of base entries. The pairs file consists of information about base entry / feature co-occurrence. These files constitute the feature vectors and context prior vectors. The following sections provide a more detailed description.
The entries file is composed of (entry, frequency) tuples. The frequency is the total number of occurrences of a unique entry in the instances file, not the number of times it occurred in the corpus; since an entry may occur with multiple features the instance frequency is always equal to or greater than the corpus frequency.
apple 540
apricot 73
avocado 35
banana 178
bilberry 13
... ...
The file format is TSV with one tuple pair per line. The sample file src/test/resources/uk/ac/susx/mlcl/byblo/bnc-gramrels-fruit.entries is an example of such an entries file.
Note that the actual entries file may make use of both enumerated strings, and skip indexing, so the actual structure may be somewhat different from the example above.
The features file is similar to the baseentrycounts file. Each tuples consists of a feature and it's frequency in the input instances file.
In previous work the TSV file format was used, consisting of one triple pair per line. The sample file src/test/resources/uk/ac/susx/mlcl/byblo/bnc-gramrels-fruit.features is an example, and the following is a snippet from this file:
aux:be 375
det:a 1897
det:an 436
det:the 5269
iobj:from 654
ncmod:from 404
ncmod:future 115
ncmod:in 440
... ...
Note that the features entries file may make use of both enumerated strings, and skip indexing, so the actual structure may be somewhat different from the example above.
The events file consists of triples: an entry, a feature, and the frequency of their co-occurrence within the instances file. This file most closely represents what could be thought of as the feature vectors. The file format is TSV, consisting of one triple pair per line. The sample file src/test/resources/uk/ac/susx/mlcl/byblo/bnc-gramrels-fruit.pairs is an example of this format.
apple det:a 55
apple det:an 171
apple det:the 325
apple iobj:of 55
banana det:a 55
banana det:the 73
cherry det:the 91
date aux:be 323
date aux:can 85
date aux:have 84
... ... ...
Note that the features entries file may make use of the compact format, enumerated strings, and skip indexing, so the actual structure may be somewhat different from the snippet above. For example, you may see something like:
0 0 55 1 171 1 325 1 55
1 0 55 2 73
1 2 91
1 4 323 1 85 1 84
...
The frequency files (entries, features, and events) are optionally filtered to produce three more frequency count files in an identical format. Filters are used primarily to improve computational performance, but can also be used to focus on application interests, or for crude data smoothing. Commonly used filters include frequency thresholds, string pattern matching, or vocabulary list matching. Filters are applied iteratively, so removing a low frequency entry will also remove it's corresponding events, and any features that occurred only with that entry. Likewise removing low frequency features can result in entries being removed if they end up with a null feature vector. The file format for filtered frequencies is the same as unfiltered in previous work; i.e. tab delimited flat file.
The final internal pre-processing step involves converting the three filtered frequency count files into the weighted feature vectors used during the thesaurus matrix generation process. This process is highly dependent on the required similarity measure. Example weightings include: TF-IDF, probability estimations, unit-vector normalisation, and PMI discrepancy between a feature vector and the corpus language model. Because this step is tightly coupled with the similarity measure, it is usually done ``on the fly'' during similarity matrix production.
The TGS takes all pairwise combinations of filtered, weighted feature vectors, passing them through a similarity function, to produce a similarity matrix. This matrix is the output of the thesaurus, but is almost never produced in a dense form. If every pair-wise combination was written to disk, the size would be quadratic in the number of entries - easily consuming terabytes of storage. To compress the output matrix, the similarities are thresholded and non-zero pairs are produced in a sparse format.
In most cases, the ordered pair of the two base entry strings uniquely identifies the record, however there are unusual settings that could allow identity pairs to be produced (a pair where both base entry strings are the same). Hence an assumption of uniqueness should be avoided.
base entry base entry similarity weight
base entry base entry similarity weight
base entry base entry similarity weight
... ... ...
There will usually be two or more output files in this format, a raw similarities file and a k-nearest-neighbours file. The raw similarities file is unsorted and will contain all pairwise combinations of base entires that have passed through the various filtering stages. The k-nearest-neighbours file is produced from the pairs file by sorting in order of the first entity ascending, then in order of similarity weight descending. For each entity in the first column, only the first k pairs are produced, i.e. the k nearest neighbours.
apple banana 0.119724
apple tomato 0.109928
apple orange 0.106146
apricot strawberry 0.091673
apricot pear 0.088619
apricot blackberry 0.085583
avocado grapefruit 0.087378
avocado mango 0.08259
avocado gooseberry 0.079608
banana apple 0.119724
banana tomato 0.11059
banana melon 0.109468
... ... ...
The software outputs the matrix as triples consisting of a pair of base entries and a similarity weight, written to a TSV file. The sample src/test/resources/uk/ac/susx/mlcl/byblo/bnc-gramrels-fruit.jaccard is an example output similarities file in this format.
Note that the full similarities file may make use of string enumeration, skip-indexing, and the compact format (as discussed above). The neighbours file should have the original strings re-instantiated but may also use the compact format.