-
Notifications
You must be signed in to change notification settings - Fork 3
Item Cache
The Item Cache provides a cache of the items that the classifier can use in training and classification and their extracted features that are used in classification. The Item Cache provides two levels of caching. There is a persistent cache of items and their features over a given time period and an in memory cache of a subset of the persistent cache that contains the items that will actually be classified.
The persistent Item Cache is not ephemeral, it is Winnow’s permanent database. The current implementation does not provide an express mechanism for regenerating the persistent cache, though it would be possible to make your application capable of using Winnow to do so.
Winnow accrues items in the persistent Item Cache as they are classified and keeps them permanently, meaning that the database grows ever larger over time. This fits with using Winnow from an application that permanently keeps a set of items that grows over time, where nothing is thrown away.
If instead instead your application uses Winnow on a stream of items where old items are expired when reaching a certain age, probably you will want to periodically prune the cache to prevent the database from reaching an unmanageable size. The cache cannot be pruned while the classifier is in use, it must be taken down. Pruning frequently will ensure that pruning is not a protracted operation that requires excessive classifier downtime. For more information, see the Ruby purge scripts in the /bin directory.
The length of the subset to use for the in-memory cache can be specified as a command-line parameter. The default is to keep items from the last 30 days in memory. If your application keeps items for a longer period of time you will want to change this default so it matches the window used by your application. A larger in-memory cache increases Winnow startup time and increases the time required for each classification job.
When the classifier is first started it will load these items into memorythe in-memory cache from the persistent cache. The in-memory cache maintains two indexes for these items, one that allows fast retrieval of an item given it’s id and one which orders items by time. The first index is used for fast retrieval of training items if they happen to be in the in-memory cache and the second index is used so that classification occurs in chronological order and when a job only needs to process items added since a given time it can stop processing when it reaches that point.
The in-memory item cache only contains the id, the extracted features and a timestamp for the item. Extracted features are stored in a Judy array that maps the feature id to its value. In our current case the feature id is the atomized version of a token, i.e. an int32, and the value represents the frequency of the token in that item. On the other hand, the persistent cache stores both the extracted features and the raw data of the item.
The Item Cache is populated via the Atom Publishing protocol. Each item with the cache corresponds to an Atom Entry which is published into the cache. The cache also includes a representation of feeds with are solely used for grouping collections of items together.
As items are published to the cache they use a processing chain within the classifier as shown in this diagram:

In the above diagram, the parallelograms represent internal queues within the classifier. These are used so that tokenization and database insertion can be serialized and performed asynchronously from the Atom Publishing Protocol (APP) requests that add the items. When a APP request adds an item it is immediately saved in its raw state in the SQLite database. This allows the HTTP request-response cycle to be completed very quickly and the tokenization and updating of the in-memory items cache (which requires locking out all classification jobs) to occur in the background.
The Item Cache also provides a callback that is triggered when either a certain number of items have been added to the cache or a certain amount of time has passed since the last item was added to the cache. By default this is triggered after either 200 items have been added or 60 seconds have passed since the last addition.
A separate thread runs nightly which purges the in-memory cache of items older than 30 days. Currently there is no mechanism for purging the persistent cache, but considering this cache is getting quite large, it currently has almost 350K items in it and the database is 3GB is size we probably need to solve this soon.
Cache persistence is implemented using a set of an SQLite databases. The persistent cache consists of a directory with the following sqlite databases:
top-level/
catalog.db
atom.db
tokens.db
The catalog.db database contains the metadata for each item in the cache and the map of token id to token strings. The atom.db file contains to source atom entry xml for each item in the database. The tokens.db file contains the tokenized form of each item in the database.
The database schema is shown here:

All datetime columns are represented as real numbers in the Julian Day format as recommended in the SQLite documentation.
Table and column definitions are as follows:
Contains metadata for each entry in the database.
- id: An integer surrogate key used to identify the item internally to the item cache.
- full_id: A URI for the item that is defined in the atom:id element for the item. This is a global identifier for the item.
- updated: The datetime the item was updated, according to the atom:updated element for the item.
- created_at: The datetime the item was added to the cache.
- last_used_at: The datetime the item was last used to train a classifier, or null if never used.
Specifies a list of items in the entries table that should be used for the random background within the classifier.
- entry_id: Id of the entry to use in the random background.
Provides a mapping between the atomized version of a token and the string version of the token.
- id: The atomized integer version of the token.
- token: The original textual version of the token.
This table also has unique constraints on each column to ensure that you can’t have duplicate tokens or mappings.
Stores the raw atom xml content for each entry in the catalog.
- id: The id of the entry. This is a foreign key to the entries.id column.
- atom: The Atom xml for the entry. This is the raw content of the item that is used to generate tokens.
Stores the tokenized representation of the entries in the catalog.
- id: The id of the entry. This is a foreign key to the entries.id column.
- tokens: The tokenized representation of the item.
The tokens column is a binary format where each token is represented by six bytes, so the size of the BLOB will be multiples of six. The first 4 bytes of each token represent a 32 bit integer containing the atomized token id for that feature. The second two bytes represent a 16 bit short integer that corresponds to the frequency (or value) of that feature within the item. The token file uses network byte order, big endian, for storage, so programs wishing to read or write token files must perform their own conversions to their native byte order. The classifier uses the ntoh family of functions to perform this conversion in a portable way.