searching notes

A customer has collections containing a variety of data elements and wants to provide a fast "Google-like" search capability. We want to support case-insensitive and also partial matching. Additionally, we will require at least 3 characters for a search. There are multiple strategies for implementing this kind of scenario. In this note, we compare and contrast to options:

using $text indexes
precomputing a optimized set of search terms.

td;dr; As of 3.2.x the precomputed option is much faster.

References: http://jam.sg/blog/efficient-partial-keyword-searches/

We will start with a collection like this:

>db.items.findOne()
{
       	"_id" : ObjectId("57d6b86e49eca9bd7d2620e5"),
       	"keys" : [
       		"Zd57aDDl",
       		"t6FBgHZl"
       	]
}

Here, the keys key contains the items we want to search over. In our example, we have the search terms in an array, but this example can be easily modified for cases where the search terms occur across multiple keys in a document.

Approach #1 - Using `$text`

Let's create a text index on the keys field:

> db.items.createIndex({"keys":"text"})
{
       	"createdCollectionAutomatically" : false,
       	"numIndexesBefore" : 3,
       	"numIndexesAfter" : 4,
       	"ok" : 1
}
> db.items.getIndexes()
[
       	{
       		"v" : 1,
       		"key" : {
       			"_id" : 1
       		},
       		"name" : "_id_",
       		"ns" : "test.items"
       	},
       	{
       		"v" : 1,
       		"key" : {
       			"_fts" : "text",
       			"_ftsx" : 1
       		},
       		"name" : "keys_text",
       		"ns" : "test.items",
       		"weights" : {
       			"keys" : 1
       		},
       		"default_language" : "english",
       		"language_override" : "language",
       		"textIndexVersion" : 3
       	}
]

Let's run a search:

> db.items.find({"keys": /.*fbg.*/i }).count()
348

But,

> db.items.find({"keys": /.*fbg.*/ }).explain("executionStats")
...
    	"executionStats" : {
       		"executionSuccess" : true,
       		"nReturned" : 42,
       		"executionTimeMillis" : 4471,
       		"totalKeysExamined" : 1600000,
       		"totalDocsExamined" : 800000,
...
       			"inputStage" : {
       				"stage" : "IXSCAN",
       				"nReturned" : 800000,
       				"executionTimeMillisEstimate" : 1540,
       				"works" : 1600001,
       				"advanced" : 800000,
       				"needTime" : 800000,
       				"needYield" : 0,
       				"saveState" : 12500,
       				"restoreState" : 12500,
       				"isEOF" : 1,
       				"invalidates" : 0,
       				"keyPattern" : {
       					"keys" : 1
       				},
       				"indexName" : "keys_1",
...
       					"keys" : [
       						"[\"\", {})",
       						"[/.*fbg.*/i, /.*fbg.*/i]"
       					]
       				},

This is not so great, since this was a regular expression query, MongoDB, while using the text index, still had to examine all 1,600,000 index entries (our sample has 2 values in keys for each document). The entire index needs to be scanned since our regular expression is not anchored with a "starts with". We also want to support partial matches in our search, direct text indexes in MongoDB do not support partial words, they only support stemming (running = run). Lastly, note the runtime here of ~4.7 seconds, not acceptable from a usability perspective.

Approach #2 Pre-compute search terms.

To optimize this we need to

Figure out all possible suffixes for each term. For example, suppose we have terms, "apple" and "monitor". Searches like "apple", "app", and "LlE" should all find the document with "apple". Similarly, searching "onIT" or "Itor" or "tor" should return "monitor".
Convert each suffix to upper case so we can support case-insensitive searches
Store these suffixes in another key in each document, we'll call this suffixes
Create an index on suffixes

Sample Javascript code to do this is available: https://github.com/jamestyj/mongo-scratch/blob/master/blog/suffix-generator.js.

Once we do this, our documents look like:

> db.items.findOne()
{
       	"_id" : ObjectId("57d6b86e49eca9bd7d2620e5"),
       	"keys" : [
       		"Zd57aDDl",
       		"t6FBgHZl"
       	],
       	"suffixes" : [
       		"T6FBGHZL",
       		"6FBGHZL",
       		"FBGHZL",
       		"BGHZL",
       		"GHZL",
       		"HZL",
       		"ZD57ADDL",
       		"D57ADDL",
       		"57ADDL",
       		"7ADDL",
       		"ADDL",
       		"DDL"
       	]
}

Let's try the same search again:

> db.items.find({"suffixes":/^FBG/}).count()
348
> db.items.find({"suffixes":/^FBG/}).explain(1)
{
...
      	"executionStats" : {
       		"executionSuccess" : true,
       		"nReturned" : 348,
       		"executionTimeMillis" : 1,
       		"totalKeysExamined" : 349,
       		"totalDocsExamined" : 348,
...
       				"keyPattern" : {
       					"suffixes" : 1
       				},
       				"indexName" : "suffixes_1",
       				"isMultiKey" : true,
...
       				"direction" : "forward",
       				"indexBounds" : {
       					"suffixes" : [
       						"[\"FBG\", \"FBH\")",
       						"[/^FBG/, /^FBG/]"
       					]
       				},
...
      	"ok" : 1
}

This now runs in 1 ms, or in 1/4471^th of the time - quite a bit faster.

The trade-off here is storage size:

> db.items.stats(1024*1024)
...
       	"indexSizes" : {
       		"_id_" : 6,
       		"keys_1" : 21,
       		"suffixes_1" : 90,
       		"keys_text" : 34
       	}
...

But, storage is cheap. To implement this we need only compute the suffixes on each write to the items collection.

This concludes this note.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

searching notes

Approach #1 - Using `$text`

Approach #2 Pre-compute search terms.

Uh oh!

Uh oh!

Clone this wiki locally

searching notes

Approach #1 - Using $text

Approach #2 Pre-compute search terms.

Uh oh!

Uh oh!

Clone this wiki locally

Approach #1 - Using `$text`