Skip to content
Jason Mimick edited this page Sep 12, 2016 · 4 revisions

A customer has collections containing a variety of data elements and wants to provide a fast "Google-like" search capability. We want to support case-insensitive and also partial matching. Additionally, we will require at least 3 characters for a search. There are multiple strategies for implementing this kind of scenario. In this note, we compare and contrast to options:

  • using $text indexes
  • precomputing a optimized set of search terms.

td;dr; As of 3.2.x the precomputed option is much faster.

References: http://jam.sg/blog/efficient-partial-keyword-searches/

We will start with a collection like this:

>db.items.findOne()
{
       	"_id" : ObjectId("57d6b86e49eca9bd7d2620e5"),
       	"keys" : [
       		"Zd57aDDl",
       		"t6FBgHZl"
       	]
}

Here, the keys key contains the items we want to search over. In our example, we have the search terms in an array, but this example can be easily modified for cases where the search terms occur across multiple keys in a document.

Approach #1 - Using $text

Let's create a text index on the keys field:

> db.items.createIndex({"keys":"text"})
{
       	"createdCollectionAutomatically" : false,
       	"numIndexesBefore" : 3,
       	"numIndexesAfter" : 4,
       	"ok" : 1
}
> db.items.getIndexes()
[
       	{
       		"v" : 1,
       		"key" : {
       			"_id" : 1
       		},
       		"name" : "_id_",
       		"ns" : "test.items"
       	},
       	{
       		"v" : 1,
       		"key" : {
       			"_fts" : "text",
       			"_ftsx" : 1
       		},
       		"name" : "keys_text",
       		"ns" : "test.items",
       		"weights" : {
       			"keys" : 1
       		},
       		"default_language" : "english",
       		"language_override" : "language",
       		"textIndexVersion" : 3
       	}
]

Let's run a search:

> db.items.find({"keys": /.*fbg.*/i }).count()
348

But,

> db.items.find({"keys": /.*fbg.*/ }).explain("executionStats")
...
    	"executionStats" : {
       		"executionSuccess" : true,
       		"nReturned" : 42,
       		"executionTimeMillis" : 4471,
       		"totalKeysExamined" : 1600000,
       		"totalDocsExamined" : 800000,
...
       			"inputStage" : {
       				"stage" : "IXSCAN",
       				"nReturned" : 800000,
       				"executionTimeMillisEstimate" : 1540,
       				"works" : 1600001,
       				"advanced" : 800000,
       				"needTime" : 800000,
       				"needYield" : 0,
       				"saveState" : 12500,
       				"restoreState" : 12500,
       				"isEOF" : 1,
       				"invalidates" : 0,
       				"keyPattern" : {
       					"keys" : 1
       				},
       				"indexName" : "keys_1",
...
       					"keys" : [
       						"[\"\", {})",
       						"[/.*fbg.*/i, /.*fbg.*/i]"
       					]
       				},

This is not so great, since this was a regular expression query, MongoDB, while using the text index, still had to examine all 1,600,000 index entries (our sample has 2 values in keys for each document). The entire index needs to be scanned since our regular expression is not anchored with a "starts with". We also want to support partial matches in our search, direct text indexes in MongoDB do not support partial words, they only support stemming (running = run). Lastly, note the runtime here of ~4.7 seconds, not acceptable from a usability perspective.

Approach #2 Pre-compute search terms.

To optimize this we need to

  • Figure out all possible suffixes for each term. For example, suppose we have terms, "apple" and "monitor". Searches like "apple", "app", and "LlE" should all find the document with "apple". Similarly, searching "onIT" or "Itor" or "tor" should return "monitor".
  • Convert each suffix to upper case so we can support case-insensitive searches
  • Store these suffixes in another key in each document, we'll call this suffixes
  • Create an index on suffixes

Sample Javascript code to do this is available: https://github.com/jamestyj/mongo-scratch/blob/master/blog/suffix-generator.js.

Once we do this, our documents look like:

> db.items.findOne()
{
       	"_id" : ObjectId("57d6b86e49eca9bd7d2620e5"),
       	"keys" : [
       		"Zd57aDDl",
       		"t6FBgHZl"
       	],
       	"suffixes" : [
       		"T6FBGHZL",
       		"6FBGHZL",
       		"FBGHZL",
       		"BGHZL",
       		"GHZL",
       		"HZL",
       		"ZD57ADDL",
       		"D57ADDL",
       		"57ADDL",
       		"7ADDL",
       		"ADDL",
       		"DDL"
       	]
}

Let's try the same search again:

> db.items.find({"suffixes":/^FBG/}).count()
348
> db.items.find({"suffixes":/^FBG/}).explain(1)
{
...
      	"executionStats" : {
       		"executionSuccess" : true,
       		"nReturned" : 348,
       		"executionTimeMillis" : 1,
       		"totalKeysExamined" : 349,
       		"totalDocsExamined" : 348,
...
       				"keyPattern" : {
       					"suffixes" : 1
       				},
       				"indexName" : "suffixes_1",
       				"isMultiKey" : true,
...
       				"direction" : "forward",
       				"indexBounds" : {
       					"suffixes" : [
       						"[\"FBG\", \"FBH\")",
       						"[/^FBG/, /^FBG/]"
       					]
       				},
...
      	"ok" : 1
}

This now runs in 1 ms, or in 1/4471^th of the time - quite a bit faster.

The trade-off here is storage size:

> db.items.stats(1024*1024)
...
       	"indexSizes" : {
       		"_id_" : 6,
       		"keys_1" : 21,
       		"suffixes_1" : 90,
       		"keys_text" : 34
       	}
...

But, storage is cheap. To implement this we need only compute the suffixes on each write to the items collection.

This concludes this note.

Clone this wiki locally