04 August 2017

How to Use Full-Text Search in MongoDB?

By Hidora

MongoDB, one of the leading NoSQL databases, is well known for its fast performance, flexible schema, scalability and great indexing capabilities. At the core of this fast performance lies MongoDB indexes, which support efficient execution of queries by avoiding full-collection scans and hence limiting the number of documents MongoDB searches.


Starting from version 2.4, MongoDB began with an experimental feature supporting Full-Text Search using Text Indexes. This feature has now become an integral part of the product (and is no longer an experimental feature). Using MongoDB full-text search, you can define a text index on any field in the document whose value is a string or an array of strings. When we create a text index on a field, MongoDB tokenizes and stems the indexed field’s text content, and sets up the indexes accordingly.


In this Tutorial we are going to explore the full-text search functionalities of MongoDB.


Creation of MongoDB server on HIDORA


First of all we need to install MongoDB, so let’s see how fast and easy MongoDB can be installed on Hidora PaaS:

- Log in to the Hidora dashboard with your credentials.

- Click Create Environment in the upper left corner of the dashboard.


create environment


- In the Environment topology dialog pick MongoDB as a database you want to use (it can be found in the NoSQL databases drop-down list). Set the cloudlet limits for this node, type the name of your first environment and confirm the creation.


Jelastic UI MongodDB


- Wait a minute until the process is completed.


MongodDB Environment



CONNECTION TO THE MONGODB WITH SSH

Now let’s see how you can access your Hidora account with all of its environments and containers.


Note. SSH access is provided to the whole account but not a separate environment


- Open the Hidora dashboard and navigate to the upper toolbar.

- Click the Settings button.


settings


In the opened Account settings tab, navigate to the SSH Keychain > Public option.


Note. The availability of this option is enabled only for the billing customers. In case you need this access during the trial period, just let us know and we’ll grant you the necessary access.


- Click the link in the note to open your SSH gate. As a result, you’ll access Shell Handler via console automatically. Or, just copy the given command line and run it via your console (SSH client).


ssh connector

putty gateway

Creation of Sample data


Data in MongoDB has a flexible schema. Unlike SQL databases, where you must determine and declare a table’s schema before inserting data, MongoDB’s collections do not enforce document structure. This flexibility facilitates the mapping of documents to an entity or an object. Each document can match the data fields of the represented entity, even if the data has substantial variation. In practice, however, the documents in a collection share a similar structure.


The key challenge in data modeling is balancing the needs of the application, the performance characteristics of the database engine, and the data retrieval patterns. When designing data models, always consider the application usage of the data (i.e. queries, updates, and processing of the data) as well as the inherent structure of the data itself.


MongoDB stores data records as BSON documents. BSON is a binary representation of JSON documents, though it contains more data types than JSON. For the BSON spec, see bsonspec.org.


BSON


MongoDB stores BSON documents, i.e. data records, in collections; the collections in databases. In MongoDB, databases hold collections of documents.


Collection


To select a database to use, in the mongo shell, issue the use <db> statement, as in the following example:

    use myDB

Create a Database


If a database does not exist, MongoDB creates the database when you first store data for that database. As such, you can switch to a non-existent database and perform the following operation in the mongo shell:

    use myNewDB
    db.myNewCollection1.insertOne( { x: 1 } )

The insertOne() operation creates both the database myNewDB and

the collection myNewCollection1 if they do not already exist.


MongoDB stores documents in collections. Collections are analogous to tables in relational databases.


Create a Collection


If a collection does not exist, MongoDB creates the collection when you first store data for that collection.

    db.myNewCollection2.insertOne( { x: 1 } ) 
    db.myNewCollection3.createIndex( { y: 1 } )

Both the insertOne() and the createIndex() operations create their respective collection if they do not already exist.


Explicit Creation


MongoDB provides the db.createCollection() method to explicitly create a collection with various options, such as setting the maximum size or the documentation validation rules. If you are not specifying these options, you do not need to explicitly create the collection since MongoDB creates new collections when you first store data for the collections.


Searching documents


Starting in MongoDB 3.2, MongoDB introduces a version 3 of the text index

MongoDB provides text indexes to support text search queries on string content. Text indexes can include any field whose value is a string or an array of string elements.


Create Text Index


IMPORTANT: A collection can have at most one text index.


To create a text index, use the db.collection.createIndex() method. To index a field that contains a string or an array of string elements, include the field and specify the string literal "text" in the index document, as in the following example:

    db.reviews.createIndex( { comments: "text" } )

You can index multiple fields for the text index. The following example creates a text index on the fields subject and comments:

    db.reviews.createIndex(   {     subject: "text",     comments: "text"   } )

A compound index can include text index keys in combination with ascending/descending index keys. In order to drop a text index, use the index name.


Specify Weights


For a text index, the weight of an indexed field denotes the significance of the field relative to the other indexed fields in terms of the text search score.


For each indexed field in the document, MongoDB multiplies the number of matches by the weight and sums the results. Using this sum, MongoDB then calculates the score for the document.


The default weight is 1 for the indexed fields. To adjust the weights for the indexed fields, include the weights option in the db.collection.createIndex() method.


Wild card Text Indexes


When creating a text index on multiple fields, you can also use the wildcard specifier ($**). With a wildcard text index, MongoDB indexes every field that contains string data for each document in the collection. The following example creates a text index using the wildcard specifier:
    db.collection.createIndex( { "$**": "text" } )

This index allows for text search on all fields with string content. Such an index can be useful with highly unstructured data if it is unclear which fields to include in the text index or for ad-hoc querying.


Wildcard text indexes are text indexes on multiple fields. As such, you can assign weights to specific fields during index creation to control the ranking of the results.


Wildcard text indexes, as with all text indexes, can be part of a compound indexes. For example, the following creates a compound index on the field a as well as the wildcard specifier:

    db.collection.createIndex( { a: 1, "$**": "text" } )

As with all compound text indexes, since the a precedes the text index key, in order to perform a $text search with this index, the query predicate must include an equality match conditions a.


Case Insensitivity


The version 3 text index supports the common C, simple S, and for Turkish languages, the special T case foldings as specified in Unicode 8.0 Character Database Case Folding .


The case foldings expands the case insensitivity of the text index to include characters with diacritics, such as é and É, and characters from non-Latin alphabets, such as “И” and “и” in the Cyrillic alphabet.


Version 3 of the text index is also diacritic insensitive . As such, the index also does not distinguish between é, É, e, and E.


Previous versions of the text index are case insensitive for [A-z] only; i.e. case insensitive for non-diacritics Latin characters only. For all other characters, earlier versions of the text index treat them as distinct.


Diacritic Insensitivity


With version 3, text index is diacritic insensitive. That is, the index does not distinguish between characters that contain diacritical marks and their non-marked counterpart, such as é, ê, and e. More specifically, the text index strips the characters categorized as diacritics in Unicode 8.0 CharacterDatabase Prop List.


Version 3 of the text index is also case insensitive to characters with diacritics. As such, the index also does not distinguish between é, É, e, and E.


Previous versions of the text index treat characters with diacritics as distinct.


Tokenization Delimiters


For tokenization, version 3 text index uses the delimiters categorized under Dash, Hyphen, Pattern_Syntax, Quotation_Mark, Terminal_Punctuation, and White_Space in Unicode 8.0 CharacterDatabase Prop List.


For example, if given a string "Il a dit qu'il «était le meilleur joueur du monde»", the text index treats «, », and spaces as delimiters.


Previous versions of the index treat « as part of the term "«était" and » as part of the term "monde»".


Index Entries


Text index tokenizes and stems the terms in the indexed fields for the index entries. T ext index stores one index entry for each unique stemmed term in each indexed field for each document in the collection. The index uses simple language-specific suffix stemming.


Supported Languages and Stop Words


MongoDB supports text search for various languages. text indexes drop language-specific stop words (e.g. in English, the, an, a, and, etc.) and use simple language-specific suffix stemming. For a list of the supported languages, see Text Search Languages.


If you specify a language value of "none", then the text index uses simple tokenization with no list of stop words and no stemming.


Sparse Property


Text indexes are sparse by default and ignore the sparse: true option. If a document lacks a text index field (or the field is null or an empty array), MongoDB does not add an entry for the document to the text index. For inserts, MongoDB inserts the document but does not add to the text index.


For a compound index that includes a text index key along with keys of other types, only the text index field determines whether the index references a document. The other keys do not determine whether the index references the documents or not.


Restrictions


One Text Index Per Collection. A collection can have at most one text index.


Text Search and Hints


You cannot use hint() if the query includes a $text query expression.


Text Index and Sort

Sort operations cannot obtain sort order from a text index, even from a compound text index; i.e. sort operations cannot use the ordering in the text index.


Compound Index

A compound index can include a text index key in combination with ascending/descending index keys. However, these compound indexes have the following restrictions:

- A compound text index cannot include any other special index types, such as multi-key or geospatial index fields.

- If the compound text index includes keys preceding the text index key, to perform a $text search, the query predicate must include equality match conditions on the preceding keys.


Drop a Text Index


To drop a text index, pass the name of the index to the db.collection.dropIndex() method. To get the name of the index, run the db.collection.getIndexes() method.


Storage Requirements and Performance Costs


Text indexes have the following storage requirements and performance costs:

- Text indexes can be large. They contain one index entry for each unique post-stemmed word in each indexed field for each document inserted.

- Building a text index is very similar to building a large multi-key index and will take longer than building a simple ordered (scalar) index on the same data.

- When building a large text index on an existing collection, ensure that you have a sufficiently high limit on open file descriptors.

- Text indexes will impact insertion throughput because MongoDB must add an index entry for each unique post-stemmed word in each indexed field of each new source document.

- Additionally, text indexes do not store phrases or information about the proximity of words in the documents. As a result, phrase queries will run much more effectively when the entire collection fits in RAM.


Text Search Support


The text index supports $text query operations. For examples of text search, see the $text referencepage. For examples of $text operations in aggregation pipelines, see Text Search in the Aggregation Pipeline.


Is there a way to improve performance?

The full text search does not work properly for really large datasets as all matches are returned as a single document and the command does not support a “skip” parameter to retrieve results page-by-page. Despite of projecting to nothing but the “_id” field a huge set of matches will not be returned in its entirety if the result exceeds Mongo’s 16MB per document limit. A compound text index cannot include any other type of index, like multi-key indexes or geo-spatial indexes. Additionally, if your compound text index includes any index keys before the text index key, all the queries must specify the equality operators for the preceding keys. Text indexes create an overhead while inserting new documents. This in turn hits the insertion throughput. Some queries like phrase searches can be relatively slow.


Conclusion

MongoDB’s full-text search is not proposed as a complete replacement of search engine databases like Elastic, SOLR, etc. However, it can be effectively used for the majority of applications that are built with MongoDB today.