Tuesday, November 6, 2018

Stemming Search Terms in Sitecore Lucene Indexes

    It is copy of my article, initially published here to keep everything in one place.

   Sitecore content search is great technology that allows you to get search on your Sitecore website with minimum efforts. But one thing that always disappointed me is that this search doesn’t understand word forms. Single and plural form of a noun will be saved as two separate terms in the index(e.g.: “tool” and “tools”). Single, past tense and normal form of a verb will result in three different terms in the index(e.g.: “deny”, “denies” and “denied”). It gives worse search results. If you will search “deny” then it would not found documents with “denies” or “denied”.

    There are few options how you can “fix” it. First one is usage of similarity parameter in the query: x => x.YourFieldName.Like(“tools”, 0.8f). It is quick and dirty solution. Now, content search will return results with similar words. But there is the other side of the coin. You will get search results with similar words where you don’t expect. E.g.: search for “Ireland” will give “Iceland” and “Island” in results.

    The other option is using “Stemming”. Stemming is the process of reducing inflected (or sometimes derived) words to their word stem. There are a different implementation of stemming algorithms. Lucene.Net has implementation of the Porter Stemming algorithm. It could be used to extend Sitecore content search. We need to implement our own analyser:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
using Lucene.Net.Analysis;
using System;
using System.IO;
 
namespace Feature.StemmedSearch.Search
{
    public class PorterStemLowerCaseKeywordAnalyzer : KeywordAnalyzer
    {
        public override TokenStream TokenStream(string fieldName, TextReader reader)
        {
            return new PorterStemFilter(new LowerCaseFilter(new KeywordTokenizer(reader)));
        }
    }
}
    Then register field mapping in the content search configuration:
<fieldNames>
  <field fieldName="_content"              storageType="YES" indexType="TOKENIZED"    vectorType="NO" boost="1f" type="System.String" settingType="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration, Sitecore.ContentSearch.LuceneProvider">
 <analyzer type="Feature.StemmedSearch.Search.PorterStemLowerCaseKeywordAnalyzer, Feature.StemmedSearch" />
  </field>
</fieldNames>
    I used _content field as example, but it is better don’t change Sitecore fields that come out of the box and use your own custom fields. Now, after rebuild of indexes we can see that all search terms are saved in a stemmed way:
    And when you check search queries by turning on verbose logging. You will see that search query terms are also stemmed for _content field:
Hurray! Now, our Sitecore website search is more similar to Google search. :-)
Stay tuned, in the second part we will do the same for Solr indexes.