Sunday, December 9, 2018

Stemming Search Terms in Sitecore Solr Indexes

It is copy of my article, initially published here to keep everything in one place.

I wrote about stemming in Sitecore Lucene content search in my previous article. But, just to remind you: Stemming is the process of reducing inflected (or sometimes derived) words to their word stem or root form. It allows you to make your search to return more relevant results. That is why usage of stemming could be a good and easy option to improve your search.

Configuring stemming in Solr is even easier than configuring it in Sitecore Lucene Content Search. You don’t need to write even one line of code. All you need is configuration.

There is schema.xml file in configuration of each Solr core. When you will open it you will see that there is field type text_en:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
 <tokenizer class="solr.StandardTokenizerFactory" />
 <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
 <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
 <filter class="solr.LowerCaseFilterFactory" />
 <filter class="solr.EnglishPossessiveFilterFactory" />
 <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
 <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
 <filter class="solr.EnglishMinimalStemFilterFactory"/>
        -->
 <filter class="solr.PorterStemFilterFactory" />
  </analyzer>
  <analyzer type="query">
 <tokenizer class="solr.StandardTokenizerFactory" />
 <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
 <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
 <filter class="solr.LowerCaseFilterFactory" />
 <filter class="solr.EnglishPossessiveFilterFactory" />
 <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
 <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
 <filter class="solr.EnglishMinimalStemFilterFactory"/>
        -->
 <filter class="solr.PorterStemFilterFactory" />
  </analyzer>
</fieldType>


It contains filter solr.PorterStemFilterFactory that do stemming of your indexed document and your query. Compared to Lucene.Net, you have three options what stemmer to use for English language: Porter, Lovins or Porter2. Also, you have the ability for stemming documents in different languages: Armenian, Basque, Catalan, Danish, Dutch, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.
To use stemming on your field you should change its type from text_general to text_en:

<field name="_content" type="text_en" indexed="true" stored="false" />

Then you need to restart Solr and rebuild indexes. And this one small configuration change will improve search quality on your website.

No comments:

Post a Comment