Apache Solr Multilingual

The goal of this project is to extend Apache Solr Search Integration for Drupal in a clean way to provide

  • better support for non-English languages
  • support for multilingual search
  • cross-language information retrieval (CLIR)
  • an easy to use administration interface

Apache Solr Multilingual will not replace Apache Solr Search Integration but base on it.

Motivation

Apache Solr Search Integration provides an (more or less) easy way to use Apache Solr as a powerful search engine for Drupal. Unfortunately the only language that works well with the module out of the box is English.

So if you run a non-English website you need to tweak all the configuration files by hand or you will loose some of the advantages that Solr gives you compared to Drupal's built-in database driven search. That requires a deep knowledge of Solr and search technology in general.

The entire process gets even more complicated if you run a multilingual website.

That is why we started thinking about an additional module called Apache Solr Multilingual to hide most of the complexity from the site administrator of a Drupal website.

Language Specific Problems

  • Stop Words
    Words you want to exclude from your search index are called stop words. These list of words strongly depends on the focus of your website and of course on your site's language.
  • Stemming
    Every word in the search index is stored in reduced form called word stem. This strategy enables the user to find content independently from the flection of the key word e.g. singular or plural. Unfortunately the stemming algorithm is different from language to language.
  • Protected Words
    In some cases you want to exclude some words from the stemming described above. These protected words are language specific like stop words.
  • Compound Word Splitting
    There are languages like German that combine words (e.g. "Dampfschifffahrt"). In order to deal with that problem you need to split such words in parts depending on language specific word catalogs.
  • Spell Checking
    No doubt that spell checking should be language specific.

Apache Solr Multilingual tries to solve the language specific problems described above out of the box or supports the site administrator by providing a user interface that hides some of the complexity.

Additionally Apache Solr Multilingual provides a way to offer language specific searches for different languages at once on multilingual websites. Therefore Apache Solr Multilingual integrates with Drupal's standard multilingual features provided by core modules and the Internationalization module.

As a special feature Apache Solr Multilingual could be configured to deal with the translations of nodes and taxonomies on multilingual sites. That means that you can find content in any language no matter which language was used to enter the search phrase. That is a simple implementation of CLIR but our plan is to extend this feature.

Current State

We released a third alpha version to receive some feedback. Please note that apachesolr_multilingual 6.x-1.x depends on apachesolr 6.x-1.x.
apachesolr 6.x-2.x which is currently in beta state will be supported by apachesolr_multilingual 6.x-2.x

See README.txt for installation instructions.

Different Resources

Comments

Hey There. I found your blog

Hey There. I found your blog using msn. This is an extremely well written article. I will be sure to bookmark it and return to read more of your useful info. Thanks for the post. I will definitely comeback.