A Stemming Analyzer for Zend’s PHP Lucene

In my last post I spoke a little about Zend’s Lucene implementation in PHP, and its extensive usefulness for content-oriented PHP web applications. One of the roadblocks to implementing a Google-like search, however, was the absence of a stemming analyzer in the Zend package.

While using PHP Lucene, I came across this issue while developing a plug-in for wordpress. I wasn’t getting the relevancy I needed in test searches that I was looking for, and I decided to develop one of my own. I decided that this analyzer should:

  • Stem words for greater search relevancy
  • Use the pre-existing Zend lowercase filter
  • Filter out a standard set of stop words using the Zend stop words filter

After a couple days working on the issue, I’ve developed an analyzer that performs these tasks. I’ve named it the ‘StandardAnalyzer’ after the implementation Java’s Lucene has. You can download the StandardAnalyzer at its project page.

Just a few notes on the about its creation:

  • It is not meant to sit within the Zend framework folder. The ‘StandardAnalyzer’ should sit alongside it, and is configured accordingly. The reason for this is to keep what is Zend’s in Zend’s folder, and what is the user’s in his own. I figured that if the StandardAnalyzer was ever integrated into framework, the good folks at Zend would know best how they would like it.
  • The code provided handles English words only, but organized to encourage future languages as well.
  • I must give a special thanks to Richard Heyes, whose Stemming algorithm is used instead of my own. In tests, I found his code to be a bit more elegant and quicker than my own, which was a direct port of the Java stemming algorithm. From what I gather, Richard is a Zend-Certified Engineer, making his code usage very fitting.

Example Usage

I’ve decided to pack the StandardAnalyzer with an example project and index to make things a little easier for those looking to use it. The example project, as well as most user projects, would start off like:

require_once 'Zend/Search/Lucene.php';
require_once 'StandardAnalyzer/Analyzer/Standard/English.php';

As mentioned before, the StandardAnalyzer folder should sit in the same directory as the Zend Framework. Now that you have the power of Zend ready to go, you can proceed build your index. But don’t forget that to use the StandardAnalyzer, you have to set the default analyzer to an instance of the StandardAnalyzer. So before you index documents or search over the index, you should call:

Zend_Search_Lucene_Analysis_Analyzer::setDefault
( new StandardAnalyzer_Analyzer_Standard_English() );

I folded that line to keep it looking readable.

Anyway, any indexing or searching you do after this line uses the Standard analyzer. (I may not have been very clear, but the same analyzer needs to be used when indexing and searching, or else you won’t get many results.) You can also change the getDefaultAnalyzer() code in Zend/Search/Lucene/Analysis/Analyzer to reference your the StandardAnalyzer too. But I would rather not change this code of the Framework, and leave it in untainted form.

So take a look at the StandardAnalyzer project, and the example project. The example was put together fairly quickly, but it should provide a good example of how to use it. I think a synonym filter would make a nice addition in the future, so I might take a look into that.

This entry was posted in PHP Development, Search Engine Development and tagged . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.
  • http://enterventure.com Patrick

    Hi,

    Do you have an alpha / beta version of the WpSearch plugin? There doesn’t appear to be anything on the Project page. I’m in the process of learning how to use lucene and would love to see how it works on my local version of wordpress.

    I’m pretty new to all of this so I’d be happy to give whatever credit / donation you were looking for.

    Thanks!

  • Pingback: Jaskson ’s DevNotes » Implementing a Stemming Analyzer for Zend_Search_Lucene

  • Peleg Michaeli

    Hello Kenny,

    Maybe it was due to a fault or maybe due to your decision, but my comment here (from last week) had been deleted.

    Would you be able to explain me the reason for this? And if not — may I ask my question again?

    Thanks ahead,
    Peleg.

  • Peleg Michaeli

    That’s weird — after posting this comment, my last comment appeared again.
    Sorry for bothering you for no reason.

    Peleg.

  • http://dayg.slingandstoneweb.com dayg

    This is exactly what I was looking for, a Porter Stemmer based analyzer for Zend Lucene.

    Thank you very much. :)

  • Kamal

    Hi, I was trying to add more filters then the one you have already provided in
    standardanalyzer-1.0.0b/StandardAnalyzer/Analyzer/Standard/English.php between line 37-39

    $this->addFilter(new Zend_Search_Lucene_Analysis_TokenFilter_LowerCaseUtf8());
    $this->addFilter(new Zend_Search_Lucene_Analysis_TokenFilter_StopWords($this->_stopWords));
    $this->addFilter(new StandardAnalyzer_Analysis_TokenFilter_EnglishStemmer());

    I am only able to add filters whic are of type Zend_Search_Lucene_Analysis_TokenFilter. Lets say if I try to add $this->addFilter(new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive) it doesn’t work because its not of type Zend_Search_Lucene_Analysis_TokenFilter. Can you please have extra functionality to cater for this or if you have any solution can you post it or email me.

    Thanks alot for your help in advance

  • http://www.filife.com Mike

    ‘mortgage’ and ‘mortgages’ stem down to ‘mortgag’ (correct)

    ‘mortgagee’ (as in person that borrows a mortgage) stems down to ‘mortgage’, I was expecting ‘mortgag’

    Is this a bug or correct behavior? Here’s more to the story (I am using zend_search_lucene):

    ———————–

    I’m using Porter Stemmer to stem the words, and here’s a problem I’m running into:

    Word “mortgage” is correctly stemmed to “mortgag” Word “mortgagee” is (arguably incorrectly) stemmed to “mortgage”

    There are approximately 100 documents with the word “mortgage” There is 1 document with word “mortgagee”

    When I build an index without putting “mortgagee” in any documents, everything works fine: searching for “mortgage” or “mortgages” or “mortgag” returns all 100 documents.

    When I build an index and one of the documents contains “mortgagee”, searching the index for “mortgage” only returns a single document with “mortgagee” (which was stemmed down to “mortgage”). However, searching for “mortgag” or “mortgages” returns all 100 documents.

    The only logical conclusion I can make from this problem is lucene first searches for the pre-stemmed word, and if it doesn’t find any results, it continues to search for the stemmed word. Thus, when searching for ‘mortgage’, it first finds the ‘mortgage’ that was stemmed from ‘mortgagee’ and stops searching. Is this the correct behavior, or is it a bug?

  • Kenny Katzgrau

    @Mike,

    Normally, Lucene stems all docs as they go into the index. During searches, Lucene stems any words in the search phrase.

    Really, that’s the extent of word stemming.

    I think I may have an idea of what’s going on. Since all documents which are added to the search results are scored, that score is normalized to a value between 0 and 1. I’m thinking that in search #1, since you have an exact match between the search phrase and the stemmed word, the score is 1.00 while other matches get a very low score (and arent added to the final result list). In fact, “mortage” and “mortga” are considered completely different words in the post-analysis operations.

    I could be wrong, but I think it makes sense. Your situation might be an interesting edge case in the StandardAnalyzer search.

    And I think “mortgagee” to “mortgage” might be incorrect, but I’m not sure of whether the Porter Stemmer was ever considered anything other “the best available,” so I guess we’ll have to settle :)

  • tibor

    Hi
    Thanks for the great synthesis about Analysers, got me on the go straight off.
    I’ve looked quite thouroughly around for a French analyser to stemm et clean accents using zend but it’s not so easy to find,
    was wondering if by any chance you might know where I can find libs similiar to this , these are the java equivelant :

    ISOLatin1AccentFilter
    FrenchStemFilter

    offcourse I’m looking for php and Zend version
    best Regards
    Tibor

  • Raj Kumar

    I am extremely impressed with your writing skills and also with the layout on your blog

    Angularjs Training In Hyderabad