In my last post I spoke a little about Zend’s Lucene implementation in PHP, and its extensive usefulness for content-oriented PHP web applications. One of the roadblocks to implementing a Google-like search, however, was the absence of a stemming analyzer in the Zend package.
While using PHP Lucene, I came across this issue while developing a plug-in for wordpress. I wasn’t getting the relevancy I needed in test searches that I was looking for, and I decided to develop one of my own. I decided that this analyzer should:
- Stem words for greater search relevancy
- Use the pre-existing Zend lowercase filter
- Filter out a standard set of stop words using the Zend stop words filter
After a couple days working on the issue, I’ve developed an analyzer that performs these tasks. I’ve named it the ‘StandardAnalyzer’ after the implementation Java’s Lucene has. You can download the StandardAnalyzer at its project page.
Just a few notes on the about its creation:
- It is not meant to sit within the Zend framework folder. The ‘StandardAnalyzer’ should sit alongside it, and is configured accordingly. The reason for this is to keep what is Zend’s in Zend’s folder, and what is the user’s in his own. I figured that if the StandardAnalyzer was ever integrated into framework, the good folks at Zend would know best how they would like it.
- The code provided handles English words only, but organized to encourage future languages as well.
- I must give a special thanks to Richard Heyes, whose Stemming algorithm is used instead of my own. In tests, I found his code to be a bit more elegant and quicker than my own, which was a direct port of the Java stemming algorithm. From what I gather, Richard is a Zend-Certified Engineer, making his code usage very fitting.
Example Usage
I’ve decided to pack the StandardAnalyzer with an example project and index to make things a little easier for those looking to use it. The example project, as well as most user projects, would start off like:
require_once 'Zend/Search/Lucene.php';
require_once 'StandardAnalyzer/Analyzer/Standard/English.php';
As mentioned before, the StandardAnalyzer folder should sit in the same directory as the Zend Framework. Now that you have the power of Zend ready to go, you can proceed build your index. But don’t forget that to use the StandardAnalyzer, you have to set the default analyzer to an instance of the StandardAnalyzer. So before you index documents or search over the index, you should call:
Zend_Search_Lucene_Analysis_Analyzer::setDefault
( new StandardAnalyzer_Analyzer_Standard_English() );
I folded that line to keep it looking readable.
Anyway, any indexing or searching you do after this line uses the Standard analyzer. (I may not have been very clear, but the same analyzer needs to be used when indexing and searching, or else you won’t get many results.) You can also change the getDefaultAnalyzer() code in Zend/Search/Lucene/Analysis/Analyzer to reference your the StandardAnalyzer too. But I would rather not change this code of the Framework, and leave it in untainted form.
So take a look at the StandardAnalyzer project, and the example project. The example was put together fairly quickly, but it should provide a good example of how to use it. I think a synonym filter would make a nice addition in the future, so I might take a look into that.
Pingback: Jaskson ’s DevNotes » Implementing a Stemming Analyzer for Zend_Search_Lucene