StandardAnalyzer

I wrote a StandardAnalyzer for PHP Lucene to be used by those who need a google-like search which performs:

  • Word Stemming
  • Stop word filtering
  • Lowercase filtering

A brief background of the StandardAnalyzer is available in my post which discusses it. There is also a Readme.txt in the project which provides some more information.

Provided below is the sample project mentioned in that post, which includes the StandardAnalyzer code. This is the first release of this code, hence the beta designation. If you happen to find an issue, let me know: katzgrau@gmail.com.

Download

Zip file:
standardanalyzer-1.0.0b.zip

Tarball:
standardanalyzer-1.0.0b.tar.gz

  • Pingback: A Lucene-based Search Plugin For Wordpress

  • Pingback: Jaskson ’s DevNotes » Implementing a Stemming Analyzer for Zend_Search_Lucene

  • Vitor ALmeida

    Hey mannn nice work !
    I looking for the same funcionality to portuguese language.. can u give me some ideas ??

    How can i implement a stemming for portuguese? wath is the process ?

    thank u !

  • http://www.sellmyretro.com Rich Mellor

    This is an excellent project (if a little old now) and still holds valid. However, I am having issues with the search results for where I am looking for ZX80 or ZX81 in results.

    Even though I use $index->find(‘name: “zx81″‘) on the oringal Zend Lucene engine, this returns where name includes just zx (eg. Sinclair ZX Spectrum), but under your analyzer it returns no results!

    Any ideas what is causing this? Other searches for (say) “ZX Spectrum” or “QL” work fine, so it is something to do with the combination of letters and numbers…

  • katzgrau

    It really depends on the English stemming algorithm being used. This project used the Porter stemming algorithm, which stems zx80 to … zx80. Unfortunately that’s not going to match a search for ‘zx’.

    Your best bet would be write in a hack to avoid using the stemming library in special cases, if you know what they are ahead of time.

  • katzgrau

    I should add that ZX and QL may be stemmed differently in that algorithm. It’s not a bug in the code, more of a flaw in the unpredictable nature of the English language that the algorithm tries to work with.

  • http://www.sellmyretro.com Rich Mellor

    Actually it is something peculiar in Zend_Search_Lucene – if I use the standard text caseinsensitive analyser, then a search for ZX80 shows entries with ZX80 and ZX (as the analyser presumably ignores the numbers).

    However, if I use the standard textnum caseinsensitive (which is similar to your basic analyser without the porter stemming algorithm), a search for ZX80 returns no results at all (despite rebuilding the index from scratch) !!

    Weird because I can see the regex allows A-Z, a-z and 0-9 so no idea why ZX80 is being completely ignored…