A word on Lucene’s PHP port by Zend

Lucene is an open source search engine written in Java. If you have never heard of it prior to now, listen to this: It allows you to create a mini google-like search for anything. That’s right — anything.

But I’ll be a little more specific: Consider you run a news website — or a wiki for that matter. How would you let users search the website? For most programmers, the answer is in implementing a plain-vanilla SQL search over the title and content of the articles. There are a few issues with this approach:

  • The search time can be fairly lengthy
  • Running a LIKE query can still be very inaccurate (a search for ‘manager’ over a field containing ‘manage’ will not be considered a match)
  • There is almost no relevancy relationship in the way the results are ordered

Lucene is a Java package which lets a Java programmer insert documents into an ‘index’, basically the search engine’s data base, and search over that index later on. So it is a true search engine that a Java programmer can use to grab information at incredible speeds: milliseconds in my tests.

The details of Lucene can be found at its Apache incubator Site.

I’ll get to the real point of this post. Considering how useful a tool Lucene would be, you are probably somewhat disappointed that I said it was for Java. After all, many would find something like this most useful if integrated with a server-side language such as PHP.

Zend, a PHP devoted firm most noted for the Zend Framework created a “Search” component as part of its framework, which is a port of Lucene for PHP. Using this port can be extremely useful for implementing search functionality in a web application. There is a single problem standing in the way of creating a true full text search, although, and that is the default search functionality provided in PHP Lucene.

Consider a scenario where we are employers searching for prospective employees on a job search board. In a certain applicant’s resume, he states that he has “Managed a software team with great success, and has great managerial skills.”

Let’s assume this guy’s resume, as well as thousands of other resumes are in a lucene search index. When an employer executes a search on the job board, the job board code than uses the Lucene API to find documents matching the manager’s search terms “sales manager”.

Using the standard functionality of PHP Lucene, our employer would likely never find our mentioned employee. Why? Because the word ‘managed’ and ‘managerial’ is not the word ‘manager’. Even though this document is very relevant to the employer’s search, it will be nowhere within the result set.

Java Lucene has a way to overcome this scenario: the Standard Analyzer. The Standard analyzer is a component that Java Lucene can use to manipulate data when it is going into a search index. So when “Managed a software team with great success, and has great managerial skills” is put in the index, it will be stored as “manag a software team with great success, and has great manag skill.” The standard analyzer performs lower casing and word stemming on the data of a document.

The analyzer is also used on queries. “Sales manager” would become “sal manag”. Now a query of these terms would definitely turn up the employee we just spoke about.

PHP’s Lucene unfortunately does not have this ability yet. My next post will be about my creation of such an analyzer for PHP Lucene.

This entry was posted in PHP Development, Search Engine Development and tagged . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.
  • Alexander Veremyev

    First of all, thank you for the article about PHP Lucene implementation!

    Just to inform:
    Apache Lucene stemming filter is not used by default and should be turned on manually, but it may give undefined behaviour for wildcard and fuzzy search queries.

    Zend_Search_Lucene also uses analyzers paradigm and custom stemming analyzer may be implemented, but it’s not currently included into Zend_Search_Lucene package. You are welcome to contribute to it! :)

  • http://codefury.net Kenny Katzgrau

    Alexander,

    Thank you for pointing out the implication I made in my post! It is true that the default analyzer in Apache Lucene is not the Standard Analyzer (Off the top of my head, I think it is actually the Whitespace analyzer).

    I am definitely looking forward to contributing the Standard Analyzer (see the next post) and an article to the Zend Developer Zone.

    Thanks again,
    Kenny Katzgrau

  • http://codysnider.com Cody Snider

    Couldn’t the Porter Stemmer be applied here to get the desired result in PHP?

  • Rajasekar Msr

    How to start pear lucene search excel data in php?