Lucene is an open source search engine written in Java. If you have never heard of it prior to now, listen to this: It allows you to create a mini google-like search for anything. That’s right — anything.
But I’ll be a little more specific: Consider you run a news website — or a wiki for that matter. How would you let users search the website? For most programmers, the answer is in implementing a plain-vanilla SQL search over the title and content of the articles. There are a few issues with this approach:
- The search time can be fairly lengthy
- Running a LIKE query can still be very inaccurate (a search for ‘manager’ over a field containing ‘manage’ will not be considered a match)
- There is almost no relevancy relationship in the way the results are ordered
Lucene is a Java package which lets a Java programmer insert documents into an ‘index’, basically the search engine’s data base, and search over that index later on. So it is a true search engine that a Java programmer can use to grab information at incredible speeds: milliseconds in my tests.
The details of Lucene can be found at its Apache incubator Site.
I’ll get to the real point of this post. Considering how useful a tool Lucene would be, you are probably somewhat disappointed that I said it was for Java. After all, many would find something like this most useful if integrated with a server-side language such as PHP.
Zend, a PHP devoted firm most noted for the Zend Framework created a “Search” component as part of its framework, which is a port of Lucene for PHP. Using this port can be extremely useful for implementing search functionality in a web application. There is a single problem standing in the way of creating a true full text search, although, and that is the default search functionality provided in PHP Lucene.
Consider a scenario where we are employers searching for prospective employees on a job search board. In a certain applicant’s resume, he states that he has “Managed a software team with great success, and has great managerial skills.”
Let’s assume this guy’s resume, as well as thousands of other resumes are in a lucene search index. When an employer executes a search on the job board, the job board code than uses the Lucene API to find documents matching the manager’s search terms “sales manager”.
Using the standard functionality of PHP Lucene, our employer would likely never find our mentioned employee. Why? Because the word ‘managed’ and ‘managerial’ is not the word ‘manager’. Even though this document is very relevant to the employer’s search, it will be nowhere within the result set.
Java Lucene has a way to overcome this scenario: the Standard Analyzer. The Standard analyzer is a component that Java Lucene can use to manipulate data when it is going into a search index. So when “Managed a software team with great success, and has great managerial skills” is put in the index, it will be stored as “manag a software team with great success, and has great manag skill.” The standard analyzer performs lower casing and word stemming on the data of a document.
The analyzer is also used on queries. “Sales manager” would become “sal manag”. Now a query of these terms would definitely turn up the employee we just spoke about.
PHP’s Lucene unfortunately does not have this ability yet. My next post will be about my creation of such an analyzer for PHP Lucene.