Updates on KLogger, wpSearch 1.5.6, and More
Since my last post, I’ve been working on and off on the next version of wpSearch. The last version (1.5.5) stands as somewhat of an official release of the plugin, and I would consider previous versions to be ‘release candidates’ due to the growth wpSearch saw since then.
I’m in my second to last semester at school, so the lack of posts over the past two months are attributed to that — that doesn’t mean that nothing has happened.
Many of wpSearch’s users have made note of some very important issues in the current release. Because I have been recieiving feedback on wpSearch in pretty large quanities lately, I’ve been doing my best to reply to everyone — some, although, might not get a reply right away (Sorry!).
I’ve taken note of all reported issues so far, and a future release of wpSearch over the holidays is likely. I have also recieved a code submission from the folks at the Alpha-Beta-Release Blog for KLogger. They have written rolling log capabilities into the class, ensuring that log files never grow beyong a certain size, and begin writing to a new file when some preset limit is reached.
Also, I don’t know how many blog readers have any sort of interest in the Google Search Appliance, but I will be developing a PHP “Bridge” for easy code-wise communication with the Appliance. I am writing this library for LTech Consulting (A Google Partner) — they have already written a bridge in C#/.NET. If anyone is interested, I’ve written a lightweight blog post on LTech’s blog detailing the basics of the project.
Lastly, zinkk, a startup development company I am involved in, has grabbed its first couple of contracts over the past few months, and also has a real office. It’s a big step when a small company started by students actually moves into its first office — there are feelings of “Wow, we did it,” and “We are a bona-fide firm now,” and “Whoa, we better keep making money so we can pay the rent!”
Anyway, we have some very talented people at zinkk, and I’m excited to a part of it. John Bellone, one of our developers, works for CitiGroup in Manhattan. John Crepezzi works at Sun Microsystems. Dan Boston is a doctoral student at NJIT, and Tarcisio has worked for Johnson & Johnson. All of these guys have worked on some impressive projects — many of them open source or publicly licensed.
Oh yeah, Zend put a piece I wrote about writing a Stemming Analyzer for Zend_Lucene in the tutorial section of their website.
That’s it for now — I’m going to talk about a release of a MySQL management class for PHP next time.
wpSearch 1.5.0.5 Released With Features, Fixes
After an exhausting week and a half tracking down the source of a mysterious bug in wpSearch, I think I can finally close the book on the “null result” issue that had me pouring over the source code.
wpSearch 1.5.0.5, the first official release after the 1.5 landmark, brings to the forefront some of the features and fixes slated in the last post. wpSearch 1.5 has had the following features implemented:
- Comment Searching
- A behind-the-scenes event logger for easily figuring out user issues
- An upgrade to the underlying Lucene Search
- An upgrade to the underlying StandardAnalyzer (used for relevancy)
And these fixes:
- No more null results after a post is edited
- Foreign character support (or simply indexing content with ‘UTF-8′ encoding
- Memory issues for content-heavy posts
wpSearch 1.5.0.5 is a rock-solid release that is starting to make a name for itself in the Wordpress world. The new ‘Phone Home’ feature in wpSearch allows users to report their copy of wpSearch. A few of the blogs with wpSearch currently in use are listed here:
Patrick Cushing at the EnterVenture blog wrote a very detailed comparision of the default Wordpress search’s relevancy vs. wpSearch’s. This article ended up at digg.
Of course, as far as wpSearch has come in its short lifespan, there exists a set of users that deserve credit for pointing out issues and keeping me informed of bugs, needed features, etc.. So, in no particular order, I would like to thank:
- ComputerBob, at ComputerBob.com for pointing out the first instance of the empty result issue. He has thoroughly documented his usage with wpSearch at his blog, in a fair and balanced fashion. Furthermore, he has sent his index data back with detailed comments when most users would simply give up on wpSearch. Thanks ComputerBob.
- Robert Irizarry, who has kept the wpSearch thread at the Wordpress repository stuffed with feature ideas and issue notices.
- Olivier, who’s 6000 posts provided the first failed scalability test for wpSearch. His pointing out of this issue led to a change to allow for greater scalability — in other words, wpSearch 1.5 was tested successfully up to 7,000 posts. Great dedication to detailing these issues has helped wpSearch greatly.
- Karl Heigl, who first mentioned the fact the wpSearch was not handling German accents, and subsequently all foreign (to the U.S.) characters. This also ended up affecting Olivier. This bug was fixed in 1.5.0.5. Thanks Karl!
- A user named Brian, said, “Thanks for the update. If you need any other information or even help testing, I’d be happy to assist. Just let me know. ” Thanks for your support Brian.
- And to all those who have donated to this project so far!
So, wpSearch 1.5.0.5 wouldn’t be at it’s current status if it weren’t for those supporting it.
Features coming up for wpSearch include result highlighting, contextual snippets, and a progress meter for index building. I encourage everyone who is reading this but hasn’t installed wpSearch yet to try it out, and see the awesome blog search that you’ve been missing.
KLogger: A Simple Logging Class for PHP
Since the latest release of wpSearch, a couple issues have cropped up and are slated to be fixed shortly. Some of the issues, although, are a bit harder to catch without a good set of debugging tools for PHP. The classic example of such a tool would be a log file logger.
As soon as I realized the need for a logger while developing wpSearch, I decided to check to see if one had already existed on the internet — someone had surely created a simple logging class and made it available before .. I would think. I’m a believer in the C programmer’s motto “build upon the work of others”, so checking to see if someone else has done the same thing prior to starting a project comes naturally.
After a little browsing, I couldn’t find what I was looking for. Put plainly, I wanted a logging class that:
- Checked permissions prior to logging
- Had a priority heirarchy built in ( Debug, Info, Error, and Fatal Message Levels)
- Logged to plain old text files
- Managed file handling cleanly (Open the file once, close the file once)
- Managed resources (make sure the file gets closed)
Not too complicated. This logging class would require around 100 lines of code.
Another option involved using logging functions available in Zend, or the logging class provided in PEAR. These libraries were a little overkill for what I needed, so I passed.
I decided to write of the class myself, and it has turned out to be pretty handy. I figured someone else would probably find it useful as well, so I have posted it on it’s own project page. Click here to go to the KLogger project page.
Using KLogger is very straight-forward. Here’s an example:
require_once 'KLogger.php'; ... $log = new KLogger ( "log.txt" , KLogger::DEBUG ); // Do database work that throws an exception $log->LogError("An exception was thrown in ThisFunction()"); // Print out some information $log->LogInfo("Internal Query Time: $time_ms milliseconds"); // Print out the value of some variables $log->LogDebug("User Count: $User_Count");
Depending on the priority level that is used when instantiating a new KLogger, only certain messages are actually logged to the file. If the most verbose priority level is used, ( KLogger::DEBUG ), all messages are logged. If the least verbose level is used ( KLogger::FATAL ), only Fatal-level errors are logged. Here’s a breakdown of the most verbose level to the least-verbose level:
- Debug (Most Verbose)
- Info …
- Warn …
- Error …
- Fatal (Least Verbose)
A sixth level is also available: KLogger::OFF, so if you need to release a site without logging, you can simply set the priority level to Off, and not have a single message logged. (In fact, a log file will never be opened).
There isn’t too much documentation on the class right now, but it shouldn’t be too hard to figure out.
A couple comments on design: A few log class examples I’ve seen implement a singleton pattern, essentially locking the programmer into using one logger at all times. I usually lean towards letting the developer decide those things: If he wants to use a single logger, he can create a global log object in his application.
An screenshot of KLogger:
I think at this point I’ve actually done a better job describing KLogger in this post than on the project page. A few more details are listed there, along with the download. If you find it useful, leave a comment and let me know.
Click here to go to the KLogger project page.
wpSearch 1.5: The Fastest, Lightest Yet
After its first week in the wild, wpSearch has been run on a number of different versions of Wordpress and PHP, highlighting some places to improve aspects of its core. wpSearch 1.5 has just been released, with a completely rewritten search mechanism to bring search speeds into the milliseconds.
Certain features available in wpSearch 1.x.x.x have been removed in favor of tighter integration with the Wordpress core and raw speed. The search popup is no longer an option, removing the need for 2 javascript libraries, 2 CSS files, and 4 images. This decreases the page load time by close to 500 ms for a first-time page view over a broadband connection.
wpSearch now integrates its results into the Wordpress search using pure Wordpress API. Here are some statistics from version 1.5:
| Indexing Performance: | ~30 minutes for 6,000 posts (5 docs / sec) |
| Typical Search Speed: | 30-100 ms over 1,000 posts |
| “Atypical Search” Speed: | 400 ms over 1,000 duplicate posts with 1,000 matches |
| Indexing Performance: | ~30 minutes for 6,000 posts |
These stats were gathered on a non-dedicated development server with 1 GB Ram, 3.2 Ghtz Hyperthreaded Intel P4, with Windows XP, and a WAMP installation without any sort of code caching (like Zend Optimizer). Needless to say, this server isn’t the quickest, but it still turns out very impressive search times.
wpSearch 1.5 is also completely compatible with the latest release of Wordpress, 2.6.
Get it here: http://wordpress.org/extend/plugins/wpsearch/
Keep the comments coming!
katzgrau@gmail.com
wpSearch Accepted Into Wordpress Plugins
wpSearch (more info in my previous post), the lucene-powered search plugin for Wordpress, has officially been accepted into the Wordpress plugins repository. You can view and download wpSearch here:
http://wordpress.org/extend/plugins/wpsearch/
The latest version as of right now is 1.1.0.0. Several major features have been added since the original beta release.
- Seamless integration of wpSearch into your blog. After you activate wpSearch and build your blog’s search index, the search box on your blog will now be configured to use wpSearch for searches.
- You can now decide whether you want search results in the page ( the standard ), or have them loaded into and AJAX search pop-up. (Originally, the AJAX pop-up was the only way to view results ). This option is configurable via the Wordpress admin screen.
- Bloggers can now tweak the importance of things such as title, content, and tags in a blog search. This effectively allows control over what is considered relevant in a blog search.
So what’s next for wpSearch?
More searchable content: It’s no secret that the best content on a blog is sometimes in the comments. This is especially true for bloggers of tech and programming sites where blog readers often put useful contributions in comments.
The opening of the source: At SourceForge! Sure, PHP is inherently open-source (it’s a scripting language, after all!). But the best future for wpSearch would entail its placement into SourceForge.NET where the coding community can have the opportunity to contribute to the wpSearch project. wpSearch is already registered at SourceForge, and has a project page at:
http://wpsearch.sourceforge.net/. (Right now, there isn’t much setup up).
I plan to have wpSearch developed at SourceForge, and have stable releases be uploaded to the plug-in repository at Wordpress.
There are some other features I plan to add to wpSearch very shortly, one of which is contextual search result content, so you can see the words around the matching content of a search result.
I can’t think of the others off the top of my head. What I would really like to know is if anyone finds wpSearch to be of value so far, and whether they are having any difficulties.
I read on another blog that blogs get xx% more comments if the words “Have your say” are at the end of a post. I think I’ll try that.
Have your say!
A Lucene-based Search Plugin For Wordpress
There are many things I love about Wordpress — the extendability, the ease of use, and large library of themes available online, to name a few. But if there is one aspect of Wordpress that needs a little work, it is the default search functionality.
Recently, I’ve been spending a lot of time working on a search plugin for Wordpress that is based on the Lucene search engine — a very cool and powerful search library used by a lot of big places. The plugin is in its beta stage, and ready for use and evaluation by anyone who would like to check it out. The plugin is currently implemented on my blog, so you can use the search box on the upper-right side to see it in action.
wpSearch uses the PHP port of the library by Zend. It also spawned a sub-project, the PHP StandardAnalyzer. You can read more about that here.
The search currently uses a lightbox floating over the page to allow users to navigate search results. An option to integrate the results into the page may be and option in the future.
The major features of wpSearch are:
- Unmatched and customizable search relevancy (that’s the power of Lucene working)
- Very fast search speed
- Wildcard and Boolean operator support
- Easy installation
- Instantly updated searching after a post has been written
- Searching of Posts and Pages
Features for advanced users:
- Customizable interface via CSS
- Access to the internal search service for extendability
wpSearch was written for a development contest at LTech Consulting (a firm specializing in search with Lucene and the Google Search Appliance), but with the full intent of being open source. If anyone is interested in helping develop it, drop me a comment on this post.
Also, if anyone gives this plugin a try and has any suggestions, I would really appreciate your input! Just leave a comment and I’ll get back to you. The plugin will be made available in the Wordpress search repository shortly.
Full information about wpSearch (installation instructions, screenshots, etc) is available on wpSearch’s project page.
A Stemming Analyzer for Zend’s PHP Lucene
In my last post I spoke a little about Zend’s Lucene implementation in PHP, and its extensive usefulness for content-oriented PHP web applications. One of the roadblocks to implementing a Google-like search, however, was the absence of a stemming analyzer in the Zend package.
While using PHP Lucene, I came across this issue while developing a plug-in for wordpress. I wasn’t getting the relevancy I needed in test searches that I was looking for, and I decided to develop one of my own. I decided that this analyzer should:
- Stem words for greater search relevancy
- Use the pre-existing Zend lowercase filter
- Filter out a standard set of stop words using the Zend stop words filter
After a couple days working on the issue, I’ve developed an analyzer that performs these tasks. I’ve named it the ‘StandardAnalyzer’ after the implementation Java’s Lucene has. You can download the StandardAnalyzer at its project page.
Just a few notes on the about its creation:
- It is not meant to sit within the Zend framework folder. The ‘StandardAnalyzer’ should sit alongside it, and is configured accordingly. The reason for this is to keep what is Zend’s in Zend’s folder, and what is the user’s in his own. I figured that if the StandardAnalyzer was ever integrated into framework, the good folks at Zend would know best how they would like it.
- The code provided handles English words only, but organized to encourage future languages as well.
- I must give a special thanks to Richard Heyes, whose Stemming algorithm is used instead of my own. In tests, I found his code to be a bit more elegant and quicker than my own, which was a direct port of the Java stemming algorithm. From what I gather, Richard is a Zend-Certified Engineer, making his code usage very fitting.
Example Usage
I’ve decided to pack the StandardAnalyzer with an example project and index to make things a little easier for those looking to use it. The example project, as well as most user projects, would start off like:
require_once 'Zend/Search/Lucene.php'; require_once 'StandardAnalyzer/Analyzer/Standard/English.php';
As mentioned before, the StandardAnalyzer folder should sit in the same directory as the Zend Framework. Now that you have the power of Zend ready to go, you can proceed build your index. But don’t forget that to use the StandardAnalyzer, you have to set the default analyzer to an instance of the StandardAnalyzer. So before you index documents or search over the index, you should call:
Zend_Search_Lucene_Analysis_Analyzer::setDefault ( new StandardAnalyzer_Analyzer_Standard_English() );
I folded that line to keep it looking readable.
Anyway, any indexing or searching you do after this line uses the Standard analyzer. (I may not have been very clear, but the same analyzer needs to be used when indexing and searching, or else you won’t get many results.) You can also change the getDefaultAnalyzer() code in Zend/Search/Lucene/Analysis/Analyzer to reference your the StandardAnalyzer too. But I would rather not change this code of the Framework, and leave it in untainted form.
So take a look at the StandardAnalyzer project, and the example project. The example was put together fairly quickly, but it should provide a good example of how to use it. I think a synonym filter would make a nice addition in the future, so I might take a look into that.
A word on Lucene’s PHP port by Zend
Lucene is an open source search engine written in Java. If you have never heard of it prior to now, listen to this: It allows you to create a mini google-like search for anything. That’s right — anything.
But I’ll be a little more specific: Consider you run a news website — or a wiki for that matter. How would you let users search the website? For most programmers, the answer is in implementing a plain-vanilla SQL search over the title and content of the articles. There are a few issues with this approach:
- The search time can be fairly lengthy
- Running a LIKE query can still be very inaccurate (a search for ‘manager’ over a field containing ‘manage’ will not be considered a match)
- There is almost no relevancy relationship in the way the results are ordered
Lucene is a Java package which lets a Java programmer insert documents into an ‘index’, basically the search engine’s data base, and search over that index later on. So it is a true search engine that a Java programmer can use to grab information at incredible speeds: milliseconds in my tests.
The details of Lucene can be found at its Apache incubator Site.
I’ll get to the real point of this post. Considering how useful a tool Lucene would be, you are probably somewhat disappointed that I said it was for Java. After all, many would find something like this most useful if integrated with a server-side language such as PHP.
Zend, a PHP devoted firm most noted for the Zend Framework created a “Search” component as part of its framework, which is a port of Lucene for PHP. Using this port can be extremely useful for implementing search functionality in a web application. There is a single problem standing in the way of creating a true full text search, although, and that is the default search functionality provided in PHP Lucene.
Consider a scenario where we are employers searching for prospective employees on a job search board. In a certain applicant’s resume, he states that he has “Managed a software team with great success, and has great managerial skills.”
Let’s assume this guy’s resume, as well as thousands of other resumes are in a lucene search index. When an employer executes a search on the job board, the job board code than uses the Lucene API to find documents matching the manager’s search terms “sales manager”.
Using the standard functionality of PHP Lucene, our employer would likely never find our mentioned employee. Why? Because the word ‘managed’ and ‘managerial’ is not the word ‘manager’. Even though this document is very relevant to the employer’s search, it will be nowhere within the result set.
Java Lucene has a way to overcome this scenario: the Standard Analyzer. The Standard analyzer is a component that Java Lucene can use to manipulate data when it is going into a search index. So when “Managed a software team with great success, and has great managerial skills” is put in the index, it will be stored as “manag a software team with great success, and has great manag skill.” The standard analyzer performs lower casing and word stemming on the data of a document.
The analyzer is also used on queries. “Sales manager” would become “sal manag”. Now a query of these terms would definitely turn up the employee we just spoke about.
PHP’s Lucene unfortunately does not have this ability yet. My next post will be about my creation of such an analyzer for PHP Lucene.










