Tuesday, July 30, 2013

Integrate PHP application with Solr search engine

Apache Solr
So why do you need a search engine, is database not enough? If you create a small website it might not matter. With medium or big size applications it’s often wiser to go for a search engine. Saying that, even a small websites can benefit from Solr if you desire a high level of relevance in search results.

Let’s imagine you have to create a search handler for an e-commerce website. A naive approach would be creating database query like this:
It might work if the search phrase is exactly as part of a title or a description. In the real life items have complex names, for instance: Apple iPhone 4G black 16GB. If somebody looks for “iPhone 16GB” no results will be returned. You can mitigate it by replacing white spaces with “%” character before the phrase is passed to SQL.
It will work for the above problem but what if the phrase is “iPhone 16GB 4G”? Obviously different order of keywords won’t work with the above system. I presume you can have an additional column and order words alphabetically, but what about misspells or synonyms? Coming up with a good solution for search system is a challenging task.
Producing a clever algorithm is not the only problem. Text search is resource consuming exercise. Laying too much stress on a database is never a good idea. The ultimate reason for that is databases don’t scale well. You can’t just add another instance as you would do for a web server or Memcached. Scaling database requires preparation, changes in software, configuration, down time and generally speaking is expensive. The good news is both problems can be solved with Solr.
Solr is an enterprise search platform based on Apache Lucene. It’s fast, stable, has good document and scales very well. While Solr is a robust solution and listing all features it provides is light years beyond scope of this post, it’s relatively easy to start using it.
First, download the latest version of the Service from the official site. Solr is written in Java so you also need Java Runtime Environment to run it.
After few seconds you should see something like
Solr has a web interface which is available under port 8983. Open a web browser and go to http://localhost:8983/solr/.
If you look at the left hand side navigation you will find “collection1″. Collections in Solr are something similar to database table. You can query it. Click on the collection and chose “query” from submenu.
First option is called “Request-Handler (qt)” with default value “/select”. Request handlers are sort of pre-defined queries. If you look into Solr config file you can find all of them.
Second and the most interesting parameter is query. Default value “*:*” selects everything. If you click on “execute query” you should get something like this:
    
    name="response" numFound="0" start="0" />
The index is empty but It’s not a problem. You can quickly insert some example data.
Now you can go back to query interface. This time one document should be returned.
Collection’s data structure is defined in schema file.
The file is has very good comments and you can easy figure out what’s going on there. If you want to amend the schema don’t remove filed named “text” (without a good reason). It’s used by other fields and some request handlers are referring to it (including select, look above).
If you use relational database you don’t want to duplicate data. Solr is not a database. Many fields are copied to the text field. Default request handler will look there on search.
To access Solr from PHP you need a client. I can recommend the one available on PECL. It’s fast, have clear API and is well document. There is one issue with the current version (1.0.2) of the extension. It doesn’t work with Solr4.x ;) . There is a small difference in protocol between 3.x and 4.x. Don’t worry, I’ve fix this issue and you can download working version from here https://github.com/lukaszkujawa/php-pecl-solr. I’ve been using this fix for a while now and it feels stable. It introduces small change to SolrClient constructor – additional parameter to specify version. The patch will go to the official release so you won’t lose consistence.
Edit your php.ini and add
Restart web server.
Now we can create a PHP script which will insert something into the index.
If you insert more then one document commit at the end. It’s resource consuming process and you don’t want commits to clobber.
It’s worth to know how to work with Solr. You can use it with various projects. It has very cool features which will let you pull all required data in one request. You will have to invest some time to master it but it will pay out. Solr is very well document and has active community. If you are serious about using with you projects read Apache Solr 3 Enterprise Search Server. It will get you up to speed not only with the service but also with basics of data mining.

No comments:

Post a Comment