
So why do you need a search engine, is database not enough? If you 
create a small website it might not matter. With medium or big size 
applications it’s often wiser to go for a search engine. Saying that, 
even a small websites can benefit from Solr if you desire a high level 
of relevance in search results.
Let’s imagine you have to create a search handler for an e-commerce 
website. A naive approach would be creating database query like this:
  
  
   
    
     
    |  | 
SELECT * FROM PRODUCTS 
WHERE LOWER(title) like LOWER('%$phrase%') 
OR LOWER(description) like LOWER('%$phrase%'); | 
 
 
It might work if the search phrase is exactly as part of a title or a
 description. In the real life items have complex names, for instance: 
Apple iPhone 4G black 16GB. If somebody looks for “iPhone 16GB” no 
results will be returned. You can mitigate it by replacing white spaces 
with “%” character before the phrase is passed to SQL.
  
  
   
    
     
    |  | 
$phrase = str_replace(' ', '%', $phrase); | 
 
 
It will work for the above problem but what if the phrase is “iPhone 
16GB 4G”? Obviously different order of keywords won’t work with the 
above system. I presume you can have an additional column and order 
words alphabetically, but what about misspells or synonyms? Coming up 
with a good solution for search system is a  challenging task.
Producing a clever algorithm is not the only problem. Text search is 
resource consuming exercise. Laying too much stress on a database is 
never a good idea. The ultimate reason for that is databases don’t scale
 well. You can’t just add another instance as you would do for a web 
server or Memcached. Scaling database requires preparation, changes in 
software, configuration, down time and generally speaking is expensive. 
The good news is both problems can be solved with Solr. 
Solr is an enterprise search platform based on Apache Lucene. It’s 
fast, stable, has good document and scales very well. While Solr is a 
robust solution and listing all features it provides is light years 
beyond scope of this post, it’s relatively easy to start using it. 
First, download the latest version of the Service from 
the official site. Solr is written in Java so you also need Java Runtime Environment to run it.
  
  
   
    
     
    |  | 
$ cd solr-4.1.0/example/ 
$ java -jar start.jar  | 
 
 
After few seconds you should see something like
  
  
   
    
     
    |  | 
2013-03-09 18:47:41.177:INFO:oejs.AbstractConnector:Started SocketConnector@0.0.0.0:8983 | 
 
 
Solr has a web interface which is available under port 8983. Open a web browser and go to 
http://localhost:8983/solr/. 
If you look at the left hand side navigation you will find 
“collection1″. Collections in Solr are something similar to database 
table. You can query it. Click on the collection and chose “query” from 
submenu.  
First option is called “Request-Handler (qt)” with default value 
“/select”. Request handlers are sort of pre-defined queries. If you look
 into Solr config file you can find all of them.
  
  
   
    
     
    |  | 
$ vim solr-4.1.0/example/solr/collection1/conf/solrconfig.xml | 
 
 
  
   
    
     
    |  | 
name="/select" class="solr.SearchHandler"> 
    name="defaults"> 
        name="echoParams">explicit 
       name="rows">10 
       name="df">text 
     | 
 
 
 
Second and the most interesting parameter is query. Default value 
“*:*” selects everything. If you click on “execute query” you should get
 something like this:
  
  
   
    
     
    | 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 | 
xml version="1.0" encoding="UTF-8"?> 
 
    name="responseHeader"> 
        name="status">0 
        name="QTime">1 
        name="params"> 
        name="indent">true 
        name="q">*:* 
        name="wt">xml 
         | 
 
 
    
    name="response" numFound="0" start="0" />
The index is empty but It’s not a problem. You can quickly insert some example data.
  
  
   
    
     
    |  | 
$ cd solr-4.1.0/example/exampledocs/ 
$ java -jar post.jar monitor.xml 
  
SimplePostTool version 1.5 
Posting files to base url http://localhost:8983/solr/update using content-type application/xml.. 
POSTing file monitor.xml 
1 files indexed. 
COMMITting Solr index changes to http://localhost:8983/solr/update.. | 
 
 
Now you can go back to query interface. This time one document should be returned.
Collection’s data structure is defined in schema file.
  
  
   
    
     
    |  | 
$ vim solr-4.1.0/example/solr/collection1/conf/schema.xml | 
 
 
The file is has very good comments and you can easy figure out what’s
 going on there. If you want to amend the schema don’t remove filed 
named “text” (without a good reason). It’s used by other fields and some
 request handlers are referring to it (including select, look above).
  
  
   
    
     
    | 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 | 
$ grep text solr-4.1.0/example/solr/collection1/conf/schema.xml | grep copy 
  
<copyField source="cat" dest="text"/> 
<copyField source="name" dest="text"/> 
<copyField source="manu" dest="text"/> 
<copyField source="features" dest="text"/> 
<copyField source="includes" dest="text"/> 
<copyField source="title" dest="text"/> 
<copyField source="author" dest="text"/> 
<copyField source="description" dest="text"/> 
<copyField source="keywords" dest="text"/> 
<copyField source="content" dest="text"/> 
<copyField source="content_type" dest="text"/> 
<copyField source="resourcename" dest="text"/> 
<copyField source="url" dest="text"/> | 
 
 
If you use relational database you don’t want to duplicate data. Solr
 is not a database. Many fields are copied to the text field. Default 
request handler will look there on search.
To access Solr from PHP you need a client. I can recommend 
the one available on PECL. It’s fast, have clear API and is 
well document. There is one issue with the current version (1.0.2) of the extension. It doesn’t work with Solr4.x 

 . There is a small difference in protocol between 3.x and 4.x. Don’t 
worry, I’ve fix this issue and you can download working version from 
here 
https://github.com/lukaszkujawa/php-pecl-solr.
 I’ve been using this fix for a while now and it feels stable. It 
introduces small change to SolrClient constructor – additional parameter
 to specify version. The patch will go to the official release so you 
won’t lose consistence.
  
  
   
    
     
    |  | 
$ git clone https://github.com/lukaszkujawa/php-pecl-solr.git 
$ cd php-pecl-solr/ 
$ phpize 
$ whereis php-config 
php-config: /usr/bin/php-config /usr/bin/X11/php-config 
$ ./configure --with-php-config=/usr/bin/php-config 
$ make 
$ make install | 
 
 
Edit your php.ini and add
  
Restart web server.
  
  
   
    
     
    |  | 
$ /etc/init.d/apache2 restart | 
 
 
Now we can create a PHP script which will insert something into the index.
  
  
   
    
     
    | 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 | 
 
  
$options = array ( 
    'hostname' => '127.0.0.1', 
); 
  
$client = new SolrClient($options, "4.0"); // use 4.0 for any version of Solr 4.x, ignore this parameter for previous versions 
  
$doc = new SolrInputDocument(); 
  
$doc->addField('id', 100); 
$doc->addField('title', 'Hello Wolrd'); 
$doc->addField('description', 'Example Document'); 
$doc->addField('cat', 'Foo'); 
$doc->addField('cat', 'Bar'); 
  
$response = $client->addDocument($doc); 
  
$client->commit(); 
  
/* ------------------------------- */ 
  
$query = new SolrQuery(); 
  
$query->setQuery('hello'); 
  
$query->addField('id') 
->addField('title') 
->addField('description') 
->addField('cat'); 
  
$queryResponse = $client->query($query); 
  
$response = $queryResponse->getResponse(); 
  
print_r( $response->response->docs ); | 
 
 
If you insert more then one document commit at the end. It’s resource consuming process and you don’t want commits to clobber. 
It’s worth to know how to work with Solr. You can use it with various
 projects. It has very cool features which will let you pull all 
required data in one request. You will have to invest some time to 
master it but it will pay out. Solr is very well document and has active
 community. If you are serious about using with you projects read 
Apache Solr 3 Enterprise Search Server. It will get you up to speed not only with the service but also with basics of data mining.
 
No comments:
Post a Comment