How to build a Search Engine.

Whether you are building a CMS, maintaining an intranet portal, managing a large database of information, or acting as an information aggregator, you will probably need to offer some level of search functionality to your user base. This meditation presents several different options, and offers and in-depth discussion and example of a comprehensive solution.

Search Engine Providers

For basic websites, the easiest and quickest solution is to use an existing search engine provider like Google. Once Google has indexed your site, visitors can search for content using the following Google search syntax:

site:yourdomain.com searchterms
[download]

Google even offers copy-and-paste solutions for adding a search box to your website that will automatically restrict visitors' searches to your domain.

Google now even offers webmaster tools to review how your site is indexed and locate potential problems. You can even notify Google of updates to your site, so they may potentially be indexed sooner.

While this is quick and easy, it has many drawbacks! Your dataset must be a website, it must be publicly available and cannot require any authentication, it must not look like a web application¹, the search results look like a separate website and may include ads, you cannot offer advanced searching options (e.g. limiting search to one section of a site), and Google (or any other provider) will be slow at catching updates and changes to your site.

¹ - Google does not index pages that look like they have an ID or session ID, or make use of more than a few parameters.

So, what do you do if you cannot live with any of these limitations? On a side note, Google does offer commercial search services, but I am cheap!

Databases

If your website stores its data in a database, you may have the option of using full-text catalogues. A full-text catalogue indexes one or more fields in a table for searching. You can then execute an SQL query which includes a search query. The database returns records that are relevant your search.

FTCs are convenient because they automatically update as your data changes and they don't require additional libraries. However, their functionality is often limited in terms of analysis, they are not extensible, and their query languages are often not what people are used to. And since each catalogue is limited to a table, they may not work for you at all if your data is related through many tables. For more information, check out Mysql's Full Text documentation.

Embeddable Search Engines

An embeddable search engine is a library that allows you to build search functionality directly into your program (or website) without using a stand-alone service. This is analogous to an embedded database system, as apposed to a database server like mysql.

Before we dive into our options, let's consider the various features we might be interested in:

Analyzers: In search engine terms, an analyzer is a component that determines how text should be indexed. For example, a Stop Word analyzer indexes all but the most commonly used words – so words like A, An, The, et al. don't effect your queries. A Stemming analyzer reduces multiple inflections of the same word – so a search for Eat would find Eat, Eats, Eating, etc. Normally, for text-based search we want both of these, and maybe more. However, if we offer advanced search functions (like limiting a search to a specific section of the site) we would like to use a very basic analyzer for indexing the section names, so they don't get messed up (more on this later).
Preview: A good search engine will help us generate previews for display in the list of search results. Even better is when it highlights words in the preview related to your query.
Dependencies: More than just what other Perl modules are used, but in some environments, XS/C++ is not an option.
Extensibility: Can we create our own analyzers? Can we create our own query languages? Can we pre/post-process the results? This are things you may need.
Documentation: This requirement is often overlooked when considering products to solve a need.
Performance: The speed of the indexing process can be critical, especially if you have large files to index. Moreover, if you have a large dataset, you may need support for incremental updates (updating only the changed records, instead of reindexing the entire site every time something changes).

A Review of Available Tools

There are several available open source embeddable search engines out there. The defacto standard is Apache's Lucene. While Lucene is written in Java, making it of little use to us as Perl developers, it has driven the design of most alternatives and so it is important to know about. On the Perl side, there are several different ports of Lucene available.

Plucene is a port of Lucene to Perl. I started using this several years ago for a large CMS that I had built and continue to maintain. Its one major pro is that it is all Perl. However, off the shelf, it includes only a very basic analyzer, no preview generation, and is not known for its speed. I had to string together several additional Perl packages to get these features. My biggest issue with Plucen, however, was the lack of documentation. Without a background in how Apache's Lucene worked, I was left to navigate a very large set of PODs to find answers to may questions. Also, I fear that the differences between Plucene and Lucene make Lucene's documentation a misleading reference now and then. Regardless, Plucene was very stable and performed very well during the several years I used it!

Lucene is a Perl binding the C++ port of Lucene, called CLucene. Note: do not confuse this with CLucene, the deprecated Perl bindings for the same library. Being a native port, it is certainly faster than Plucene. It offers a very direct and comprehensive overview, but still leaves you the look at Apache's Lucene's documentation. It offers several analyzers and incremental updates. But, you are still own your own for generating previews, and parsing PDFs, etc.

KinoSearch is a Perl/C++/XS search engine loosley based on Lucene. Unlike the two previous options, its API was designed for Perl, offering much easier and cleaner programming. Being native, it offers very fast indexing, good documentation on their website. It offers stemming and stopword analyzers, generates highlighted previews, you have more control over your index setup and contents, and is very extensible. Despite its low version number, it is very complete. This is the choice I went with and will discuss below.

Apache's Lucy is a brand-new project started by the creators of KinoSearch and a Ruby search engine called Ferret. They plan to create a native search engine with Perl and Ruby bindings. Perhaps, this is the beginning of a new defacto search engine with bindings for most languages (like mysql, and postgres in the database world). But, it has only just started, so it isn't yet an option.

Note: I want to make sure the authors of these modules are aware that I really appreciate their efforts and my criticism of their modules is merely a professional review. Maintaining a port of Lucene is a arduous task at best, and I thank you for all of your efforts!!!

HTDig is a non Lucene-related technology that seems dead. Their own website hasn't announced an update in 2.5 years.

SWISH-e is a very complete and comprehensive search indexing tool. It is also not related to Lucene. SWISH-e offers a command line tool to quickly index a file set or website. It indexes many different file types, and does all of the website crawling work for you. It is fast and SWISH offers Perl bindings. It has a smart analyzer and is extensible through unix-style piping. They have a nice article called How to Index Anything. This is a very quick and complete solution, but I needed more control over how my content was indexed and searched and so would not work for me.

Indexing Your Site

I chose KinoSearch for my needs this time around. However, you can use what you learn here with any Lucene related library.

The first thing you want to do is open an index. I suggest you put the index somewhere outside of the webspace. A minor annoyance with KinoSearch was the fact that you have to tell KinoSearch to create a new index if there isn't one already there. Otherwise it will barf. And you can't just always create a new index if you plan to do incremental updates.

my $index = KinoSearch::InvIndexer->new(
     analyzer => KinoSearch::Analysis::PolyAnalyzer->new(language => '
+en'),
     invindex => $pathToIndex,
     # Create the index if it isn't already there.
     create => not(-e $pathToIndex/segments"),
);
[download]

The KinoSearch::Analysis::PolyAnalyzer is a great feature of KinoSearch. It automatically loads analyzers designed for your specified language, including Stemming features. Use this for "google style" searching.

The next step is to define the structure of what you want to index. This, to me, is one of they most powerful features of Lucene style engines. Think of this step as defining the columns of a table in the database. You indicate what fields you want, which ones are indexed (so they can be searched), which ones are stored (for use during search results), which ones are analyzed, etc.

$index->spec_field(name => 'id', analyzed => 0, vectorized => 0);
$index->spec_field(name => ‘section', analyzed => 0, vectorized => 0);
$index->spec_field(name => 'url', indexed => 0, analyzed => 0, vectori
+zed => 0);  
$index->spec_field(name => 'title', boost => 3, vectorized => 0);
$index->spec_field(name => 'content');
[download]

Let's go over each field.

ID: this will hold the ID of each record. We will use ID later to remove a record when we want to update it. Since we will only ever search for exact ID matches (non-fuzzy), we don't want it analyzed. We also tell it not to vectorize the IDs. Vectorizing is used during preview highlighting. The only fields you should vectorize are the ones that are using in a preview.
Section: the section with in the website. We will use this field to narrow the search to a specific section, if the visitor wants. We can provide a dropdown, for instance, of available sections and add the selection to the query string, if they choose one. Since we will always want only exact section matches, we do not analyze it.
URL: the url the user should be taken to if they click on the result. In my application, the URLs are not very meaningfull, so I don't index them. That means they are not used in the search. Instead, I just store the URLs with the rest of the record for later use.
Title: the title of the page. If something matches in the title, then this record is probably even more relevant than if it just matched in the content. So, we apply a boost factor.
Content: the actually body of the page. Note that we are vectorizing this, since we plan to generate a preview of the relevant portions of the page.

You can add any fields to meet your needs (like Last Updated). These examples should give you everything you need to know to properly configure your fields.

Now we need to add your content. In some cases, you may just dump a bunch of database records, or you may choose to crawl a site or file set. Either way, it is up to you. Here is how you add each record to the index:

my $doc = $index->new_doc;
$doc->set_value(id => $record->{id} || '');
$doc->set_value(section => $record->{section} || '');
$doc->set_value(url => $record->{record} || '');
$doc->set_value(title => $record->{record} || '');
     
my $content = processContent($record->{content});
$doc->set_value(content => $content || '');
$index->add_doc($doc);
[download]

We create a new Document, the record in search engine terminology. We set the corresponding fields values, and then add the document to the index. We may, however, want to do some special processing to the content field first.

It is a good idea to strip out HTML. More over, you may want to only extract the portion of your webpage that is significant, so that ads and navigation don't affect search results.
If the data is coming from a file, you may want to use one of the tools below to extract the text.
In my node KinoSearch & Large Documents, I indicated that I was having trouble indexing very large files (> 3MB). So, for now, I limit the indexing to only the first 512 KChars. Frankly, if the author of the file cannot get the key points of his document across in the first 500,000 chars, than he is probably not a good writer. :-) update: creamygoodness pointed out that the use of $& could be killing my performance. It sure was. Indexing is now 100's of times faster. See KinoSearch & Large Documents for more info!

When we are all done, we finish and optimize the index:

$index->finish(optimize => 1);
[download]

Searching Your Index

Now we want to create a search. Generally, this part would be built into a script (CGI, shell, ModPerl, etc). I will leave that up to you. Here is how you execute the search:

# Open the index
my $index = KinoSearch::InvIndexer->new(
     analyzer => KinoSearch::Analysis::PolyAnalyzer->new(language => '
+en'),
     invindex => $pathToIndex,
);

# Create a highlighter
My $highlighter = KinoSearch::Highlight::Highlighter->new(excerpt_fiel
+d => 'content');

# Execute the search
My $hits = $index->search($query);
[download]

Up to this point there should be no surprises. The highlighter is used to highlight generated previews. You have to tell it which field is used for generating the preview. The highlighter above puts strong tags around words that match. You can customize this to meet your needs. Note: the highlighter is smart enough to highlight terms that match because of your analyzer. So, in a search for Eat: Eating and Eats would also be highlighted if found.

Now we display our results:

# Get the total hits
my $count = $hits->total_hits;

# Get the first 10 records
$hits->seek(0, 10);

# Generate previews
$hits->create_excerpts(highlighter => $highlighter); 

while (my $result = $hits->fetch_hit_hashref)  {
      …
}
[download]

The preview generation step creates previews of each record. By default, they are limited to 200 chars, and show portions of the document that were most relevant to your search.

Inside the while loop, each $result is a hash reference with each of your fields for that record; id, url, section, title, content, etc. You will also have a field called score that has the score of the record, and excerpt which has your preview, all nicely highlighted. Also, the results come out in order of most to least relevant.

In the previous example, I took the query directly from the user and passed it to the index. But what about limiting sections, like I promised? Here we have an additional variable called $section containing their choice of section. I update the query to insure that all results are in that section as follows:

$query = qq/+section:"$section" AND ($query)/;
[download]

Incremental Updates

Sometimes it is easiest to just recrawl your dataset every once and while, or even every time it changes. KinoSearch, CLucene, SWISH-e, and Plucene are all plenty fast for most datasets. But, if you are concerned, or have extremely large sets of data (or a busy server) we can elect to update only the records that have been modified.

First, we open the index. Then for each updated record, we remove the existing entry:

$index->delete_docs_by_term(KinoSearch::Index::Term->new(id => $record
+->{id}));
[download]

This deletes all records with that ID number (in most cases, probably only one record). Then we create a new document, set all of the field values, and add it to the index.

When done updating, you need to finish the index, but you don't need to optimize at that moment. You can wait for several updates if your server is really busy.

Files

If you just want to index a bunch of files (PDFs, DOCs, etc) consider SWISH-e. But, if you have some files on your website, you will want to extract the text from them before you add it to your index. Here are some tools that will help:

HTML::TreeBuilder is a great choice for extracting only relevant portions of HTML files.
pdftotext for PDFs
wvWare for MS Word
Generically, you can get pretty far with some binary files using the strings command

Other Features

For fun, consider using Text::Aspell to support inline google-style spelling corrections.

I hope this helps!

Ted Young

($$<<$$=>$$<=>$$<=$$>>$$) always returns 1. :-)

Comment on How to build a Search Engine. Select or Download Code


No such thing as a small change
	PerlMonks