Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Re: Searching module

by eduardo (Curate)
on Jan 30, 2001 at 00:38 UTC ( [id://55059] : note . print w/replies, xml ) Need Help??

in reply to Searching module

ZydecoSue said:

I noticed that CPAN contains one called Search-InvertedIndex, but that seems really complicated for I thought should be a simple task.

And eduardo cringed... I have written search engines pretty much my entire professional programming life. All I did at every single employer I can think of was write indexers and search engines for different types of data. Relational data, flat data, ISAM data, geographic data, archaic data, encrypted data... Please, do yourself a favor, and realize that searching is one of the most time honored and well studied fields in computer science. If you point your browser to <a href="">Sorting and Searching</a> by the great Knuth you will realize that if it took him 1/2 of a 780 page book, maybe there is more complexity to this entire "searching" thing that at first seems to be on the surface.

The first and most important thing that you need to do is understand the data that you are searching through. Is it flat files, is it DBM's, are you looking at RDBMS tables, OORDBMS? What is the "nature" of the data, what is it's "thingness." What does it contain, what does it show you, how does it index?

Most data that you will find, can be described in two categories:

  • That which has a key
  • That which does not have a key

If you realize that your data is data that can be keyed, then your problems become much easier. There are 100's if not 1000's of mechanisms for the ease of searching through keyed information. You have choices ranging from:

  • Create a database with primary keys
  • Create DBM's which you tie
  • Create keyed index files
  • Use some pre-built system (it's amazing what's out there)

If however, you are doing free form searching on data, data that can not be related as simply as key => value, then the problem is a slight bit more complicated. You are asking for things which are more "full-text" and open form. This is very difficult to implement right, which is why you have such a difference in the quality of search engines. A search engine (like Google) does just this, attempt to find a way to intelligently parse the free form data that exists on the internet. There is *never* a good reason to invent the wheel (well, I lie, sometimes for didactic purposes)... if it is this type of data you have, then I suggest you find an indexing / full text search system:

  • Glimpse is an amazing produce for full text searching
  • ht://dig is also pretty good

However, all that I can suggest, is do yourself a favor, this is a more complex thing than just indexing and using grep. Understand your data, understand your structure, understand what it is that you are trying to accomplish, and remember, you can do what merlyn says in his WebTechniques column, use WWW::Search and rely on Altavista to do your searching for you :)