Speed searching HTML docs

I've been pondering changing an application that I created a year or so ago to make it somewhat more robust.

The application is a tool that allows searches through a whole bunch (say 700 or so) HTML files. It prints a listing (and a link to) of files, sorted on how many times the keyword is matched.

So far so good, this sounds easy right? The problem lies in the fact that the documents are consistently updated, say 6 or 7 files get changed every day. They are updated by multiple users. The searches need to be as "real-time" as possible.

The way that I've solved this in the past was by building 2 applications. One to check the files for updates (every 5 minutes), parse the files, and store a hash mapping keywords to filenames in a storable file.
The second app is just a cgi interface that loads the stored file, blazingly fast finds the "answers" to the search.

There are a couple of reasons that I don't like this approach, the main one being that every once in a while, the application checking for updates dies. Sometimes we don't notice and people are retriving out of date information.

The second reason that I don't like this is because the tool that is monitoring the files is run from a command prompt (yes this is all on Windows), which requires the server to be logged in.

The third reason that I want to re-write this is because I finished it when I was much more new to perl than I am now. There is some really ugly code in it, and I may be moving to a new job (actually just losing this one) and I want to leave my successor readable code.

So, I'm polling for suggestions, given the above scenario, what would you suggest the best way to accomplish my goals would be? Those goals, to clarify: pseudo-real time search, very fast, and stable

The ideas I've been kicking around:

When a search occurs, check the last update of the cached information, if too long, kick off a seperate process to check for updates/rebuild the cache
Offer an update button when creating documents
Rebuild an app that can function as a service in Windows to update the cached information
Real time search - possibly using a fork() to search multiple files at once

I can see positives and negatives in all of the above, what would you suggest?

Comment on Speed searching HTML docs Download Code

Replies are listed 'Best First'.
Re: Speed searching HTML docs by perrin (Chancellor) on Aug 15, 2002 at 20:30 UTC
You might be better off to use one of the many existing open source search engines for this, but I think it would solve your immediate problems if you just ran your updater as a "scheduled task" every 5 minutes. Then there would be no question of it dying or any need to be logged in. In case you haven't heard of it, "Scheduled Tasks" is a cron-like feature of Windows that you can access through the control panels.	[reply]
Re: Speed searching HTML docs by mp (Deacon) on Aug 15, 2002 at 21:48 UTC
Are the documents being updated through a web interface that you have control of as well? I.e., can you do something that is driven off of the update event rather than have to poll the last updated time? If so, you could do a rebuild on document update (with a mechanism to avoid multiple concurrent updates). If you are forced to poll file timestamps to determine when to update, one possibility would be to use the first approach you mention above (updating on demand when a search is requested) with some modifications: 1. Whenever you rebuild the cache, store the fact that a cache rebuild was initiated at such and such a time. After successful completion of the cache rebuild, store the time that the cache rebuild started. 2. When a search occurs, you can compare the file timestamps to the cache rebuild start timestamp. If any file is newer than the cache rebuild start time, a rebuild is needed. 3. To avoid having to do step 2 very often, you can also record the last time you did step 2 and only do it again after some fixed time elapsed (time-to-live). 4. You would need some mechanism to avoid running multiple cache rebuilds concurrently, but you also need a way to prevent that mechanism from locking out all future cache rebuilds if a cache rebuild failed part way through. 5. The user that caused a cache rebuild could be returned results from a search against the old keyword cache, so that he doesn't have to wait for the rebuild to take place (if that is acceptable). 6. You might also need a mechanism for preventing step 2 from running multiple times concurrently. This would be much easier on an operating system that had a reliable task scheduler like cron.	[reply]
Re: Speed searching HTML docs by Abigail-II (Bishop) on Aug 16, 2002 at 09:11 UTC
One thing I'm missing in this scenario is how you deal with files that are deleted. As you describe it now, it might be a search returns files that are no longer there. If I had control over the publishing system, it's easy. Whenever a new file is added, a file is modified or deleted, you have to update your index. Otherwise, I'd run a sceduled process (for instance from cron or whatever Windows uses). It should take the list of indexed files (with their timestamps) and compare them with the files and timestamps on the systems. All differences need to be reindexed. If this dies once in a while halfway, the next time the sceduler fires the process changes will be picked up. You also may want to completely reindex the site every night or weekend. Abigail	[reply]
Re: Speed searching HTML docs by vladb (Vicar) on Aug 16, 2002 at 07:15 UTC
There are a dozen modules that can make this jobs ever so much easier. One that I would recommend is DBIx-FullTextSearch. I've used this module in every project that required some searching. It has always worked for me just fine. Unfortunately, it currently doesn't offer search scoring algorithm (actually, I'm still 'working' on it). I believe this module would also allow you to index individual files, even those containing HTML in them. _____________________ `# Under Construction` [download]	[reply] [d/l]
Re: Speed searching HTML docs by the_slycer (Chaplain) on Aug 16, 2002 at 12:55 UTC
I probably didn't explain it well enough originally, the bit that checks for "updates" does a file/timestamp check, if a file is deleted, I catch it and remove it from the search hash. If I had control over the publishing system, I would never have asked this question ;) The "publishing system" is currently a shared directory on a server that anybody in the group can update.. and I cannot change that process. Thanks everybody for the help, I think I will probably go with a scheduled process, it's a quick and easy modification to the current setup, so there's a big benifit there.	[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks