Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

Sitesearch Lite

by George_Sherston (Vicar)
on Nov 12, 2001 at 17:59 UTC ( #124824=sourcecode: print w/replies, xml ) Need Help??
Category: CGI Programming
Author/Contact Info George_Sherston
Description: This is a simple pair of scripts that caches a single directory web site and then searches the cache. I'm most grateful to pjf, demerphq and davorg for advice in the CB whilst I was running this up. As well as the two scripts ( and I've included a snippet that needs to go into a template file that matches the look of the rest of your site. Filepaths: 'your_file_path' is the path on your server, whilst 'your_path' is the full web address of your site's root directory. If the scripts are kept somewhere other than in your_path/cgi-bin/, a little editing is necessary. You'll also need to choose two strings that always occur in your html just before and just after the bit of the page that has the page-specific text in it - these are 'unique_ident_1' and 'unique_ident_2'. If you don't want to use this, then there are a couple of lines in the script you'll need to comment out.

Here is the "" script, that needs to be run each time the site is updated:

#!/usr/local/bin/perl -w

# this script builds a cache of web pages, which can be searched
# by; shd be run each time web site content changes

use strict;
use CGI qw(:standard);
use Data::Dumper;

my $dir = 'your_file_path';    # dir to search.
my $ext = 'htm';               # page types to search.
my $cache = 'sitesearch.dat';  # cache file.
my (@Results,$file,$title);

# optional boundaries for search area, to avoid
# searching on repeated text:
my $startstring = 'unique_ident_1';    
my $endstring = 'unique_ident_2';

chdir $dir;

# get all the relevant pages, strip out title, file name and
# searchable text, store in array of hashes:
while (<*.$ext>) {
    open FILE, $_;
    read FILE, $file, -s(FILE);
    $file =~ m#<title>(.*?)</title>#i;
    $title = $1;
    $file =~ s/^.*$startstring/$startstring/s;  # delete these 2 lines
+ if
    $file =~ s/$endstring.*$//s;                # you want to search t
+he whole page
    $file =~ s/<[^>]*>/ /g;
    push @Results, {filename => $_, title => $title, text => $file};

# save the results in the cache file:
open SAVE, ">$cache" or die "could not open $cache $!";
print SAVE Dumper(\@Results);
close SAVE;

# check what has been saved in the cache file, and display it:
open RETRIEVE, $cache or die "could not open $cache $!";
my $data = do { local $/; <RETRIEVE> };
my @Retrieves = @{ my $VAR1; eval $data };
print header,h2('You have successfully cached the following pages:');
print $_->{title},br for @Retrieves;

Here is the "" script, which shd be called from a form on all the pages on the site:

#!/usr/local/bin/perl -w

# this script searches the given web directory for
# pages containing the given search term.   Need to run
# first to create the cache file.

use lib 'your_file_path';
use strict;
use CGI qw(:standard);
use HTML::Template;

my $cache = '/u/www/virtual/w01-0785/sitesearch.dat'; # path to cache 
my $param = 'search';            # name field from input tag
                    # in calling web page.

# get the web site data from the cache:
open RETRIEVE, $cache or die "could not open $cache $!";
my $data = do { local $/; <RETRIEVE> };
my @Retrieves = @{ my $VAR1; eval $data };

# search the data and pull out the relevant filenames and page titles:
my $SearchTerm = param($param);    
my @Results;
my $text;
for (@Retrieves) {
    if ($_->{text} =~ /$SearchTerm/i) {
       push @Results, {filename => $_->{filename}, title => $_->{title

# print the search results:
my $results_tmpl = HTML::Template->new(filename => 'siteresults.tmpl')
    search_term => $SearchTerm,
    results => [@Results],
print header;
print $results_tmpl->output();

There also has to be a file called siteresults.tmpl, which shd look like the other html files on the site, but contain, at the appropriate point, this stuff:

# send back the sitesearch input box:
<form method="get" action="http://your_path/cgi-bin/">
    <input type="text" name="search" size="15" value="<TMPL_VAR SEARCH


# print out the relevant links:
Here are the Pages that match your Search Terms:<br>
    <a href="http://your_path/<TMPL_VAR FILENAME>"><TMPL_VAR TITLE></a

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://124824]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (6)
As of 2021-01-16 12:35 GMT
Find Nodes?
    Voting Booth?