Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Search Engines for Dummies

by hostile17 (Novice)
on Feb 26, 2001 at 08:33 UTC ( [id://60815]=perlquestion: print w/replies, xml ) Need Help??

hostile17 has asked for the wisdom of the Perl Monks concerning the following question:

I've created my own search engine, for a relatively small set of documents -- less than a hundred -- on my site.

I got the code basics from an old article on webmonkey, at http://hotwired.lycos.com/webmonkey/97/16/index2a.html which you don't have to read to understand my questions, by the way.

Here they are:

  • The author of the article puts his index into a database -- I didn't know how to do that, so I just have a flat text file index in the form:
  • apple,0,2,5,8
    banana,2,3,4,5,8
    ...
    zebra,4,5,6,7,8


    where the numbers refer to an array of documents
  • is there something inherently better about using a database, for a job of this size? I've got a 400Kb index for a few megs of HTML files.
  • What's the DBM/Berkeley DB format (this is what he used in the article) and how does it compare to SQL? Is the Berkeley DB something that every self-respecting UNIX server should have, or is it likely to be optional?
  • Where can I go to start learning how to make Perl talk to SQL, if it comes to that?
  • When my script runs, it has to open the 400Kb index file and search it, and it does this for each word. Should I be opening and closing the file every time or somehow rewinding it?
  • Is there a RAM issue with this approach? A speed issue?
  • Where do all the old $_'s go?*
  • The index just uses number to refer to the documents, as mentioned above -- in order to display titles and URLs I have a rather large list of URLs and another one for document titles, and I just dump them in as arrays:
    @filenames = qw (
    aaaaa.html
    aaaab.html
    ...
    zzzzz.html
    );
    - this is probably counterproductive when running the script too, isn't it? Should I maybe put them into their own text files and just read them in every time? Would that be a better or a worse approach than the one I'm using?

TIA,
Spike
* not a serious question, but I suppose I'm right in assuming Perl never holds an part of my 400Kb file in memory at all, except for the current line in $_?

Replies are listed 'Best First'.
(dws)Re: Search Engines for Dummies
by dws (Chancellor) on Feb 26, 2001 at 09:48 UTC
    The benefit of using a database, either an RDMBS or DBM, is that the question "which files use this word" can be answered in a relatively small number of disk reads compared with scanning a large flat file.

    With a flat file, you have to read until you find the word (or determine it isn't there). On average, you'll have to read half of the file. Worst case, the entire file. Assuming a 400Kb file is read in 8K chunks (by underlying OS buffering), that's 25 reads on average; 50 worst case.

    With a database, you'll incur a small number or disk reads to traverse the index, and then a single read to pick up the rest of the record (file numbers, in your example).

    If the search expression contains a number of words, at some point it might be faster to scan a flat file (assuming you're going to search for all terms in a single pass). You may need to determine the break-even point by experimenting.

    Another issue to consider in choosing between a flatfile or database-based implementations is ease of update, particularly if you're going to attempt incremental updates. It's going to be faster to make a single pass over a flatfile (writing an updated flatfile to replace it) than it is to scan the database record-by-record.

    With that as background: let's consider your questions.

    Q: Is there something inherently better about using a database?
    A: "It depends", but the answer is probably YES.

    Q: What's the DBM format, and how does it compare to SQL?
    A: Both separate indexes from record data, allowing a "does this key exist" search to be done with a small number of disk accesses.

    Q: Is DBM common?
    A: Yes, but some (older, poor) implementation limit the record size to 1K. Check this on your system.

    Q: Where can I go to learn about Perl talking to SQL?
    A: There are a number of good books and on-line articles that talk about using Perl's DBI interface (DBI.pm) to talk to databases. Take a look at O'Reilly's new ONLAMP site, or search for "Perl DBD" on <href="http://www.google.com/">Google.

    Q: Should I be opening my 400Kb file for every search term?
    A: No. Write some extra code so that you can do the search in one pass.

    Q: What about RAM?
    A: If you're scanning the file line by line, RAM shouldn't be an issue for you. Perl will buffer a disk page at a time, from which it extracts lines (which you read into $_, which gets overwritten every time you read a line.)

    Q: What about my data structures?
    A: On first blush, they seem O.K. Having URLs and document titles available as arrays isn't going to significantly increase your script's startup time, and having your indexing process emit a .pl file you can require is an elegant way to load those arrays.

    Let's pretend you also asked...

    Q: Is there anything else I can do to save space or time?
    A: Consider using a set of "stop words" -- words that are so common that they're liable to occur in nearly every document (e.g., "a an are in it is the this") and don't index those words. In the unlikely event that someone searches only on one of those terms, you can either verbally abuse them or do a linear search through your documents. You can probably cut your index size in half.

    P.S. Good first post. Welcome to the Monastery.

      Thank you both very much for your help.
      having your indexing process emit a .pl file you can require is an elegant way to load those arrays.

      Thank you, that's smart, I can add that to the indexer easily. That file just reads:
      @myarrayoffilenames= qw(

      the list of filenames

      );
      right? It doesn't need to be a fully-fledged file with a #! line and everything else?

      Consider using a set of "stop words" -- words that are so common that they're liable to occur in nearly every document (e.g., "a an are in it is the this")

      I've already limited the index to words of four letters or more, with some exceptions, but you're quite right, there are lots of words which are in every file so I can cut major chunks out of the index, by hand, right now!
        I've already limited the index to words of four letters or more,

        Well, there goes "sex" :)

        Seriously, a four-or-more letter rule isn't very good. You risk dropping significant two or three letter terms (e.g., AI, XML), and while cluttering up the index with common words (e.g., were, which).

        Try this simple experiment. Sort your index by the length of each line. Terms that appear in all or nearly all of the documents will rise to the top. Then look at the first 100 or so words. If they're not "significant" (and here you'll have to make the call on what's significant to your domain), then add them to a list of words that the indexer will ignore.

      Assuming a 400Kb file is read in 8K chunks (by underlying OS buffering), that's 25 reads on average; 50 worst case.

      I'll also assuming that you are reading the file from the begining. Why not try doing a binary search and start in the middle? I'm not sure of big O notation for this, but it would reduce the number of searches if you stick with the flat file.

      update: You could also divide up the file by frequency letters starting words in the language of choice (english?). Or, start the flat file with a listing of how many words start with a, b,.. then compare to search word and read x lines into file and maybe do a binary search if there are a lot of words that start with a particular letter.
        You could have the indexer remember the starting position of each letter in the alphabet, and emit a separate require-ready file of these positions. That would indeed cut the search down. Or you could sort the index by word length, and remember those starting positions.

        Note that in either case, you're building an index, and have set one foot on the slippery slope towards a using DBM or SQL. Why not go all the way?

(jeffa) Re: Search Engines for Dummies
by jeffa (Bishop) on Feb 26, 2001 at 09:35 UTC

    If a particular site will need to use a database, whether or not a searching ability is needed, then storing the site index in a database table (or two) would be preferable (barring that the additional access will not hamper regular trafic access).

    In you case however, I would do as the man says and use a DBM file. Because the site index can be stored in one table (without TOO much redundancy), a database is a bit overkill - especially for a small site. In your case with a 'flat ASCII text 400K file', using a DBM file in replace might be a little overkill as well - but I say take the challenge and learn something new. Then, start researching relational databases.

    What is a Berkeley DB file? Well, the easy answer is that generally speaking, a Berkeley DB file is a more efficient way of storing data then a simple ASCII 'flat file.' A Google search produced this online reference. Give it a look.

    As an added bonus, I found this dusty archived node in the monastery: DBI vs MLDBM/GDBM_File, etc.. You will find a lot of good information from the many posts in that thread concerning which type of data storage is best (which always depends on the situation).

    Minor correction - Perl does not speak to SQL - Perl speaks to a database through a driver, and the language that Perl uses is SQL.

    There will always be memory and speed issues with whichever approach you choose to store your data. A 400K file is a speck of dust to modern PC's - so you have nothing whatsoever to worry about. Except the concept of scalability - what happens when that 400k file becomes 400 gigs? That flat file just is not going to be able to survive the browser time out.

    Last note - this is NOT an easy task on the whole. Just remember that, because you are probably fixing to take a long journey, and it will get rough at times. Here is another thread for you to ponder: Searching module

    Jeff

    R-R-R--R-R-R--R-R-R--R-R-R--R-R-R--
    L-L--L-L--L-L--L-L--L-L--L-L--L-L--
    
    
Re: Search Engines for Dummies
by jeorgen (Pilgrim) on Feb 26, 2001 at 20:20 UTC
    If you want a ready-made solution I can heartily recommend the DBIx::FullTextSearch that uses MySQL as backend. It's great, it even decodes:
    apples +bananas -"blue oranges"
    automatically, if you want it to. It can be done as simple as this:
    my @doc_ids = $fts->search(qq{apples +bananas -"blue oranges"});
    It's one of these "just add water" modules at CPAN.

    Another great search engine in perl is ksearch, and it does not need a database (it uses any DBM version you have access to).

    /jeorgen

Re: Search Engines for Dummies
by wardk (Deacon) on Feb 26, 2001 at 20:20 UTC

    I am just finishing up on a similar site, end users (and we are talking extreme non-techies) post documents to a site (about 130 docs so far) that has search capabilities.

    The initial plan was to create a database to store everything about the site, including content (in fact the end users did this in Access and it got out of control, they cannot even maintain it...so IT has stepped in to resolve the mess). Turns out the site may be a throw away within a year, so I tossed the database idea (switching to Oracle was overkill, $$-wise, and no way was I going to entertain leaving it in ASP/Access), and came up with the following.... It sounds very similar to what you are working on....

    Here in a nutshell how I implemented it (using only standard perl, and CGI.pm):

    • end users build their own docs from well-commented templates
    • all pertinent doc data is in META tags, stuff that help build a categorized (division, jobclass, keywords, among others). Users use a custom CGI that assists in uploading to the unix box (FTP is too complex)
    • a perl job "compiles" the site index from these META tags, creating a base file that produces the index via CGI. this index file is essentially a CSV, with the keywords in a column derived from the META tag keyword=
    • I use a simple grep from the search widget to create a categorized doc listing of the subset of found documents
    • created some simple tools to create the minimum required set of META tags, they just paste them into the doc. as well as an on-demand index generator and some other self-serve tools to maintain the site.

    The end users got their first test drive last thursday and it did just what they needed without being overly complex.

    Anyway, this method worked well for us, and went from concept to reality in just a few weeks, despite being a "side project".

    ...not sure something like could work for you, but since you seem to be solving the same problem...and of course TIMTOWTDI!

    have fun!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://60815]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2024-04-25 17:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found