Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Binary searching with Perl.

by DigitalKitty (Parson)
on May 24, 2003 at 08:55 UTC ( [id://260548]=perlquestion: print w/replies, xml ) Need Help??

DigitalKitty has asked for the wisdom of the Perl Monks concerning the following question:

Hi all.

Two of the most common search algorithms are the:

linear search and the binary search

Each has it's advantages as well as it's 'faults'. In a linear search, each element of the data structure is examined. Needless to say, in a small array ( for example ) this isn't a problem since there are only a relatively small number of elements to analyze. If the array is large, a binary search would be most effective. In a binary search, the array needs to be sorted before a search is initiated. Once that has been accomplished, there exists a tremendous gain in performance ( please see below ). By establishing a 'mid-point' in the array, we are able to eliminate 1/2 of the array ( the lower or upper portion ) when comparing our key ( the value we want to find ) to it's position in the array and it's relation to the mid-point. With each successive comparison, we divide the array in 1/2 thereby establishing a new mid-point. Then it's a simple matter of establishing whether or not our key is higher or lower than it.

An array with 2_048 elements could be successfully searched using ( at most ) only 11 comparisons ( 2**11 == 2048 ). When compared to the time it would probably take to search the same array using a linear search, the benefits should be readily apparent. The formula for determining the maximum number of comparisons needed is the first power of 2 greater than the number of elements in the array.

#!/usr/bin/perl -w use strict; my @array; my $low = 0; my $mid; my $found_key = 0; my $num; my $key; my $index; @array = ( 5, 10, 30, 4, -3, 18, 101, 2001, 46, 23 ); @array = sort { $a <=> $b } @array; my $high = $#array; print "Please enter key to search for: "; chomp( $key = <STDIN> ); while( ( $low <= $high ) && !$found_key ) { $mid = ( $low + $high ) / 2; if( $key == $array[$mid] ) { $found_key = 1; $index = int( $mid ); } elsif( $key < $array[$mid] ) { $high = $mid - 1; } else { $low = $mid + 1; } } if( $found_key ) { print "$key is at position: $index\n"; } else { print "Sorry. I could not find: $key"; }


This is my first attempt at implementing a binary search in perl. If you could help me improve the code, I would be most appreciative.

Thanks,
-Katie.

Replies are listed 'Best First'.
Re: Binary searching with Perl.
by gmax (Abbot) on May 24, 2003 at 09:14 UTC

    There is a mistake in your algorithm. Your $mid position is being set incorrectly.

    $mid = ( $low + $high ) / 2;

    should become

    $mid = int(( $low + $high ) / 2);

    See also Binary search algorithm for more examples and discussion about this matter

    update Fixed my own mistake. The error was there, but in the wrong place. The original algorithm does not catch the highest index item.

    update (2) Jon Bentley in Programming Pearls advised that binary search can be tricky, and even experienced programmers can fail to get it properly. He is right. I have corrected this script three times, every time in the wrong way. Finally, when I monitored the three variables, I got the (hopefully) right path. :)

    _ _ _ _ (_|| | |(_|>< _|
      > even experienced programmers can fail to get it properly.

      There's always the "CPAN module" option. ;-)

      Search::Binary

      Michele.
Re: Binary searching with Perl.
by Dominus (Parson) on May 25, 2003 at 04:47 UTC
    As more food for thought, I'll present an application of binary search where it doesn't make sense to 'use a hash'. Suppose you're looking up records in a large sorted file. The following search subroutine will search the file efficiently, using a binary search. It gets two arguments: $fh is a filehandle open to the file you want to search, and $cmp is a comparison function. search will call $cmp repeatedly with various records from the file; $cmp should return zero if its argument is the desired record, negative if its argument is earlier than the desired record, and positive if its argument is later than the desired record.

    search then returns the first record in the file that is not earlier than the desired record, and leaves the filehandle positioned at that record. If there is no such record, it returns undef.

    sub search { my ($fh, $cmp) = @_; my ($lo, $hi) = (0, -s $fh); while (1) { my $mid = int(($lo + $hi)/2); if ($mid) { seek $fh, $mid-1, SEEK_SET; my $junk = <$fh>; } else { seek $fh , $mid, SEEK_SET; } my $start = tell $fh; my $rec = <$fh>; return unless defined $rec; chomp($rec); if ($hi == $lo) { seek $fh, $start, SEEK_SET; return $rec }; local $_ = $rec; if ($cmp->($rec) < 0) { $lo = $mid+1 } else { $hi = $mid } } }
    For example, suppose you want to search the dictionary file for words that begin with foo. You might:
    open DICT, "<", "/usr/dict/Web2" or die...; if (search(\*DICT, sub { lc $_ cmp "foo" })) { # DICT is now positioned at the first word beginning with 'foo' while (<DICT>) { last unless /^foo/i; print; } }
    The search call moves the filehandle quickly to the first record that begins with foo, and then the while loop prints out the words, quitting after the last one. Using search is quicker than scanning the file from the beginning looking for the first foo word, unless the file is very small.

    The behavior is similar to that of the standard Search::Dict module, but the code is much simpler, and I think my interface is more flexible. I wrote this as an example for my new class Strategies for Lightweight Databases. Share and enjoy!

    --
    Mark Dominus
    Perl Paraphernalia

      My inclination would still be to use DB_File or a relative to access a BTREE. One of the methods in the API is exactly equivalent to your search interface, but the implementation should be more performance-efficient. (And using their API takes less code.)
        Says tilly:
        My inclination would still be to use DB_File or a relative to access a BTREE.
        That's a good solution in some cases, but not always. The world is full of sorted text files, and it doesn't always make sense to manufacture a much larger secondary file, which must be maintained in parallel with the text file, just to perform searches on the data.

        --
        Mark Dominus
        Perl Paraphernalia

Re: Binary searching with Perl.
by TVSET (Chaplain) on May 24, 2003 at 12:49 UTC
Re: Binary searching with Perl.
by Anonymous Monk on May 24, 2003 at 11:23 UTC

    Noting that this is a learning exercise and "use a hash" advice is unhelpful.

    A binary search only becomes more efficient than a linear search, if the data is already sorted, or if you can sort once and amortise the cost of that sort over many, many searches.

    If the data being search is static within the context of the program, then you should ensure that the program receives that data pre-sorted. If the data is dynamically aquired (gets added to) during the life of the program, then you should maintain the data in it sorted state.

    Maintaining dynamic datasets in an order state means using either a binary search and insert on a linked list or maintaining the data as a binary tree

      Maintaining dynamic datasets in an order state means using either a binary search and insert on a linked list

      Binary search on a linked list? How would you do that efficiently?

      Abigail

        s/binary/linear/

      I believe it is beneficial for a learner to see the other ways of thinking. While my remark was not focused on helping the inquirer to master binary searching it was aimed at helping her to get a broader view of the subject. I don't see why do you describe it as unhelpful.

      The other points of you post are quite valid - but I did not write anything against them. I would just add to it that can use a tied hash to accomplish the goal of maintaining data in a binary tree, by using for example BerkeleyDB::Btree.

A bit OT remark (was Re: Binary searching with Perl.)
by zby (Vicar) on May 24, 2003 at 09:30 UTC
    If you are using the array just as a set of values, i.e. you need three operations: adding a value to the set, deleting a value and checking if the value is in the set, you can use a hash instead of a sorted array. En equivalent of the binary search algorithm is inside the implementation of hashes - so you can use it for free, with no effort.
      A sorted array with binary search allows you to do things efficiently that you cannot do efficient with hashes. For instance, finding the rank of the element you are searching for, if the searched element isn't in the set, find the largest element in the set that is smaller (nearest neighbour), or doing a range query (given two values, give all elements of the set between those values).

      Hashes are fast to do member queries with, and to do updates, but hashes are very limited in what you can do with them.

      Abigail

Re: Binary searching with Perl.
by aquarium (Curate) on May 24, 2003 at 14:29 UTC
    your considerations are very hypothetical (linear vs binary) + there are modified linear and binary (+other) searches, depending on your need. be careful though of making such grand statements as saying binary is so much faster for large datasets. yes, binary is generally faster. one very serious consideration to take into account (considering the high execution speed of programs on most recent pcs) is data access times from media. this involves many variables including block factors and such. also once you get into "huge" datasets than it's a matter of course to use a decent database engine, which has indexes and optimizers etc. anyway, good luck with your binary search algorithm. btw, it's easier (and faster) to traverse a binary linked list.
Re: Binary searching with Perl.
by Anonymous Monk on May 24, 2003 at 15:40 UTC

    Mind you, you could avoid eleven comparisons if you use the hash, Katie.

    Make the data you want to search the key of a hash, and find things in constant time. Of course, it only works if you're looking for exact matches, not for aproximate matches, misspellings, or values which fall within a range.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://260548]
Approved by gmax
Front-paged by larsen
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (7)
As of 2024-04-23 11:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found