Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^2: Searching text files

by runrig (Abbot)
on Sep 14, 2006 at 21:26 UTC ( [id://573012]=note: print w/replies, xml ) Need Help??


in reply to Re: Searching text files
in thread Searching text files

...doing binary searches.

And guess what, someone already did all the work for you (and I already mentioned it above) ;-)

Replies are listed 'Best First'.
Re^3: Searching text files
by xach (Initiate) on Sep 14, 2006 at 21:50 UTC
    Seems almost like overkill when it can be done in just a few lines. The function below searches three million records very quickly.
    # Usage: record_search($search_string, $filename, 11); # Returns the record number if the record is found, undef otherwise. # Searched file must be sorted. sub record_search { my ($key, $filename, $recordsize) = @_; my $fh = IO::File->new($filename, 'r') or croak("Could not open `$filename': $!"); my $hi = ($fh->stat)[7] / $recordsize; my $lo = 0; while ($lo < $hi) { my $mid = int(($lo + $hi) / 2); $fh->seek($mid * $recordsize, 0); my $target = $fh->getline(); chomp($target); if ($target < $key) { $lo = $mid + 1; } elsif ($target == $key) { return $mid; } else { $hi = $mid; } } }
    Update Forgot to mention: even though this function opens the file every time, it can successfully search 3 million records about 2000 times per second on a 3Ghz Pentium 4. I can't see much justification for anything more complicated. Threads? Bit strings? Crazy! :)
      Sometimes being able to do it in "just a few lines" isn't the point (take Data::Page for example). It's having to always rewrite, correctly, those lines every time you want to solve this particular problem. In this case, the module makes for a more flexible solution, and it's a "few lines" (about 12 actually) that I don't have to rewrite. The downside is that it is about 20% slower, but that's a non-issue when response to any single query is imperceptible.
      I can't see much justification for anything more complicated.


      Optionally searching for strings instead of numbers?
      Not having to have a fixed record size or not having to know what the recordsize is? (note: there is a set recordsize function in File::SortedSeek which claims to improve performance, but I could not see any difference in speed from using it)

      Anyway, here's your function using F:SS instead:

      use File::SortedSeek; sub record_search { my ($key, $filename, $recordsize) = @_; my $fh = IO::File->new($filename, 'r') or croak("Could not open $filename: $!"); # I don't see any difference in speed with the next line # File::SortedSeek::set_line_length($recordsize); numeric( $fh, $key); return File::SortedSeek::was_exact(); }

      Update: Looking at the answers below, I learned about Search::Dict, which would work just about as well, and has been in core perl since at least 5.6.1, but still supports my point about saving just a few lines.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://573012]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (1)
As of 2024-04-19 00:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found