Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Specified Line Searching in file!

by Anonymous Monk
on Jul 29, 2004 at 10:52 UTC ( [id://378340]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

How can I fast search specified lines by it's id(at the beginning of it) quckly(10seconds) and with memory usage less than 10mb(minimum memory usage)? File length about 4mb it's about 100000 lines! a BIG-BIG file? :o) What's the optimal algorithm?

Replies are listed 'Best First'.
Re: Specified Line Searching in file!
by beable (Friar) on Jul 29, 2004 at 11:56 UTC

    A 4MB file is not really very big, and 10 seconds is a long time. Here is an example program which creates a 4MB file with 100000 lines, and then searches the file linearly for a line by id. It finishes the search in well under 10 seconds, without any special trickery. I don't believe you need an "optimal algorithm" to meet your stated specifications. Even the standard Un*x tool grep can find a specified line in a file that big in less than 4/100ths of a second.

    You might want to research about premature+optimization, Premature optimization

    #!/usr/bin/perl use strict; use warnings; # make a 4MB file, 100000 lines == 40bytes/line print("creating file...\n"); create_file("testfile.txt", 40, 100000); # find a line in the file print("searching file...\n"); find_line("testfile.txt", "000099000"); print("searching file...\n"); find_line("testfile.txt", "000000100"); print("searching file...\n"); find_line("testfile.txt", "000099999"); sub create_file { my $filename = shift; my $bytes = shift; my $lines = shift; my @chars = map{chr($_)} (32..126); open FILE, ">$filename" or die "can't open $filename: $!"; for (my $i = 0; $i < $lines; $i++) { # make id, 10 bytes my $line = sprintf("%09d ", $i); # make rest of line for (my $j = 0; $j < $bytes-10; $j++) { $line .= $chars[rand(@chars)]; } print FILE "$line\n"; } close FILE or die "can't close $filename: $!"; } sub find_line { my $filename = shift; my $id = shift; my $start = time; open FILE, "<$filename" or die "can't open $filename: $!"; while (my $line = <FILE>) { if ($line =~ m/^$id/) { print "FOUND! $line"; last; } } my $end = time; my $time_taken = $end - $start; if ($time_taken < 10) { print "Finished in $time_taken seconds. ", "That's less than 10 seconds -- mission accomplished!\n"; } } __END__ Output: creating file... searching file... FOUND! 000099000 H_eHMl}XjMhkQysu1h?ON8y1d9loP= Finished in 1 seconds. That's less than 10 seconds -- mission accompli +shed! searching file... FOUND! 000000100 w6M:YgK1{*AmG;Lh5Cp,unCo\[aJQ` Finished in 0 seconds. That's less than 10 seconds -- mission accompli +shed! searching file... FOUND! 000099999 u#656~vOL^HMy{V_[D]@-2e}tH}Auo Finished in 0 seconds. That's less than 10 seconds -- mission accompli +shed!
Re: Specified Line Searching in file!
by bart (Canon) on Jul 29, 2004 at 11:44 UTC
    Under the assumption your "big file" doesn't change that often: create an index of seek/tell values per line.

    Sample code without error checking:

    open IN, "bigfile.txt"; open IDX, ">bigfile.idx"; binmode IDX; print IDX, pack "N", 0; while(<IN>) { print IDX pack "N", tell IN; }

    Now, you can quickly locate line 10000 (the very first line has index 0):

    my $lineno = 10000; open IN, "bigfile.txt"; open IDX, "bigfile.idx"; binmode IDX; seek IDX, 4 * $lineno, 0; sysread IDX, (my $buffer), 4; seek IN, unpack("N", $buffer), 0; my $line = <IN>;

    This will work as long as your file length of the big file fits into 32 bits.

Re: Specified Line Searching in file!
by ccn (Vicar) on Jul 29, 2004 at 11:07 UTC

    If you lines are sorted by it's ID in the file, then use binary search, overvise try:

    #!/usr/bin/perl -nw use constant ID => 'id_tag'; # ID what you searching use constant ID_LENGTH => length(ID); if (substr($_, 0, ID_LENGTH) eq ID ) { print "found:\n$_"; exit; }

    Update: usage: script.pl BIG_FILE.txt

    please tell how fast my code on your data, thanks

    Update: If your big file contains constant data, then you may create an index file. The format of that file is :

    # [id_tag] [byte offset] id_tag1 0 id_tag2 123 id_tag3 463 ...
    Thus you are able to search a tag in the index file and fetch searched line using seek
Re: Specified Line Searching in file!
by periapt (Hermit) on Jul 29, 2004 at 11:53 UTC
    you could tie the file to an array with Tie::File. This module lets you access the lines of a file on disc through an array. You could then use a binary search on the array to find the element. You should need a maximum of 17 or so comparisons for a 100,000 line file to find what you are looking for. That should keep it under 10 seconds even with the cost of disc access. (see the above post for references to a host of discussions on binary searches)

    Good luck

    PJ
    unspoken but ever present -- use strict; use warnings; use diagnostics; (if needed)
Re: Specified Line Searching in file!
by davidj (Priest) on Jul 29, 2004 at 11:05 UTC
    The fastest way? Use a database and index it by the id on which you want to run the search.

    davidj
Re: Specified Line Searching in file!
by b4e (Sexton) on Jul 29, 2004 at 11:14 UTC
Re: Specified Line Searching in file!
by davido (Cardinal) on Jul 29, 2004 at 16:19 UTC
    Four Megabytes isn't a BIG-BIG file, unless you're still using a Commodore-64 computer or a PC-XT. You should be able to loop your way through the file reading line by line, and disposing of the line if it's irrelevant before moving onto the next one, like this:

    my( $found, $lineno ); while ( <DATA> ) { if ( /keyID/ ) { $found = $_; $lineno = $.; print "Found $found at line $lineno.\n"; last; } }

    That ought to get through your file in pretty short order, and won't consume copious amounts of memory.


    Dave

      i agree with Dave here, i have search like this that takes a 30MB file running on a 500mhz G4 and it handles it in about 7 seconds with the system doing various other things. no need to get too crazy unless you're in a really performance intensive situation.
Re: Specified Line Searching in file!
by Anonymous Monk on Jul 31, 2004 at 21:34 UTC
    It's the author! Who knows, this program must use processes, it must read file by portions, not to open all at one time! But only 10 seconds!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://378340]
Approved by Corion
Front-paged by ccn
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2024-04-23 09:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found