Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Search Efficiency

by treebeard (Acolyte)
on Jul 11, 2001 at 18:43 UTC ( [id://95699]=perlquestion: print w/replies, xml ) Need Help??

treebeard has asked for the wisdom of the Perl Monks concerning the following question:

OK, I have searched "Super Search" for an answer to my issue, and this is what I have gleaned.

I have a process that looks through an extracted | delimited flat file via a while (FILEHANDLE) for a user id. When I find an id, I split it based on the pipe, examine the data, and based on values, move it into a specific keyed hash to be processed in another part of the program.

My issue is that the flat files are enormous (well for me) at +250 MBs. Is there a way that I can tell perl not to rescan the file each time the while loop goes through?

#$userid already populated open(FILE,"FILE.txt") while(<FILE>) { if(/$userid/o) { chomp; #below is an example, there are actually 15 #variables coming out of this split. ($inv, $date, $amt)= split(/[|]/); #code to build hash. } } #processing of hashes

Is there anyway to tell perl, "hey, you have already gone through x lines, don't start over at the beginning of the file when you loop back through?

Replies are listed 'Best First'.
Re: Search Efficiency
by bikeNomad (Priest) on Jul 11, 2001 at 18:53 UTC
    You can use tell to save the current position of the file handle, and then use seek to return to it later.

    Look at DBD::CSV for a solution that will keep you from having to scan the files yourself. However, I don't know how efficient it is in terms of re-scanning.

    Also, the BerkeleyDB module should be able to handle flat files efficiently if you use its DB_RECNO and specify a -Delim=>'|' .

Re: Search Efficiency
by Cirollo (Friar) on Jul 11, 2001 at 19:00 UTC
    Perl doesn't read in the entire file each time through that loop. Each time through the loop, Perl just reads the next line from the still-opened file.

    A few tidbits that might come in handy if you're doing anything more complex:

    If you need to know the current line number, look at the $. variable (dollar dot).

    The seek function can be used to go to a specific part of a file without reading the whole thing. The tell function returns the current file position in bytes, based at 0.

      Cirollo is right. and you want to be careful about where your hash is defined. if it's created inside the while loop, it will be regenerated after every match of the $userid. if it's created outside the while loop, you can add info to it and do your processing after the while loop ends. i suspect this might be the real problem behind your question,

      Is there anyway to tell perl, "hey, you have already gone through x lines, don't start over at the beginning of the file when you loop back through?

      by the way, always use strict and perl -w to keep your code on it's very best behaviour.

      ~Particle

Re: Search Efficiency
by Abigail (Deacon) on Jul 11, 2001 at 19:47 UTC
    I fail to see your problem. Perl only goes as many times through a file as you tell it to. In the code you give, you scan the file once, and only once. And unless you seek back to the beginning and enter the loop again, Perl isn't going through the file again.

    I get the feeling you didn't give us all the code, but if you do, I expect the answer to be "don't do that then".

    Note that if you want Perl to stop reading the file in its current iteration, look at the last; statement.

    -- Abigail

Re: Search Efficiency
by VSarkiss (Monsignor) on Jul 11, 2001 at 19:02 UTC

    If I'm reading right, you're doing a linear search through the text file each time looking for a specific $userid.

    If the file is sorted, and you're looking for values in the same order, then you can use seek and tell to move around in the file. Something like this:

    my $last_pos = 0; # start at the beginning # later.... seek(FILE, $last_pos); # go to where I was while (<FILE>) { # start reading line at a time if (/$userid/) { # sure about the /o, BTW? $last_pos = tell(FILE); # remember this spot .... # other stuff
    You can see why order is important: I'm assuming you can find the next record by advancing in the file.

    If the requests are in random order, then it may be worthwhile to build your own index (we used to call these ISAM files back in the day ;-). In the beginning of your code, scan the file once, build a hash of positions. Then you can seek to any particular record. Something like this:

    my %index; while (<FILE>) { $userid = split(...) # It's in there somewhere, right? $index{$userid} = tell(FILE}; } # later... seek(FILE, $index{$userid}); $_ = <FILE>; ($inv, $date, $amt) = split (...);
    You'll have to fiddle with these to make sure you're seeking to the right spot in the file

    Now the big suggestion: forget everything I just wrote! Get yourself a relational database and get rid of all this seek/tell stuff. If you've got that much data and you're doing random reads, there's just no point in writing your own ISAM stuff.

    HTH

      first, i had read that the /o would help wrt perl evaluating the variable, got the idea from the O'Reilly books...was I correct in my assumption?

      second. We are extracting a lot of data from an existing billing system, formatting it using Perl, and building a huge flat file that someone (not me :)) is going to load into sql server. (not my idea)

      We are using perl dbi scripts to extract the data from Oracle, then we process them using scripts based on the one above. It is not elegent, but we are not working with much time. Therefore our approach is

      1. Build tables

      2. Extract tables to file system (solaris)

      3. Format files

      4. Send the buggers out.

      I know that this getting away from my original question, but should we have merged the dbi call scripts and the format scripts?

        Well, the /o says "the variable inside this regular expression will always have the same value". Believing that to be true, Perl will not recompile the RE. If you change the value of the variable, the RE won't change. This may be good or bad depending on what you're doing: good if you really don't change it, because you'll save time on RE recompiles; bad if you do change it, because your matches will appear to behave strangely. ("I changed the variable, why didn't it match?")

        It's hard to answer your second question without a lot more knowledge about what you're doing. But my guess is that things would be a lot easier if you combine the query and reformatting into one program. The general rule: use the database for what it does best (data retrieval and manipulation) and Perl for what it does best (everything else ;-). There's no point in extracting data into a flat file and running back and forth over it: during the extract, get it in the form that makes it easiest to manipulate.

        That said: if you're loading the final result into SQL Server, have you considered using DTS (built in to SS 7 and above)? There are many painful things about it (programming data transforms in VBScript -- barf), but it's hard to beat for sheer speed in server-to-server data transfers.

        HTH

        what kind of database is this getting loaded into ?
        Calm begets calm
Re (tilly) 1: Search Efficiency
by tilly (Archbishop) on Jul 11, 2001 at 19:07 UTC
    If you can make the id be the first column and sort on id, then Search::Dict is your friend.

    A database would be an even better friend.

    But if this is something like a log file, you will likely find that File::Tail does the trick for you.

    BTW a warning. Unless your operating system and Perl both support filesizes over 2 GB, your file may wind up losing data. The solution to that is to break it into pieces and rotate it periodically.

Re: Search Efficiency
by tachyon (Chancellor) on Jul 11, 2001 at 19:42 UTC

    I don't understand what you mean by when you code loops back through as the code you present does not. If you need to call this loop a lot then you shoud probably cache the lines you have found recently and search the cache first. If you can't find the $userid in the cache then search the whole file. Some code like this will work.

    # $userid already populated # first check cache open(CACHE, "<cache.txt") or die "Oops $!"; my @cached = <CACHE>; close CACHE; # now get the most recently cached line # as a line could be cached multiple times my @lines = grep{/$userid/}@cached; my $line = pop @lines || ''; # only look though the file if we have not cached the # data associated with $userid recently if (not $line) { open(FILE,"FILE.txt") or die "Oops $!"; while(<FILE>) { if(/$userid/o) { $line = $_; close FILE; last; # exit as soon as we find $userid; } } close FILE; } # do the stuff if ($line) { chomp $line; cache($line); my ($inv, $date, $amt)= split(/[|]/); #code to build hash. } else { warn "Can't find userid: $userid\n"; } sub cache { my $line = shift; open(CACHE, ">>cache.txt") or die "Oops unable to open cache to ap +pend: $!"; flock CACHE, LOCK_EX; print CACHE "$line\n"; close CACHE; } # to limit the cache size you will need to clean it up # from time to time say via a cron tab. this sub take # the desired maximum cache size as its argument sub clean_cache { my $max_length = shift; open(CACHE, "+<cache.txt") or die "Oops unable to open cache to R/ +W: $!"; flock CACHE, LOCK_EX; my @cached = <CACHE>; my $start = $#cache - $max_length; $start = 0 if $start < 0; seek CACHE, 0, 0; truncate CACHE, 0; print CACHE @cache[$start..$#cache]; close CACHE; }

    Totally untested, sorry. Something like this should minimise accesses to your 250MB file. At this size you might be better to store the data on a database rather than a flat file.

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://95699]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2024-04-25 06:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found