Search Efficiency

treebeard has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Search Efficiency by bikeNomad (Priest) on Jul 11, 2001 at 18:53 UTC
You can use tell to save the current position of the file handle, and then use seek to return to it later. Look at DBD::CSV for a solution that will keep you from having to scan the files yourself. However, I don't know how efficient it is in terms of re-scanning. Also, the BerkeleyDB module should be able to handle flat files efficiently if you use its DB_RECNO and specify a -Delim=>'\|' .	[reply]
Re: Search Efficiency by Cirollo (Friar) on Jul 11, 2001 at 19:00 UTC
Perl doesn't read in the entire file each time through that loop. Each time through the loop, Perl just reads the next line from the still-opened file. A few tidbits that might come in handy if you're doing anything more complex: If you need to know the current line number, look at the $. variable (dollar dot). The seek function can be used to go to a specific part of a file without reading the whole thing. The tell function returns the current file position in bytes, based at 0.	[reply]
Re: Re: Search Efficiency by particle (Vicar) on Jul 11, 2001 at 19:37 UTC
Cirollo is right. and you want to be careful about where your hash is defined. if it's created inside the while loop, it will be regenerated after every match of the $userid. if it's created outside the while loop, you can add info to it and do your processing after the while loop ends. i suspect this might be the real problem behind your question, Is there anyway to tell perl, "hey, you have already gone through x lines, don't start over at the beginning of the file when you loop back through? by the way, always `use strict` and `perl -w` to keep your code on it's very best behaviour. ~Particle	[reply] [d/l] [select]
Re: Search Efficiency by Abigail (Deacon) on Jul 11, 2001 at 19:47 UTC
I fail to see your problem. Perl only goes as many times through a file as you tell it to. In the code you give, you scan the file once, and only once. And unless you seek back to the beginning and enter the loop again, Perl isn't going through the file again. I get the feeling you didn't give us all the code, but if you do, I expect the answer to be "don't do that then". Note that if you want Perl to stop reading the file in its current iteration, look at the `last;` statement. -- Abigail	[reply] [d/l]
Re: Search Efficiency by VSarkiss (Monsignor) on Jul 11, 2001 at 19:02 UTC
If I'm reading right, you're doing a linear search through the text file each time looking for a specific `$userid`. If the file is sorted, and you're looking for values in the same order, then you can use `seek` and `tell` to move around in the file. Something like this: `my $last_pos = 0; # start at the beginning # later.... seek(FILE, $last_pos); # go to where I was while (<FILE>) { # start reading line at a time if (/$userid/) { # sure about the /o, BTW? $last_pos = tell(FILE); # remember this spot .... # other stuff` [download] You can see why order is important: I'm assuming you can find the next record by advancing in the file. If the requests are in random order, then it may be worthwhile to build your own index (we used to call these ISAM files back in the day ;-). In the beginning of your code, scan the file once, build a hash of positions. Then you can seek to any particular record. Something like this: `my %index; while (<FILE>) { $userid = split(...) # It's in there somewhere, right? $index{$userid} = tell(FILE}; } # later... seek(FILE, $index{$userid}); $_ = <FILE>; ($inv, $date, $amt) = split (...);` [download] You'll have to fiddle with these to make sure you're seeking to the right spot in the file Now the big suggestion: forget everything I just wrote! Get yourself a relational database and get rid of all this seek/tell stuff. If you've got that much data and you're doing random reads, there's just no point in writing your own ISAM stuff. HTH	[reply] [d/l] [select]
Re: Re: Search Efficiency by treebeard (Acolyte) on Jul 11, 2001 at 19:31 UTC
first, i had read that the /o would help wrt perl evaluating the variable, got the idea from the O'Reilly books...was I correct in my assumption? second. We are extracting a lot of data from an existing billing system, formatting it using Perl, and building a huge flat file that someone (not me :)) is going to load into sql server. (not my idea) We are using perl dbi scripts to extract the data from Oracle, then we process them using scripts based on the one above. It is not elegent, but we are not working with much time. Therefore our approach is 1. Build tables 2. Extract tables to file system (solaris) 3. Format files 4. Send the buggers out. I know that this getting away from my original question, but should we have merged the dbi call scripts and the format scripts?	[reply]
Re^3: Search Efficiency by VSarkiss (Monsignor) on Jul 11, 2001 at 21:17 UTC
Well, the `/o` says "the variable inside this regular expression will always have the same value". Believing that to be true, Perl will not recompile the RE. If you change the value of the variable, the RE won't change. This may be good or bad depending on what you're doing: good if you really don't change it, because you'll save time on RE recompiles; bad if you do change it, because your matches will appear to behave strangely. ("I changed the variable, why didn't it match?") It's hard to answer your second question without a lot more knowledge about what you're doing. But my guess is that things would be a lot easier if you combine the query and reformatting into one program. The general rule: use the database for what it does best (data retrieval and manipulation) and Perl for what it does best (everything else ;-). There's no point in extracting data into a flat file and running back and forth over it: during the extract, get it in the form that makes it easiest to manipulate. That said: if you're loading the final result into SQL Server, have you considered using DTS (built in to SS 7 and above)? There are many painful things about it (programming data transforms in VBScript -- barf), but it's hard to beat for sheer speed in server-to-server data transfers. HTH	[reply]
Re: Re^3: Search Efficiency by jehuni (Pilgrim) on Jul 13, 2001 at 13:23 UTC
Re: Re: Re: Search Efficiency by MZSanford (Curate) on Jul 11, 2001 at 20:07 UTC
what kind of database is this getting loaded into ? Calm begets calm	[reply]
Re: Re: Re: Re: Search Efficiency by treebeard (Acolyte) on Jul 11, 2001 at 20:39 UTC
Re (tilly) 1: Search Efficiency by tilly (Archbishop) on Jul 11, 2001 at 19:07 UTC
If you can make the id be the first column and sort on id, then Search::Dict is your friend. A database would be an even better friend. But if this is something like a log file, you will likely find that File::Tail does the trick for you. BTW a warning. Unless your operating system and Perl both support filesizes over 2 GB, your file may wind up losing data. The solution to that is to break it into pieces and rotate it periodically.	[reply]
Re: Search Efficiency by tachyon (Chancellor) on Jul 11, 2001 at 19:42 UTC
I don't understand what you mean by when you code loops back through as the code you present does not. If you need to call this loop a lot then you shoud probably cache the lines you have found recently and search the cache first. If you can't find the $userid in the cache then search the whole file. Some code like this will work. # $userid already populated # first check cache open(CACHE, "<cache.txt") or die "Oops $!"; my @cached = <CACHE>; close CACHE; # now get the most recently cached line # as a line could be cached multiple times my @lines = grep{/$userid/}@cached; my $line = pop @lines \|\| ''; # only look though the file if we have not cached the # data associated with $userid recently if (not $line) { open(FILE,"FILE.txt") or die "Oops $!"; while(<FILE>) { if(/$userid/o) { $line = $_; close FILE; last; # exit as soon as we find $userid; } } close FILE; } # do the stuff if ($line) { chomp $line; cache($line); my ($inv, $date, $amt)= split(/[\|]/); #code to build hash. } else { warn "Can't find userid: $userid\n"; } sub cache { my $line = shift; open(CACHE, ">>cache.txt") or die "Oops unable to open cache to ap +pend: $!"; flock CACHE, LOCK_EX; print CACHE "$line\n"; close CACHE; } # to limit the cache size you will need to clean it up # from time to time say via a cron tab. this sub take # the desired maximum cache size as its argument sub clean_cache { my $max_length = shift; open(CACHE, "+<cache.txt") or die "Oops unable to open cache to R/ +W: $!"; flock CACHE, LOCK_EX; my @cached = <CACHE>; my $start = $#cache - $max_length; $start = 0 if $start < 0; seek CACHE, 0, 0; truncate CACHE, 0; print CACHE @cache[$start..$#cache]; close CACHE; } [download] Totally untested, sorry. Something like this should minimise accesses to your 250MB file. At this size you might be better to store the data on a database rather than a flat file. cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]


Perl Monk, Perl Meditation
	PerlMonks