Re: Efficiency and Large Arrays
by splinky (Hermit) on Jul 23, 2000 at 01:57 UTC
|
Kozz...
The part of your post that sends up a red flag for me is
that you say your app gets incredibly slow, though not CPU
intensive. If you're talking about really large arrays,
it's likely that you're running out of real memory and you're
getting a lot of disk thrashing due to virtual memory.
One big memory waster I see in your method is that you're reading the
whole file into memory at once. Since all your records are
separated by blank lines, you can take advantage of Perl's
"paragraph mode". Just set $/ to the null string,
like so:
$/ = '';
and every $var = <FILE> will read in one record.
Check perldoc perlvar for more information on $/.
If that still doesn't take care of it, you might consider
using some kind of database for temporary storage. They
usually do some kind of smart caching in an attempt to
keep the stuff you need in memory and the stuff you're not
using on disk. I'm not particularly knowledgeable on
databases, so I'll defer that question to others.
Have fun!
*Woof* | [reply] [d/l] [select] |
RE: Efficiency and Large Arrays
by reptile (Monk) on Jul 23, 2000 at 02:31 UTC
|
1. Read in a file into a scalar
2. Split scalar into an array using split(/\n{2}/, $scalar)
That's rather wasteful. You can probably do it all at once, if you're slick, iterating over each line in the file as you go. You can probably get #3, and #4 in there too. For #5 (the phone numbers), maybe keep a temporary hash outside the loop with the phone numbers in it, and as you go along, if it doesn't exist in the hash, add it, if it does, drop it. The biggest speed increase you could have, I would imagine, would be getting all of these points into a single loop over the file, and I believe it can be done.
It's a lot of code, and I don't want to embarass myself by coming up with something right now, but some ideas:
Iterate over each line of the file, of course. Keep a variable handy to store the current serial number (if any) and whether a record is open or not (in case there's any line noise between records and such, probably not important). Also, your records hash, and a hash of phone numbers that you're just going to get rid of in the end.
When you get a "SERIAL NUMBER (\d+)" line, put $1 in your current serial number value. When you get a { line, set open to true, and } set it to false. Anything else, while it's open, is stuff to shove into the hash. And, when you get a phone number, check so see if it already exists (in your phone number hash), and if it does, you can just delete() the current record out of the hash (or, do the check when you get a closing brace so if you have stuff after the phone number, it won't re-enter the record).
Oh, I forgot, when you get a SERIAL NUMBER thing, you can do the check there for repeating numbers and figure out a new one. It's not that important that you don't have all the records already, as if your new number is taken by a later record, that later record will be incremented too. However, if that's not the behavior you want, my entire suggestion goes out the window.
I hope you could follow. I could send you some actual code but it would take me some time to write up and test, so /msg me or something if you want some code.
local $_ = "0A72656B636148206C72655020726568746F6E41207473754A";
while(s/..$//) { print chr(hex($&)) }
| [reply] [d/l] |
Re: Efficiency and Large Arrays
by fundflow (Chaplain) on Jul 23, 2000 at 01:56 UTC
|
Since you say that the process is getting slower and is not
using much cpu, then it is probably swapping.
In order to speed it up, you need to avoid loading the whole
file (or buy more memory :)
Things might be much faster for a method that goes through the
file several times, as you probably don't store the file
on tapes.
According to your description, the record number doesn't matter
to you much. Why do you use the increment-until-unique method?
It might be enough to give a new number. If this method is
important to you, then why not do a first pass on the file
to find the maximum and then just assign an increasing number
starting from that? If you need to be really strict with
this, and the record numbers are usually in increasing order,
don't forget to start the search from the last place you stopped.
To summerize:
- Read a record at a time
- Favor multiple passes.
- Try to stuff the least possible into the hashes
| [reply] |
Re: Efficiency and Large Arrays
by chromatic (Archbishop) on Jul 23, 2000 at 02:43 UTC
|
The bit about 'increment serial number until it's unique' throws a red flag for me. There are two solutions that come to mind. Either put things in a nice relational database (especially one that has a feature like auto-increment id column) or find some other sort of unique identifier.
If you're creating a new hash already, you can use its reference (yes, you read that right) as a unique ID. (I've seen this used as keys for a Flyweight object, pretty cool!) They're guaranteed to be unique, as they have something or other :) to do with memory locations:
]$ perl
my $h_ref = {};
print "$h_ref\n";
HASH(0x80d761c)
You can get rid of everything except the hex digits with a simple tr/// statement: tr/a-f0-9//dc;.
That's quicker than scanning for unique numbers.
Still, there's something I can't quite put my finger on here... perhaps you could show us your intended data structure? | [reply] [d/l] [select] |
|
This is a real overkill, don't you think?
Also, what happens when you run the script the second time?
Who guarantees that the number you got (memory location) does not appear already
somewhere else in the next records?
If he renumbers anyway, then any number can do. Instead of using
perl's heap pointer, it will be easier to explicitly pick one.
Anyway, his problem seems to lie on the memory use more than
the numbering scheme.
| [reply] |
|
Also, what happens when you run the script the second time?
Yes, that's a problem, if the serial numbers need to be maintained. If this is a one-time-per-dataset operation, and the serial numbers are there just while manipulating the data, it doesn't really matter.
Anyway, his problem seems to lie on the memory use more than the numbering scheme.
But the reason he's keeping all the old records around is to make sure he doesn't reuse a number. If he uses a unique identifier (the reference value is unique, automatically generated, and readily available), he doesn't have to keep all of the records around in memory.
The thing that bothered me was using grep to look for already-used phone numbers. What if they were the primary key of the hash? Then, it's a simple lookup to see if one's already used.
| [reply] |
|
|
]$ perl
my $h_ref = {};
print 0+$h_ref,"\n";
print unpack('H*',pack('L',0+$h_ref)),"\n";
135099932
80D761C
__SIG__
use B;
printf "You are here %08x\n", unpack "L!", unpack "P4", pack
"L!", B::svref_2object(sub{})->OUTSIDE;
| [reply] [d/l] [select] |
Somethings not mentioned yet
by gryng (Hermit) on Jul 23, 2000 at 09:23 UTC
|
A few tips that haven't been mentioned (and are also not to be considered
complete).
First, you didn't say if your data was ordered or not. If it happens to
be ordered by either feild, then you do not need to put much effort at all
into that dup check. Just keep the last item found. That will be all you
need to check for to see if the next item is a duplicate.
Of course if they are not ordered, then this will not be as good of a
solution. You should still consider it though. For instance, if the files
are semi-ordered, that is, there may be about 5-10% mis-ordering, but
otherwise it's in the right order, then you can still use the same routine,
but instead you use the last field as a water mark sort of value -- that is,
if you come across a value that is lower, it gets set to ++$water_mark.
Also, if the files are completely not ordered, you may want to simply sort
them beforehand, this initial cost can easily outway the memory cost of your
hash.
Another, much simpler, method is to completely get rid of the serial
numbers in the file and just start at 0 or 1 for the first record and count
up. This only works if you don't care about your serial numbers changing each
time you run this program. This is good because then you can also use the
previous mentioned idea of sorting by phone number to do your dup-check for
phone numbers, and then avoid the dup-check on serials by simply making up
your own.
Like I said, many caveats. But depending on what you are doing, these
can really speed things up.
Oh, one other things, if you have multiple files, and you sort beforehand
you can keep your files separate, but you'll need to open all the files up
at once so that you can read from the current lowest one.
Ciao,
Gryn
| [reply] |
Re: Efficiency and Large Arrays
by turnstep (Parson) on Jul 23, 2000 at 03:25 UTC
|
Here's one way to do it. Does not require much memory,
and should be fairly fast. Not tested on 1MB files. :)
#!perl
use strict;
## Files to be loaded:
my @files = qw(file1.txt file2.txt file3.txt);
## New file to create:
my $newfoo = "Bigfile.txt";
my ($file, $number, %phone, %serial, %change);
{
local $/='';
for $file (@files) {
open(FOO, "$file") or die "Could not open $file: $!\n";
while(<FOO>) {
## Check for duplicate phone number
/^Phone (.*)/m or die "No phone for record $.\n";
$phone{$1}++ and $change{"$file$."}=-1 and next;
## If phone number cool, check the serial number:
/^SERIAL NUMBER (\d+)/ or die "Bad serial number for record $.\n";
## Make sure it is unique, if not, add one
if ($serial{$1}++) {
$number=$1;
1 while $serial{++$number};
$serial{$number}++;
$change{"$file$."}=$number;
}
}
close(FOO);
} ## go to next file in the list
## Loop through them all again, this time for keeps :)
open(NEWFOO, ">$newfoo") or die "Could not create $newfoo: $!\n";
for $file (@files) {
open(FOO, "$file") or die "Could not open $file: $!\n";
while(<FOO>) {
if ($change{"$file$."}) {
$change{"$file$."}==-1 and next;
s/^SERIAL NUMBER (\d+)/SERIAL NUMBER $change{"$file$."}/;
}
print NEWFOO $_;
}
}
close(NEWFOO);
} ## end generic loop
| [reply] [d/l] |
RE: Efficiency and Large Arrays
by lhoward (Vicar) on Jul 23, 2000 at 01:46 UTC
|
If you have the memory to spare and efficiency is of utmost
importance, then by all means load as much as you can into memory.
If memory is tight or you can deal with longer runtimes
(and more disk IO) then store the data in some
sort of persistant storage (SQL DB, hash tied to a DBM file, etc...).
This is the classic space vs. performance tradeoff. You are
going to have to decide which is more important to you in
your particular circumstances. | [reply] |
|
Well, it all gets loaded into memory (currently), which is fine by me. Let it gobble the memory if it's necessary. But the issue is that the entire process is still too slow.
*scratches head*
| [reply] |
|
I would recomend using a profiler like the Devel::DProf
module to profile your code and try and isolate the bottlenecks.
| [reply] |
Some solutions I've implemented
by Kozz (Friar) on Jul 23, 2000 at 19:00 UTC
|
My dear monks:
Thank you so much for all the input and opinions. I'm learning more every day! Since reading all these comments, I've re-written my script to parse one record at a time ("paragraph mode" as suggested by Splinky) from each file, and then discarding that record when it's been printed to STDOUT. I've also created a smaller hash to store only a few needed bits of info. I've also taken the suggestion to simply re-assign the SERIAL numbers as I encounter each record, starting from zero and incrementing once per loop, since it's not necessary that these are sorted in any fashion (Why didn't I think of that before??).
I think I'm also going to use some multiple hashes such that the key is the phone, as suggested above -- this should help things along as well. While the script is still slow, despite some of the changes I've made, I'll make some more modifications and even perhaps write to a hash tied to a DBM as suggested by lhoward. Thanks again, most wise monks!
--Kozz | [reply] |
Re: Efficiency and Large Arrays
by Anonymous Monk on Jul 23, 2000 at 09:31 UTC
|
Since any of the serial numbers could potentially be duplicated and you will change one of the duplicates to another arbitrary number it seems the actual number is not really important. Why not just reallocate the serial number on _every_ record as you read from top to bottom? Start from 1. This gets rid of the need to worry about duplicated serial nos and allows a single pass from top to bottom.
To handle duplicated phone_numbers, keep a hash of just the phone numbers (as key). When you read a record, if the phone number key is already in the hash don't print the record out, otherwise set the hash value for that phone no to 1 and print the record.
| [reply] |
|
I agree with you, that's what I was mentioning in my post previously.
You can also reduce memory further if you presort the data on phone
numbers. Then you don't have to keep a hash at all.
Ciao,
Gryn
| [reply] |
Re: Efficiency and Large Arrays
by turnstep (Parson) on Jul 25, 2000 at 23:02 UTC
|
A side note to those who recommend sorting first - yes,
it will help immensely if you are able to do so first,
but keep in mind that the sorting itself will be fairly
expensive, especially for files this size. In addition,
a simple unix sort will not be up to the task
(without some further trickery) as each record is in multiple lines (and spans 1+ files) and the sort keys
may be in different spots in each record. Not a challenge
for perl, of course, but it probaby cancels out any speed
benefit of having it pre-sorted.
| [reply] |
|
Yes it would be "expensive" but if you don't have the memory to keep all the records in, then sort is not only the faster way to go, it's the only sane way to go. Sorting doesn't require large amounts of memory (though simple sorting would), and perl could do the presorting itself. Anyway for small files presorting would be a waste of time-energy and time-cpu. However as the input files grow, simply sorting your input beforehand can do wonders.
Ciao,
Gryn
| [reply] |
|
Yes, simple sorting requires a lot of memory, but even
a more complex sort requires that at the very least
you read each record once. Why not just grab the information while you are there, as in my example program above? I agree that a sorted file is far better, and you'd
probably want to get it sorted at some point, but I would
not call it the only "sane" way! :)
| [reply] |
|