http://qs321.pair.com?node_id=23929

Kozz has asked for the wisdom of the Perl Monks concerning the following question:

My most respected monks,
I request our insight on how more efficiently parse & manipulate large arrays. More specifically, arrays created from files. (Please allow me to present an example:
I have two or more files (file1.txt, file2.txt, etc) which contain multiple instances of things like this:
SERIAL NUMBER 1000 { Name John Doe Phone 555-5555 } SERIAL NUMBER 1001 { Name Jane Doe Phone 555-5555 }
Let us suppose that aside from the Name and Phone, there are perhaps a dozen other 'fields' in each 'record', and that there are up to 10,000 records in each file. And therefore each of the files are anywhere from 1MB to 1.5MB in size, perhaps larger. My goal is the following: I've created a script which does this, but with large files, it has become unbelievably slow (though not CPU intensive). I need to make it faster and more efficient (usually those two go hand-in-hand). Right now my script does the following:
  1. Read in a file into a scalar
  2. Split scalar into an array using split(/\n{2}/, $scalar)
  3. Iterate over each record in array, creating a hash whose key is the unique serial number, and whose value is the current record in array
  4. If $hash{serial} already exists, increment serial in current record of array until unique, then create new key/value pair with record
  5. Determine if phone number in current record already exists in values of hash, discard record and continue to next record
#5 above is currently being done by grepping the values of the hash for the phone number, though I know that's rather wasteful (creating a temp array to be thrown away). What I'm wrestling with is the concept of holding such a large amount of memory hostage (which is expensive) so that I can constantly check that I'm creating a new, unique hash key (and not clobbering an old value), and then grepping the values of the hash for the phone numbers to make sure I create another key/value pair containing the same phone. All of this seems quite expensive, and I'm sure that the most wise monks will have some insight. Thanks in advance.
--Kozz

Replies are listed 'Best First'.
Re: Efficiency and Large Arrays
by splinky (Hermit) on Jul 23, 2000 at 01:57 UTC
    Kozz...

    The part of your post that sends up a red flag for me is that you say your app gets incredibly slow, though not CPU intensive. If you're talking about really large arrays, it's likely that you're running out of real memory and you're getting a lot of disk thrashing due to virtual memory.

    One big memory waster I see in your method is that you're reading the whole file into memory at once. Since all your records are separated by blank lines, you can take advantage of Perl's "paragraph mode". Just set $/ to the null string, like so:

    $/ = '';

    and every $var = <FILE> will read in one record.

    Check perldoc perlvar for more information on $/.

    If that still doesn't take care of it, you might consider using some kind of database for temporary storage. They usually do some kind of smart caching in an attempt to keep the stuff you need in memory and the stuff you're not using on disk. I'm not particularly knowledgeable on databases, so I'll defer that question to others.

    Have fun!

    *Woof*

RE: Efficiency and Large Arrays
by reptile (Monk) on Jul 23, 2000 at 02:31 UTC

    1. Read in a file into a scalar
    2. Split scalar into an array using split(/\n{2}/, $scalar)

    That's rather wasteful. You can probably do it all at once, if you're slick, iterating over each line in the file as you go. You can probably get #3, and #4 in there too. For #5 (the phone numbers), maybe keep a temporary hash outside the loop with the phone numbers in it, and as you go along, if it doesn't exist in the hash, add it, if it does, drop it. The biggest speed increase you could have, I would imagine, would be getting all of these points into a single loop over the file, and I believe it can be done.

    It's a lot of code, and I don't want to embarass myself by coming up with something right now, but some ideas:

    Iterate over each line of the file, of course. Keep a variable handy to store the current serial number (if any) and whether a record is open or not (in case there's any line noise between records and such, probably not important). Also, your records hash, and a hash of phone numbers that you're just going to get rid of in the end.

    When you get a "SERIAL NUMBER (\d+)" line, put $1 in your current serial number value. When you get a { line, set open to true, and } set it to false. Anything else, while it's open, is stuff to shove into the hash. And, when you get a phone number, check so see if it already exists (in your phone number hash), and if it does, you can just delete() the current record out of the hash (or, do the check when you get a closing brace so if you have stuff after the phone number, it won't re-enter the record).

    Oh, I forgot, when you get a SERIAL NUMBER thing, you can do the check there for repeating numbers and figure out a new one. It's not that important that you don't have all the records already, as if your new number is taken by a later record, that later record will be incremented too. However, if that's not the behavior you want, my entire suggestion goes out the window.

    I hope you could follow. I could send you some actual code but it would take me some time to write up and test, so /msg me or something if you want some code.

    local $_ = "0A72656B636148206C72655020726568746F6E41207473754A"; while(s/..$//) { print chr(hex($&)) }

Re: Efficiency and Large Arrays
by fundflow (Chaplain) on Jul 23, 2000 at 01:56 UTC
    Since you say that the process is getting slower and is not using much cpu, then it is probably swapping.
    In order to speed it up, you need to avoid loading the whole file (or buy more memory :)
    Things might be much faster for a method that goes through the file several times, as you probably don't store the file on tapes.

    According to your description, the record number doesn't matter to you much. Why do you use the increment-until-unique method?
    It might be enough to give a new number. If this method is important to you, then why not do a first pass on the file to find the maximum and then just assign an increasing number starting from that?
    If you need to be really strict with this, and the record numbers are usually in increasing order, don't forget to start the search from the last place you stopped.


    To summerize:
    - Read a record at a time
    - Favor multiple passes.
    - Try to stuff the least possible into the hashes
Re: Efficiency and Large Arrays
by chromatic (Archbishop) on Jul 23, 2000 at 02:43 UTC
    The bit about 'increment serial number until it's unique' throws a red flag for me. There are two solutions that come to mind. Either put things in a nice relational database (especially one that has a feature like auto-increment id column) or find some other sort of unique identifier.

    If you're creating a new hash already, you can use its reference (yes, you read that right) as a unique ID. (I've seen this used as keys for a Flyweight object, pretty cool!) They're guaranteed to be unique, as they have something or other :) to do with memory locations:

    ]$ perl my $h_ref = {}; print "$h_ref\n"; HASH(0x80d761c)
    You can get rid of everything except the hex digits with a simple tr/// statement: tr/a-f0-9//dc;. That's quicker than scanning for unique numbers.

    Still, there's something I can't quite put my finger on here... perhaps you could show us your intended data structure?

      This is a real overkill, don't you think?

      Also, what happens when you run the script the second time?
      Who guarantees that the number you got (memory location) does not appear already somewhere else in the next records?

      If he renumbers anyway, then any number can do. Instead of using perl's heap pointer, it will be easier to explicitly pick one.

      Anyway, his problem seems to lie on the memory use more than the numbering scheme.
        Also, what happens when you run the script the second time?

        Yes, that's a problem, if the serial numbers need to be maintained. If this is a one-time-per-dataset operation, and the serial numbers are there just while manipulating the data, it doesn't really matter.

        Anyway, his problem seems to lie on the memory use more than the numbering scheme.

        But the reason he's keeping all the old records around is to make sure he doesn't reuse a number. If he uses a unique identifier (the reference value is unique, automatically generated, and readily available), he doesn't have to keep all of the records around in memory.

        The thing that bothered me was using grep to look for already-used phone numbers. What if they were the primary key of the hash? Then, it's a simple lookup to see if one's already used.

      Instead of using the string form of the hash reference just take the numeric to begin with. If you wanted the hex form then just pack/sprintf it.

      ]$ perl my $h_ref = {}; print 0+$h_ref,"\n"; print unpack('H*',pack('L',0+$h_ref)),"\n"; 135099932 80D761C

      __SIG__ use B; printf "You are here %08x\n", unpack "L!", unpack "P4", pack "L!", B::svref_2object(sub{})->OUTSIDE;
Somethings not mentioned yet
by gryng (Hermit) on Jul 23, 2000 at 09:23 UTC
    A few tips that haven't been mentioned (and are also not to be considered complete).

    First, you didn't say if your data was ordered or not. If it happens to be ordered by either feild, then you do not need to put much effort at all into that dup check. Just keep the last item found. That will be all you need to check for to see if the next item is a duplicate.

    Of course if they are not ordered, then this will not be as good of a solution. You should still consider it though. For instance, if the files are semi-ordered, that is, there may be about 5-10% mis-ordering, but otherwise it's in the right order, then you can still use the same routine, but instead you use the last field as a water mark sort of value -- that is, if you come across a value that is lower, it gets set to ++$water_mark.

    Also, if the files are completely not ordered, you may want to simply sort them beforehand, this initial cost can easily outway the memory cost of your hash.

    Another, much simpler, method is to completely get rid of the serial numbers in the file and just start at 0 or 1 for the first record and count up. This only works if you don't care about your serial numbers changing each time you run this program. This is good because then you can also use the previous mentioned idea of sorting by phone number to do your dup-check for phone numbers, and then avoid the dup-check on serials by simply making up your own.

    Like I said, many caveats. But depending on what you are doing, these can really speed things up.

    Oh, one other things, if you have multiple files, and you sort beforehand you can keep your files separate, but you'll need to open all the files up at once so that you can read from the current lowest one.

    Ciao,
    Gryn

Re: Efficiency and Large Arrays
by turnstep (Parson) on Jul 23, 2000 at 03:25 UTC

    Here's one way to do it. Does not require much memory, and should be fairly fast. Not tested on 1MB files. :)

    #!perl use strict; ## Files to be loaded: my @files = qw(file1.txt file2.txt file3.txt); ## New file to create: my $newfoo = "Bigfile.txt"; my ($file, $number, %phone, %serial, %change); { local $/=''; for $file (@files) { open(FOO, "$file") or die "Could not open $file: $!\n"; while(<FOO>) { ## Check for duplicate phone number /^Phone (.*)/m or die "No phone for record $.\n"; $phone{$1}++ and $change{"$file$."}=-1 and next; ## If phone number cool, check the serial number: /^SERIAL NUMBER (\d+)/ or die "Bad serial number for record $.\n"; ## Make sure it is unique, if not, add one if ($serial{$1}++) { $number=$1; 1 while $serial{++$number}; $serial{$number}++; $change{"$file$."}=$number; } } close(FOO); } ## go to next file in the list ## Loop through them all again, this time for keeps :) open(NEWFOO, ">$newfoo") or die "Could not create $newfoo: $!\n"; for $file (@files) { open(FOO, "$file") or die "Could not open $file: $!\n"; while(<FOO>) { if ($change{"$file$."}) { $change{"$file$."}==-1 and next; s/^SERIAL NUMBER (\d+)/SERIAL NUMBER $change{"$file$."}/; } print NEWFOO $_; } } close(NEWFOO); } ## end generic loop
RE: Efficiency and Large Arrays
by lhoward (Vicar) on Jul 23, 2000 at 01:46 UTC
    If you have the memory to spare and efficiency is of utmost importance, then by all means load as much as you can into memory. If memory is tight or you can deal with longer runtimes (and more disk IO) then store the data in some sort of persistant storage (SQL DB, hash tied to a DBM file, etc...). This is the classic space vs. performance tradeoff. You are going to have to decide which is more important to you in your particular circumstances.
      Well, it all gets loaded into memory (currently), which is fine by me. Let it gobble the memory if it's necessary. But the issue is that the entire process is still too slow.
      *scratches head*
        I would recomend using a profiler like the Devel::DProf module to profile your code and try and isolate the bottlenecks.
Some solutions I've implemented
by Kozz (Friar) on Jul 23, 2000 at 19:00 UTC
    My dear monks:
    Thank you so much for all the input and opinions. I'm learning more every day! Since reading all these comments, I've re-written my script to parse one record at a time ("paragraph mode" as suggested by Splinky) from each file, and then discarding that record when it's been printed to STDOUT. I've also created a smaller hash to store only a few needed bits of info. I've also taken the suggestion to simply re-assign the SERIAL numbers as I encounter each record, starting from zero and incrementing once per loop, since it's not necessary that these are sorted in any fashion (Why didn't I think of that before??). I think I'm also going to use some multiple hashes such that the key is the phone, as suggested above -- this should help things along as well. While the script is still slow, despite some of the changes I've made, I'll make some more modifications and even perhaps write to a hash tied to a DBM as suggested by lhoward. Thanks again, most wise monks!
    --Kozz
Re: Efficiency and Large Arrays
by Anonymous Monk on Jul 23, 2000 at 09:31 UTC

    Since any of the serial numbers could potentially be duplicated and you will change one of the duplicates to another arbitrary number it seems the actual number is not really important. Why not just reallocate the serial number on _every_ record as you read from top to bottom? Start from 1. This gets rid of the need to worry about duplicated serial nos and allows a single pass from top to bottom.

    To handle duplicated phone_numbers, keep a hash of just the phone numbers (as key). When you read a record, if the phone number key is already in the hash don't print the record out, otherwise set the hash value for that phone no to 1 and print the record.

      I agree with you, that's what I was mentioning in my post previously. You can also reduce memory further if you presort the data on phone numbers. Then you don't have to keep a hash at all.

      Ciao,
      Gryn

Re: Efficiency and Large Arrays
by turnstep (Parson) on Jul 25, 2000 at 23:02 UTC

    A side note to those who recommend sorting first - yes, it will help immensely if you are able to do so first, but keep in mind that the sorting itself will be fairly expensive, especially for files this size. In addition, a simple unix sort will not be up to the task (without some further trickery) as each record is in multiple lines (and spans 1+ files) and the sort keys may be in different spots in each record. Not a challenge for perl, of course, but it probaby cancels out any speed benefit of having it pre-sorted.

      Yes it would be "expensive" but if you don't have the memory to keep all the records in, then sort is not only the faster way to go, it's the only sane way to go. Sorting doesn't require large amounts of memory (though simple sorting would), and perl could do the presorting itself. Anyway for small files presorting would be a waste of time-energy and time-cpu. However as the input files grow, simply sorting your input beforehand can do wonders.

      Ciao,
      Gryn

        Yes, simple sorting requires a lot of memory, but even a more complex sort requires that at the very least you read each record once. Why not just grab the information while you are there, as in my example program above? I agree that a sorted file is far better, and you'd probably want to get it sorted at some point, but I would not call it the only "sane" way! :)