Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

How to create a compact data structure?

by bart (Canon)
on Dec 19, 2006 at 20:43 UTC ( [id://590773]=perlquestion: print w/replies, xml ) Need Help??

bart has asked for the wisdom of the Perl Monks concerning the following question:

I have a script that builds a large data structure (a hash of records), and I notice that as the data set grows, it's beginning to slow down due to its large memory consumption. So now I'm trying to restrict the amount of memory it needs, by storing as minimal as necessary, and as compact as possible. The record structure is equivalent to:
$record{$key} = { count => $integer, flag => $boolean }
Populating the hash goes something like:
foreach my $item (LARGE_LIST) { $key = property($item); $record{$key}{count}++; if(condition_is_true($item)) { $record{$key}{flag} = 1; } }
Later I'll be only interested to extract those keys for which the flag is set, and the count is 2 or more.

Now, how (and by how much) can I restrict the memory demands more? I think an array instead of a hash for the record structure could help a little, but probably not much.

I've also thought of using a string for the record and vec, but the equivalent of the above probably won't be this simple or as fast, for example, vec prefers that the string is defined — no depending on autovivification.

Any other ideas?

Replies are listed 'Best First'.
Re: How to create a compact data structure?
by Corion (Patriarch) on Dec 19, 2006 at 20:51 UTC

    My approach with such things is to use a tied hash (Berkeley DB in my case), and to pack and unpack the flag and count. It works well enough and is simple enough so I don't make ugly programming errors:

    my $mask = "NN"; foreach my $item (LARGE_LIST) { $key = property($item); my $val = $record{ $key } || pack $mask, 0,0; my ($count, $flag) = unpack $mask, $val; $count++; $flag ||= condition_is_true($item); $record{ $key } = pack $mask, $count, $flag; }

    Even without using a tied hash, you'll almost cut your memory use by half as you don't need to allocate the anonymous hash and the scalar for the flag anymore.

      Since the flag is boolean, why allocate a long for it? A byte should do.
      my $mask ="NC";

      --shmem

      _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                    /\_¯/(q    /
      ----------------------------  \__(m.====·.(_("always off the crowd"))."·
      ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
      My approach with such things is to use a tied hash (Berkeley DB in my case)

      Can you give me an idea on the speed of using a Berkeley DB? Before even bothering to try installing it, I'd like to know by how much it would slow down.

        Can you give me an idea on the speed of using a Berkeley DB?

        Not without profiling your application. You do take the tie hit on hash accesses, but if your working set is so large that it won't fit into memory at once, you avoid thrashing your virtual memory system.

        No matter what approach you take, you'll likely spend more time managing your data structures in an appropriate format, so you'll always pay a penalty over the naïve approach. Your goal is to amortize that over all operations to avoid a sudden slowdown at the end.

Re: How to create a compact data structure?
by perrin (Chancellor) on Dec 19, 2006 at 20:56 UTC
    You can get rid of the hash refs, and use one big hash with exists for the flags. Then you can just assign undef for true, which is cheap.
    foreach my $item (LARGE_LIST) { $key = property($item); $record{$key}++; if(condition_is_true($item)) { $flag{$key} = undef; } } # later if (exists $flag{$key}) { # flag is on }
      This is pretty clever, especially since more recent perls allegedly store common keys for different hashes in one, single place.

      I wonder if (and how) you can reliably use Devel::Size to measure how much memory is used by both hashes...?

        I don't know about 'reliable' but 'how' should work like this:
        use Devel::Size qw(total_size); # code by perrin my $size_rec_hash = total_size(\%record); my $size_flag_hash = total_size(\%flag); print "Total size: $size_rec_hash + $size_flag_hash =", $size_rec_hash + $size_flag_hash, "\n";
        Update: I missed the point, please ignore.

        -- Hofmator

        Code written by Hofmator and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

Re: How to create a compact data structure?
by GrandFather (Saint) on Dec 19, 2006 at 21:13 UTC

    Abstracting the hash helps play with different implementations. A pack/unpack and an array variant (commented out) are shown:

    use warnings; use strict; my @large_list = ( {key => 1, cond => 1}, {key => 2, cond => 0}, {key => 3, cond => 0}, {key => 3, cond => 0}, {key => 1, cond => 0}, ); my $record = bless {}; for my $item (@large_list) { $record->add ($item); } for (sort keys %$record) { print "Item $_ count ", $record->getCount ($_), ', flag ', $record +->getFlag ($_), "\n"; } sub add { my ($self, $item) = @_; my $key = $item->{key}; if (! exists $self->{$key}) { $self->{$key} = pack ('NN', 1, $item->{cond}); } else { my ($count, $flag) = unpack ('NN', $self->{$key}); $self->{$key} = pack ('NN', $count + 1, $flag || $item->{cond} +); } } sub getFlag { return (unpack ('NN', $_[0]->{$_[1]}))[1]; } sub getCount { return (unpack ('NN', $_[0]->{$_[1]}))[0]; } #sub add { # my ($self, $item) = @_; # my $key = $item->{key}; # # if (! exists $self->{$key}) { # $self->{$key} = [1, $item->{cond}] # } else { # $self->{$key}[0]++; # $self->{$key}[1] ||= $item->{cond}; # } #} # # #sub getFlag { # return $_[0]->{$_[1]}[1]; #} # # #sub getCount { # return $_[0]->{$_[1]}[0]; #}

    Prints:

    Item 1 count 2, flag 1 Item 2 count 1, flag 0 Item 3 count 2, flag 0

    DWIM is Perl's answer to Gödel

      Inspired by some of the other replies to the OP (especially samtregar's reply mentioning Devel::Size) here is a comparison of some of the techniques suggested:

      use warnings; use strict; use Devel::Size qw(total_size); use warnings; use strict; my @large_list = ( {key => 1, cond => 1}, {key => 2, cond => 0}, {key => 3, cond => 0}, {key => 3, cond => 0}, {key => 1, cond => 0}, ); my $recordU = bless {}; my $recordA = bless {}; my $recordI = bless {}; my %record; print "Empty sizes:\n"; print 'Packed: ', total_size ($recordU), "\n"; print 'Array: ', total_size ($recordA), "\n"; print 'int: ', total_size ($recordI), "\n"; print 'Hash: ', total_size (\%record), "\n"; for my $item (@large_list) { $recordU->addU ($item); $recordA->addA ($item); $recordI->addI ($item); $record{$item->{key}}{count}++; $record{$item->{key}}{flag} ||= $item->{cond}; } print "\nPopulated sizes:\n"; print 'Packed: ', total_size ($recordU), "\n"; print 'Array: ', total_size ($recordA), "\n"; print 'int: ', total_size ($recordI), "\n"; print 'Hash: ', total_size (\%record), "\n"; for (sort keys %$recordU) { print "Item $_ count ", $recordU->getCountU ($_), ', flag ', $reco +rdU->getFlagU ($_), "\n"; print "Item $_ count ", $recordA->getCountA ($_), ', flag ', $reco +rdA->getFlagA ($_), "\n"; print "Item $_ count ", $recordI->getCountI ($_), ', flag ', $reco +rdI->getFlagI ($_), "\n"; print "Item $_ count ", $record{$_}{count}, ', flag ', $record{$_} +{flag}, "\n"; }

      Prints:

      Empty sizes: Packed: 92 Array: 92 int: 92 Hash: 92 Populated sizes: Packed: 260 Array: 497 int: 209 Hash: 682 Item 1 count 2, flag 1 Item 1 count 2, flag 1 Item 1 count 2, flag 1 Item 1 count 2, flag 1 Item 2 count 1, flag 0 Item 2 count 1, flag 0 Item 2 count 1, flag 0 Item 2 count 1, flag 0 Item 3 count 2, flag 0 Item 3 count 2, flag 0 Item 3 count 2, flag 0 Item 3 count 2, flag 0

      DWIM is Perl's answer to Gödel
Re: How to create a compact data structure?
by BrowserUk (Patriarch) on Dec 20, 2006 at 12:45 UTC

    Reducing the storage requirement of the records will only get you so far.

    Here's an (approximate) table of storage usage(*) for the various options:
    # of records 1e5
    (~MB)
    1e6
    (~MB)
    2e6
    (~MB)
    ~ per record
    (bytes)
    record type:        
    HoHs 25 240 480 250
    HoAs 22 210 420 220
    HoPVs (5 bytes) 15 129 261 136
    HoPVs (3 bytes) 15 120 245 128
    HoIVs 11 90 180 94

    It was my fault!I apologise for the screwiness of the above table, but it displays just fine until PM gets it's hands on it. PM seems to think that there is an imbalanced <tr> tag somewhere, but it isn't in the html I posted!

    As you can see, moving to arrays doesn't save much. Moving to scalars save considerably more, but the type of scalar used is important.

    If you attempt to save space by packing the fields, most of the overhead is in the PV itself, so there is little more to be gained from using smaller representations for your fields. Ie. Using a template of 'nc' makes a negligable difference over using 'Nc'.

    Even if you decided that you could get away with only storing 3 bits--1 for the flag condition and 2 to represent 'none seen', '1 seen', 'more than 2 seen', you'd still need to store a minimum of 1 byte, and if you do that as a PV, you'd still need ~125 bytes per record.

    If you can represent your fields entirely numerically, then avoiding using pack will allow you to use IVs instead of PVs and that yeilds the greatest saving of all. In my example, I used the sign bit of the count to store the 'condition flag'.

    Using an HoIVs allows a 65% memory saving compared to a HOHs with marginal loss of performance. It becomes important to not do anything to cause the IVs to be converted to PVs (like printing them out), as this would negate the memory saving. This can be avoided: for example, by copying the scalar to another local scalar and then printing that.

    Any greater savings would probably mean avoiding using the top level hash entirely, which can be done in some circumstances but requires knowledge of the nature of the keys. Essentially, can they be hashed in some way that will allow you to index into (say) a large scalar bit field?

    The other alternative is to move the hash out of memory, with the performance hit that entails.


    Be aware that using Devel::Size for assessing your memory requirements has it's caveats.

    It will only report the final size of the Ho*; and not tell you how much extra intermediate strorage was required in order to reach that size. Each time the hash reaches it current allocation limits, a new hash has to be allocated (~double the size of the old one I think), and the data copied over. That means that you often need ~half as much storage again as the final hash requires in order to build that hash.

    Theoretically you can do a keys %hash = NNN to avoid some of the reallocation, but I've never worked out how to calculate the value you shoudl set it to.

    Also, Devel::Size can use a substantial amount of memory itself in order to calculate the size of a structure, which precludes using external tools (top/Task Manager)to get an accurate picture of the overall memory used.

    Test code:


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      (in the HoIVs variant):

      if( condition_is_true( $item ) ) { $records{ $key } = - $records{ $key }; }

      Minor nitpick: if condition happens to be true more than once, the flag / sign bit would act like a flip-flop (the electronic circuit, not the shoes style, that is :), i.e. being reset on every other turn of the condition being true. Not sure if this can happen in the OPs data, though.

      While being irrelevant for testing memory consumption, it might yield unexpected results when used as is in production code.

      Anyway, it's easy to fix:

      $records{ $key } = - $records{ $key } if $records{ $key } > 0;

      Other than that, ++ for your analysis and comments, of course :)

Re: How to create a compact data structure?
by samtregar (Abbot) on Dec 19, 2006 at 21:47 UTC
    As a side-note, you can use Devel::Size to get an accurate picture of how you're doing as you optimize. That will work for all in-memory structures.

    -sam

Re: How to create a compact data structure?
by almut (Canon) on Dec 19, 2006 at 22:09 UTC

    You could use a single integer to encode both the count and the flag, e.g. by using the lowest bit for the flag, and counting in steps of two. Then you'd only need one hash for the whole shebang.

    foreach my $item (LARGE_LIST) { $key = property($item); $record{$key} += 2; if(condition_is_true($item)) { $record{$key} |= 1; } }

    Then later, to extract $count and $flag, something like

    for my $key (keys %record) { my $count = $record{$key} >> 1; my $flag = $record{$key} & 1; print "$key\n" if $count >= 2 and $flag; # ... }

    Not the most readable and easily maintable code, but if size is all that matters...

Re: How to create a compact data structure?
by zentara (Archbishop) on Dec 20, 2006 at 12:43 UTC
    It sounds like this might be a good job for PDL. The first few chapters of the book “PDL – Scientific Programming in Perl” book online in HTML format explain the method of doing it. The main point of PDL is to convert data to compact efficient sequential arrays for fast processing.

    You would need to convert the hash to array of arrays, but it should be manageable. I've yet to really grasp the working of piddle slices, but you should be able to fly thru the data, look for the flag and count, then pull out it's piddle slice. Of course there are obstacles to overcome, like all data being numeric, so your ascii data will need to be converted to byte data types.


    I'm not really a human, but I play one on earth. Cogito ergo sum a bum
Re: How to create a compact data structure?
by sgt (Deacon) on Dec 19, 2006 at 22:16 UTC

    First something general. Maybe you can try DBM::Deep for persistent data. And try some home-made struct made with the help C::Binary::Convert (but not pure perl obviously). At least it can give you an idea of what you can save memory-wise and how fast you can go.

    Another idea (that came to me while previewing) is that maybe you could split your record into two - one for counts and one for the condition -- working out the intersection later

    hth --stephan
Re: How to create a compact data structure?
by mkirank (Chaplain) on Dec 20, 2006 at 17:12 UTC
     foreach my $item (LARGE_LIST) { Can you reduce some size from the LARGE_LIST (if they have similar values) by using Data::Reuse ?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://590773]
Approved by GrandFather
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2024-03-29 01:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found