Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: Memory utilization and hashes

by bfdi533 (Friar)
on Jan 17, 2018 at 21:44 UTC ( #1207436=note: print w/replies, xml ) Need Help??


in reply to Memory utilization and hashes

code updated and tested

#!/usr/bin/perl use warnings; use strict; $|++; use JSON; my $l; my @vals; my $json; my %pairs; while (<>) { $l = $_; chomp $l; @vals = split /;/, $l; if ($vals[0] =~ /Query/) { $pairs{$vals[1]}{$vals[2]} = $vals[3]; } elsif ($vals[0] =~ /Answer/) { $pairs{$vals[1]}{$vals[2]} = $vals[3]; $json = encode_json $pairs{$vals[1]}; print $json."\n"; delete $pairs{$vals[1]}; } }
[root@hadron ~]# ./t-1207429.pl t-1207429.txt {"ip":"1.2.3.4","host":"www.example.com"} {"ip":"2.3.4.5","host":"www.cnn.com"} {"ip":"3.4.5.6","host":"www.google.com"}

The real question is whether, if running this against 100GB file with >500000 hash entries, will delete actually reduce the size of the has or not?

Or is there a leaner way to do this?

Replies are listed 'Best First'.
Re^2: Memory utilization and hashes
by pryrt (Monsignor) on Jan 17, 2018 at 21:57 UTC

    delete will definitely reduce the size of the hash, because every time you get a first answer for a given query, it will delete the entire entry for that query. Of course, if there's a second answer for the query, it cannot find the entry for the query, so it creates it again, without the host key.

    You might want to expand your example data to include a sample with more than one response (out of order) for the same query (for example, query 2, with two or three rows of answers), and display the output. Then tell us what you want the real output to be, given that set of data. Something like:

    Query;1;host;www.example.com Answer;1;ip;1.2.3.4 Query;2;host;www.cnn.com Query;3;host;www.google.com Answer;2;ip;2.3.4.5 Answer;3;ip;3.4.5.6 Answer;2;ip;9.8.7.6 Answer;2;ip;5.4.3.2 ----------------------- {"host":"www.example.com","ip":"1.2.3.4"} {"ip":"2.3.4.5","host":"www.cnn.com"} {"ip":"3.4.5.6","host":"www.google.com"} {"ip":"9.8.7.6"} {"ip":"5.4.3.2"}

    Also, for debugging, add print "DEBUG: ", encode_json \%pairs; just before the end of the while loop: that will let you watch the hash grow and shrink, and will tell you whether or not it's doing the right thing

      Right, so it is much more complicated in my real code. I create an array for the multiple answers as such and am doing some funky checks to print out the info because the index number can be reused. So, say index 2 has an answer provided, then 2 can be re-used in another query. I then dump what is left of the has at the end of the code for those items that did not get re-used and replaced.

      Like I said, it is really messy in "real life".

      I will provide example code that is closer to my real code shortly but my real question is, I supposed, if a hash is the right way to do this after all due to memory issues and such.

        I did try to use Devel::Size to see if the memory actually goes down so am writing the size of the has to a log file every time I "dump" a line and the size never decreases since I have been testing it.

        The here is an example. First column is line count into the file being processed, the second is the index (equivalent to $vals1) and the last the size of the %pairs hash. Here the size is 122MB for the %pairs hash ...

        ... 424872: e5c651161 (122480629) 424875: 6d6148148 (122481928) 424886: 108038067 (122484667) 424890: 4db238067 (122487257) 424892: 502c57487 (122488556) 424895: c53c57539 (122489855) 424896: 578757487 (122489855) 424923: 300959147 (122495178) 424928: a9bb41168 (122496165) 424936: dfc243245 (122499555) 424937: 0a9534098 (122499555) 424944: 666b34098 (122501654) 424954: 494949982 (122504073) 424956: 182939296 (122505372) 424960: c1ad46207 (122507962) 424962: 3d1249982 (122507962) 424968: 3c1336561 (122512355) 424974: b24939296 (122514993) 424987: 3c7b36561 (122517700) 424998: eb1544993 (122520311) 425005: 818a49369 (122521727) ...
Re^2: Memory utilization and hashes
by QM (Parson) on Jan 26, 2018 at 10:16 UTC
    I don't think delete shrinks the hash per se. Certain hash admin is performed to mark hash entries unused, etc. Some linked memory (references) may become free.

    But the only way to shrink the hash is to make a new hash, and copy over the "trimmed" old hash, and then throw away the old hash.

    You should be able to make a test case for this, showing the size of a hash does not shrink after deletes, and that total process memory doesn't shrink, but only grows. It is up to you and Perl to make efficient use of an ever growing pile of memory allocated by the OS.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1207436]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2021-04-20 07:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?