Memory utilization and hashes

bfdi533 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Memory utilization and hashes by Laurent_R (Canon) on Jan 17, 2018 at 23:08 UTC
Given all what you've said so far, especially that it seems you can't never be sure you have collected all the answers for a given query, I think I would probably go for a completely different approach. I would use the OS's sort utility to reorganize the input file, sorting on the id number (second field). I would then read all the records for a given id number (storing them in an array or a hash), collect the information from the query record and use it to process the answer records. Once I've finished processing an id number, clear the data structures and start again with the next if number lines. This way, the memory usage of your Perl program will be limited to the maximum number of lines there can be for one id number. (Of course, the sort phase will use a lot of memory, but the *nix sort utilities know well how to handle that, they write temporary data on disk to avoid memory overflow.) Sorting your large file will take quite a bit of time, but at least you're guaranteed never to exceed your system's available memory. An alternative would be to use a database, but I doubt it would be faster.	[reply]
Re^2: Memory utilization and hashes by bfdi533 (Friar) on Jan 18, 2018 at 17:34 UTC
Turns out that the unix sort was exactly the prior step that was missing to help speed this up. With a correct choice of keys, the file now is in sequential order by "ID" and when a new Query comes in, it is now easy to check if the current "ID" = the prior "ID" and flush any accumulated hash entries and continue. This keeps the hash to, in testing, no more than 3-7 'extra' keys for each set of "ID"s in the file and then dumps the set. Memory usage has stayed small and the processing is now approx 1/4 the total time of the prior runs.	[reply]
Re^3: Memory utilization and hashes by poj (Abbot) on Jan 19, 2018 at 13:34 UTC
What does this sample of data you provided look like after the *nix sort ? `Query;1;host;www.example.com Answer;1;ip;1.2.3.4 Query;2;host;www.cnn.com Query;3;host;www.google.com Answer;2;ip;2.3.4.5 Answer;2;ip;2.3.4.5 Query;4;host;www.google.com Answer;4;ip;3.4.5.6 Answer;3;ip;3.4.5.6 Query;2;host;www.example2.com Answer;4;ip;1.2.4.5 Answer;2;ip;2.3.4.5` [download] poj	[reply] [d/l]
Re^4: Memory utilization and hashes by bfdi533 (Friar) on Jan 25, 2018 at 20:30 UTC
Re^3: Memory utilization and hashes by bfdi533 (Friar) on Jan 18, 2018 at 23:46 UTC
For what is is worth, and if anyone is interested, here are some stats from the processing after I introduced the *nix sort before my perl script. elapsed time \| type \|rows after\| rows before\| pct \| rows/second \| \|processing\| processing \|smaller\| 00:03:05.98667 \| dns \| 1791555 \| 4614653 \| 38.82 \| 24811.7405403301 00:03:50.106203 \| dns \| 2262736 \| 5822777 \| 38.86 \| 25304.737221708 00:04:51.91195 \| dns \| 2733705 \| 7039758 \| 38.83 \| 24116.0322487654 00:05:36.348691 \| dns \| 3208365 \| 8266995 \| 38.81 \| 24578.6447850335 00:06:33.947878 \| dns \| 3683419 \| 9490938 \| 38.81 \| 24091.8622234589 00:07:35.58667 \| dns \| 4155971 \| 10705249 \| 38.82 \| 23497.7221787459 00:08:25.086565 \| dns \| 4633553 \| 11946401 \| 38.79 \| 23652.1852447214 00:09:07.952743 \| dns \| 5109618 \| 13183845 \| 38.76 \| 24060.1861536808 00:10:16.250404 \| dns \| 5596902 \| 14441405 \| 38.76 \| 23434.3132373833 00:10:54.578348 \| dns \| 6070888 \| 15662586 \| 38.76 \| 23927.7483709253 00:11:39.012952 \| dns \| 6547181 \| 16896184 \| 38.75 \| 24171.4891714911 00:12:43.13814 \| dns \| 7019314 \| 18113219 \| 38.75 \| 23735.1772249255 00:13:34.23578 \| dns \| 7499659 \| 19365386 \| 38.73 \| 23783.5114541392 00:14:35.939246 \| dns \| 7973633 \| 20591767 \| 38.72 \| 23508.2137191967 00:15:12.223167 \| dns \| 8448494 \| 21815382 \| 38.73 \| 23914.5231004641 00:15:52.951662 \| dns \| 8923786 \| 23043433 \| 38.73 \| 24181.1142357817 00:17:45.637116 \| dns \| 9402613 \| 24278649 \| 38.73 \| 22783.2238906363 00:17:52.402055 \| dns \| 9880079 \| 25516948 \| 38.72 \| 23794.1990888856	[reply]
Re: Memory utilization and hashes by BrowserUk (Patriarch) on Jan 17, 2018 at 21:38 UTC
Your posted code will not run. You have an error in a variable name here: `chomp $;`. You have unbalanced [] here:`%pairs{$l[1}{$l[2]} = $l[3];`. And hash element references should start with $ not %. In addition, you assign `$_` to `$l`, use it as a scalar: `@vals = split /;/, $l;`, and then index it as an array:`%pairs{$l[1]}{$l[2]} = $l[3];` Use strict; use warnings; Only post code that compiles. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity. In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit	[reply] [d/l] [select]
Re: Memory utilization and hashes by Laurent_R (Canon) on Jan 17, 2018 at 21:27 UTC
I don't understand your code. while (<>) { $l = $_; chomp $; # you probably want to +chomp $l, or possibly $_ (but you no longer use $_), but not $ @vals = split /;/, $l; # you split your line i +nto @vals, but no longer use that variable. Besides, # declaring @vals with +my would be good practice if ($l =~ /Query/) { # you could use somethi +ng like: if $vals[0] eq "Query" %pairs{$l[1]}{$l[2]} = $l[3]; # where are $l[1], $l[2 +] and $l[3] coming from? Also, %pairs{...} is probably a syntax error +. } elsif {$l =~ /Answer/) { # again, you could use: + if $vals[0] eq "Answer". Also, "elsif {..." is a syntax error. %pairs{$l[1}{$l[2]} = $l[3]; # again, where are $l[1 +], $l[2] and $l[3] coming from? Also a syntax error. $json = encode_json $pairs{$l[1]}; # given the previous co +de, I doubt that you really want to encode $pairs{$l[1]} print $json."\n"; # is you intent to prin +t to the screen? delete $pairs{$l[1]}; # not sure it's needed, + since you just reuse the same variable in the next iteration } } [download] Also, I don't understand what's going on when you have two queries or two answers in a row, as in your data example. With the code you're showing, the hash should not grow significantly, even without the call to `delete`. (Update:: but this is no longer true with the updated code posted below.)	[reply] [d/l] [select]
Re^2: Memory utilization and hashes by bfdi533 (Friar) on Jan 17, 2018 at 21:32 UTC
Sorry for the typos in the code; fixing them. My actual data consists of data from several hundred MB to several hundred GB so that sample data set is just a sample of the sort of thing I am processing. The two queries and two answers in a row is what my real world data contains, specifically there can be anywhere from 1 to n answers for each query and the queries and answers occur in any order and the only guarantee is that the answer will follow (sometime later) the query it goes with. Max rows in files to process = 31291204, average lines in files 8707186.	[reply]
Re^3: Memory utilization and hashes by Laurent_R (Canon) on Jan 17, 2018 at 21:49 UTC
Just to keep track: my $l; # all these three variables should probably better decla +red within the my @vals; # while loop. Only %pairs probably need to be declared b +efore the while my $json; while (<>) { $l = $_; chomp $l; @vals = split /;/, $l; if ($vals[0] =~ /Query/) { $pairs{$vals[1]}{$vals[2]} = $vals[3]; # %pairs isn't decla +red anywhere } elsif {$vals[0] =~ /Answer/) { # syntax error: elsi +f { should be elsif ( $pairs{$vals[1}{$vals[2]} = $vals[3]; $json = encode_json $pairs{$vals[1]}; # what do you think +is the content of $pairs{$vals[1]}? Probably not what you want to enc +ode. print $json."\n"; delete $pairs{$vals[1]}; } } [download] This will still not compile. Do yourself a favor. Use the following pragmas: `use strict; use warnings;` [download]	[reply] [d/l] [select]
Re^3: Memory utilization and hashes by Laurent_R (Canon) on Jan 17, 2018 at 22:11 UTC
specifically there can be anywhere from 1 to n answers for each query Then you can't delete your hash entries as you go, because when a second answer comes of a given query, you no longer have the information from the query available.	[reply]
Re^4: Memory utilization and hashes by bfdi533 (Friar) on Jan 17, 2018 at 22:35 UTC
Re^5: Memory utilization and hashes by Laurent_R (Canon) on Jan 17, 2018 at 23:48 UTC
Re^3: Memory utilization and hashes by Laurent_R (Canon) on Jan 17, 2018 at 21:39 UTC
Even with the fixes that you did in the original post, you still have several syntax errors.	[reply]
Re: Memory utilization and hashes by poj (Abbot) on Jan 17, 2018 at 22:11 UTC
Do you need to store the answers ? `#!perl use strict; use warnings; use JSON; my %host = (); while (<DATA>) { chomp; my @f = split /;/, $_; if ($f[0] eq 'Query') { $host{$f[1]} = $f[3]; } elsif ($f[0] eq 'Answer') { my $json = encode_json { host=>$host{$f[1]},$f[2]=>$f[3] }; print $json."\n"; delete $host{$f[1]}; } } __DATA__ Query;1;host;www.example.com Answer;1;ip;1.2.3.4 Query;2;host;www.cnn.com Query;3;host;www.google.com Answer;2;ip;2.3.4.5 Answer;3;ip;3.4.5.6` [download] poj	[reply] [d/l]
Re^2: Memory utilization and hashes by bfdi533 (Friar) on Jan 17, 2018 at 22:26 UTC
Since I need all of the query and answers info on one line in the output, yes, I need to collect them up until I have all of the answers. here is a more closely working example of the code. I was trying to keep it simple and focus on the memory usage of the hash but here we are. #!/usr/bin/perl use warnings; use strict; $\|++; use JSON; my $l; my @vals; my $json; my %pairs; my %pind; my %flush; while (<DATA>) { $l = $_; chomp $l; @vals = split /;/, $l; if ($vals[0] =~ /Query/) { if (! $pairs{$vals[1]}) { $pind{$vals[1]} = 0; } if (!defined $flush{$vals[1]}) { $flush{$vals[1]} = " "; } elsif ($flush{$vals[1]} ne $vals[1]) { $json = encode_json $pairs{$vals[1]}; print "DEBUG: Flushing \"complete\" answer\n"; print $json."\n"; delete $pairs{$vals[1]}; $flush{$vals[1]} = $vals[1]; $pind{$vals[1]} = 0; } $pairs{$vals[1]}{$vals[2]} = $vals[3]; $pairs{$vals[1]}{id} = $vals[1]; } elsif ($vals[0] =~ /Answer/) { $pairs{$vals[1]}{$vals[0]}[$pind{$vals[1]}++]{$vals[2]} = $val +s[3]; } } print "DEBUG: output remaining data ...\n"; foreach my $key (keys %pairs) { $json = encode_json $pairs{$key}; print $json."\n"; } __DATA__ Query;1;host;www.example.com Answer;1;ip;1.2.3.4 Query;2;host;www.cnn.com Query;3;host;www.google.com Answer;2;ip;2.3.4.5 Answer;2;ip;2.3.4.5 Query;4;host;www.google.com Answer;4;ip;3.4.5.6 Answer;3;ip;3.4.5.6 Query;2;host;www.example2.com Answer;4;ip;1.2.4.5 Answer;2;ip;2.3.4.5 [download] Results in: `DEBUG: Flushing "complete" answer {"Answer":[{"ip":"2.3.4.5"},{"ip":"2.3.4.5"}],"id":"2","host":"www.cnn +.com"} DEBUG: output remaining data ... {"Answer":[{"ip":"3.4.5.6"},{"ip":"1.2.4.5"}],"id":"4","host":"www.goo +gle.com"} {"Answer":[{"ip":"1.2.3.4"}],"id":"1","host":"www.example.com"} {"Answer":[{"ip":"3.4.5.6"}],"id":"3","host":"www.google.com"} {"Answer":[{"ip":"2.3.4.5"}],"id":"2","host":"www.example2.com"}` [download]	[reply] [d/l] [select]
Re^3: Memory utilization and hashes by poj (Abbot) on Jan 18, 2018 at 14:30 UTC
Same idea using one hash. `#!/usr/bin/perl use strict; use warnings; use JSON; my %query = (); while (<DATA>) { chomp; next unless /\S/; # skip blank lines my ($s1,$n,$s2,$v2,undef) = split ';',$_,5; if ($s1 eq 'Query') { if (exists $query{$n}){ # print and reuse output($n); } $query{$n} = [$v2]; } elsif ($s1 eq 'Answer') { push @{$query{$n}},$v2; } } # remaining output($_) for keys %query; sub output { my $n = shift; my $host = shift @{$query{$n}}; print encode_json { id=>$n,host=>$host,ip => $query{$n} }; print "\n"; }` [download] poj	[reply] [d/l]
Re^4: Memory utilization and hashes by Cristoforo (Curate) on Jan 27, 2018 at 20:24 UTC
Re: Memory utilization and hashes by karlgoethebier (Abbot) on Jan 18, 2018 at 10:19 UTC
"... 100+ GB ...combine rows...consolidated output..." Life is hard - so perhaps you better go with sqlite? See also Re: Reading HUGE file multiple times and Limits In SQLite. Best regards, Karl P.S.: And remember: `#!/usr/bin/env perl use strict; use warnings; use feature qw(say); use Try::Tiny; # say $0; try { ...; } catch { say $_} __END__ karls-mac-mini:playground karl$ ./bfdi533.pl Unimplemented at ./bfdi533.pl line 10.` [download] �The Crux of the Biscuit is the Apostrophe� `perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'`Help	[reply] [d/l] [select]
Re: Memory utilization and hashes by bfdi533 (Friar) on Jan 17, 2018 at 21:44 UTC
code updated and tested `#!/usr/bin/perl use warnings; use strict; $\|++; use JSON; my $l; my @vals; my $json; my %pairs; while (<>) { $l = $_; chomp $l; @vals = split /;/, $l; if ($vals[0] =~ /Query/) { $pairs{$vals[1]}{$vals[2]} = $vals[3]; } elsif ($vals[0] =~ /Answer/) { $pairs{$vals[1]}{$vals[2]} = $vals[3]; $json = encode_json $pairs{$vals[1]}; print $json."\n"; delete $pairs{$vals[1]}; } }` [download] `[root@hadron ~]# ./t-1207429.pl t-1207429.txt {"ip":"1.2.3.4","host":"www.example.com"} {"ip":"2.3.4.5","host":"www.cnn.com"} {"ip":"3.4.5.6","host":"www.google.com"}` [download] The real question is whether, if running this against 100GB file with >500000 hash entries, will delete actually reduce the size of the has or not? Or is there a leaner way to do this?	[reply] [d/l] [select]
Re^2: Memory utilization and hashes by pryrt (Abbot) on Jan 17, 2018 at 21:57 UTC
`delete` will definitely reduce the size of the hash, because every time you get a first answer for a given query, it will delete the entire entry for that query. Of course, if there's a second answer for the query, it cannot find the entry for the query, so it creates it again, without the `host` key. You might want to expand your example data to include a sample with more than one response (out of order) for the same query (for example, query 2, with two or three rows of answers), and display the output. Then tell us what you want the real output to be, given that set of data. Something like: `Query;1;host;www.example.com Answer;1;ip;1.2.3.4 Query;2;host;www.cnn.com Query;3;host;www.google.com Answer;2;ip;2.3.4.5 Answer;3;ip;3.4.5.6 Answer;2;ip;9.8.7.6 Answer;2;ip;5.4.3.2 ----------------------- {"host":"www.example.com","ip":"1.2.3.4"} {"ip":"2.3.4.5","host":"www.cnn.com"} {"ip":"3.4.5.6","host":"www.google.com"} {"ip":"9.8.7.6"} {"ip":"5.4.3.2"}` [download] Also, for debugging, add `print "DEBUG: ", encode_json \%pairs;` just before the end of the while loop: that will let you watch the hash grow and shrink, and will tell you whether or not it's doing the right thing	[reply] [d/l] [select]
Re^3: Memory utilization and hashes by bfdi533 (Friar) on Jan 17, 2018 at 22:12 UTC
Right, so it is much more complicated in my real code. I create an array for the multiple answers as such and am doing some funky checks to print out the info because the index number can be reused. So, say index 2 has an answer provided, then 2 can be re-used in another query. I then dump what is left of the has at the end of the code for those items that did not get re-used and replaced. Like I said, it is really messy in "real life". I will provide example code that is closer to my real code shortly but my real question is, I supposed, if a hash is the right way to do this after all due to memory issues and such.	[reply]
Re^4: Memory utilization and hashes by bfdi533 (Friar) on Jan 17, 2018 at 22:16 UTC
Re^5: Memory utilization and hashes by Laurent_R (Canon) on Jan 17, 2018 at 22:40 UTC
Some notes below your chosen depth have not been shown here
Re^2: Memory utilization and hashes by QM (Parson) on Jan 26, 2018 at 10:16 UTC
I don't think `delete` shrinks the hash per se. Certain hash admin is performed to mark hash entries unused, etc. Some linked memory (references) may become free. But the only way to shrink the hash is to make a new hash, and copy over the "trimmed" old hash, and then throw away the old hash. You should be able to make a test case for this, showing the size of a hash does not shrink after deletes, and that total process memory doesn't shrink, but only grows. It is up to you and Perl to make efficient use of an ever growing pile of memory allocated by the OS. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l]
Re: Memory utilization and hashes by pwagyi (Monk) on Jan 18, 2018 at 02:39 UTC
I think it may be appropriate to use database.	[reply]
Re: Memory utilization and hashes by QM (Parson) on Jan 26, 2018 at 10:11 UTC
I have used DBM::Deep to store native Perl hashes on disk persistently. And hashes of hashes, and hashes of arrays of ... you get the idea. It solves your problem. It will be some factor slower (say, 5-10x) because of disk writes. There is a maximum file size, so depending on your data, you may need multiple subhashes each mapped to its own file. But if your problem is easily solved another way, staying in memory, you'll probably be happier. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply]
Re: Memory utilization and hashes by ikegami (Patriarch) on Jan 18, 2018 at 19:41 UTC
You could use a database (like SQLite). Upd: Woops, I just noticed someone already suggested this.	[reply]
Re: Memory utilization and hashes by Anonymous Monk on Jan 18, 2018 at 15:39 UTC
SQLite is exactly what I would recommend in this case: "it's just a disk file," but it's ideally suited to this sort of thing. You can import data very rapidly into an SQLite table, and you can also use its `ATTACH DATABASE` feature to work with more than one database (file ...) at a time. It has a very fast indexer and a good query engine, and it won't blink at all when dealing with this number of rows. And, since you can easily use them with spreadsheets and so-forth, you might well find that your need for custom programming is severely reduced or even eliminated. Hands down, this is the way I would do this.	[reply]
Re^2: Memory utilization and hashes by bfdi533 (Friar) on Jan 18, 2018 at 23:51 UTC
Not a bad thought but you might notice that I had an array in my hash which I needed in the JSON output: `{"Answer":[{"ip":"3.4.5.6"},{"ip":"1.2.4.5"}],"id":"4","host":"www.goo +gle.com"}` [download] This is certainly doable in a database (SQLite or PostgreSQL) but would involve another table and then a complicated query to get into the proper format to make it into JSON. Not as easy as it sounds in my specific use case, but certainly something I had considered at one point. Thanks for the pointer in this direction and the friendly reminder. 2018-01-28 Athanasius changed pre to code tags	[reply] [d/l]


Pathologically Eclectic Rubbish Lister
	PerlMonks