Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: Hash Search is VERY slow

by LanX (Saint)
on Sep 29, 2021 at 16:36 UTC ( [id://11137118]=note: print w/replies, xml ) Need Help??


in reply to Hash Search is VERY slow

My guess is also that your memory consumption leads to excessive swapping.

The general idea is presorting, for instance by iterating multiple times over the file and only processing one IP range after the other.

This is expensive in IO but will create smaller data structures.

Only if ...

... you really need all the data present in memory at once, consider breaking up the ranges into a tree of nested data structures and processing them in linear order.

like $hash->{'192'}{'168'}{'101'}{'208'} or $hash->{'192.168'}{'101.208'} instead of $hash->{'192.168.101.208'} °

If you now process all IPs in order , then Perl (well the OS) will be able to swap all memory-pages with unrelated sub-hashes out. This will be cheap because the number of swaps is minimized by the sorting. (see also Re: Small Hash a Gateway to Large Hash? )

An additional approach is using more compact data structures, hashes are efficient for sparse data. But if your IPs range from 0-255 an array is certainly more efficient.

Furthermore, there is no point in repeating URLs like "logmeinrescue.com" in your array, counting them is more memory efficient.

Anyway 800k lines input doesn't sound heavy though, not sure if we have the full picture (???)

edit

Like choroba already said, preloading the input completely into memory sounds like a waste of resources, you should check how much that costs.

OTOH if you decided to implement my initial idea to process one IP range after the other, it'll reduce IO if (and only if) all fits into memory.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

°) I'm aware that 192.168.*.* is very common

Replies are listed 'Best First'.
Re^2: Hash Search is VERY slow
by Tux (Canon) on Sep 30, 2021 at 07:59 UTC

    If the number of keys is the cause of the memory cunsumption, make the keys smaller and more efficient. It might matter if you are low on memory. Best reason to pack the IP's is the it is much easier to sort

    use List::Util qw( shuffle ); use Devel::Size qw( total_size ); my ($x, %ip0, %ip1, %ip2, %ip3, %ip4, %ip5) = 31; foreach my $i0 ((shuffle 1..254)[0..$x]) { foreach my $i1 ((shuffle 1..254)[0..$x]) { foreach my $i2 ((shuffle 1..254)[0..$x]) { foreach my $i3 ((shuffle 1..254)[0..$x]) { $ip0{"$i0.$i1.$i2.$i3"}++; $ip1{"$i0.$i1"}{"$i2.$i3"}++; $ip2{$i0}{$i1}{$i2}{$i3}++; $ip3{pack "CCCC", $i0, $i1, $i2, $i3}++; $ip4{pack "CC", $i0, $i1}{pack "CC", $i2, $i3}++; $ip5{pack "C", $i0}{pack "C", $i1}{pack "C", $i2}{pack "C", $i3}++ +; }}}} printf "{192.168.101.208} : %9d\n", total_size (\%ip0); printf "{192.168}{101.208} : %9d\n", total_size (\%ip1); printf "{192}{168}{101}{208} : %9d\n", total_size (\%ip2); printf "{xC0xA8x65xD0} : %9d\n", total_size (\%ip3); printf "{xC0xA8}{x65xD0} : %9d\n", total_size (\%ip4); printf "{xC0}{xA8}{x65}{xD0} : %9d\n", total_size (\%ip5); ==> {192.168.101.208} : 116483356 {192.168}{101.208} : 69879492 {192}{168}{101}{208} : 71176834 {xC0xA8x65xD0} : 106954864 {xC0xA8}{x65xD0} : 69611776 {xC0}{xA8}{x65}{xD0} : 71176690

    On my machine using perl-5.28.0, $ip4{pack "CC", $i0, $i1}{pack "CC", $i2, $i3} proves to be the most memory-friendly approach


    Enjoy, Have FUN! H.Merijn

      > $ip4{pack "CC", $i0, $i1}{pack "CC", $i2, $i3} proves to be the most memory-friendly approach

      This might be true for long keys, but the original strings of the keys are only stored once.

      {192.168}{101.208} : 69879492 ... {xC0xA8}{x65xD0} : 69611776

      that's a gain of ~0.3% °

      I'd call this micro optimization, or did I miss something?

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      °) 0.383% to be more precise, so rather ~0.4%

        Your suggestion of using {"192.168"}{"101.208"} is indeed almost as small as my pack version, so it shows close to no gain in memory. Both are however 40% smaller than the original {"192.168.101.208"}.

        I just included all of those to show the differences in size.

        I first thought I'd use Socket::inet_aton, but that of course does not support the {A.B}{C.D} split.

        The win in using pack over plain is not the .3% gain in memory size, but the ease of sorting.


        Enjoy, Have FUN! H.Merijn

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11137118]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (1)
As of 2024-04-25 19:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found