Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Hash Search is VERY slow

by rtjensen (Novice)
on Sep 28, 2021 at 21:09 UTC ( #11137097=perlquestion: print w/replies, xml ) Need Help??

rtjensen has asked for the wisdom of the Perl Monks concerning the following question:

Hello! I'm hitting a brick wall here... I have a script that loads a CSV file of around 800k lines, they're firewall logs, I'm trying to pull out the IP address and the URL they're hitting. I take the IP, check the hash to see if we've seen the Ip already, if not, creates a new entry in the hash for it and creates an array which will hold a list of URLs. If it has seen the Ip before, it pulls the array of URLs from the Hash, adds the next URL to it, and sticks it back in the hash, and moves on. It's fine for say.... a few thousand lines... then it slows to crawl. around 100k, it comes to almost a halt. CPU is high, memory usage is around 7% of system, so fairly low. I let it run with the full dataset and after 30 min it never finished. 50k entries takes about 60 seconds, 100k takes 180 seconds... I feel like it's the 'exists' check on the Hash, but how can I make it faster? Here's the code:
foreach (@list) { # my $entry=time(); $linecounter++; #split the log entry up into an array; source IP is field 7; U +RL is 31; #PALO URL LOGS ONLY! my @message=split(',',$_); my $ip=$message[7]; my $url=$message[31]; #Check if we've seen this IP already in the Hash, if not add i +t to the hash; if (!(exists $ipURL{$ip})) { # print "Doesn't Exist... adding\n"; my @urlList; push(@urlList,$url); $ipURL{$ip}= \@urlList; } else { # print "Defined\n"; my @urlList=@{$ipURL{$ip}}; push (@urlList,$url); $ipURL{$ip}=\@urlList; } if (!($linecounter % 50000)) { print "Lines: $linecounter\n"; } } formatOutput(\%ipURL); # print Dumper \%ipURL;
Here's how the structure is with a very small dataset (4 lines)
perl urlListbyIP.pl List Length:5 Formatting Output... $VAR1 = { '192.168.102.120' => [ '"autodiscover-s.outlook.com/"', '"outlook.office365.com/"' ], 'Source address' => [ 'URL/Filename' ], '192.168.101.208' => [ '"logmeinrescue.com/"', '"logmeinrescue.com/"' ] }; List End:7 Execution Time: 0.01 s
I have another similar script that loads 3.5m lines and it compares each line with a few if $_=~/REGEX/ lines, and that finishes in 25-30 seconds, I dont get why this is so much slower. The delay is definately in the foreach loop on @list, as it never gets to the formatOutput() sub. Please help!

Replies are listed 'Best First'.
Re: Hash Search is VERY slow
by choroba (Archbishop) on Sep 28, 2021 at 21:50 UTC
    When working with a large file, don't read the whole file into an array. Process the file line by line. Replace
    foreach (@list)
    by
    while (<>)

    Also, when populating the hash, Perl can do most of the work for you:

    push @{ $ipURL{$ip} }, $url;
    This line can replace the whole if..else... construct. If the key doesn't exist in the hash, it will be created, and the dereference will create an anonymous array to which the url will be pushed. If the key exists, the url is simply pushed to the referenced array.

    Whether it's also responsible for the slowdown, I have no idea.

    Another place where you can avoid creating a temporary variable is the splitting. You can populate the two variables directly by subscripting the result of split:

    my ($ip, $url) = (split /,/)[7, 31];

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Hash Search is VERY slow
by kcott (Bishop) on Sep 29, 2021 at 00:06 UTC

    G'day rtjensen,

    Welcome to the Monastery.

    "I have a script that loads a CSV file of around 800k lines, they're firewall logs, I'm trying to pull out the IP address and the URL they're hitting."

    I use very large CSV files at $work. In my case, they hold biological data; however, that's completely immaterial. I have one file which I use for volume testing which is over 2Gb. I expect that's probably comparable to your logfiles.

    I would take a different approach to what you show and use Text::CSV (if you also have Text::CSV_XS installed, it will run a lot faster). What follows is example code and data to show the technique; adapt it for your specific needs.

    The data:

    $ cat pm_11137097_csv_parse.csv A,B,IP0,C,D,URL0,E,F A,B,IP1,C,D,URL1,E,F A,B,IP2,C,D,URL2,E,F A,B,IP3,C,D,URL3,E,F A,B,IP9,C,D,URL4,E,F A,B,IP2,C,D,URL5,E,F A,B,IP1,C,D,URL6,E,F A,B,IP0,C,D,URL7,E,F A,B,IP1,C,D,URL8,E,F A,B,IP0,C,D,URL9,E,F

    The code:

    #!/usr/bin/env perl use strict; use warnings; use autodie; use Data::Dump; use Text::CSV; use constant { IP => 2, URL => 5, }; my $csv_file = 'pm_11137097_csv_parse.csv'; my %urls_for_ip; my $csv = Text::CSV::->new(); { open my $fh, '<', $csv_file; while (my $row = $csv->getline($fh)) { push @{$urls_for_ip{$row->[IP]}}, $row->[URL]; } } dd \%urls_for_ip;

    The output:

    { IP0 => ["URL0", "URL7", "URL9"], IP1 => ["URL1", "URL6", "URL8"], IP2 => ["URL2", "URL5"], IP3 => ["URL3"], IP9 => ["URL4"], }

    See also: autodie and Data::Dump.

    — Ken

Re: Hash Search is VERY slow
by johngg (Canon) on Sep 28, 2021 at 22:14 UTC

    Further to what choroba said it looks from your very small sample that you might have the same IP/URL combination cropping up many times, with the result that the arrays will get quite large. It might be better to go for an HoH structure counting those combinations.

    use strict; use warnings; use Data::Dumper; open my $csvFH, q{<}, \ <<__EOD__ or die $!; dsd,sdfjk,192.168.102.120,,,"autodiscover-s.outlook.com/",,, wwfefwe,gfr,192.168.101.208,,,"logmeinrescue.com/",,, xxd,erer,192.168.100.23,,,"a.n.other.net/",,, sir,ferutr,192.168.102.120,,,"outlook.office365.com/",,, thj,,192.168.101.208,,,"logmeinrescue.com/",,' __EOD__ my %ipURL; while ( <$csvFH> ) { my( $ip, $url ) = ( split m{,} )[ 2, 5 ]; $ipURL{ $ip }->{ $url } ++; } close $csvFH or die $!; print Data::Dumper->Dumpxs( [ \ %ipURL ], [ qw{ *ipURL } ] );

    The output.

    %ipURL = ( '192.168.102.120' => { '"autodiscover-s.outlook.com/"' => 1 +, '"outlook.office365.com/"' => 1 }, '192.168.100.23' => { '"a.n.other.net/"' => 1 }, '192.168.101.208' => { '"logmeinrescue.com/"' => 2 } );

    I hope this is helpful.

    Cheers,

    JohnGG

Re: Hash Search is VERY slow
by bliako (Monsignor) on Sep 28, 2021 at 22:34 UTC

    I was trying to find the performance of Perl's hash insert in practice (as i assumed it was O(1) or very near) and hit this A short meditation about hash search performance . Though it's 20 years old almost. Avoiding the dereferencing of the array as choroba suggests will speed things up especially because you will eliminate this round-about way:

    my @urlList=@{$ipURL{$ip}}; push (@urlList,$url); $ipURL{$ip}=\@urlList;

    I may be wrong on how I think Perl handles the above but I will say it creates a new array every time and copies the old one in there. Whereas push @{$ipURL{$ip}}, $url; "magically" doesn't, but it pushes into the reference so-to-speak. As the arrays grow, that's likely the cause of your exponential degradation in performance. You may also find that pushing into an array is slower than inserting into a hash and so you can find johngg's suggestion useful.

    bw, bliako

Re: Hash Search is VERY slow
by Tux (Canon) on Sep 29, 2021 at 09:23 UTC

    Next to the fine comments already given, I suggest using a stream (as already suggested) instead of reading the whole data into a list and add a parsing filter (RFC7111 defines CSV fragments).

    Assuming you only require fields 7 and 31 and the rest is not interesting:

    #!/usr/bin/perl use 5.18.3; use warnings; use Data::Peek; use Text::CSV_XS qw( csv ); my %ipURL; csv ( in => *DATA, out => undef, fragment => "col=7;31", on_in => sub { push @{$ipURL{$_[1][0]}} => $_[1][1]; }, ); DDumper \%ipURL; __END__ 1,2,3,4,5,6,192.168.102.120,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22 +,23,24,25,26,27,28,29,30,"autodiscover-s.outlook.com/",32 1,2,3,4,5,6,192.168.102.120,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22 +,23,24,25,26,27,28,29,30,"outlook.office365.com/",32 1,2,3,4,5,6,192.168.101.208,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22 +,23,24,25,26,27,28,29,30,"logmeinrescue.com/",32 1,2,3,4,5,6,192.168.101.208,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22 +,23,24,25,26,27,28,29,30,"logmeinrescue.com/",32

    --->

    { '192.168.101.208' => [ 'logmeinrescue.com/', 'logmeinrescue.com/' ], '192.168.102.120' => [ 'autodiscover-s.outlook.com/', 'outlook.office365.com/' ] }

    As a side note: if a hostname appears more than once for the same IP, using a hash instead of a list might be more memory-friendly and easier to consume:

    my %ipURL; csv ( in => *DATA, out => undef, fragment => "col=7;31", on_in => sub { $ipURL{$_[1][0]}{$_[1][1]}++ }, ); DDumper \%ipURL;

    -->

    { '192.168.101.208' => { 'logmeinrescue.com/' => 2 }, '192.168.102.120' => { 'autodiscover-s.outlook.com/' => 1, 'outlook.office365.com/' => 1 } }

    Enjoy, Have FUN! H.Merijn
Re: Hash Search is VERY slow
by hippo (Bishop) on Sep 28, 2021 at 22:17 UTC
    I feel like it's the 'exists' check on the Hash

    Don't guess, profile.


    🦛

Re: Hash Search is VERY slow
by rtjensen (Novice) on Sep 29, 2021 at 14:38 UTC
    It's working! I used choroba's advice and removed the whole if...else construct. Resultant loop looks like this:
    while (my $row= $csv->getline_hr($fh)) { $linecounter++; my $ip=$row->{'Source address'}; my $url=$row->{'URL/Filename'}; push @{ $ipURL{$ip} }, $url; if (!($linecounter % 50000)) { print "Lines: $linecounter\n"; } } formatOutput(\%ipURL); # print Dumper \%ipURL;
    The entire thing runs in about 45 seconds now:
    perl test-urlListbyIP.pl Lines: 50000 Lines: 100000 Lines: 150000 Lines: 200000 Lines: 250000 Lines: 300000 Lines: 350000 Lines: 400000 Lines: 450000 Lines: 500000 Lines: 550000 Lines: 600000 Lines: 650000 Lines: 700000 Lines: 750000 Lines: 800000 Lines: 850000 Formatting Output... List End:1316 Execution Time: 44.18 s

      I think you'll find that it was choroba's advice to process the file on a line-by-line (CSV-record-by-record in this case) basis that did the trick. :)


      Give a man a fish:  <%-{-{-{-<

        I’m going to bet it was bliako’s observation that the array was getting cloned every time an element was added. That’s where the N^2 behavior came from.
Re: Hash Search is VERY slow
by LanX (Sage) on Sep 29, 2021 at 16:36 UTC
    My guess is also that your memory consumption leads to excessive swapping.

    The general idea is presorting, for instance by iterating multiple times over the file and only processing one IP range after the other.

    This is expensive in IO but will create smaller data structures.

    Only if ...

    ... you really need all the data present in memory at once, consider breaking up the ranges into a tree of nested data structures and processing them in linear order.

    like $hash->{'192'}{'168'}{'101'}{'208'} or $hash->{'192.168'}{'101.208'} instead of $hash->{'192.168.101.208'} °

    If you now process all IPs in order , then Perl (well the OS) will be able to swap all memory-pages with unrelated sub-hashes out. This will be cheap because the number of swaps is minimized by the sorting. (see also Re: Small Hash a Gateway to Large Hash? )

    An additional approach is using more compact data structures, hashes are efficient for sparse data. But if your IPs range from 0-255 an array is certainly more efficient.

    Furthermore, there is no point in repeating URLs like "logmeinrescue.com" in your array, counting them is more memory efficient.

    Anyway 800k lines input doesn't sound heavy though, not sure if we have the full picture (???)

    edit

    Like choroba already said, preloading the input completely into memory sounds like a waste of resources, you should check how much that costs.

    OTOH if you decided to implement my initial idea to process one IP range after the other, it'll reduce IO if (and only if) all fits into memory.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

    °) I'm aware that 192.168.*.* is very common

      If the number of keys is the cause of the memory cunsumption, make the keys smaller and more efficient. It might matter if you are low on memory. Best reason to pack the IP's is the it is much easier to sort

      use List::Util qw( shuffle ); use Devel::Size qw( total_size ); my ($x, %ip0, %ip1, %ip2, %ip3, %ip4, %ip5) = 31; foreach my $i0 ((shuffle 1..254)[0..$x]) { foreach my $i1 ((shuffle 1..254)[0..$x]) { foreach my $i2 ((shuffle 1..254)[0..$x]) { foreach my $i3 ((shuffle 1..254)[0..$x]) { $ip0{"$i0.$i1.$i2.$i3"}++; $ip1{"$i0.$i1"}{"$i2.$i3"}++; $ip2{$i0}{$i1}{$i2}{$i3}++; $ip3{pack "CCCC", $i0, $i1, $i2, $i3}++; $ip4{pack "CC", $i0, $i1}{pack "CC", $i2, $i3}++; $ip5{pack "C", $i0}{pack "C", $i1}{pack "C", $i2}{pack "C", $i3}++ +; }}}} printf "{192.168.101.208} : %9d\n", total_size (\%ip0); printf "{192.168}{101.208} : %9d\n", total_size (\%ip1); printf "{192}{168}{101}{208} : %9d\n", total_size (\%ip2); printf "{xC0xA8x65xD0} : %9d\n", total_size (\%ip3); printf "{xC0xA8}{x65xD0} : %9d\n", total_size (\%ip4); printf "{xC0}{xA8}{x65}{xD0} : %9d\n", total_size (\%ip5); ==> {192.168.101.208} : 116483356 {192.168}{101.208} : 69879492 {192}{168}{101}{208} : 71176834 {xC0xA8x65xD0} : 106954864 {xC0xA8}{x65xD0} : 69611776 {xC0}{xA8}{x65}{xD0} : 71176690

      On my machine using perl-5.28.0, $ip4{pack "CC", $i0, $i1}{pack "CC", $i2, $i3} proves to be the most memory-friendly approach


      Enjoy, Have FUN! H.Merijn

        > $ip4{pack "CC", $i0, $i1}{pack "CC", $i2, $i3} proves to be the most memory-friendly approach

        This might be true for long keys, but the original strings of the keys are only stored once.

        {192.168}{101.208} : 69879492 ... {xC0xA8}{x65xD0} : 69611776

        that's a gain of ~0.3% °

        I'd call this micro optimization, or did I miss something?

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

        °) 0.383% to be more precise, so rather ~0.4%

Re: Hash Search is VERY slow
by karlgoethebier (Abbot) on Sep 29, 2021 at 11:13 UTC

      Unless you can guarantee that the CSV you are dealing with is absolutely free of fields with embedded newlines, CSV parsing cannot be threaded or parsed in parallel.

      The OP only deals with fields 7 and 31, which are unlikely to contain new-lines, but we have no idea what the other fields may hold.


      Enjoy, Have FUN! H.Merijn
        We do have some idea of what the other fields may hold, in that the original post states the CSV files are firewall logs. I'm not aware of any common firewall log format which includes data that might contain embedded newlines, so it's probably not an issue in this case. But OP would know the specifics of the format they're dealing with better than I do, of course.

        Yes sure. Anyway:

        And yes, i'm aware that i shouldn't split etc.

        Best regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

        Right, there are no new-lines in there and it's pretty uniform data overall.
Re: Hash Search is VERY slow
by dd-b (Monk) on Oct 02, 2021 at 03:00 UTC

    Ooh, excellent timing, this thread is one thing I needed today. I'm just implementing something using several of the same tools and techniques, and while mine is enough smaller that writing it badly wouldn't be as severely punished--still, it's better to write it right, just for drill!

    (The back of the envelope says my case might grow to 500K bytes, which on a system with 16GB of ram is just not something I'm going to worry about, even if the back of the envelope is off by an order of magnitude in growth. But still.)

    In particular the fact that hash entries auto-vivify when referenced, and using that to make the pushing onto an array held in a hash entry vastly more efficient, is stuff I've seen the bits of repeatedly but mostly don't quite think of clearly enough to put them together in that order myself just by default. (The auto-vivification feels mysterious and I tend to avoid mysterious code when possible; but this case has big benefits.)

    Thanks monks!

Re: Hash Search is VERY slow
by rtjensen (Novice) on Sep 29, 2021 at 14:05 UTC
    Thank you everyone for all the insight. I have tried changing over to Text::CSV_XS, and while it does produce the same output, there was no speed difference. As a test, I removed the if (!(exists)) check on the hash, and it completed in about 45 seconds, so i'm pretty sure that's where the bottleneck is. For the fun of it, I let it run overnight just to see how long it would take to run in its current state. The good news is, now that I've processed the big file with the script, subsequent files are going to be MUCH smaller... but I'm still going to be trying to get this script working more quickly. Here are the results of letting it run overnight:
    perl test-urlListbyIP.pl Lines: 50000 Lines: 100000 Lines: 150000 Lines: 200000 Lines: 250000 Lines: 300000 Lines: 350000 Lines: 400000 Lines: 450000 Lines: 500000 Lines: 550000 Lines: 600000 Lines: 650000 Lines: 700000 Lines: 750000 Lines: 800000 Lines: 850000 Formatting Output... List End:1316 Execution Time: 25977.87 s
    Check out that runtime!!! I'll post updates if I get it going more quickly. THanks again for all the feedback.
      Very strange to me that "if exists" was the culprit but ... you found it!

        May be strange to you, but not to us; you've demonstrated time and again that you don't know anything about any of this.

A reply falls below the community's threshold of quality. You may see it by logging in.
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11137097]
Approved by LanX
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2021-12-09 01:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    R or B?



    Results (36 votes). Check out past polls.

    Notices?