http://qs321.pair.com?node_id=11137097

rtjensen has asked for the wisdom of the Perl Monks concerning the following question:

Hello! I'm hitting a brick wall here... I have a script that loads a CSV file of around 800k lines, they're firewall logs, I'm trying to pull out the IP address and the URL they're hitting. I take the IP, check the hash to see if we've seen the Ip already, if not, creates a new entry in the hash for it and creates an array which will hold a list of URLs. If it has seen the Ip before, it pulls the array of URLs from the Hash, adds the next URL to it, and sticks it back in the hash, and moves on. It's fine for say.... a few thousand lines... then it slows to crawl. around 100k, it comes to almost a halt. CPU is high, memory usage is around 7% of system, so fairly low. I let it run with the full dataset and after 30 min it never finished. 50k entries takes about 60 seconds, 100k takes 180 seconds... I feel like it's the 'exists' check on the Hash, but how can I make it faster? Here's the code:
foreach (@list) { # my $entry=time(); $linecounter++; #split the log entry up into an array; source IP is field 7; U +RL is 31; #PALO URL LOGS ONLY! my @message=split(',',$_); my $ip=$message[7]; my $url=$message[31]; #Check if we've seen this IP already in the Hash, if not add i +t to the hash; if (!(exists $ipURL{$ip})) { # print "Doesn't Exist... adding\n"; my @urlList; push(@urlList,$url); $ipURL{$ip}= \@urlList; } else { # print "Defined\n"; my @urlList=@{$ipURL{$ip}}; push (@urlList,$url); $ipURL{$ip}=\@urlList; } if (!($linecounter % 50000)) { print "Lines: $linecounter\n"; } } formatOutput(\%ipURL); # print Dumper \%ipURL;
Here's how the structure is with a very small dataset (4 lines)
perl urlListbyIP.pl List Length:5 Formatting Output... $VAR1 = { '192.168.102.120' => [ '"autodiscover-s.outlook.com/"', '"outlook.office365.com/"' ], 'Source address' => [ 'URL/Filename' ], '192.168.101.208' => [ '"logmeinrescue.com/"', '"logmeinrescue.com/"' ] }; List End:7 Execution Time: 0.01 s
I have another similar script that loads 3.5m lines and it compares each line with a few if $_=~/REGEX/ lines, and that finishes in 25-30 seconds, I dont get why this is so much slower. The delay is definately in the foreach loop on @list, as it never gets to the formatOutput() sub. Please help!