rtjensen has asked for the wisdom of the Perl Monks concerning the following question:
Hello!
I'm hitting a brick wall here...
I have a script that loads a CSV file of around 800k lines, they're firewall logs, I'm trying to pull out the IP address and the URL they're hitting.
I take the IP, check the hash to see if we've seen the Ip already, if not, creates a new entry in the hash for it and creates an array which will hold a list of URLs.
If it has seen the Ip before, it pulls the array of URLs from the Hash, adds the next URL to it, and sticks it back in the hash, and moves on. It's fine for say.... a few thousand lines... then it slows to crawl. around 100k, it comes to almost a halt. CPU is high, memory usage is around 7% of system, so fairly low. I let it run with the full dataset and after 30 min it never finished. 50k entries takes about 60 seconds, 100k takes 180 seconds... I feel like it's the 'exists' check on the Hash, but how can I make it faster?
Here's the code:
Here's how the structure is with a very small dataset (4 lines)foreach (@list) { # my $entry=time(); $linecounter++; #split the log entry up into an array; source IP is field 7; U +RL is 31; #PALO URL LOGS ONLY! my @message=split(',',$_); my $ip=$message[7]; my $url=$message[31]; #Check if we've seen this IP already in the Hash, if not add i +t to the hash; if (!(exists $ipURL{$ip})) { # print "Doesn't Exist... adding\n"; my @urlList; push(@urlList,$url); $ipURL{$ip}= \@urlList; } else { # print "Defined\n"; my @urlList=@{$ipURL{$ip}}; push (@urlList,$url); $ipURL{$ip}=\@urlList; } if (!($linecounter % 50000)) { print "Lines: $linecounter\n"; } } formatOutput(\%ipURL); # print Dumper \%ipURL;
I have another similar script that loads 3.5m lines and it compares each line with a few if $_=~/REGEX/ lines, and that finishes in 25-30 seconds, I dont get why this is so much slower. The delay is definately in the foreach loop on @list, as it never gets to the formatOutput() sub. Please help!perl urlListbyIP.pl List Length:5 Formatting Output... $VAR1 = { '192.168.102.120' => [ '"autodiscover-s.outlook.com/"', '"outlook.office365.com/"' ], 'Source address' => [ 'URL/Filename' ], '192.168.101.208' => [ '"logmeinrescue.com/"', '"logmeinrescue.com/"' ] }; List End:7 Execution Time: 0.01 s
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Hash Search is VERY slow
by choroba (Cardinal) on Sep 28, 2021 at 21:50 UTC | |
Re: Hash Search is VERY slow
by kcott (Archbishop) on Sep 29, 2021 at 00:06 UTC | |
Re: Hash Search is VERY slow
by johngg (Canon) on Sep 28, 2021 at 22:14 UTC | |
Re: Hash Search is VERY slow
by Tux (Canon) on Sep 29, 2021 at 09:23 UTC | |
Re: Hash Search is VERY slow
by bliako (Monsignor) on Sep 28, 2021 at 22:34 UTC | |
Re: Hash Search is VERY slow
by hippo (Bishop) on Sep 28, 2021 at 22:17 UTC | |
Re: Hash Search is VERY slow
by rtjensen (Novice) on Sep 29, 2021 at 14:38 UTC | |
by AnomalousMonk (Archbishop) on Sep 29, 2021 at 19:13 UTC | |
by NERDVANA (Deacon) on Sep 29, 2021 at 22:37 UTC | |
by LanX (Saint) on Sep 29, 2021 at 23:14 UTC | |
Re: Hash Search is VERY slow
by karlgoethebier (Abbot) on Sep 29, 2021 at 11:13 UTC | |
by Tux (Canon) on Sep 29, 2021 at 11:32 UTC | |
by dsheroh (Monsignor) on Sep 29, 2021 at 12:13 UTC | |
by karlgoethebier (Abbot) on Sep 29, 2021 at 15:36 UTC | |
by rtjensen (Novice) on Sep 29, 2021 at 14:16 UTC | |
Re: Hash Search is VERY slow
by LanX (Saint) on Sep 29, 2021 at 16:36 UTC | |
by Tux (Canon) on Sep 30, 2021 at 07:59 UTC | |
by LanX (Saint) on Sep 30, 2021 at 11:19 UTC | |
by Tux (Canon) on Sep 30, 2021 at 11:54 UTC | |
by LanX (Saint) on Sep 30, 2021 at 11:59 UTC | |
| |
Re: Hash Search is VERY slow
by dd-b (Monk) on Oct 02, 2021 at 03:00 UTC | |
Re: Hash Search is VERY slow
by rtjensen (Novice) on Sep 29, 2021 at 14:05 UTC | |
by Anonymous Monk on Sep 30, 2021 at 22:07 UTC | |
by Anonymous Monk on Oct 01, 2021 at 17:41 UTC | |
A reply falls below the community's threshold of quality. You may see it by logging in. | |
A reply falls below the community's threshold of quality. You may see it by logging in. |
Back to
Seekers of Perl Wisdom