Hash Search is VERY slow

rtjensen has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Hash Search is VERY slow by choroba (Cardinal) on Sep 28, 2021 at 21:50 UTC
When working with a large file, don't read the whole file into an array. Process the file line by line. Replace `foreach (@list)` [download] by `while (<>)` [download] Also, when populating the hash, Perl can do most of the work for you: `push @{ $ipURL{$ip} }, $url;` [download] This line can replace the whole if..else... construct. If the key doesn't exist in the hash, it will be created, and the dereference will create an anonymous array to which the url will be pushed. If the key exists, the url is simply pushed to the referenced array. Whether it's also responsible for the slowdown, I have no idea. Another place where you can avoid creating a temporary variable is the splitting. You can populate the two variables directly by subscripting the result of split: `my ($ip, $url) = (split /,/)[7, 31];` [download] Read more... (2 kB) `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re: Hash Search is VERY slow by kcott (Archbishop) on Sep 29, 2021 at 00:06 UTC
G'day rtjensen, Welcome to the Monastery. "I have a script that loads a CSV file of around 800k lines, they're firewall logs, I'm trying to pull out the IP address and the URL they're hitting." I use very large CSV files at $work. In my case, they hold biological data; however, that's completely immaterial. I have one file which I use for volume testing which is over 2Gb. I expect that's probably comparable to your logfiles. I would take a different approach to what you show and use Text::CSV (if you also have Text::CSV_XS installed, it will run a lot faster). What follows is example code and data to show the technique; adapt it for your specific needs. The data: `$ cat pm_11137097_csv_parse.csv A,B,IP0,C,D,URL0,E,F A,B,IP1,C,D,URL1,E,F A,B,IP2,C,D,URL2,E,F A,B,IP3,C,D,URL3,E,F A,B,IP9,C,D,URL4,E,F A,B,IP2,C,D,URL5,E,F A,B,IP1,C,D,URL6,E,F A,B,IP0,C,D,URL7,E,F A,B,IP1,C,D,URL8,E,F A,B,IP0,C,D,URL9,E,F` [download] The code: `#!/usr/bin/env perl use strict; use warnings; use autodie; use Data::Dump; use Text::CSV; use constant { IP => 2, URL => 5, }; my $csv_file = 'pm_11137097_csv_parse.csv'; my %urls_for_ip; my $csv = Text::CSV::->new(); { open my $fh, '<', $csv_file; while (my $row = $csv->getline($fh)) { push @{$urls_for_ip{$row->[IP]}}, $row->[URL]; } } dd \%urls_for_ip;` [download] The output: `{ IP0 => ["URL0", "URL7", "URL9"], IP1 => ["URL1", "URL6", "URL8"], IP2 => ["URL2", "URL5"], IP3 => ["URL3"], IP9 => ["URL4"], }` [download] See also: autodie and Data::Dump. — Ken	[reply] [d/l] [select]
Re: Hash Search is VERY slow by johngg (Canon) on Sep 28, 2021 at 22:14 UTC
Further to what choroba said it looks from your very small sample that you might have the same IP/URL combination cropping up many times, with the result that the arrays will get quite large. It might be better to go for an HoH structure counting those combinations. use strict; use warnings; use Data::Dumper; open my $csvFH, q{<}, \ <<__EOD__ or die $!; dsd,sdfjk,192.168.102.120,,,"autodiscover-s.outlook.com/",,, wwfefwe,gfr,192.168.101.208,,,"logmeinrescue.com/",,, xxd,erer,192.168.100.23,,,"a.n.other.net/",,, sir,ferutr,192.168.102.120,,,"outlook.office365.com/",,, thj,,192.168.101.208,,,"logmeinrescue.com/",,' __EOD__ my %ipURL; while ( <$csvFH> ) { my( $ip, $url ) = ( split m{,} )[ 2, 5 ]; $ipURL{ $ip }->{ $url } ++; } close $csvFH or die $!; print Data::Dumper->Dumpxs( [ \ %ipURL ], [ qw{ *ipURL } ] ); [download] The output. `%ipURL = ( '192.168.102.120' => { '"autodiscover-s.outlook.com/"' => 1 +, '"outlook.office365.com/"' => 1 }, '192.168.100.23' => { '"a.n.other.net/"' => 1 }, '192.168.101.208' => { '"logmeinrescue.com/"' => 2 } );` [download] I hope this is helpful. Cheers, JohnGG	[reply] [d/l] [select]
Re: Hash Search is VERY slow by Tux (Canon) on Sep 29, 2021 at 09:23 UTC
Next to the fine comments already given, I suggest using a stream (as already suggested) instead of reading the whole data into a list and add a parsing filter (RFC7111 defines CSV fragments). Assuming you only require fields 7 and 31 and the rest is not interesting: #!/usr/bin/perl use 5.18.3; use warnings; use Data::Peek; use Text::CSV_XS qw( csv ); my %ipURL; csv ( in => DATA, out => undef, fragment => "col=7;31", on_in => sub { push @{$ipURL{$_[1][0]}} => $_[1][1]; }, ); DDumper \%ipURL; __END__ 1,2,3,4,5,6,192.168.102.120,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22 +,23,24,25,26,27,28,29,30,"autodiscover-s.outlook.com/",32 1,2,3,4,5,6,192.168.102.120,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22 +,23,24,25,26,27,28,29,30,"outlook.office365.com/",32 1,2,3,4,5,6,192.168.101.208,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22 +,23,24,25,26,27,28,29,30,"logmeinrescue.com/",32 1,2,3,4,5,6,192.168.101.208,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22 +,23,24,25,26,27,28,29,30,"logmeinrescue.com/",32 [download] ---> `{ '192.168.101.208' => [ 'logmeinrescue.com/', 'logmeinrescue.com/' ], '192.168.102.120' => [ 'autodiscover-s.outlook.com/', 'outlook.office365.com/' ] }` [download] As a side note: if a hostname appears more than once for the same IP, using a hash instead of a list might be more memory-friendly and easier to consume: `my %ipURL; csv ( in => DATA, out => undef, fragment => "col=7;31", on_in => sub { $ipURL{$_[1][0]}{$_[1][1]}++ }, ); DDumper \%ipURL;` [download] --> `{ '192.168.101.208' => { 'logmeinrescue.com/' => 2 }, '192.168.102.120' => { 'autodiscover-s.outlook.com/' => 1, 'outlook.office365.com/' => 1 } }` [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re: Hash Search is VERY slow by bliako (Monsignor) on Sep 28, 2021 at 22:34 UTC
I was trying to find the performance of Perl's hash insert in practice (as i assumed it was O(1) or very near) and hit this A short meditation about hash search performance . Though it's 20 years old almost. Avoiding the dereferencing of the array as choroba suggests will speed things up especially because you will eliminate this round-about way: `my @urlList=@{$ipURL{$ip}}; push (@urlList,$url); $ipURL{$ip}=\@urlList;` [download] I may be wrong on how I think Perl handles the above but I will say it creates a new array every time and copies the old one in there. Whereas `push @{$ipURL{$ip}}, $url;` "magically" doesn't, but it pushes into the reference so-to-speak. As the arrays grow, that's likely the cause of your exponential degradation in performance. You may also find that pushing into an array is slower than inserting into a hash and so you can find johngg's suggestion useful. bw, bliako	[reply] [d/l] [select]
Re: Hash Search is VERY slow by hippo (Bishop) on Sep 28, 2021 at 22:17 UTC
I feel like it's the 'exists' check on the Hash Don't guess, profile. 🦛	[reply]
Re: Hash Search is VERY slow by rtjensen (Novice) on Sep 29, 2021 at 14:38 UTC
It's working! I used choroba's advice and removed the whole if...else construct. Resultant loop looks like this: `while (my $row= $csv->getline_hr($fh)) { $linecounter++; my $ip=$row->{'Source address'}; my $url=$row->{'URL/Filename'}; push @{ $ipURL{$ip} }, $url; if (!($linecounter % 50000)) { print "Lines: $linecounter\n"; } } formatOutput(\%ipURL); # print Dumper \%ipURL;` [download] The entire thing runs in about 45 seconds now: `perl test-urlListbyIP.pl Lines: 50000 Lines: 100000 Lines: 150000 Lines: 200000 Lines: 250000 Lines: 300000 Lines: 350000 Lines: 400000 Lines: 450000 Lines: 500000 Lines: 550000 Lines: 600000 Lines: 650000 Lines: 700000 Lines: 750000 Lines: 800000 Lines: 850000 Formatting Output... List End:1316 Execution Time: 44.18 s` [download]	[reply] [d/l] [select]
Re^2: Hash Search is VERY slow by AnomalousMonk (Archbishop) on Sep 29, 2021 at 19:13 UTC
I think you'll find that it was choroba's advice to process the file on a line-by-line (CSV-record-by-record in this case) basis that did the trick. :) Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re^3: Hash Search is VERY slow by NERDVANA (Deacon) on Sep 29, 2021 at 22:37 UTC
I�m going to bet it was bliako�s observation that the array was getting cloned every time an element was added. That�s where the N^2 behavior came from.	[reply]
Re^4: Hash Search is VERY slow by LanX (Saint) on Sep 29, 2021 at 23:14 UTC
Re: Hash Search is VERY slow by karlgoethebier (Abbot) on Sep 29, 2021 at 11:13 UTC
�Die Geschichte wiederholt sich immer zweimal � das erste Mal als Trag�die, das zweite Mal als Farce.� (Karl Marx) See Threads From Hell #2: How To Search A Very Huge File [SOLVED] for further inspiration. Regards, Karl �The Crux of the Biscuit is the Apostrophe�	[reply]
Re^2: Hash Search is VERY slow by Tux (Canon) on Sep 29, 2021 at 11:32 UTC
Unless you can guarantee that the CSV you are dealing with is absolutely free of fields with embedded newlines, CSV parsing cannot be threaded or parsed in parallel. The OP only deals with fields 7 and 31, which are unlikely to contain new-lines, but we have no idea what the other fields may hold. Enjoy, Have FUN! H.Merijn	[reply]
Re^3: Hash Search is VERY slow by dsheroh (Monsignor) on Sep 29, 2021 at 12:13 UTC
We do have some idea of what the other fields may hold, in that the original post states the CSV files are firewall logs. I'm not aware of any common firewall log format which includes data that might contain embedded newlines, so it's probably not an issue in this case. But OP would know the specifics of the format they're dealing with better than I do, of course.	[reply]
Re^3: Hash Search is VERY slow by karlgoethebier (Abbot) on Sep 29, 2021 at 15:36 UTC
Yes sure. Anyway: Read more... (857 Bytes) And yes, i'm aware that i shouldn't split etc. Best regards, Karl �The Crux of the Biscuit is the Apostrophe�	[reply] [d/l]
Re^3: Hash Search is VERY slow by rtjensen (Novice) on Sep 29, 2021 at 14:16 UTC
Right, there are no new-lines in there and it's pretty uniform data overall.	[reply]
Re: Hash Search is VERY slow by LanX (Saint) on Sep 29, 2021 at 16:36 UTC
My guess is also that your memory consumption leads to excessive swapping. The general idea is presorting, for instance by iterating multiple times over the file and only processing one IP range after the other. This is expensive in IO but will create smaller data structures. Only if ... ... you really need all the data present in memory at once, consider breaking up the ranges into a tree of nested data structures and processing them in linear order. like `$hash->{'192'}{'168'}{'101'}{'208'}` or `$hash->{'192.168'}{'101.208'}` instead of `$hash->{'192.168.101.208'}` � If you now process all IPs in order , then Perl (well the OS) will be able to swap all memory-pages with unrelated sub-hashes out. This will be cheap because the number of swaps is minimized by the sorting. (see also Re: Small Hash a Gateway to Large Hash? ) An additional approach is using more compact data structures, hashes are efficient for sparse data. But if your IPs range from 0-255 an array is certainly more efficient. Furthermore, there is no point in repeating URLs like `"logmeinrescue.com"` in your array, counting them is more memory efficient. Anyway 800k lines input doesn't sound heavy though, not sure if we have the full picture (???) edit Like choroba already said, preloading the input completely into memory sounds like a waste of resources, you should check how much that costs. OTOH if you decided to implement my initial idea to process one IP range after the other, it'll reduce IO if (and only if) all fits into memory. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery} �) I'm aware that `192.168..` is very common	[reply] [d/l] [select]
Re^2: Hash Search is VERY slow by Tux (Canon) on Sep 30, 2021 at 07:59 UTC
If the number of keys is the cause of the memory cunsumption, make the keys smaller and more efficient. It might matter if you are low on memory. Best reason to pack the IP's is the it is much easier to sort use List::Util qw( shuffle ); use Devel::Size qw( total_size ); my ($x, %ip0, %ip1, %ip2, %ip3, %ip4, %ip5) = 31; foreach my $i0 ((shuffle 1..254)[0..$x]) { foreach my $i1 ((shuffle 1..254)[0..$x]) { foreach my $i2 ((shuffle 1..254)[0..$x]) { foreach my $i3 ((shuffle 1..254)[0..$x]) { $ip0{"$i0.$i1.$i2.$i3"}++; $ip1{"$i0.$i1"}{"$i2.$i3"}++; $ip2{$i0}{$i1}{$i2}{$i3}++; $ip3{pack "CCCC", $i0, $i1, $i2, $i3}++; $ip4{pack "CC", $i0, $i1}{pack "CC", $i2, $i3}++; $ip5{pack "C", $i0}{pack "C", $i1}{pack "C", $i2}{pack "C", $i3}++ +; }}}} printf "{192.168.101.208} : %9d\n", total_size (\%ip0); printf "{192.168}{101.208} : %9d\n", total_size (\%ip1); printf "{192}{168}{101}{208} : %9d\n", total_size (\%ip2); printf "{xC0xA8x65xD0} : %9d\n", total_size (\%ip3); printf "{xC0xA8}{x65xD0} : %9d\n", total_size (\%ip4); printf "{xC0}{xA8}{x65}{xD0} : %9d\n", total_size (\%ip5); ==> {192.168.101.208} : 116483356 {192.168}{101.208} : 69879492 {192}{168}{101}{208} : 71176834 {xC0xA8x65xD0} : 106954864 {xC0xA8}{x65xD0} : 69611776 {xC0}{xA8}{x65}{xD0} : 71176690 [download] On my machine using perl-5.28.0, `$ip4{pack "CC", $i0, $i1}{pack "CC", $i2, $i3}` proves to be the most memory-friendly approach Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^3: Hash Search is VERY slow by LanX (Saint) on Sep 30, 2021 at 11:19 UTC
> `$ip4{pack "CC", $i0, $i1}{pack "CC", $i2, $i3}` proves to be the most memory-friendly approach This might be true for long keys, but the original strings of the keys are only stored once. `{192.168}{101.208} : 69879492 ... {xC0xA8}{x65xD0} : 69611776` [download] that's a gain of ~0.3% � I'd call this micro optimization, or did I miss something? Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery} �) 0.383% to be more precise, so rather ~0.4%	[reply] [d/l] [select]
Re^4: Hash Search is VERY slow by Tux (Canon) on Sep 30, 2021 at 11:54 UTC
Re^5: Hash Search is VERY slow by LanX (Saint) on Sep 30, 2021 at 11:59 UTC
Some notes below your chosen depth have not been shown here
Re: Hash Search is VERY slow by dd-b (Monk) on Oct 02, 2021 at 03:00 UTC
Ooh, excellent timing, this thread is one thing I needed today. I'm just implementing something using several of the same tools and techniques, and while mine is enough smaller that writing it badly wouldn't be as severely punished--still, it's better to write it right, just for drill! (The back of the envelope says my case might grow to 500K bytes, which on a system with 16GB of ram is just not something I'm going to worry about, even if the back of the envelope is off by an order of magnitude in growth. But still.) In particular the fact that hash entries auto-vivify when referenced, and using that to make the pushing onto an array held in a hash entry vastly more efficient, is stuff I've seen the bits of repeatedly but mostly don't quite think of clearly enough to put them together in that order myself just by default. (The auto-vivification feels mysterious and I tend to avoid mysterious code when possible; but this case has big benefits.) Thanks monks!	[reply]
Re: Hash Search is VERY slow by rtjensen (Novice) on Sep 29, 2021 at 14:05 UTC
Thank you everyone for all the insight. I have tried changing over to Text::CSV_XS, and while it does produce the same output, there was no speed difference. As a test, I removed the if (!(exists)) check on the hash, and it completed in about 45 seconds, so i'm pretty sure that's where the bottleneck is. For the fun of it, I let it run overnight just to see how long it would take to run in its current state. The good news is, now that I've processed the big file with the script, subsequent files are going to be MUCH smaller... but I'm still going to be trying to get this script working more quickly. Here are the results of letting it run overnight: `perl test-urlListbyIP.pl Lines: 50000 Lines: 100000 Lines: 150000 Lines: 200000 Lines: 250000 Lines: 300000 Lines: 350000 Lines: 400000 Lines: 450000 Lines: 500000 Lines: 550000 Lines: 600000 Lines: 650000 Lines: 700000 Lines: 750000 Lines: 800000 Lines: 850000 Formatting Output... List End:1316 Execution Time: 25977.87 s` [download] Check out that runtime!!! I'll post updates if I get it going more quickly. THanks again for all the feedback.	[reply] [d/l]
Re^2: Hash Search is VERY slow by Anonymous Monk on Sep 30, 2021 at 22:07 UTC
Very strange to me that "if exists" was the culprit but ... you found it!	[reply]
Re^3: Hash Search is VERY slow by Anonymous Monk on Oct 01, 2021 at 17:41 UTC
May be strange to you, but not to us; you've demonstrated time and again that you don't know anything about any of this.	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
A reply falls below the community's threshold of quality. You may see it by logging in.


"be consistent"
	PerlMonks

Hash Search is VERY slow

Only if ...

edit