searching in large file

sabas has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: searching in large file by NetWallah (Canon) on Jan 06, 2018 at 06:33 UTC
Since your specifications are quite fuzzy, I made assumptions toward simplicity, and offer the following code: use strict; use warnings; # Open the files specified in @AGV $ARGV[0] or die "ERROR: No file for 114 lines"; $ARGV[1] or die "ERROR: No file for 300K lines";$ARGV[2] or die "ERROR +: No Output file name"; open my $smallfile,"<",$ARGV[0] or die "ERROR: COuld not open small fi +le $ARGV[0]:$!"; open my $bigfile,"<",$ARGV[1] or die "ERROR: COuld not open large file + $ARGV[1]:$!"; open my $outfile,">",$ARGV[2] or die "ERROR: COuld not open output fil +e $ARGV[2]:$!"; my $search_expression = ""; while(<$smallfile>){ chomp; next unless length; # skip if empty $search_expression .="\Q$_\E\|"; } close $smallfile; chop $search_expression; # Delete extra "\|" $search_expression = qr($search_expression); my $found_lines = 0; #Perform the search while (<$bigfile>){ next unless m/$search_expression/; # We have a matching line print $outfile $_; $found_lines++; } close $bigfile; close $outfile; print "Created $found_lines lines of output in $ARGV[2]\n"; [download] We're living in a golden age. All you need is gold. -- D.W. Robertson.	[reply] [d/l]
Re^2: searching in large file by sabas (Acolyte) on Jan 07, 2018 at 17:39 UTC
What an excellent code. Thank you sir! I timed and it took less than 7 secs ONLY to complete the process ! unbelievable... if its not too much to ask i have few more questions: 1. is the while(<$smallfile>) same reading the file while not end of file? 2. kindly explain or put comment on this expression: $search_expression .="\Q$_\E\|"; 3. also this one: $search_expression = qr($search_expression); 4. next unless m/$search_expression/; Respectfully Yours, Sabas	[reply]
Re^3: searching in large file by Cristoforo (Curate) on Jan 07, 2018 at 23:37 UTC
Hello sabas To answer your questions. yes In Perl, these metacharacters need to be escaped if they are to be matched literally *\ \| ( ) [ { ^ $ + ? . (also called the dirty dozen) By using \Q ... \E*, you escape any possible metacharacters in the variable being used for the regular expression - in this case $_ He is compiling the regular expression `$search_expression` From Regexp Quote Like Operators the reason is Precompilation of the pattern into an internal representation at the moment of qr() avoids the need to recompile the pattern every time a match /$pat/ is attempted* So, this avoids compiling the regular expression each time it is encountered in the while loop below next unless m/$search_expression/; Go to the top of the while loop and get the next line unless the regular expression matches this line. This skips the lines of code below if true.	[reply] [d/l]
Re^4: searching in large file by NetWallah (Canon) on Jan 08, 2018 at 03:32 UTC
Re: searching in large file by karlgoethebier (Abbot) on Jan 06, 2018 at 12:24 UTC
If i guessed your specs right you are after an intersection. Hence something like this using Set::Scalar and Path::Tiny might be a way to go: #!/usr/bin/env perl use strict; use warnings; use Set::Scalar; use Path::Tiny; use Data::Dump; use feature qw (say); say path("file01.txt")->slurp_utf8; say path("file02.txt")->slurp_utf8; my @s = path("file01.txt")->lines_utf8( { chomp => 1 } ); my @t = path("file02.txt")->lines_utf8( { chomp => 1 } ); dd \@s; dd \@t; my $s = Set::Scalar->new(@s); my $t = Set::Scalar->new(@t); my $i = $s->intersection($t); say $i; path("out.txt")->spew_utf8( map { "$_\n" } @$i ); print path("out.txt")->slurp_utf8; __END__ karls-mac-mini:sabas karl$ ./sabas.pl foo bar donald nose cuke donald ["foo", "bar", "donald"] ["nose", "cuke", "donald"] (donald) donald [download] OK, admittedly very simplified. And 300K isn't really large IMHO. Threads From Hell #2: How To Search A Very Huge File [SOLVED] might probably also be of interest. If you want to get rid of this `@ARGV` stuff see Re: GetOpt Organization. Some might say that i repeat myself ;-) Best regards, Karl �The Crux of the Biscuit is the Apostrophe� `perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'`Help	[reply] [d/l] [select]
Re: searching in large file by Laurent_R (Canon) on Jan 06, 2018 at 10:33 UTC
Hi sabas, assuming you want to find the lines of file1 that are in file2 (that's what you say at the beginning of your post), perhaps something like this: `my %hash; open my $IN2, "<", $file2 or die "could not open $file2 $!"; while (my $line = <$IN2>) { chomp $line; next if $line =~ /^\s$/; # skip empty line $hash{$line} = 1; } close $IN2; open my $IN1, "<", $file1 or die "could not open $file1 $!"; open my $OUT, ">", $file3 or die "could not open $file3 $!"; while (my $line = <IN1>) { chomp $line; next if $line =~ /^\s$/; print $OUT "$line\n" if exists $hash{$line}; } close $IN1; close $OUT;` [download] If you want to find the lines of file2 that are in file1 (as you seem to imply later in your post), then just swap file1 and file2. In both cases, it will be pretty fast because a hash lookup is fast (much faster than scanning an entire array each time through the input loop). You could probably do it without the two `chomp`s, but I feel it's a bit safer to have them.	[reply] [d/l] [select]
Re: searching in large file by thanos1983 (Parson) on Jan 06, 2018 at 14:27 UTC
Hello sabas, Welcome to the Monastery. The fellow Monks have already suggested solutions to your question. I just wanted to add one more by using a module. A few days ago a similar question was asked find common data in multiple files. I think the best approach is to use the module setop to compare the files and to capture output easily you can do it with IPC::System::Simple. Sample of code below: Read more... (1148 Bytes) I do not know if this module is the fastest or the more efficient solution but you can Benchmark. Input data that I used to replicate your question, file1.txt Read more... (348 Bytes) Input data that I used to replicate your question, file2.txt Read more... (498 Bytes) Hope this helps, BR. Seeking for Perl wisdom...on the process of learning...not there...yet!	[reply] [d/l] [select]
Re: searching in large file by Anonymous Monk on Jan 06, 2018 at 02:52 UTC
Hi, Loading the 114 lines into a hash,then checking whether the lines in the 300k file are present in the hash will get you started. J.C.	[reply]
Re^2: searching in large file by Laurent_R (Canon) on Jan 06, 2018 at 10:13 UTC
What you suggest means that you are trying to figure out whether the lines of file 2 are in file 1. Although it is not entirely clear, the OP apparently was trying to know whether the lines of file 1 are in file 2. So, the other way around. So it's probably the second file that should be loaded into a hash (which is not a problem despite the second file being significantly larger, as it is not that large), and then one should lookup each line of file 1 in the hash. Update: actually, it is really not clear in which way the OP wants to search: at one point, it says one way, and at another point, it seems to say the other way around.	[reply]


more useful options
	PerlMonks