sabas has asked for the wisdom of the Perl Monks concerning the following question:
I have two text files 1. contains 114 lines and the 2nd txt file contains ~300K lines of which has 277 columns.
i stored the 114 lines to @array and would like to search if each of them is in the ~300K, what is the best approach and quickest approach to do it?
I want to use 3 ARGV in the command line ARGV[0]=my file1 with 114, ARGV2 is my second file that contains ~300K and ARGV3 to output and write the line of File2 where i found.
Appreciate the help please.I am about 1 month in PERL scripting..I did not use the ARG2 and ARG3 yet to minimize my debugging, feel free to rewrite my code.
How can i check the presence of 114 on each line of ~300K lines so i will not repeating the reading of ~300K and pointing the FH to 0, 0 again?
use strict;
use warnings;
my @sn; # store the line of file1.txt into array
my $i=0;
my $lctr=0;
my $flag=1;
while (<>) { # read the arg input file "file1.txt"
+that contains info
push @sn, split ' '; # store each line to array
print "sn[$i]=$sn[$i]\n";
$i++;
}
$sn[$i] = "END"; # mark the end of the array
print $sn[$i];
my $wait=<STDIN>;
my $filename = 'file2.txt';
open(my $fh, '<:encoding(UTF-8)', $filename) or die;
$i=0;
while ($flag==1) {
while (my $row = <$fh>) {
chomp $row;
print "Searching for $sn[$i]....";
if (index($row, $sn[$i]) != -1) {
print $row;
print "Found $sn[$i]\n";
my $wait = <STDIN>;
$i++;
seek FH, 0, 0;
}
$lctr++;
if ($sn[$i] eq 'END') {
$flag=0;
last;
}
}
}
Re: searching in large file
by NetWallah (Canon) on Jan 06, 2018 at 06:33 UTC
|
Since your specifications are quite fuzzy, I made assumptions toward simplicity, and offer the following code:
use strict;
use warnings;
# Open the files specified in @AGV
$ARGV[0] or die "ERROR: No file for 114 lines";
$ARGV[1] or die "ERROR: No file for 300K lines";$ARGV[2] or die "ERROR
+: No Output file name";
open my $smallfile,"<",$ARGV[0] or die "ERROR: COuld not open small fi
+le $ARGV[0]:$!";
open my $bigfile,"<",$ARGV[1] or die "ERROR: COuld not open large file
+ $ARGV[1]:$!";
open my $outfile,">",$ARGV[2] or die "ERROR: COuld not open output fil
+e $ARGV[2]:$!";
my $search_expression = "";
while(<$smallfile>){
chomp;
next unless length; # skip if empty
$search_expression .="\Q$_\E|";
}
close $smallfile;
chop $search_expression; # Delete extra "|"
$search_expression = qr($search_expression);
my $found_lines = 0;
#Perform the search
while (<$bigfile>){
next unless m/$search_expression/;
# We have a matching line
print $outfile $_;
$found_lines++;
}
close $bigfile;
close $outfile;
print "Created $found_lines lines of output in $ARGV[2]\n";
We're living in a golden age. All you need is gold. -- D.W. Robertson.
| [reply] [d/l] |
|
What an excellent code. Thank you sir! I timed and it took less than 7 secs ONLY to complete the process ! unbelievable... if its not too much to ask i have few more questions:
1. is the while(<$smallfile>) same reading the file while not end of file?
2. kindly explain or put comment on this expression: $search_expression .="\Q$_\E|";
3. also this one: $search_expression = qr($search_expression);
4. next unless m/$search_expression/;
Respectfully Yours,
Sabas
| [reply] |
|
| [reply] [d/l] |
|
Re: searching in large file
by karlgoethebier (Abbot) on Jan 06, 2018 at 12:24 UTC
|
#!/usr/bin/env perl
use strict;
use warnings;
use Set::Scalar;
use Path::Tiny;
use Data::Dump;
use feature qw (say);
say path("file01.txt")->slurp_utf8;
say path("file02.txt")->slurp_utf8;
my @s = path("file01.txt")->lines_utf8( { chomp => 1 } );
my @t = path("file02.txt")->lines_utf8( { chomp => 1 } );
dd \@s;
dd \@t;
my $s = Set::Scalar->new(@s);
my $t = Set::Scalar->new(@t);
my $i = $s->intersection($t);
say $i;
path("out.txt")->spew_utf8( map { "$_\n" } @$i );
print path("out.txt")->slurp_utf8;
__END__
karls-mac-mini:sabas karl$ ./sabas.pl
foo
bar
donald
nose
cuke
donald
["foo", "bar", "donald"]
["nose", "cuke", "donald"]
(donald)
donald
OK, admittedly very simplified. And 300K isn't really large IMHO. Threads From Hell #2: How To Search A Very Huge File [SOLVED] might probably also be of interest. If you want to get rid of this @ARGV stuff see Re: GetOpt Organization. Some might say that i repeat myself ;-)
Best regards, Karl
«The Crux of the Biscuit is the Apostrophe»
perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help
| [reply] [d/l] [select] |
Re: searching in large file
by Laurent_R (Canon) on Jan 06, 2018 at 10:33 UTC
|
Hi sabas,
assuming you want to find the lines of file1 that are in file2 (that's what you say at the beginning of your post), perhaps something like this:
my %hash;
open my $IN2, "<", $file2 or die "could not open $file2 $!";
while (my $line = <$IN2>) {
chomp $line;
next if $line =~ /^\s*$/; # skip empty line
$hash{$line} = 1;
}
close $IN2;
open my $IN1, "<", $file1 or die "could not open $file1 $!";
open my $OUT, ">", $file3 or die "could not open $file3 $!";
while (my $line = <IN1>) {
chomp $line;
next if $line =~ /^\s*$/;
print $OUT "$line\n" if exists $hash{$line};
}
close $IN1;
close $OUT;
If you want to find the lines of file2 that are in file1 (as you seem to imply later in your post), then just swap file1 and file2.
In both cases, it will be pretty fast because a hash lookup is fast (much faster than scanning an entire array each time through the input loop). You could probably do it without the two chomps, but I feel it's a bit safer to have them.
| [reply] [d/l] [select] |
Re: searching in large file
by thanos1983 (Parson) on Jan 06, 2018 at 14:27 UTC
|
Hello sabas,
Welcome to the Monastery. The fellow Monks have already suggested solutions to your question. I just wanted to add one more by using a module.
A few days ago a similar question was asked find common data in multiple files. I think the best approach is to use the module setop to compare the files and to capture output easily you can do it with IPC::System::Simple.
Sample of code below:
I do not know if this module is the fastest or the more efficient solution but you can Benchmark.
Input data that I used to replicate your question, file1.txt
Input data that I used to replicate your question, file2.txt
Hope this helps, BR.
Seeking for Perl wisdom...on the process of learning...not there...yet!
| [reply] [d/l] [select] |
Re: searching in large file
by Anonymous Monk on Jan 06, 2018 at 02:52 UTC
|
Hi,
Loading the 114 lines into a hash,then checking whether the lines in the 300k file are present in the hash will get you started.
J.C.
| [reply] |
|
What you suggest means that you are trying to figure out whether the lines of file 2 are in file 1. Although it is not entirely clear, the OP apparently was trying to know whether the lines of file 1 are in file 2. So, the other way around.
So it's probably the second file that should be loaded into a hash (which is not a problem despite the second file being significantly larger, as it is not that large), and then one should lookup each line of file 1 in the hash.
Update: actually, it is really not clear in which way the OP wants to search: at one point, it says one way, and at another point, it seems to say the other way around.
| [reply] |
|
|