Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

searching in large file

by sabas (Acolyte)
on Jan 06, 2018 at 00:23 UTC ( [id://1206791]=perlquestion: print w/replies, xml ) Need Help??

sabas has asked for the wisdom of the Perl Monks concerning the following question:

I have two text files 1. contains 114 lines and the 2nd txt file contains ~300K lines of which has 277 columns. i stored the 114 lines to @array and would like to search if each of them is in the ~300K, what is the best approach and quickest approach to do it? I want to use 3 ARGV in the command line ARGV[0]=my file1 with 114, ARGV2 is my second file that contains ~300K and ARGV3 to output and write the line of File2 where i found. Appreciate the help please.I am about 1 month in PERL scripting..I did not use the ARG2 and ARG3 yet to minimize my debugging, feel free to rewrite my code. How can i check the presence of 114 on each line of ~300K lines so i will not repeating the reading of ~300K and pointing the FH to 0, 0 again?

use strict; use warnings; my @sn; # store the line of file1.txt into array my $i=0; my $lctr=0; my $flag=1; while (<>) { # read the arg input file "file1.txt" +that contains info push @sn, split ' '; # store each line to array print "sn[$i]=$sn[$i]\n"; $i++; } $sn[$i] = "END"; # mark the end of the array print $sn[$i]; my $wait=<STDIN>; my $filename = 'file2.txt'; open(my $fh, '<:encoding(UTF-8)', $filename) or die; $i=0; while ($flag==1) { while (my $row = <$fh>) { chomp $row; print "Searching for $sn[$i]...."; if (index($row, $sn[$i]) != -1) { print $row; print "Found $sn[$i]\n"; my $wait = <STDIN>; $i++; seek FH, 0, 0; } $lctr++; if ($sn[$i] eq 'END') { $flag=0; last; } } }

Replies are listed 'Best First'.
Re: searching in large file
by NetWallah (Canon) on Jan 06, 2018 at 06:33 UTC
    Since your specifications are quite fuzzy, I made assumptions toward simplicity, and offer the following code:
    use strict; use warnings; # Open the files specified in @AGV $ARGV[0] or die "ERROR: No file for 114 lines"; $ARGV[1] or die "ERROR: No file for 300K lines";$ARGV[2] or die "ERROR +: No Output file name"; open my $smallfile,"<",$ARGV[0] or die "ERROR: COuld not open small fi +le $ARGV[0]:$!"; open my $bigfile,"<",$ARGV[1] or die "ERROR: COuld not open large file + $ARGV[1]:$!"; open my $outfile,">",$ARGV[2] or die "ERROR: COuld not open output fil +e $ARGV[2]:$!"; my $search_expression = ""; while(<$smallfile>){ chomp; next unless length; # skip if empty $search_expression .="\Q$_\E|"; } close $smallfile; chop $search_expression; # Delete extra "|" $search_expression = qr($search_expression); my $found_lines = 0; #Perform the search while (<$bigfile>){ next unless m/$search_expression/; # We have a matching line print $outfile $_; $found_lines++; } close $bigfile; close $outfile; print "Created $found_lines lines of output in $ARGV[2]\n";

                    We're living in a golden age. All you need is gold. -- D.W. Robertson.

      What an excellent code. Thank you sir! I timed and it took less than 7 secs ONLY to complete the process ! unbelievable... if its not too much to ask i have few more questions: 1. is the while(<$smallfile>) same reading the file while not end of file? 2. kindly explain or put comment on this expression: $search_expression .="\Q$_\E|"; 3. also this one: $search_expression = qr($search_expression); 4. next unless m/$search_expression/; Respectfully Yours, Sabas

        Hello sabas

        To answer your questions.

        1. yes
        2. In Perl, these metacharacters need to be escaped if they are to be matched literally \ | ( ) [ { ^ $ * + ? . (also called the dirty dozen) By using \Q ... \E, you escape any possible metacharacters in the variable being used for the regular expression - in this case $_
        3. He is compiling the regular expression $search_expression From Regexp Quote Like Operators the reason is Precompilation of the pattern into an internal representation at the moment of qr() avoids the need to recompile the pattern every time a match /$pat/ is attempted So, this avoids compiling the regular expression each time it is encountered in the while loop below next unless m/$search_expression/;
        4. Go to the top of the while loop and get the next line unless the regular expression matches this line. This skips the lines of code below if true.
Re: searching in large file
by karlgoethebier (Abbot) on Jan 06, 2018 at 12:24 UTC

    If i guessed your specs right you are after an intersection.

    Hence something like this using Set::Scalar and Path::Tiny might be a way to go:

    #!/usr/bin/env perl use strict; use warnings; use Set::Scalar; use Path::Tiny; use Data::Dump; use feature qw (say); say path("file01.txt")->slurp_utf8; say path("file02.txt")->slurp_utf8; my @s = path("file01.txt")->lines_utf8( { chomp => 1 } ); my @t = path("file02.txt")->lines_utf8( { chomp => 1 } ); dd \@s; dd \@t; my $s = Set::Scalar->new(@s); my $t = Set::Scalar->new(@t); my $i = $s->intersection($t); say $i; path("out.txt")->spew_utf8( map { "$_\n" } @$i ); print path("out.txt")->slurp_utf8; __END__ karls-mac-mini:sabas karl$ ./sabas.pl foo bar donald nose cuke donald ["foo", "bar", "donald"] ["nose", "cuke", "donald"] (donald) donald

    OK, admittedly very simplified. And 300K isn't really large IMHO. Threads From Hell #2: How To Search A Very Huge File [SOLVED] might probably also be of interest. If you want to get rid of this @ARGV stuff see Re: GetOpt Organization. Some might say that i repeat myself ;-)

    Best regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

    perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

Re: searching in large file
by Laurent_R (Canon) on Jan 06, 2018 at 10:33 UTC
    Hi sabas,

    assuming you want to find the lines of file1 that are in file2 (that's what you say at the beginning of your post), perhaps something like this:

    my %hash; open my $IN2, "<", $file2 or die "could not open $file2 $!"; while (my $line = <$IN2>) { chomp $line; next if $line =~ /^\s*$/; # skip empty line $hash{$line} = 1; } close $IN2; open my $IN1, "<", $file1 or die "could not open $file1 $!"; open my $OUT, ">", $file3 or die "could not open $file3 $!"; while (my $line = <IN1>) { chomp $line; next if $line =~ /^\s*$/; print $OUT "$line\n" if exists $hash{$line}; } close $IN1; close $OUT;
    If you want to find the lines of file2 that are in file1 (as you seem to imply later in your post), then just swap file1 and file2.

    In both cases, it will be pretty fast because a hash lookup is fast (much faster than scanning an entire array each time through the input loop). You could probably do it without the two chomps, but I feel it's a bit safer to have them.

Re: searching in large file
by thanos1983 (Parson) on Jan 06, 2018 at 14:27 UTC

    Hello sabas,

    Welcome to the Monastery. The fellow Monks have already suggested solutions to your question. I just wanted to add one more by using a module.

    A few days ago a similar question was asked find common data in multiple files. I think the best approach is to use the module setop to compare the files and to capture output easily you can do it with IPC::System::Simple.

    Sample of code below:

    I do not know if this module is the fastest or the more efficient solution but you can Benchmark.

    Input data that I used to replicate your question, file1.txt

    Input data that I used to replicate your question, file2.txt

    Hope this helps, BR.

    Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: searching in large file
by Anonymous Monk on Jan 06, 2018 at 02:52 UTC

    Hi,

    Loading the 114 lines into a hash,then checking whether the lines in the 300k file are present in the hash will get you started.

    J.C.

      What you suggest means that you are trying to figure out whether the lines of file 2 are in file 1. Although it is not entirely clear, the OP apparently was trying to know whether the lines of file 1 are in file 2. So, the other way around.

      So it's probably the second file that should be loaded into a hash (which is not a problem despite the second file being significantly larger, as it is not that large), and then one should lookup each line of file 1 in the hash.

      Update: actually, it is really not clear in which way the OP wants to search: at one point, it says one way, and at another point, it seems to say the other way around.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1206791]
Approved by stevieb
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2024-04-20 04:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found