Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

need help reading through two large files and output matches and non matches

by jlctx (Initiate)
on Apr 21, 2007 at 02:34 UTC ( #611250=perlquestion: print w/replies, xml ) Need Help??

jlctx has asked for the wisdom of the Perl Monks concerning the following question:

I have two files that are several hundred MB in size. These files each contain a sorted key. What I would like to do is read each line from each file one at a time. If a match is found, write it out to a “matched” file otherwise write to one of two “nonmatch” files. The main issue that I need to avoid is reading the files into memory. I think I understand how I want to test the condition, what I don't understand is how the looping will work.
input files file0 file1 1 2 2 4 3 5 5 8 7 9 9 logical test 1<=>2 -1 next 0, keep 1, print to nomatch0 2<=>2 0 next 0, next 1, print to match 3<=>4 -1 next 0, keep 1, print to nomatch0 5<=>4 1 keep 0, next 1, print to nomatch1 5<=>5 0 next 0, next 1, print to match 7<=>8 -1 next 0, keep 1, print to nomatch0 9<=>8 1 keep 0, next 1, print to nomatch1 9<=>9 0 next 0, next 1, print to match output files matched nomatch0 nomatch1 2 1 4 5 3 8 9 7
  • Comment on need help reading through two large files and output matches and non matches
  • Download Code

Replies are listed 'Best First'.
Re: need help reading through two large files and output matches and non matches
by roboticus (Chancellor) on Apr 21, 2007 at 03:05 UTC
Re: need help reading through two large files and output matches and non matches
by GrandFather (Saint) on Apr 21, 2007 at 03:12 UTC

    The following may get you started:

    use strict; use warnings; open TEST, '>', 'test1.txt'; print TEST <<DATA; 1 2 3 5 7 9 DATA close TEST; open TEST, '>', 'test2.txt'; print TEST <<DATA; 2 4 5 8 9 DATA close TEST; open IN1, '<', 'test1.txt'; open IN2, '<', 'test2.txt'; my $in1Line = <IN1>; my $in2Line = <IN2>; while (defined $in1Line or defined $in2Line) { if (! defined $in2Line or $in1Line < $in2Line) { print "No match from test1.txt: $in1Line"; $in1Line = <IN1>; } elsif (! defined $in1Line or $in2Line < $in1Line) { print "No match from test2.txt: $in2Line"; $in2Line = <IN2>; } else { # match print "Match: $in1Line"; $in1Line = <IN1>; $in2Line = <IN2>; } } close IN1; close IN2;

    Prints:

    No match from test1.txt: 1 Match: 2 No match from test1.txt: 3 No match from test2.txt: 4 Match: 5 No match from test1.txt: 7 No match from test2.txt: 8 Match: 9

    DWIM is Perl's answer to Gödel
Re: need help reading through two large files and output matches and non matches
by thezip (Vicar) on Apr 21, 2007 at 03:16 UTC

    The problem you describe has a high degree of commonality with a problem I was presented with in a university Java Data Structures/Sorting Algorithms class. The algorithm you seek to implement is a variant of an "external merge sort".

    I'll leave the details for you to implement.

    Good luck on your assignment!


    Update: Removed irrelevant statement regarding merging into a single file

    Where do you want *them* to go today?
Re: need help reading through two large files and output matches and non matches
by naikonta (Curate) on Apr 21, 2007 at 03:40 UTC
    WARNING: Untested code follows.
    #---- find-match.pl------ #!/usr/bin/perl use strict; use warnings; my %match_files = ( m1 => 'match', m2 => 'nomatch0', m3 => 'nomatch1', ); my %match_fh; for (keys %match_files) { open my $fh, '>', $match_files{$_} or die "Can't open $match_files{$_}: $!\n"; $match_fh{$_} = $fh; } while (<>) { my $match_result = test_for_match($_); # whatever it is! my $fh = $match_fh{$match_result}; unless ($fh) { # well, just in case warn "File $ARGV at line $. returns unexpected match result: $matc +h_result. Skipped\n"; # or put to another file? next; } print $fh; close(ARGV) if eof(ARGV); }
    Execution,
    $ perl find-match.pl file0 file1 fileN
    The diamond operator (<>) reads files one line by one line into memory instead of reading the whole content at once. The $_ default variable will hold the line in each iteration. This operator will continue to do so until all files are out of lines, returns undef and the while loop terminates.

    The last line in the loop is merely for an aesthetic to allow us identify the correct line number of the currently processed file for throwing warning. You might not need both the last line and the unless clause.

    After reading this, you may want to go to at least one of open, perlop, perlvar, and some of perlfaqs. But I think you should go there first though. Simple or super search on the term "read file" will result in a bunch of nodes discussing this issue. Also, you may want to dig a bit on the Q&A part, specially on the files category.

    Well, I guess I'm in the mood of something.


    Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!

Re: need help reading through two large files and output matches and non matches
by Krambambuli (Curate) on Apr 21, 2007 at 07:06 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://611250]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2022-05-18 02:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (68 votes). Check out past polls.

    Notices?