need help reading through two large files and output matches and non matches

jlctx has asked for the wisdom of the Perl Monks concerning the following question:

I have two files that are several hundred MB in size. These files each contain a sorted key. What I would like to do is read each line from each file one at a time. If a match is found, write it out to a “matched” file otherwise write to one of two “nonmatch” files. The main issue that I need to avoid is reading the files into memory. I think I understand how I want to test the condition, what I don't understand is how the looping will work.

input files
file0    file1    
1         2    
2         4
3         5
5         8
7         9
9

logical test
1<=>2    -1    next 0, keep 1, print to nomatch0
2<=>2     0    next 0, next 1, print to match
3<=>4    -1    next 0, keep 1, print to nomatch0
5<=>4     1    keep 0, next 1, print to nomatch1
5<=>5     0    next 0, next 1, print to match
7<=>8    -1    next 0, keep 1, print to nomatch0
9<=>8     1    keep 0, next 1, print to nomatch1
9<=>9     0    next 0, next 1, print to match

output files
matched    nomatch0   nomatch1
2          1           4
5          3           8
9          7
[download]

Comment on need help reading through two large files and output matches and non matches Download Code

Replies are listed 'Best First'.
Re: need help reading through two large files and output matches and non matches by roboticus (Chancellor) on Apr 21, 2007 at 03:05 UTC
jlck: I recently posted a bit of code that does just this, in node Re: How to deal with Huge data. I hope this helps! ...roboticus	[reply]
Re: need help reading through two large files and output matches and non matches by GrandFather (Saint) on Apr 21, 2007 at 03:12 UTC
The following may get you started: use strict; use warnings; open TEST, '>', 'test1.txt'; print TEST <<DATA; 1 2 3 5 7 9 DATA close TEST; open TEST, '>', 'test2.txt'; print TEST <<DATA; 2 4 5 8 9 DATA close TEST; open IN1, '<', 'test1.txt'; open IN2, '<', 'test2.txt'; my $in1Line = <IN1>; my $in2Line = <IN2>; while (defined $in1Line or defined $in2Line) { if (! defined $in2Line or $in1Line < $in2Line) { print "No match from test1.txt: $in1Line"; $in1Line = <IN1>; } elsif (! defined $in1Line or $in2Line < $in1Line) { print "No match from test2.txt: $in2Line"; $in2Line = <IN2>; } else { # match print "Match: $in1Line"; $in1Line = <IN1>; $in2Line = <IN2>; } } close IN1; close IN2; [download] Prints: `No match from test1.txt: 1 Match: 2 No match from test1.txt: 3 No match from test2.txt: 4 Match: 5 No match from test1.txt: 7 No match from test2.txt: 8 Match: 9` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: need help reading through two large files and output matches and non matches by thezip (Vicar) on Apr 21, 2007 at 03:16 UTC
The problem you describe has a high degree of commonality with a problem I was presented with in a university Java Data Structures/Sorting Algorithms class. The algorithm you seek to implement is a variant of an "external merge sort". I'll leave the details for you to implement. Good luck on your assignment! Update: Removed irrelevant statement regarding merging into a single file Where do you want them* to go today?*	[reply]
Re: need help reading through two large files and output matches and non matches by naikonta (Curate) on Apr 21, 2007 at 03:40 UTC
WARNING: Untested code follows. #---- find-match.pl------ #!/usr/bin/perl use strict; use warnings; my %match_files = ( m1 => 'match', m2 => 'nomatch0', m3 => 'nomatch1', ); my %match_fh; for (keys %match_files) { open my $fh, '>', $match_files{$_} or die "Can't open $match_files{$_}: $!\n"; $match_fh{$_} = $fh; } while (<>) { my $match_result = test_for_match($_); # whatever it is! my $fh = $match_fh{$match_result}; unless ($fh) { # well, just in case warn "File $ARGV at line $. returns unexpected match result: $matc +h_result. Skipped\n"; # or put to another file? next; } print $fh; close(ARGV) if eof(ARGV); } [download] Execution, `$ perl find-match.pl file0 file1 fileN` [download] The diamond operator (`<>`) reads files one line by one line into memory instead of reading the whole content at once. The `$_` default variable will hold the line in each iteration. This operator will continue to do so until all files are out of lines, returns undef and the while loop terminates. The last line in the loop is merely for an aesthetic to allow us identify the correct line number of the currently processed file for throwing warning. You might not need both the last line and the unless clause. After reading this, you may want to go to at least one of open, perlop, perlvar, and some of perlfaqs. But I think you should go there first though. Simple or super search on the term "read file" will result in a bunch of nodes discussing this issue. Also, you may want to dig a bit on the Q&A part, specially on the files category. Well, I guess I'm in the mood of something. Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!	[reply] [d/l] [select]
Re: need help reading through two large files and output matches and non matches by Krambambuli (Curate) on Apr 21, 2007 at 07:06 UTC
Maybe it's worth also considering any of File::Sort, File::MergeSort, Sort::Key::Merger or Sort::Merge. There might be other available modules too. CPAN is stuffed with useful things, it's always worth to check.	[reply]


P is for Practical
	PerlMonks