Combining Files

ImpalaSS has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Combining Files by chipmunk (Parson) on Jan 23, 2001 at 19:56 UTC
Have you considered using the Unix `join` command? This is exactly the kind of task that it handles. If you don't already have join, you can get a Perl implementation from the Perl Power Tools project.	[reply] [d/l]
Re: Re: Combining Files by ImpalaSS (Monk) on Jan 23, 2001 at 20:07 UTC
Chipmunk, Thanks for your assistance. One more question, how would i go about running a unix command inside of a perl script? thanks Dipul	[reply]
Re: Re: Re: Combining Files by Fastolfe (Vicar) on Jan 23, 2001 at 23:37 UTC
He was offering it as a suggestion to avoid having to write a Perl script to do this at all. You described it as a problem that you need solved, not as a small piece of a puzzle that you're writing a larger Perl script to solve. If it's a problem in and of itself, just use a simple pre-written non-Perl tool to do it, if that's the best thing to do.	[reply]
Re: Re: Re: Combining Files by chipmunk (Parson) on Jan 23, 2001 at 20:41 UTC
I found a bunch of good answers to this question in calling Unix commands.	[reply]
Re: Combining Files by turnstep (Parson) on Jan 23, 2001 at 20:25 UTC
Update: After talking with ImpalaSS, modified code to handle multiple entries in the first file with the same matching key. Quick solution. Not the best way, as it is not very flexible, but gets the job done. Requires the two files to be joined as the first and second arguemnts, and the file to be created as the third. Outputs any "unmached" records. #!/dev/null/perl -- :) use strict; my $first = shift or die "Need the first file!\n"; my $second = shift or die "Need the second file!\n"; my $third = shift or die "Need the third file!\n"; ## May as well test them all now open(FIRST, "$first") or die "Could not open $first: $!\n"; open(SECOND, "$second") or die "Could not open $second: $!\n"; open(THIRD, ">$third") or die "Could not write $third: $!\n"; my (%first, %found); while(<FIRST>) { my ($key, @cols) = split(m#\\|# => $_, -1); push(@{$first{$key}}, \@cols); $found{$key} = $.; } close(FIRST); while(<SECOND>) { my ($one, $key, @cols) = split(m#\\|# => $_, -1); if (exists $first{$key}) { delete $found{$key}; for (@{$first{$key}}) { print THIRD "$_->[0]\|$one\|$cols[0]\n"; } } else { printf "Line %5d: Record %5d exists in $second but not in $first\n +", $.,$key; } } close(SECOND); close(THIRD); ## Double check the first file: for (sort {$found{$a} <=> $found{$b}} keys %found) { printf "Line %5d: Record %5d exists in $first but not in $second\n", $found{$_},$_; } [download]	[reply] [d/l]
Re: Re: Combining Files by lemming (Priest) on Jan 24, 2001 at 00:02 UTC
Very similar to code I wrote for a like project. I found that if there was a huge amount of data that it would bog the machine down. The way I found around this was to work with sorted data (using the Unix sort command) and then to buffer the input by reading from the first file for many lines, then go to the second file and output all the matching lines and keep reading in the second file until you fill your buffer, go to the first, etc... Note, I was working with data I knew and I was able to tune the buffering for the machine it ran on.	[reply]
Re: Combining Files by Fastolfe (Vicar) on Jan 23, 2001 at 23:42 UTC
I mean no disrespect to the poster, but has anyone else noticed how many past posts by this person that we've all been happy to write code for? There have been a lot of projects that he's been asked to do, and I see a lot of code written by friendly PerlMonks to solve his problem for him. Again, I don't mean any offense to anyone, but perhaps we should focus more on helping him solve his problems rather than writing his code for him?	[reply]
Re: Re: Combining Files by ImpalaSS (Monk) on Jan 24, 2001 at 00:32 UTC
Fastolfe Let me assure you, no offence taken. I was given this job as a co-op from Drexel University, and I didn't know a thing about perl, why they hired me is beyond me. In anycase, mainly through the help of monks like yourself, in a few short months I have gained a better understanding of perl. Every time some code is posted for me, I take it and break it up, and place a comment after each line, so I learn from the code rather than just copy paste it. However, I completely agree with you in that I would rather be given help towards my goal rather than the solution. Thanks again for everyones help Dipul	[reply]
Combining Files (code break down) by turnstep (Parson) on Jan 24, 2001 at 20:06 UTC
This is an explanation of the code I wrote above, at the request of some users. Again, this is not very flexible code, as it depends on always matching the same fields, but is a demonstration of using hashes, anonymous arrays, and references. use strict; ## "use strict": never leave home without it. my $first = shift or die "Need the first file!\n"; my $second = shift or die "Need the second file!\n"; my $third = shift or die "Need the third file!\n"; ## Grab the files we want to parse and create, ## using the shift trick to grab items off of the ## @ARGV array. open(FIRST, "$first") or die "Could not open $first: $!\n"; open(SECOND, "$second") or die "Could not open $second: $!\n"; open(THIRD, ">$third") or die "Could not write $third: $!\n"; ## We try and open all three files first, since the first ## operation can be memory/time-intensive, and there is ## no point in doing that if any of the three files cannot ## be read or opened. Hint: use or, not \|\| after an open my (%first, %found); ## %first will store the information from the first data ## file, and %found is used to keep track of whether or ## not each item in the first data fie is used in the ## second. while(<FIRST>) { my ($key, @cols) = split(m#\\|# => $_, -1); ## For every line in the first data file, we split it ## on the "pipe" character. We give split an argument ## of -1 to indicate that we do not want trailing null ## fields to be stripped. Once they are split, the first ## item, or field, from the line is put into the variable ## $key, and the rest are put into the array @cols. ## Example: for the line: ## 12345\|8432\|FooBar\|\|BAZ\|\|\| ## $key becomes "12345" ## @cols becomes ("8432", "FooBar", "", "BAZ", "", "") push(@{$first{$key}}, \@cols); ## This is a little tricky. We are pushing a reference to ## the cols array into an anonymous array, which is the ## value for the hash key $key. ## To break this down a bit: ## my $refarray = \@cols; ## reference to cols array ## $first{$key} = (); ## anonymous array ## push(@{$first{$key}}, $refarray); ## We use the anonymous array for the cases where more ## than one line has the same key. If we knew that each ## key was unique and never appeared on more than one ## line, we could use: ## $first{$key} = \@cols; ## But, since we may have multiples, the value of the hash ## points to an array. Each item in that array is another ## array, which contains the information that split grabbed. ## This part can be confusing at first, but it's not so bad ## once you get the hang of references. Remember that every ## hash value, and every array element, can contain exactly ## one and only one peice of information, be it a string, ## a number, or a reference. Think of a reference as a piece ## of string that leads to another container, usually a ## hash or an array. $found{$key} \|\|= $.; ## Finally moving on. :) This stores the current line number ## of the first data file into the hash %found, at the key ## "$key". We could store this into an array as well, since ## the same key may appear on multiple lines, but in this ## case, grabbing only one of the lines is good enough. ## We use the \|\|= so that only the first line in which the ## key appears is saved, and the rest discarded. } close(FIRST); ## At this point, the first data file is completely in memory, ## stored in the hash %first. while(<SECOND>) { ## Now we read in each line of the second file, similar ## to the way we did the first file. Since the first ## data file is in memory, we do not need to store the ## second file into memory, but can simply parse it ## line by line. my ($one, $key, @cols) = split(m#\\|# => $_, -1); ## In this case, we split the same as before, but use ## the second field as the matching key. The rest of ## the line is stored in $one @cols if (exists $first{$key}) { delete $found{$key}; for (@{$first{$key}}) { print THIRD "$_->[0]\|$one\|$cols[0]\n"; } } ## First we make sure that this key was seen before, by ## checking to see if the key exists in the hash %first. ## Remember to always use exists and not simply a ## if ($first{$key}) when checking hash elements, as ## testing for existence is not the same as testing ## for "truth". ## Next, we delete an element from the hash %found. We do ## not need to check for the existence of this one, as ## it is automatically set when $first{$key} is. However, ## note that if we find a line with the same key in the ## second file, the delete will fail as that key has already ## been removed from the hash. This is not a big deal, but ## to be really precise we could have said: ## delete $found{$key} if exists $found{$key}; ## Next, we go into the array pointed to by $first{$key}. ## Recall that every line from the first data file with ## that key will create another item in the array. ## Finally, we print out (to the third file, open for writing) ## the first element of the array, which is the second ## "pipe-delimited" field of the first data file. Then we ## print the first element of the second data file line ## we are currently reading, followed by the third field ## from the second, which is simply the first element of ## the cols array. (We could even make this a simple scalar ## since we never use the rest of this array). ## That's the main part of the program. The rest is just ## cleanup and error checking else { printf "Line %5d: Record %5d exists in $second but not in $first\n", $.,$key; ## If we find a line in the second data file that does not ## have a match in the first file, we write a message to ## STDOUT telling the line number and key where this occured. } } close(SECOND); close(THIRD); ## Double check the first file: for (sort {$found{$a} <=> $found{$b}} keys %found) { ## We loop through all the remaining items in the %found hash, ## sorting them by the line number for easier output. Keys ## that were found in the second data file have been deleted. printf "Line %5d: Record %5d exists in $first but not in $second\n", $found{$_},$_; } Hope this helps. Questions welcome.	[reply]
Re: Combining Files by Beatnik (Parson) on Jan 23, 2001 at 20:09 UTC
DBD::CSV can easily handle text-based data files. Access is done by simple SQL statements, making field order pretty simple. Greetz Beatnik ... Quidquid perl dictum sit, altum viditur.	[reply]
Re: Combining Files by salvadors (Pilgrim) on Jan 23, 2001 at 19:28 UTC
Is this the latest version on a homework assignment? Doing the work that people are actually getting paid to do? If you can't understand something and need help on a specific point, that's fine, ask away; but please don't ask us to do your job for you. Tony	[reply]
Re: Re: Combining Files by ImpalaSS (Monk) on Jan 23, 2001 at 19:52 UTC
Well, I guess your right, it did come off that way, but I didn't want you guys to do it for me, just point me in the right direction. I didnt mean to come off that way. my apologies Dipul	[reply]