Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Combining Files

by ImpalaSS (Monk)
on Jan 23, 2001 at 19:18 UTC ( [id://53698]=perlquestion: print w/replies, xml ) Need Help??

ImpalaSS has asked for the wisdom of the Perl Monks concerning the following question:

Hello,
My boss came to me with a large assignment today, and the first step involves combining the data from 2 different files using a number common in both as an index.
The files look as follows
1st file: 20722|20723|3 1 6 0 1 2484 19488| 9400|9401|3 1 6 0 1 2484 14087| 9400|9402|3 1 6 0 1 2484 14088| 16582|16583|3 1 6 0 1 2484 19239| 16582|16584|3 1 6 0 1 2484 19240| 20722|20724|3 1 6 0 1 2484 19489| 27347|27348|3 1 6 0 1 160 14342| 25408|25409|3 1 6 0 1 2484 37107| 25408|25410|3 1 6 0 1 2484 37108| Second file: 6724|20722|pa0598KingsManor|26870|567| 6871|18377|pa0179Ithan|26386|440| 22372|22373|SITE-0||| 5293|9079|nj5056Hammonton|166|88| 19435|30497|nj5007Paulsboro|46|3| 6871|23106|pa0210Manatawny|26363|636| 14862|23026|de0742CrossKeys|27865|514| 1362|29489|SITE-0||| 14862|18468|pa0376Dartmouth|27345|424| 10469|29899|PPA051P_WestGate|26885|706| 6871|30955|pa0075Cedarville|322|163|
This is how the files have to be combined. The first column in the first file, corresponds to the second column of the second file. So what I need help with is, printing the new file as follows. The second column from the first file, then the first column of the first file, then the corresponding 3rd column from the second file.

So, for example,
 20722|20723|3 1 6 0 1 2484 19488| from the first file corresponds to the
6724|20722|pa0598KingsManor|26870|567| on the second file because the 1st column in file 1 matches the second column in file 2. In this case, i would want  20723|20722|pa0598KingsManor| printed on the new file.

Thank you very much for all your time and help.

Dipul

Replies are listed 'Best First'.
Re: Combining Files
by chipmunk (Parson) on Jan 23, 2001 at 19:56 UTC
    Have you considered using the Unix join command? This is exactly the kind of task that it handles.

    If you don't already have join, you can get a Perl implementation from the Perl Power Tools project.

      Chipmunk,
      Thanks for your assistance. One more question, how would i go about running a unix command inside of a perl script? thanks

      Dipul
        He was offering it as a suggestion to avoid having to write a Perl script to do this at all. You described it as a problem that you need solved, not as a small piece of a puzzle that you're writing a larger Perl script to solve. If it's a problem in and of itself, just use a simple pre-written non-Perl tool to do it, if that's the best thing to do.
Re: Combining Files
by turnstep (Parson) on Jan 23, 2001 at 20:25 UTC
    Update: After talking with ImpalaSS, modified code to handle multiple entries in the first file with the same matching key.


    Quick solution. Not the best way, as it is not very flexible, but gets the job done. Requires the two files to be joined as the first and second arguemnts, and the file to be created as the third. Outputs any "unmached" records.

    #!/dev/null/perl -- :) use strict; my $first = shift or die "Need the first file!\n"; my $second = shift or die "Need the second file!\n"; my $third = shift or die "Need the third file!\n"; ## May as well test them all now open(FIRST, "$first") or die "Could not open $first: $!\n"; open(SECOND, "$second") or die "Could not open $second: $!\n"; open(THIRD, ">$third") or die "Could not write $third: $!\n"; my (%first, %found); while(<FIRST>) { my ($key, @cols) = split(m#\|# => $_, -1); push(@{$first{$key}}, \@cols); $found{$key} = $.; } close(FIRST); while(<SECOND>) { my ($one, $key, @cols) = split(m#\|# => $_, -1); if (exists $first{$key}) { delete $found{$key}; for (@{$first{$key}}) { print THIRD "$_->[0]|$one|$cols[0]\n"; } } else { printf "Line %5d: Record %5d exists in $second but not in $first\n +", $.,$key; } } close(SECOND); close(THIRD); ## Double check the first file: for (sort {$found{$a} <=> $found{$b}} keys %found) { printf "Line %5d: Record %5d exists in $first but not in $second\n", $found{$_},$_; }

      Very similar to code I wrote for a like project. I found that if there was a huge amount of data that it would bog the machine down. The way I found around this was to work with sorted data (using the Unix sort command) and then to buffer the input by reading from the first file for many lines, then go to the second file and output all the matching lines and keep reading in the second file until you fill your buffer, go to the first, etc...

      Note, I was working with data I knew and I was able to tune the buffering for the machine it ran on.
Re: Combining Files
by Fastolfe (Vicar) on Jan 23, 2001 at 23:42 UTC
    I mean no disrespect to the poster, but has anyone else noticed how many past posts by this person that we've all been happy to write code for? There have been a lot of projects that he's been asked to do, and I see a lot of code written by friendly PerlMonks to solve his problem for him.

    Again, I don't mean any offense to anyone, but perhaps we should focus more on helping him solve his problems rather than writing his code for him?

      Fastolfe
      Let me assure you, no offence taken. I was given this job as a co-op from Drexel University, and I didn't know a thing about perl, why they hired me is beyond me. In anycase, mainly through the help of monks like yourself, in a few short months I have gained a better understanding of perl. Every time some code is posted for me, I take it and break it up, and place a comment after each line, so I learn from the code rather than just copy paste it. However, I completely agree with you in that I would rather be given help towards my goal rather than the solution.

      Thanks again for everyones help

      Dipul
Combining Files (code break down)
by turnstep (Parson) on Jan 24, 2001 at 20:06 UTC

    This is an explanation of the code I wrote above, at the request of some users. Again, this is not very flexible code, as it depends on always matching the same fields, but is a demonstration of using hashes, anonymous arrays, and references.

    
    
    
    use strict;
    
    ## "use strict": never leave home without it.
    
    my $first  = shift or die "Need the first file!\n";
    my $second = shift or die "Need the second file!\n";
    my $third  = shift or die "Need the third file!\n";
    
    ## Grab the files we want to parse and create, 
    ## using the shift trick to grab items off of the 
    ## @ARGV array.
    
    open(FIRST,  "$first")  or die "Could not open $first: $!\n";
    open(SECOND, "$second") or die "Could not open $second: $!\n";
    open(THIRD,  ">$third") or die "Could not write $third: $!\n";
    
    ## We try and open all three files first, since the first 
    ## operation can be memory/time-intensive, and there is 
    ## no point in doing that if any of the three files cannot 
    ## be read or opened. Hint: use or, not || after an open
    
    
    my (%first, %found);
    
    ## %first will store the information from the first data 
    ## file, and %found is used to keep track of whether or 
    ## not each item in the first data fie is used in the 
    ## second.
    
    while(<FIRST>) {
    
      my ($key, @cols) = split(m#\|# => $_, -1);
    
      ## For every line in the first data file, we split it 
      ## on the "pipe" character. We give split an argument 
      ## of -1 to indicate that we do not want trailing null 
      ## fields to be stripped. Once they are split, the first 
      ## item, or field, from the line is put into the variable 
      ## $key, and the rest are put into the array @cols.
    
      ## Example: for the line:
      ## 12345|8432|FooBar||BAZ|||
      ## $key becomes "12345"
      ## @cols becomes ("8432", "FooBar", "", "BAZ", "", "")
    
      push(@{$first{$key}}, \@cols);
    
      ## This is a little tricky. We are pushing a reference to 
      ## the cols array into an anonymous array, which is the 
      ## value for the hash key $key.
    
      ## To break this down a bit:
      ## my $refarray = \@cols; ## reference to cols array
      ## $first{$key} = (); ## anonymous array
      ## push(@{$first{$key}}, $refarray);
    
      ## We use the anonymous array for the cases where more 
      ## than one line has the same key. If we knew that each 
      ## key was unique and never appeared on more than one 
      ## line, we could use:
      ## $first{$key} = \@cols;
      ## But, since we may have multiples, the value of the hash
      ## points to an array. Each item in that array is another 
      ## array, which contains the information that split grabbed.
    
      ## This part can be confusing at first, but it's not so bad 
      ## once you get the hang of references. Remember that every 
      ## hash value, and every array element, can contain exactly 
      ## one and only one peice of information, be it a string, 
      ## a number, or a reference. Think of a reference as a piece 
      ## of string that leads to another container, usually a 
      ## hash or an array.
    
      $found{$key} ||= $.;
    
      ## Finally moving on. :) This stores the current line number 
      ## of the first data file into the hash %found, at the key 
      ## "$key". We could store this into an array as well, since 
      ## the same key may appear on multiple lines, but in this 
      ## case, grabbing only one of the lines is good enough.
      ## We use the ||= so that only the first line in which the 
      ## key appears is saved, and the rest discarded.
    }
    close(FIRST);
    
    ## At this point, the first data file is completely in memory, 
    ## stored in the hash %first.
    
    
    while(<SECOND>) {
    
      ## Now we read in each line of the second file, similar 
      ## to the way we did the first file. Since the first 
      ## data file is in memory, we do not need to store the 
      ## second file into memory, but can simply parse it 
      ## line by line.
    
      my ($one, $key, @cols) = split(m#\|# => $_, -1);
    
      ## In this case, we split the same as before, but use 
      ## the second field as the matching key. The rest of 
      ## the line is stored in $one @cols
    
      if (exists $first{$key}) {
        delete $found{$key};
        for (@{$first{$key}}) {
          print THIRD "$_->[0]|$one|$cols[0]\n";
        }
      }
    
      ## First we make sure that this key was seen before, by 
      ## checking to see if the key exists in the hash %first. 
      ## Remember to always use exists and not simply a 
      ## if ($first{$key}) when checking hash elements, as 
      ## testing for existence is not the same as testing 
      ## for "truth".
    
      ## Next, we delete an element from the hash %found. We do 
      ## not need to check for the existence of this one, as 
      ## it is automatically set when $first{$key} is. However, 
      ## note that if we find a line with the same key in the 
      ## second file, the delete will fail as that key has already 
      ## been removed from the hash. This is not a big deal, but 
      ## to be really precise we could have said:
      ## delete $found{$key} if exists $found{$key};
    
      ## Next, we go into the array pointed to by $first{$key}. 
      ## Recall that every line from the first data file with 
      ## that key will create another item in the array. 
      ## Finally, we print out (to the third file, open for writing) 
      ## the first element of the array, which is the *second* 
      ## "pipe-delimited" field of the first data file. Then we 
      ## print the *first* element of the second data file line 
      ## we are currently reading, followed by the third field 
      ## from the second, which is simply the first element of 
      ## the cols array. (We could even make this a simple scalar 
      ## since we never use the rest of this array).
    
      ## That's the main part of the program. The rest is just 
      ## cleanup and error checking
    
      else {
        printf "Line %5d: Record %5d exists in $second but not in $first\n",
               $.,$key;
    
        ## If we find a line in the second data file that does not 
        ## have a match in the first file, we write a message to 
        ## STDOUT telling the line number and key where this occured.
    
      }
    }
    close(SECOND);
    close(THIRD);
    
    ## Double check the first file:
    for (sort {$found{$a} <=> $found{$b}} keys %found) {
    
      ## We loop through all the remaining items in the %found hash,
      ## sorting them by the line number for easier output. Keys 
      ## that were found in the second data file have been deleted.
    
      printf "Line %5d: Record %5d exists in $first but not in $second\n",
             $found{$_},$_;
    }
    
    

    Hope this helps. Questions welcome.

Re: Combining Files
by Beatnik (Parson) on Jan 23, 2001 at 20:09 UTC
    DBD::CSV can easily handle text-based data files. Access is done by simple SQL statements, making field order pretty simple.

    Greetz
    Beatnik
    ... Quidquid perl dictum sit, altum viditur.
Re: Combining Files
by salvadors (Pilgrim) on Jan 23, 2001 at 19:28 UTC

    Is this the latest version on a homework assignment? Doing the work that people are actually getting *paid* to do?

    If you can't understand something and need help on a specific point, that's fine, ask away; but please don't ask us to do your job for you.

    Tony

      Well,
      I guess your right, it did come off that way, but I didn't want you guys to do it for me, just point me in the right direction. I didnt mean to come off that way.
      my apologies

      Dipul

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://53698]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (None)
    As of 2024-04-25 01:13 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found