http://qs321.pair.com?node_id=32597

Limo has asked for the wisdom of the Perl Monks concerning the following question:

I have a program that will take, amongst other things, a column header as input and return data from just that field(s). Where I want to go with this is:
<open FILE A> #I can do this <read data in column "X" (or "x", "y", "z")> #I can do this <return a subset of data> #I can do this <open FILE B> #I begin to lose it here. <find all matching strings from previous subset> <take as argument column labels> <return data associated with "subset" and "column data"> <create FILE C> <store data using new column labels>
Syntax should be:
merge [ FILE A, $field, FILE B, $field ]
In other words, lets say I have 2 files. File 1 has columns:
full_name rank serial number
File 2 has columns:
last_name address phone number
I want to take as input "full_name" (or whatever column I specify) from File 1, create a subset of data from the column "full_name", search File 2 for any matches, and return corresponding data from the column "address" and/or "phone number", depending on what I specified at the beginning. Write the results to <STDOUT> , labeling the new columns:
new_name current_address current_phone
So, for instance, I want to create a report that tells me the address and phone number of everyone named "Joe", regardless of last name. I suspect that I have to use references of some sort, but as I've been programming Perl for about a month, read about references, but haven't worked with them yet.. Thanks in advance for any light you all can shed on this one.

Replies are listed 'Best First'.
Re: Merging Files: A Different Twist
by chromatic (Archbishop) on Sep 15, 2000 at 06:08 UTC
    This sounds like a case for a full fledged database... something like DBD::RAM could come in very handy.

    If I understand your specification, however, you want to get data from one column in the first file, look for matches in one column of the second file and print the whole row that contains a match.

    This is why hashes were invented:

    my %to_match = (); # hold things to match in here my $output = ''; # hold what we want to print # open first file while (<FILE1>) { my $element = (split(/\t/, $_))[$column1]; $to_match{$element} = 1; } # open second file while (<FILE2>) { my $element = (split(/\t/, $_))[$column2]; if (defined($to_match{$element})) { $output .= $_; } } # open output file and display results
    That's very generic, and you'll probably want to change the split regex, at least. $column1 and $column2 are the 0-based numeric values of the columns you want to grab out of the specific files.

    If you don't have *exact* matches, you'll have to use a regex or String::Approx or something to do a fuzzy match. This ought to get you started, though.

RE: Merging Files: A Different Twist
by Limo (Scribe) on Sep 15, 2000 at 06:49 UTC
    Chromatic, thanks for your reply. But you're a bit off on the spec. I don't necessarily want to print the entire row that contains the returned string. I want to:
    1. Create a subset (an array), based upon columnar data in File A; the + column(s) as specified on the command line. See my example. 2. For each matched index in the array, read File B horizontally, grab +bing data only from the column(s) that I specified on the command lin +e. 3. Creating a new file, File C with new column labels.
    I apologize profusely if I'm lacking clarity. But, I'm not sure how else to explain my spec, other than how I explained it in my original post. Thanks again, for replying!
      Unfortunately, your initial example is flawed. I see no way to tell how to relate the two files. I'll assume that the "name" fields are really supposed to be identical, and you want the names and address of all the generals.

      What it sounds like is you want to implement something like the SQL "select" statement:

      SELECT file1.name,address FROM file1, file2 WHERE file1.name = file2.name AND rank = 'General';
      If this is what you want, then the "use a real DB, like DBD::RAM" suggestion is probably the most general.

      If you really wanted to do this by hand...

      First, an insult to your intelligence... I had to hand-code this type of thing as a homework assigment for a database class I took to get my degree. Given how poorly specced the problem is, it sounds like the same type of thing, except you won't be writing it in perl4 like I did. As such, I won't give you the -best- solution, but I'll give you a working solution. Some parts are downright stupid for how to do it. This is intentional -- If you can fix it, you earned it.

      A relational DB has tables consisting of rows and columns. A join operation basically creates the cartesian product of two tables, and a selection operation creates new table which has a subset of rows from the old, and a projection operation creates a new table which has a subset of columns.

      What it sounds like is you want to join your two files, take a selection, then a projection. Let's solve it that way.

      We first need to read in the two files into tables. You could do something like this to read in each file:

      while (<FILE1>) { chomp push @table1,[ split /\s+/ ]; }
      This effectively creates a 2-dimensional array, where each row is a record in the file, and each column is, well, a column.

      I'm assuming you know which column is which; you aren't relying on things like the first line of the file having the column names. The tables then contain just pure data.

      So now we have @table1 and @table2. Time to join them:

      my @join; for my $i (@table1) { for my $j (@table2) { push @join,@$i,@$j; } }
      Good, now we have joined the two tables. We now need to select from the tables the info we need.

      Let's assume we are looking for things which meet the following criteria:

      • The first columns of the two tables match
      • The second column of table 1 is "Joe"
      given that there are 3 columns in each table...
      my @select; for $i (@join) { push @select,$i if ($i->[0] eq $i->[3]) and ($i->[1] eq "Joe"); }
      Now, that wasn't hard. Final step now, the projection... We want just columns 1 and 3 from table 1, and column 2 from table 2:
      my @project; for $i (@select) { push @project,[ $i->[0,2,4] ]; }
      Now all you have to do is print out @project into file3, and you are done. That I'll leave as an exercise for you.

      If I've done this right, you should be able to simplify this a lot, or at least, use it as a basis for what you are doing.

      And for the curious: No this is not how I did it for my DB class... for that I cheated, and used persistant hashes, not arrays.