Joining two files on common field

ch1 has asked for the wisdom of the Perl Monks concerning the following question:

I trying to join two files on a common field.
File one:
upc,qty
840197083284,2
840197083291,3
840197083307,21
840197083314,0
File two:
product_sku,size,color,upc
121,MD,Black,840197083277
121,LG,Black,840197083284
121,XL,Black,840197083291
121,2XL,Black,840197083307

I'm trying to join on upc.
I've read a number of post.... combining 2 files with a comon field which seems to be close but I can't make it work.

Update: I'm new to perl and to perlmonks... So, First thanks for all the help. Here is the code that worked for me.

open(I1,"Inventory.csv") or die "file1: $!";
$_ = <I1>;   # read column headings

while (<I1>) {  # get data
    chomp;
    my ($upc,$qty) = split /,/;
    $data{$upc} = $qty;
}

open(I2,"Matrix.csv") or die "file2: $!";
$_ = <I2>;   # read column headings

while (<I2>) {  # get data
    chomp;
    my ($sku,$siz,$clr,$upc) = split /,/;
    $data{$upc} .= join ',', '', $sku, $siz, $clr;

}

open(my $fh_out, '>', 'new.csv')
    or die("Can't open output file: $!\n");

for (sort keys %data) {
     my @content = "$data{$_},$_";
    if (grep(/.*,.*,.*,.*$/, @content)){@column = grep(/.*,.*,.*,.*$/,
+ @content)};
    #@column = grep(/.*,.*,.*,.*$/, @content);
    print $fh_out "@column\n";
}
[download]

Edit: g0n - Linkified PM link

Comment on Joining two files on common field Download Code

Replies are listed 'Best First'.
Re: Joining two files on common field by jZed (Prior) on Sep 22, 2005 at 20:38 UTC
Well you didn't say what you want to do with the results, but assuming you want them in an Array of Arrays: `use DBI; my $dbh=DBI->connect('dbi:CSV(RaiseError=1):'); my $results = $dbh->selectall_arrayref(" SELECT * FROM file1 NATURAL JOIN file2 ");` [download]	[reply] [d/l]
Re^2: Joining two files on common field by ch1 (Novice) on Sep 22, 2005 at 22:54 UTC
Here is code but I'm getting an error. `#!/usr/bin/perl -w use DBI; $dbh = DBI->connect("DBI:CSV:f_dir=/home/turn2sp/tmp") or die "Cannot connect: " . $DBI::errstr; $dbh->{'csv_tables'}->{'Matrix'} = { 'file' => 'Matrix.csv'}; $dbh->{'csv_tables'}->{'Inventory'} = { 'file' => 'Inventory.csv'}; $sth = $dbh->prepare("SELECT * from Matrix NATURAL JOIN Inventory"); #$sth = $dbh->prepare("SELECT * FROM Inventory"); $sth->execute(); while ( @row = $sth->fetchrow_array ) { print "@row\n"; }` [download] DBD::CSV::db prepare failed: Parse error near NATURAL JOIN Inventory at /usr/lib/perl5/site_perl/5.8.4/i686-linux/DBD/File.pm line 195. for Statement "SELECT * from Matrix NATURAL JOIN Inventory" at ./j line 7. Can't call method "execute" on an undefined value at ./j line 10.	[reply] [d/l]
Re^3: Joining two files on common field by jZed (Prior) on Sep 22, 2005 at 23:00 UTC
That works for me. What version of SQL::Statement are you using?	[reply]
Re^4: Joining two files on common field by ch1 (Novice) on Sep 23, 2005 at 04:53 UTC
Re^5: Joining two files on common field by jZed (Prior) on Sep 23, 2005 at 05:10 UTC
Some notes below your chosen depth have not been shown here
Re^3: Joining two files on common field by ikegami (Patriarch) on Sep 23, 2005 at 17:35 UTC
To get a better error message, change `$sth = $dbh->prepare("SELECT * from Matrix NATURAL JOIN Inventory");` [download] to `$sth = $dbh->prepare("SELECT * from Matrix NATURAL JOIN Inventory") or die "Cannot prepare: " . $DBI::errstr;` [download]	[reply] [d/l] [select]
Re: Joining two files on common field by ikegami (Patriarch) on Sep 22, 2005 at 20:40 UTC
I believe you could use DBD::AnyData. Or, here's something based on bart's solution in the mentioned thread: use strict; use warnings; my %data; { open(my $fh_in, '<', 'file1.txt') or die("Can't open first input file: $!\n"); <$fh_in>; # Skip header. while (<$fh_in>) { chomp; my ($upc, @fields) = split /,/; push(@{$data{$key}}, @fields); } } { open(my $fh_in, '<', 'file2.txt') or die("Can't open second input file: $!\n"); <$fh_in>; # Skip header. while (<$fh_in>) { chomp; my @fields = split /,/; my $upc = pop(@fields); push(@{$data{$key}}, @fields); } } { open(my $fh_out, '>', 'file3.txt') or die("Can't open output file: $!\n"); foreach (sort keys %data) { print $fh_out (join(',', @$_), "\n"); } } [download]	[reply] [d/l]
Re: Joining two files on common field by davidrw (Prior) on Sep 22, 2005 at 21:59 UTC
if it suits your needs, you can simply do this on the commandline with `join` as well: `[davidrw@host davidrw]$ join -t , -1 1 -2 4 /tmp/f1 /tmp/f2 upc,qty,product_sku,size,color 840197083284,2,121,LG,Black 840197083291,3,121,XL,Black 840197083307,21,121,2XL,Black` [download] see "man join" for the various options to get the output the way you need it (espically if you want to show unmatched lines)	[reply] [d/l] [select]
Re^2: Joining two files on common field by ambrus (Abbot) on Apr 12, 2006 at 12:59 UTC
For examples of other nodes when I or someone else have suggested using textutils, see Re: File comparison, Re: Checking for new files, Re: Joining two files on common field, Re: comparing arrays, Re: help with lists, Re: Seeking guidance on how to approach a task, Re: Delete every n line, Re: duplicates getting omitted while comparing values inside foreach., Re: duplicates getting omitted while comparing values inside foreach., Re: Parsing Pattern Question, (honorary mention) Re: Join lines that match an string, Re: How do you create a tab delimited txt file from two files?, Re: Merging two list with simple operation.	[reply]
Re^2: Joining two files on common field by ch1 (Novice) on Sep 22, 2005 at 23:01 UTC
I tried join first... well becasue it should have been easy but it returns no output.	[reply]
Re^3: Joining two files on common field by davidrw (Prior) on Sep 23, 2005 at 02:08 UTC
what were your attempts? I got nothing at first because i didn't specifiy the comma delimiter with -t (default is tab)... what i posted above worked and that was the actual output... worth a quick glance of the man page in case the params are slightly different for your system?	[reply]
Re^4: Joining two files on common field by ch1 (Novice) on Sep 23, 2005 at 04:57 UTC
Re^5: Joining two files on common field by davidrw (Prior) on Sep 23, 2005 at 12:47 UTC
Re^5: Joining two files on common field by ambrus (Abbot) on Sep 23, 2005 at 16:02 UTC
Re: Joining two files on common field by graff (Chancellor) on Sep 22, 2005 at 22:57 UTC
Yet another way -- just a plain hash: `use strict; my %data; open(I1,"file1") or die "file1: $!" $_ = <I1>; # read column headings while (<I1>) { # get data chomp; my ($upc,$qty) = split /,/; $data{$upc} = $qty; } open(I2,"file2") or die "file2: $!" $_ = <I2>; # read column headings while (<I2>) { # get data chomp; my ($sku,$siz,$clr,$upc) = split /,/; $data{$upc} .= join ',', '', $sku, $siz, $clr; # or: $data{$upc} = join ',', $data{$upc},$sku,$siz,$clr; } for (sort keys %data) { print "$_,$data{$_}\n"; }` [download] That's pretty brittle in a number of ways: it won't handle properly quoted fields in a CSV file (e.g. '121,MD,"Green, Kelly",87654321'); it depends rigidly on a particular ordering and quantity of columns in each file; if file2 has a upc value not found in file1, it'll print ",$sku,$siz,$clr"; if file1 has a upc value not found in file2, it'll print "$upc,$qty". But it's simple and should do the job as you described it.	[reply] [d/l]
Re^2: Joining two files on common field by Skeeve (Parson) on Sep 23, 2005 at 06:10 UTC
You said: That's pretty brittle in a number of ways: ... it depends rigidly on a particular ordering and quantity of columns in each file So let's fix that. `$_ = <I1>; # read column headings` becomes `$_ = <I1>; # read column headings my @columns= split /,/;` [download] And when reading the columns, instead of `my ($upc,$qty) = split /,/;` and `my ($sku,$siz,$clr,$upc) = split /,/;` we put `@line{@columns}= split /,/;` (provided you have a `my %line;` defined) So in full: Read more... (995 Bytes) `$\=~s;s.;q^\|D9JYJ^^qq^\//\\\///^;ex;print`	[reply] [d/l] [select]
Re: Joining two files on common field by graff (Chancellor) on Sep 24, 2005 at 14:44 UTC
Regarding your update, where you show a modified version of the code I suggested, I should point out a minor ~~(harmless)~~ misconception in your final "for" loop to print the results. Here's your version of the for loop: `for (sort keys %data) { my @content = "$data{$_},$_"; if (grep(/.,.,.,.$/, @content)){ @column = grep(/.,.,.,.$/, @content) }; #@column = grep(/.,.,.,.$/, @content); print $fh_out "@column\n"; }` [download] The second line assigns a single string to an array (@content). This is bad form, but perl will "do the right thing" and create the array with a single element, which is the string you supplied: `"$data{$_},$_"` . Next, the "if" condition returns true if the single array element happens to match your regex, where you appear to be testing whether a string contains at least three commas. If it matches, you use grep on the single element in @content to store the string in a new array "@column". (Looks like you have removed "use strict;" because @column appears to be working as an undeclared global -- bad idea.) Anyway, if the match fails, @column remains undefined. UPDATE: rather, @column remains unchanged -- it still holds whatever may have been assigned to it previously, if anything. That could cause a problem in your output. Regardless of what happens with that if condition, you print @column followed by a line-feed (even if it's undefined --update: or contains a string from a previous iteration). All of that would be better stated as follows: `for ( sort keys %data ) { print "$data{$_},$_\n" if ( $data{$_} =~ /,.*?,/ ); }` [download] (that is, if the original value of "$data{$_}" contains two commas, add ",$_\n" at the end and print it; otherwise skip it)	[reply] [d/l] [select]
Re: Joining two files on common field by samtregar (Abbot) on Sep 22, 2005 at 20:33 UTC
Should we try to guess what you might be doing wrong or do you want to show us your code? -sam	[reply]

Back to Seekers of Perl Wisdom