ch1 has asked for the wisdom of the Perl Monks concerning the following question:
I trying to join two files on a common field.
File one:
upc,qty
840197083284,2
840197083291,3
840197083307,21
840197083314,0
File two:
product_sku,size,color,upc
121,MD,Black,840197083277
121,LG,Black,840197083284
121,XL,Black,840197083291
121,2XL,Black,840197083307
I'm trying to join on upc.
I've read a number of post.... combining 2 files with a comon field which seems to be close but I can't make it work.
Update:
I'm new to perl and to perlmonks... So, First thanks for all the help.
Here is the code that worked for me.
open(I1,"Inventory.csv") or die "file1: $!";
$_ = <I1>; # read column headings
while (<I1>) { # get data
chomp;
my ($upc,$qty) = split /,/;
$data{$upc} = $qty;
}
open(I2,"Matrix.csv") or die "file2: $!";
$_ = <I2>; # read column headings
while (<I2>) { # get data
chomp;
my ($sku,$siz,$clr,$upc) = split /,/;
$data{$upc} .= join ',', '', $sku, $siz, $clr;
}
open(my $fh_out, '>', 'new.csv')
or die("Can't open output file: $!\n");
for (sort keys %data) {
my @content = "$data{$_},$_";
if (grep(/.*,.*,.*,.*$/, @content)){@column = grep(/.*,.*,.*,.*$/,
+ @content)};
#@column = grep(/.*,.*,.*,.*$/, @content);
print $fh_out "@column\n";
}
Edit: g0n - Linkified PM link
Re: Joining two files on common field
by jZed (Prior) on Sep 22, 2005 at 20:38 UTC
|
Well you didn't say what you want to do with the results, but assuming you want them in an Array of Arrays:
use DBI;
my $dbh=DBI->connect('dbi:CSV(RaiseError=1):');
my $results = $dbh->selectall_arrayref("
SELECT * FROM file1 NATURAL JOIN file2
");
| [reply] [d/l] |
|
Here is code but I'm getting an error.
#!/usr/bin/perl -w
use DBI;
$dbh = DBI->connect("DBI:CSV:f_dir=/home/turn2sp/tmp")
or die "Cannot connect: " . $DBI::errstr;
$dbh->{'csv_tables'}->{'Matrix'} = { 'file' => 'Matrix.csv'};
$dbh->{'csv_tables'}->{'Inventory'} = { 'file' => 'Inventory.csv'};
$sth = $dbh->prepare("SELECT * from Matrix NATURAL JOIN Inventory");
#$sth = $dbh->prepare("SELECT * FROM Inventory");
$sth->execute();
while ( @row = $sth->fetchrow_array ) {
print "@row\n";
}
DBD::CSV::db prepare failed: Parse error near NATURAL JOIN Inventory at /usr/lib/perl5/site_perl/5.8.4/i686-linux/DBD/File.pm line 195.
for Statement "SELECT * from Matrix NATURAL JOIN Inventory" at ./j line 7.
Can't call method "execute" on an undefined value at ./j line 10.
| [reply] [d/l] |
|
| [reply] |
|
|
|
|
To get a better error message, change
$sth = $dbh->prepare("SELECT * from Matrix NATURAL JOIN Inventory");
to
$sth = $dbh->prepare("SELECT * from Matrix NATURAL JOIN Inventory")
or die "Cannot prepare: " . $DBI::errstr;
| [reply] [d/l] [select] |
Re: Joining two files on common field
by ikegami (Patriarch) on Sep 22, 2005 at 20:40 UTC
|
use strict;
use warnings;
my %data;
{
open(my $fh_in, '<', 'file1.txt')
or die("Can't open first input file: $!\n");
<$fh_in>; # Skip header.
while (<$fh_in>) {
chomp;
my ($upc, @fields) = split /,/;
push(@{$data{$key}}, @fields);
}
}
{
open(my $fh_in, '<', 'file2.txt')
or die("Can't open second input file: $!\n");
<$fh_in>; # Skip header.
while (<$fh_in>) {
chomp;
my @fields = split /,/;
my $upc = pop(@fields);
push(@{$data{$key}}, @fields);
}
}
{
open(my $fh_out, '>', 'file3.txt')
or die("Can't open output file: $!\n");
foreach (sort keys %data) {
print $fh_out (join(',', @$_), "\n");
}
}
| [reply] [d/l] |
Re: Joining two files on common field
by davidrw (Prior) on Sep 22, 2005 at 21:59 UTC
|
if it suits your needs, you can simply do this on the commandline with join as well:
[davidrw@host davidrw]$ join -t , -1 1 -2 4 /tmp/f1 /tmp/f2
upc,qty,product_sku,size,color
840197083284,2,121,LG,Black
840197083291,3,121,XL,Black
840197083307,21,121,2XL,Black
see "man join" for the various options to get the output the way you need it (espically if you want to show unmatched lines) | [reply] [d/l] [select] |
|
For examples of other nodes when I or someone else have suggested using textutils, see
Re: File comparison, Re: Checking for new files, Re: Joining two files on common field,
Re: comparing arrays, Re: help with lists, Re: Seeking guidance on how to approach a task, Re: Delete every n line, Re: duplicates getting omitted while comparing values inside foreach., Re: duplicates getting omitted while comparing values inside foreach., Re: Parsing Pattern Question, (honorary mention) Re: Join lines that match an string, Re: How do you create a tab delimited txt file from two files?, Re: Merging two list with simple operation.
| [reply] |
|
I tried join first... well becasue it should have been easy but it returns no output.
| [reply] |
|
what were your attempts? I got nothing at first because i didn't specifiy the comma delimiter with -t (default is tab)... what i posted above worked and that was the actual output... worth a quick glance of the man page in case the params are slightly different for your system?
| [reply] |
|
|
|
Re: Joining two files on common field
by graff (Chancellor) on Sep 22, 2005 at 22:57 UTC
|
Yet another way -- just a plain hash:
use strict;
my %data;
open(I1,"file1") or die "file1: $!"
$_ = <I1>; # read column headings
while (<I1>) { # get data
chomp;
my ($upc,$qty) = split /,/;
$data{$upc} = $qty;
}
open(I2,"file2") or die "file2: $!"
$_ = <I2>; # read column headings
while (<I2>) { # get data
chomp;
my ($sku,$siz,$clr,$upc) = split /,/;
$data{$upc} .= join ',', '', $sku, $siz, $clr;
# or: $data{$upc} = join ',', $data{$upc},$sku,$siz,$clr;
}
for (sort keys %data) {
print "$_,$data{$_}\n";
}
That's pretty brittle in a number of ways: it won't handle properly quoted fields in a CSV file (e.g. '121,MD,"Green, Kelly",87654321'); it depends rigidly on a particular ordering and quantity of columns in each file; if file2 has a upc value not found in file1, it'll print ",$sku,$siz,$clr"; if file1 has a upc value not found in file2, it'll print "$upc,$qty".
But it's simple and should do the job as you described it. | [reply] [d/l] |
|
You said: That's pretty brittle in a number of ways: ... it depends rigidly on a particular ordering and quantity of columns in each file
So let's fix that.
$_ = <I1>; # read column headings
becomes
$_ = <I1>; # read column headings
my @columns= split /,/;
And when reading the columns, instead of
my ($upc,$qty) = split /,/;
and
my ($sku,$siz,$clr,$upc) = split /,/;
we put
@line{@columns}= split /,/;
(provided you have a my %line; defined)
So in full:
$\=~s;s*.*;q^|D9JYJ^^qq^\//\\\///^;ex;print
| [reply] [d/l] [select] |
Re: Joining two files on common field
by samtregar (Abbot) on Sep 22, 2005 at 20:33 UTC
|
Should we try to guess what you might be doing wrong or do you want to show us your code?
-sam
| [reply] |
Re: Joining two files on common field
by graff (Chancellor) on Sep 24, 2005 at 14:44 UTC
|
Regarding your update, where you show a modified version of the code I suggested, I should point out a minor (harmless) misconception in your final "for" loop to print the results.
Here's your version of the for loop:
for (sort keys %data) {
my @content = "$data{$_},$_";
if (grep(/.*,.*,.*,.*$/, @content)){
@column = grep(/.*,.*,.*,.*$/, @content)
};
#@column = grep(/.*,.*,.*,.*$/, @content);
print $fh_out "@column\n";
}
The second line assigns a single string to an array (@content). This is bad form, but perl will "do the right thing" and create the array with a single element, which is the string you supplied: "$data{$_},$_" .
Next, the "if" condition returns true if the single array element happens to match your regex, where you appear to be testing whether a string contains at least three commas. If it matches, you use grep on the single element in @content to store the string in a new array "@column". (Looks like you have removed "use strict;" because @column appears to be working as an undeclared global -- bad idea.) Anyway, if the match fails, @column remains undefined. UPDATE: rather, @column remains unchanged -- it still holds whatever may have been assigned to it previously, if anything. That could cause a problem in your output.
Regardless of what happens with that if condition, you print @column followed by a line-feed (even if it's undefined --update: or contains a string from a previous iteration).
All of that would be better stated as follows:
for ( sort keys %data ) {
print "$data{$_},$_\n" if ( $data{$_} =~ /,.*?,/ );
}
(that is, if the original value of "$data{$_}" contains two commas, add ",$_\n" at the end and print it; otherwise skip it) | [reply] [d/l] [select] |
|
|