http://qs321.pair.com?node_id=913408

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have to compare 2 csv files. I sort them by certain columns so that I can find matching rows with File::Sort module. I have few issues. 1. Skip the first (header) row. How I can provide this option in File::Sort 2. Missing lines. This is something I can see by simple diff, but I like to do with Perl. Which modules I should use for this? I like to compare things appropriately.

Thanks. 3. In general which module is good for CSV files comparision?

Replies are listed 'Best First'.
Re: File::Sort issues
by Somni (Friar) on Jul 09, 2011 at 02:49 UTC
    File::Sort is really not suited to parsing CSV files properly. If they are simple enough, it's possible, but CSV files rarely remain simple enough.

    How to generically compare two CSV files is difficult to answer. It depends on whether or not you can read the entire file into memory, and if their fields match. The very simplest method would be to normalize your CSV files, sort them, and then diff them.

    The simplest way of normalizing them is to parse them, and then spit them back out; if you do this with the same module for each (using the same options), theoretically any rows with the same values would output the same.

    Normalizing with Text::CSV_XS is straightforward:

    #!/usr/bin/perl use Text::CSV_XS; use warnings; use strict; { die("usage: $0 [<file>]\n") if @ARGV > 1; my($file, $fh); if (@ARGV) { $file = $ARGV[0]; open($fh, '<', $file) || die("Unable to open file '$file': $!.\n"); } else { $file = '-'; $fh = \*STDIN; } my $csv = Text::CSV_XS->new({ binary => 1, eol => "\015\012" }); while (my $row = $csv->getline($fh)) { $csv->print(\*STDOUT, $row); } die("Error parsing CSV file '$file': ", $csv->error_diag, "\n") if $csv->error_diag and not $csv->eof; }

    (My first pass used *ARGV, but this results in some odd diagnostics and weird edge cases.)

    At this point, you simply sort the output. Field values and the header are irrelevant; you're simply trying to make all of your CSV files consistent so diff can make some sense of it.

    diff -u <(csv-normalize csv1.csv | sort) <(csv-normalize csv2.csv | sort)

    This is the simplest and quickest way of comparing two CSV files. It has the advantage of being able to work on relatively large CSV files quickly, but it won't work if the field layout differs between them.

      How this will handle the case of missing lines in the file.
        Duplicate lines, missing lines, extra lines, etc. are all displayed by diff, and controlled through its options.

        For example, unified diff (-u) will show +'s for new lines, and -'s for removed lines. If large chunks of lines are added, removed, or moved, diff will show various sets of +'s and -'s; you can use -d (--minimal) to reduce the changes shown.

        The point is, once you've normalized the files you have a wide variety of comparison tools available: straight text from diff; side by side comparison in an editor with vimdiff; byte-by-byte comparison with cmp; sort and uniq to reduce it to some subset; and so on.

Re: File::Sort issues
by vladdrak (Monk) on Jul 09, 2011 at 01:44 UTC
    You should use Text::CSV for this, it's really fast and handles all the gotchas you might encounter, like headers and blanks, etc. It provides a getline_hr() method which returns a hash (keyed with your header names). You could use this to build two hashes to compare manually. Or try CSVDiff in CPAN.
Re: File::Sort issues
by osbosb (Monk) on Jul 11, 2011 at 12:53 UTC
    I have a suggestion - You could always split by your delimiter and then grab the rows you would like to compare via regex and the print out the diff.