Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
File::Sort is really not suited to parsing CSV files properly. If they are simple enough, it's possible, but CSV files rarely remain simple enough.

How to generically compare two CSV files is difficult to answer. It depends on whether or not you can read the entire file into memory, and if their fields match. The very simplest method would be to normalize your CSV files, sort them, and then diff them.

The simplest way of normalizing them is to parse them, and then spit them back out; if you do this with the same module for each (using the same options), theoretically any rows with the same values would output the same.

Normalizing with Text::CSV_XS is straightforward:

#!/usr/bin/perl use Text::CSV_XS; use warnings; use strict; { die("usage: $0 [<file>]\n") if @ARGV > 1; my($file, $fh); if (@ARGV) { $file = $ARGV[0]; open($fh, '<', $file) || die("Unable to open file '$file': $!.\n"); } else { $file = '-'; $fh = \*STDIN; } my $csv = Text::CSV_XS->new({ binary => 1, eol => "\015\012" }); while (my $row = $csv->getline($fh)) { $csv->print(\*STDOUT, $row); } die("Error parsing CSV file '$file': ", $csv->error_diag, "\n") if $csv->error_diag and not $csv->eof; }

(My first pass used *ARGV, but this results in some odd diagnostics and weird edge cases.)

At this point, you simply sort the output. Field values and the header are irrelevant; you're simply trying to make all of your CSV files consistent so diff can make some sense of it.

diff -u <(csv-normalize csv1.csv | sort) <(csv-normalize csv2.csv | sort)

This is the simplest and quickest way of comparing two CSV files. It has the advantage of being able to work on relatively large CSV files quickly, but it won't work if the field layout differs between them.


In reply to Re: File::Sort issues by Somni
in thread File::Sort issues by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (4)
As of 2024-04-14 13:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found