http://qs321.pair.com?node_id=65094
Category: Text Processing
Author/Contact Info
Description: I am tired of people asking how to handle CSV and not having a good answer that doesn't involve learning DBI first. In particular I don't like Text::CSV. This is called Text::xSV at tye's suggestion since you can choose the character separation. Performance can be improved significantly, but that wasn't the point.

For details you can read the documentation.

UPDATE 1
Fixed minor bug that resulted in quotes in quoted fields remaining doubled up.

UPDATE 2
Fixed missing defined test that caused a warning. Thanks TStanley.

package Text::xSV;
$VERSION = 0.03;
use strict;
use Carp;

sub bind_fields {
  my $self = shift;
  my %field_pos;
  foreach my $i (0..$#_) {
    $field_pos{$_[$i]} = $i;
  }
  $self->{field_pos} = \%field_pos;
}

sub bind_header {
  my $self = shift;
  $self->bind_fields($self->get_row());
  delete $self->{row};
}

sub extract {
  my $self = shift;
  my $row = $self->{row} or confess("No row found (did you call get_ro
+w())?");
  my $lookup = $self->{field_pos}
    or confess("Can't find field info (did you bind_fields or bind_hea
+der?)");
  my @data;
  foreach my $field (@_) {
    if (exists $lookup->{$field}) {
      push @data, $row->[$lookup->{$field}];
    }
    else {
      my @allowed = sort keys %$lookup;
      confess(
        "Invalid field $field for file '$self->{filename}'.\n" .
        "Valid fields are: (@allowed)\n"
      );
    }
  }
  return wantarray ? @data : \@data;
}

# Private block for shared variables in a small "parse engine".
# The concept here is to use pos to step through a string.
# This is the real engine, all else is syntactic sugar.
{
  my ($self, $fh, $line);

  sub get_row {
    $self = shift;
    delete $self->{row};
    $fh = $self->{fh};
    defined($line = <$fh>) or return;
    chomp($line);
    my @row = _get_row();
    $self->{row} = \@row;
    return wantarray ? @row : [@row];
  }

  sub _get_row {
    my @row;
    my $q_sep = quotemeta($self->{sep});
    my $match_sep = qr/\G$q_sep/;
    my $start_field = qr/\G(")|([^"$q_sep]*)/;

    # This loop is the heart of the engine
    while ($line =~ /$start_field/g) {
      if ($1) {
        push @row, _get_quoted();
      }
      else {
        push @row, $2;
      }
      my $pos = pos($line);
      if ($line !~ /$match_sep/g) {
        if ($pos == length($line)) {
          return @row;
        }
        else {
          my $expected = "Expected '$self->{sep}'";
          confess("$expected at $self->{filename}, line $., char $pos"
+);
        }
      }
    }
    confess("I have no idea how parsing $self->{filename} left me here
+!");
  }

  sub _get_quoted {
    my $piece = "";
    while ($line =~ /\G((?:[^"]|"")*)/g) {
      $piece .= $1;
      if ($line =~ /\G"/g) {
        $piece =~ s/""/"/g;
        return $piece;  # EXIT HERE
      }
      else {
        # Must be at end of line
        $piece .= $/;
        defined($line = <$fh>) or
          confess("File $self->{filename} ended inside a quoted field"
+);
        chomp($line);
      }
    }
    confess("I have no idea how parsing $self->{filename} left me here
+!");
  }
}

sub new {
  my $self = bless ({}, shift);
  my @fields = qw(filename fh filter sep);
  # args, required, optional, defaults
  @$self{@fields} = proc_args( +{ @_ },  [], [@fields],  { sep => ',' 
+});
  if (defined($self->{filename}) and not defined($self->{fh})) {
    $self->open_file($self->{filename});
  }
  $self->set_sep($self->{sep});
  return $self;
}

sub open_file {
  my $self = shift;
  my $file = $self->{filename} = shift;
  my $fh = do {local *FH}; # Old trick, not needed in 5.6
  open ($fh, "< $file") or confess("Cannot read '$file': $!");
  $self->{fh} = $fh;
}


# See node 43323 - this could be in a module but I didn't want to
# create external dependencies.
sub proc_args {
  my $args = shift;
  my $req = shift;
  my $opt = shift || [];
  my $default = shift || {};
  my @res;
  foreach my $arg (@$req) {
    if (exists $args->{$arg}) {
      push @res, $args->{$arg};
      delete $args->{$arg};
    }
    else {
      confess("Missing required argument $arg");
    }
  }
  foreach my $arg (@$opt) {
    if (exists $args->{$arg}) {
      push @res, $args->{$arg};
      delete $args->{$arg};
    }
    else {
      push @res, $default->{$arg};
    }
  }
  if (%$args) {
    my $bad = join ", ", sort keys %$args;
    confess("Unrecognized arguments: $bad\n");
  }
  else {
    return @res;
  }
}

sub set_fh {
  $_[0]->{fh} = $_[1];
}

sub set_filename {
  $_[0]->{filename} = $_[1];
}

sub set_sep {
  my $self = shift;
  my $sep = shift;
  if (1 == length($sep)) {
    $self->{sep} = $sep;
  }
  else {
    confess("The separator '$sep' is not of length 1");
  }
}

1;

__END__
=head1 NAME

Text::xSV

=head1 EXAMPLE

  use Text::xSV;
  my $csv = new Text::xSV;
  $csv->open_file("foo.csv");
  $csv->bind_header();
  while ($csv->get_row()) {
    my ($name, $age) = $csv->extract(qw(name age));
    print "$name is $age years old\n";
  }

=head1 ABSTRACT

This module is for reading character separated data.  The most common
example is comma-separated.  However that is far from the only
possibility, the same basic format is exported by Microsoft products
using tabs, colons, or other characters.

The format is a series of rows separated by returns.  Within each row
you have a series of fields separated by your character separator.
Fields may either be unquoted, in which case they do not contain a
double-quote, separator, or return, or they are quoted, in which case
they may contain everything, and will pair double-quotes.

People usually naively solve this with split.  A next step up is to
read a line and parse it.  Unfortunately this choice of interface
(which is made by Text::CSV on CPAN) makes it impossible to handle
returns embedded in a field.  Therefore you may need access to the
whole file.

This module solves the problem by creating a CSV object with access to
the filehandle, if in parsing it notices that a new line is needed, it
can read at will.

=head1 DESCRIPTION

First you set up and initialize an object, then you read the CSV file
through it.  The creation can also do multiple initializations as
well.  Here are the available methods

=over 4

=item C<new>

This is the constructor.  It takes a hash of optional arguments.
They are the I<filename> of the CSV file you are reading, the
I<fh> through which you read, and the one character I<sep> that
you are using.  If the filename is passed and the fh is not, then
it will open a filehandle on that file and sets the fh accordingly.
The separator defaults to a comma.

=item C<set_filename>

=item C<set_fh>

=item C<set_sep>

Set methods corresponding to the optional arguments to C<new>.

=item C<open_file>

Takes the name of a file, opens it, then sets the filename and fh.

=item C<bind_fields>

Takes an array of fieldnames, memorizes the field positions for later
use.  C<bind_headers> is preferred.

=item C<bind_headers>

Reads a row from the file as a header line and memorizes the positions
of the fields for later use.  File formats that carry field informatio
+n
tend to be far more robust than ones which do not, so this is the
preferred function.

=item C<get_row>

Reads a row from the file.  Returns an array or reference to an array
depending on context.  Will also store the row in the row property for
later access.

=item C<extract>

Extracts a list of fields out of the last row read.

=back

=head1 BUGS

When I say single character separator, I mean it.

Performance could be better.  That is largely because the API was
chosen for simplicity of a "proof of concept", rather than for
performance.  One idea to speed it up you would be to provide an
API where you bind the requested fields once and then fetch many
times rather than binding the request for every row.

Also note that should you ever play around with the special variables
$`, $&, or $', you will find that it can get much, much slower.  The
cause of this problem is that Perl avoids calculating those on a match
unless it has seen one of those.  This does many, many matches.

=head1 AUTHOR

Ben Tilly, aka tilly on http://www.perlmonks.org/.
Replies are listed 'Best First'.
Re: Text::xSV
by greenFox (Vicar) on Apr 17, 2001 at 18:32 UTC

    Tilly, out of interest what is it that you "don't like" about Text::CSV? I am also wondering why you chose to write a new module rather than either improve or extend the existing one. I can see Text::xSV has features that Text:CSV does not but would it not be better to add that functionality to Text::CSV? I know people have a lot of reasons for writing yet another mail clinet or wp but given the fairly limited scope and function of these two modules I don't see any gain out of two efforts...

    Sometimes CPAN has so many modules to do similar functions it is quite daunting to know which to choose- I look for recomendations, I look for authors names I know. However if it is possible I stick with the "standard modules", they have been blessed by a "higher authority", they are one less thing to worry about during upgrades etc and I know they will be maintained (not saying you won't). So aside from the reasons relating to duplication of effort I think there are some good reasons to having your code form part of a standard module.

    I'd appreciate hearing your reasoning for going the way that you have

    update: note to self- read the documentation before making assumptions :) Thanks tilly.

    --
    my $chainsaw = 'Perl';

      The API for Text::CSV has you hand lines off to the module to parse. This is incompatible with having the module detect that a return is embedded in a field and so another line needs to be fetched to parse the row. I didn't see any easy way to provide anything like a compatible interface to the current one which handled embedded returns. Therefore I felt that it was best to provide the functionality in a new module with a new API.

      Note that the API for Text::CSV_XS is able to handle embedded returns, and I should update this to mention that. However the API they chose seemed specific to their implementation, so I didn't try to tie out to it.

        Tilly, I'd like to suggest an addition to Text::xSV. It would be useful for the calling program to access the list of field header strings that are found in / bound from the first row of a CSV file; i.e. add a method like "get_header" as follows:
        sub bind_header { my $self = shift; $self->bind_fields($self->get_row()); $self->{field_name} = [ @{$self->{row}} ]; # this is new delete $self->{row}; } # the following method is new: sub get_header { my $self = shift; if ( exists( $self->{field_name} )) { return wantarray ? @{$self->{field_name}} : $self->{field_name +}; } }

        That works for me, but maybe you would want to do it a little differently. Thanks very much for this module, and for your discussions in various threads about CSV data -- you got my vote.

Re: Text::xSV
by dragonchild (Archbishop) on Oct 03, 2003 at 16:36 UTC
    Are there any plans to have this module also output to xSV? Text::CSV reads and writes CSV, so it would make sense (in my twisted world) for Text::xSV to do so, as well ...

    ------
    We are the carpenters and bricklayers of the Information Age.

    The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

      Yes. There are many plans.

      Computed fields are my next step though, and I think I will have some tuits this weekend.

Re: Text::xSV
by princepawn (Parson) on Apr 17, 2001 at 01:41 UTC
      I believe they are orthogonal to my module.

      Besides which I would not encourage the proliferation of file formats which are by design not extensible or self-documenting. See the note in the documentation that I provided about exactly that. Fixed width file formats are not exactly on my hit parade of great ideas to encourage.

      An incidental question. Even if I did want to use fixed width file formats, what is your value add over the better-known pack and unpack plus Perl's native support for formats? Perl started life dealing with lots of ASCII reports and has quite a few now little-used facilities for doing just that.

        If one is designing from scratch, then you are right, fixed-width is not extensible nor self-documenting. But I know that Valley Media, the company who does fulfillment for Amazon.com and CDNOW and some other people I have worked for, does use fixed-width format and they have too much technology behind it too change in midstream. I would be surprised if they are the only such place using this admittedly inferior format.

        Regarding use pack and unpack instead of Parse::FixedLength, Text::FixedLength, and Text::FixedLength::Extra, if you read "Data Munging with Perl", the answer would be none, because he did not mention the FixedLength module set in his book, but instead focused on pack and unpack.

        I think pack and unpack are excellent for implementation of fixed-width data processing, but in terms of making a human-level description, a higher level interface is needed. For example, let's say I have this:
        field name field width justifiction numeric?
        catalog id 12 L no
        price 6 L yes -- with 2 decimal places

        Here is a high-level description that one could expect a data-entry expert to convert the english above into:

        catalog_id = '12L', price = '6L*2'
        and to turn that into the programmatic representaiton of Parse::FixedLength is a one-step transformation:
        package Business::Logic::Orders; our $line_data = [ { catalog_id => '12L' } , { price => '6L*2' } ]; 1;
        And then the script which plugs into the data description is quite clean, with no low-level Perl pack/unpack:
        ### open D, 'data-file.dat'; while (<D>) { Parse::FixedLength::Parse(\%line_parse, $Business::Logic::Orders); DBIx::Recordset->Insert({ \%dsn }, \%line_parse} ); }
        And of course, internally Parse::FixedLength can implement it's high-level API with pack, unpack, or sprintf as need be.