Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Command-line utility to apply perl expressions to CSV files

by gpvos (Novice)
on Nov 20, 2022 at 20:16 UTC ( [id://11148278]=CUFP: print w/replies, xml ) Need Help??

Here's csved, an old script of mine that I use to quickly munge CSV files.

Essentially, it works similar to perl -a or awk. After reading each line, it is parsed into fields and put into @F. Then it executes your expression, and after that, unless -n is given, it prints @F again in CSV format. You can mangle @F any way you like, changing, reordering or deleting entries or completely reassigning it. You can also use next to skip printing a particular line.

I recently thought it would be useful to be able to address columns by their name if the CSV file has those in its first line, so I added the -h option which, via an egregious abuse of tying that I'm very proud of, allows you to also address the fields using the hash %F and field names taken from the first line of the CSV file. Any access through %F actually affects @F directly except via a fieldname-to-index mapping. You can delete entries by assigning undef to them, or delete $F{fieldname} also works: it doesn't delete anything from the tied hash, but instead immediately deletes it from the underlying array @F representing the current line/row.

Also I added -b and -e options which work similarly to BEGIN and END, but I haven't really used them yet. Options to read and write CSV with different separator characters are also available.

This also called for a (still incomplete) test suite; you can look at the results here: https://github.com/gpvos/csved

I hope you will find it useful.

#!/usr/bin/perl -w # # csved - apply a Perl expression to all records of a CSV file # # Copyright (c) 2003-2022 Gerben Vos. All rights reserved. # This program is free software; you can redistribute it and/or # modify it under the same terms as Perl 5.8.0. # # Run without arguments for usage information. # # Requires the Text::CSV_XS module by Jochen Wiedmann and Alan Citterm +an, # available from http://www.cpan.org/ . use v5.10; use strict 'vars', 'refs'; # This is a weird usage of tying because the hash is tied only once, # but the underlying data changes for every processed line. package Tie::FieldNames; sub TIEHASH { my $self = shift; my $i = 0; my $tied = { HEADERS => { map { ($_, $i++) } @_ }, }; die "Duplicate field names: " . join(", ", map { "'$_'" } _dups(@_)) unless scalar keys %{$tied->{HEADERS}} == scalar @_; return bless $tied, $self; } sub _dups { my %count; ++$count{$_} for @_; return grep { $count{$_} > 1 } keys %count; } sub setcurrent { $_[0]->{CURRENT} = $_[1]; } sub FETCH { my ($self, $key) = @_; die "$key: No such field name" unless exists $self->{HEADERS}->{$k +ey}; return $self->{CURRENT}->[$self->{HEADERS}->{$key}]; } sub STORE { my ($self, $key, $value) = @_; die "$key: No such field name" unless exists $self->{HEADERS}->{$k +ey}; $self->{CURRENT}->[$self->{HEADERS}->{$key}] = $value; } sub DELETE { my ($self, $key) = @_; die "$key: No such field name" unless exists $self->{HEADERS}->{$k +ey}; delete $self->{CURRENT}->[$self->{HEADERS}->{$key}]; } sub EXISTS { my ($self, $key) = @_; die "$key: No such field name" unless exists $self->{HEADERS}->{$k +ey}; return exists $self->{CURRENT}->[$self->{HEADERS}->{$key}]; } sub FIRSTKEY { my ($self) = @_; my $dummy = keys %{$self->{HEADERS}}; each %{$self->{HEADERS}}; } sub NEXTKEY { return each %{$_[0]->{HEADERS}}; } sub SCALAR { return scalar grep { defined } @{$_[0]->{CURRENT}}; } package main; use Text::CSV_XS; use IO::Handle; my $_silent = 0; my $_use_headers = 0; my $_begin = ""; my $_end = ""; my $_sep = ','; my $_quote = '"'; my $_escape = '"'; my $_progname; ($_progname = $0) =~ s#.*/##; my $_usage = <<USAGE; Usage: $_progname [option...] expr [file...] Apply a Perl expression to all records of a CSV file. -n suppress printing -h treat first line as field names and allow access through \%F -b expr execute expr before processing any input -e expr execute expr after processing all input -F char set the character that separates fields (default: comma) -Q char set the character for quoting fields that contain a separat +or character or end-of-line (default: double quote) -E char set the character for escaping a quote character (default: +double quote) The input is processed line by line. expr should be a Perl expression modifying the \@F array; indexes are +0-based. With -h, you can also modify the \%F hash with the same effect. Deleti +ng an array or hash entry, or setting it to undef, will cause it not to be o +utput. The value of \@F after executing expr is sent to the standard output. With -h, field names are read from the first line just before executin +g expr on it, so if e.g. field 8 is named "comment", you can use "delete \$F{ +comment}" and that would be equivalent to "delete \$F[8]". Duplicate field names + in the input or a field name in expr that does not exist in the input will ca +use a fatal error. A field name for \%F refers to a fixed index in \@F, so i +f you rearrange the data in \@F, accesses through \%F may refer to incorrect + fields. Works with binary files (for example, with embedded newlines). You can use next (to suppress printing), last and redo in expr. \$. works in expr, so you can e.g. use "if (\$. > 1) { ... }" to not a +pply some code to the first line. Variable names starting with _ are reserved, don't use them in expr. Uses the Text::CSV_XS module by Jochen Wiedmann and Alan Citterman, wi +th thanks. USAGE while (@ARGV > 0 and $ARGV[0] =~ m{^-}) { my $opt = shift; $_silent = 1, next if $opt eq '-n'; $_use_headers = 1, next if $opt eq '-h'; $_begin = shift, next if $opt eq '-b'; $_end = shift, next if $opt eq '-e'; $_sep = shift, next if $opt eq '-F'; $_quote = shift, next if $opt eq '-Q'; $_escape = shift, next if $opt eq '-E'; die $_usage; } die $_usage unless @ARGV > 0; my $_expr = shift; my $_csv = Text::CSV_XS->new({ sep_char => $_sep, quote_char => $_quot +e, escape_char => $_escape, binary => 1, eol => "\n" }); unshift(@ARGV, '-') unless @ARGV; { no strict; eval $_begin; die $@ if $@; } my %F; my $_tied; while ($ARGV = shift) { open(_IN, $ARGV); my $_fields; while ($_fields = $_csv->getline(\*main::_IN) and @$_fields) { my(@F) = @$_fields; if ($_use_headers) { if ($. == 1) { $_tied = tie %F, 'Tie::FieldNames', @F; } $_tied->setcurrent(\@F); } # This means you can use next, last, and redo in expr # without excessive noise. Also turn off strictness. no warnings "exiting"; no strict; eval $_expr; die $@ if $@; $_csv->print(STDOUT, [ grep { defined } @F ]) unless $_silent; } } if ($_use_headers) { undef $_tied; untie %F; } { no strict; eval $_end; die $@ if $@; }

Replies are listed 'Best First'.
Re: Command-line utility to apply perl expressions to CSV files
by Tux (Canon) on Nov 21, 2022 at 16:10 UTC

      No, I don't think I did. From a cursory look, these tie the entire CSV, or maybe even load the entire thing into memory. Their documentation says that they have speed problems for some operations. My utility processes the CSV as a stream and only holds one line at a time in memory. Because of that, it is generally lightning fast. Also, I often use it as only one step within a pipeline.

      For cases where it is necessary to load the entire table into memory, I tend to use something like csvsql from the (Python) csvkit. If I just need to delete or reorder some columns, I tend to use csvtool (from OCaml). But my tool shines for the cases in between those two when you need to do simple data transformations, like converting a column into hexadecimal, or adding a column that is the sum or concatenation of two or more others.

      It is mostly intended to be similar to awk, but then reading and writing CSV instead of text (and offering the flexibility of Perl). In fact, that's why I added the -b and -e options, to have something similar to BEGIN and END, even though I haven't yet used those options in a real scenario.

      The horrible way I use tie just allows you to address the field values in the current line by their column name using a hash called %F, even though those are actually stored in the array @F which you can also directly use. The tie just passes through the actions after applying a mapping from field name to array index. But you can also work with only the array and ignore the tied hash, which was in fact the only possibility until two weeks ago.

        In the vein of other tools for an interactive approach check out visidata which gives a tty spreadsheet (pretty much anything csv, tsv, or xls that python can read) editor.

        The cake is a lie.
        The cake is a lie.
        The cake is a lie.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://11148278]
Approved by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2024-04-25 13:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found