Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Re^2: Suggestions to make this code more Perlish

by TheDamian (Priest)
on Mar 30, 2014 at 11:25 UTC ( #1080278=note: print w/replies, xml ) Need Help??

in reply to Re: Suggestions to make this code more Perlish
in thread Suggestions to make this code more Perlish

Just for interest, here's the same approach in Perl 6:
use v6; my regex fields { \" <( .*? )> \" | <-[,"]>* } my $input = open 'input.csv', :r; my $output = open 'output.tff', :w; for $input.lines { next unless /<fields>* % ','/; $output.print: @<fields>.join(chr 31) ~ chr 30; } $output.close;
That's not the shortest way to do it in Perl 6, but it's closest to the Perl 5 example above.

A shorter and more Perl6-ish version might be:

my regex fields { \" <( .*? )> \" | <-[,"]>* } lines open 'input.csv' ==> map({@<fields>.join(chr 31) ~ chr 30 if /<fields>* % ','/}) ==> spurt('output.tff');


Replies are listed 'Best First'.
Re^3: Suggestions to make this code more Perlish
by kcott (Bishop) on Mar 30, 2014 at 15:08 UTC

    Thanks for that Damian. I'm not really across Perl6 syntax. I looked in Perl6 Regexes documentation; unfortunately, there's several sections with nothing more than "TODO", including "Alternation" and "Grouping and Capturing", so I pretty much gave up at that point. Can you suggest a better source of documentation?

    Anyway, inspired by your "shorter and more Perl6-ish version", here's a shorter and more Perl5-ish version of my original (this replaces the while loop, everything else remains the same):

    my $re = qr{(?:"(?<a>[^"]*)"|(?<a>[^,]*))(?:,|\000)}; print $tff_fh $_ for map { chomp; s/$re/$+{a}\037/g; $_ } <$csv_fh>;

    Due to the issue described in "Repeated Patterns Matching a Zero-length Substring", I was getting '\037\037' (at the end of $_) after each 's///g': hence the 's/[\037]+$//;' to remove them.

    I found that by replacing ',?' with '(?:,|\000)', I got zero '\037' characters after the 's///g' (so the 's/[\037]+$//;' wasn't needed at all). [Note: '(?:,|)', '(?:,|$)', '(?:,|\z)' and '(?:,|\Z)' all produced '\037\037' after each 's///g'.]

    While I suspect this has something to do with '\0' terminated strings in C, I don't fully understand what's happening. As it could be a side effect that might behave differently in another Perl version (I'm using v5.18.1), and not being able to answer the inevitable "How does this work?" question, I left it out of my original solution.

    You, or someone else, may have a quick answer. If not, I was planning to spend a bit more time looking into this and, in the absence of finding a solution, post a more generalised example with a question later in the week.

    -- Ken

      Hi Ken,

      The best place to read up about Perl 6 regexes is the specification itself.

      You mused:

      While I suspect this has something to do with '\0' terminated strings in C, I don't fully understand what's happening.

      No, it's not anything to do with C string terminators.

      The problem with your previous version was that you were matching an optional comma at the end of each field and then replacing it with a definite "\037" every time. So, for the last field in each record (which, of course, isn't followed by a comma), your were nevertheless appending an unwanted "\037".

      The global substitution would then loop one last time, matching a final zero-character field (because of the (?<a>[^,]*) alternative, which can match nothing). The substitution on that empty field then causes a second unnecessary "\037" to be appended.

      You could fix that by rewriting your original version something like this:

      open my $csv_fh, '<', 'input.csv'; open my $tff_fh, '>', 'output.tff'; my $field = qr{ " (?<field> [^"]* ) " | (?<field> [^,"]* ) }x; while (my $line = <$csv_fh>) { $line =~ s{ $field (?<comma> ,?) } { $+{field} . ($+{comma} && chr 31) }gxe; $line =~ s{\n}{chr 30}xe; print {$tff_fh} $line; }

      This version still matches the optional comma each time, but now only appends a "\037" if there actually was a comma. Which means there are no extras to remove, once the line is complete.

      Note that I also removed the chomp and replaced it with an explicit substitution of the trailing newline. I felt that this highlights the transformation more clearly than did your clever (but subtle and "at-a-distance") use of $\.


        Thanks for the documentation link. That certainly has a lot more text than I found where I was previously looking: and no "TODO"s in sight. I have a bit of reading ahead of me.

        Yes, I was aware of why I was getting two \037 characters at the end (in the first solution). What I haven't figured out yet is why I was getting zero \037 characters at the end (when I changed ',?' to '(?:,|\000)' — in the second solution).

        Thanks also for the additional feedback: much appreciated.

        -- Ken

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1080278]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2022-05-27 16:53 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (97 votes). Check out past polls.