http://qs321.pair.com?node_id=1095668


in reply to Re^3: Speeds vs functionality
in thread Speeds vs functionality

I believe Modern Perl should have a core module that can easily parse these simple Unicode CSV records. It should handle them in any character encoding scheme of Unicode:  UTF-8, UTF-16, or UTF-32. And it should handle the Unicode byte order mark seamlessly.

Why not?

🎥Film🎥🎬🎥Year🎥🎬🎥Awards🎥🎬🎥Nominations🎥🎬🎥Director🎥
🎥12 Years a Slave🎥🎬2013🎬3🎬9🎬🎥🎥🎥 Steve McQueen🎥
🎥Argo🎥🎬2012🎬3🎬7🎬🎥🎥🎥 Ben Affleck🎥
🎥The Artist🎥🎬2012🎬5🎬10🎬🎥🎥🎥 Michel Hazanavicius🎥
🎥The King's Speech🎥🎬2010🎬4🎬12🎬🎥🎥🎥 Tom Hooper🎥
🎥The Hurt Locker🎥🎬2009🎬6🎬9🎬🎥🎥🎥 Kathryn Bigelow🎥
🎥Slumdog Millionaire🎥🎬2008🎬8🎬10🎬🎥🎥🎥 Danny Boyle🎥
🎥No Country for Old Men🎥🎬2007🎬4🎬8🎬🎥🎥🎥 Joel Coen
🎥🎥 Ethan Coen🎥
🎥The Departed🎥🎬2006🎬4🎬5🎬🎥🎥🎥 Martin Scorsese🎥

sep_char	🎬	U+1F3AC CLAPPER BOARD (UTF-8: F0 9F 8E AC)
quote_char	🎥	U+1F3A5 MOVIE CAMERA  (UTF-8: F0 9F 8E A5)
escape_char	🎥	U+1F3A5 MOVIE CAMERA  (UTF-8: F0 9F 8E A5)
"Film","Year","Awards","Nominations","Director"
"12 Years a Slave",2013,3,9,"🎥 Steve McQueen"
"Argo",2012,3,7,"🎥 Ben Affleck"
"The Artist",2012,5,10,"🎥 Michel Hazanavicius"
"The King's Speech",2010,4,12,"🎥 Tom Hooper"
"The Hurt Locker",2009,6,9,"🎥 Kathryn Bigelow"
"Slumdog Millionaire",2008,8,10,"🎥 Danny Boyle"
"No Country for Old Men",2007,4,8,"🎥 Joel Coen
🎥 Ethan Coen"
"The Departed",2006,4,5,"🎥 Martin Scorsese"

I recognize that the current XS core module for parsing CSV records, Text::CSV_XS (marvelously maintained by Tux), may not be the right module to use as the basis for a new, fully Unicode-capable module. But because Perl's native Unicode capabilities exceed those of most other programming languages, Perl should have a proper FSM-based Unicode CSV parser, even if it's pure Perl and not XS.

I long ago accepted that Unicode conformance and comparative slowness go hand in hand 👫. So what? Look what you're trading a few seconds here and there for:  the technological foundation of World Peace ☮ and Universal Love 💕.

UPDATE:  Removed references to core module. I don't care about that. I just want a Unicode-capable Perl CSV module.

Replies are listed 'Best First'.
Re^5: Speeds vs functionality
by farang (Chaplain) on Jul 31, 2014 at 23:50 UTC

    Text::CSV_PP is able to parse that text, at least in UTF-8.

    use v5.12;
    use warnings;
    use utf8::all;
    use Text::CSV_PP;
    
    my $csv = Text::CSV_PP->new ( 
        { binary      => 1 , 
          quote_char  => '🎥' ,
          escape_char => '🎥' ,
          sep_char    => '🎬'  } )
      or die "Cannot use CSV_PP: "
       .Text::CSV_PP->error_diag ();
    
    my @rows;
    my $fh = *DATA;
    while ( my $row = $csv->getline( $fh ) ) {
             push @rows, $row;
    }
    $csv->eof or $csv->error_diag();
    for ( @rows ) {
        printf("%-25s%s\n", $_->[0], $_->[4]);
    }
    __DATA__
    🎥Film🎥🎬🎥Year🎥🎬🎥Awards🎥🎬🎥Nominations🎥🎬🎥Director🎥
    🎥12 Years a Slave🎥🎬2013🎬3🎬9🎬🎥🎥🎥 Steve McQueen🎥
    🎥Argo🎥🎬2012🎬3🎬7🎬🎥🎥🎥 Ben Affleck🎥
    🎥The Artist🎥🎬2012🎬5🎬10🎬🎥🎥🎥 Michel Hazanavicius🎥
    🎥The King's Speech🎥🎬2010🎬4🎬12🎬🎥🎥🎥 Tom Hooper🎥
    🎥The Hurt Locker🎥🎬2009🎬6🎬9🎬🎥🎥🎥 Kathryn Bigelow🎥
    🎥Slumdog Millionaire🎥🎬2008🎬8🎬10🎬🎥🎥🎥 Danny Boyle🎥
    🎥No Country for Old Men🎥🎬2007🎬4🎬8🎬🎥🎥🎥 Joel Coen 🎥🎥 Ethan Coen🎥
    🎥The Departed🎥🎬2006🎬4🎬5🎬🎥🎥🎥 Martin Scorsese🎥
    

    Output:

    Film                     Director
    12 Years a Slave         🎥 Steve McQueen
    Argo                     🎥 Ben Affleck
    The Artist               🎥 Michel Hazanavicius
    The King's Speech        🎥 Tom Hooper
    The Hurt Locker          🎥 Kathryn Bigelow
    Slumdog Millionaire      🎥 Danny Boyle
    No Country for Old Men   🎥 Joel Coen 🎥 Ethan Coen
    The Departed             🎥 Martin Scorsese
    

      Booyah! farang FTW!

      Here's my test with a very lightly refactored version of the same script:

      use v5.14;
      use strict;
      use warnings;
      use utf8;
      
      use Text::CSV_PP;
      
      binmode STDOUT, ':encoding(UTF-8)';
      
      my $csv = Text::CSV_PP->new({
          sep_char    => '🎬',
          quote_char  => '🎥',
          escape_char => '🎥',
          binary      => 1,
      });
      
      my @rows;
      
      my $fh = *DATA;
      
      while (my $row = $csv->getline($fh)) {
          push @rows, $row;
      }
      
      $csv->eof() or $csv->error_diag();
      
      for my $row (@rows) {
          $row->[4] =~ s/\n\s*/, /g;
      
          printf "%-24s %s\n", $row->[0], $row->[4];
      }
      
      exit 0;
      
      __DATA__
      🎥Film🎥🎬🎥Year🎥🎬🎥Awards🎥🎬🎥Nominations🎥🎬🎥Director🎥
      🎥12 Years a Slave🎥🎬2013🎬3🎬9🎬🎥🎥🎥 Steve McQueen🎥
      🎥Argo🎥🎬2012🎬3🎬7🎬🎥🎥🎥 Ben Affleck🎥
      🎥The Artist🎥🎬2012🎬5🎬10🎬🎥🎥🎥 Michel Hazanavicius🎥
      🎥The King's Speech🎥🎬2010🎬4🎬12🎬🎥🎥🎥 Tom Hooper🎥
      🎥The Hurt Locker🎥🎬2009🎬6🎬9🎬🎥🎥🎥 Kathryn Bigelow🎥
      🎥Slumdog Millionaire🎥🎬2008🎬8🎬10🎬🎥🎥🎥 Danny Boyle🎥
      🎥No Country for Old Men🎥🎬2007🎬4🎬8🎬🎥🎥🎥 Joel Coen
      🎥🎥 Ethan Coen🎥
      🎥The Departed🎥🎬2006🎬4🎬5🎬🎥🎥🎥 Martin Scorsese🎥
      

      This correctly produces:

      Film                     Director
      12 Years a Slave         🎥 Steve McQueen
      Argo                     🎥 Ben Affleck
      The Artist               🎥 Michel Hazanavicius
      The King's Speech        🎥 Tom Hooper
      The Hurt Locker          🎥 Kathryn Bigelow
      Slumdog Millionaire      🎥 Danny Boyle
      No Country for Old Men   🎥 Joel Coen, 🎥 Ethan Coen
      The Departed             🎥 Martin Scorsese
      

      Notice that this version handles the literal newline (\n, CR-LF) in the Coen brothers record, which I change to ',' in the output.

      Thank you, farang. I stand corrected:  there is a Unicode-capable CSV parser/generator Perl module on CPAN. And I think you just solved a very long-lived problem for me.

      shoot :) I had used the mutators to set seperator and those values and I couldn't get it to work :) thanks farang

      It works on a UTF-16 CSV file.

      use v5.14;
      use strict;
      use warnings;
      use utf8;
      
      use autodie qw( open close );
      use Text::CSV_PP;
      
      @ARGV == 1 or die "Usage: perl $0 <CSV file>\n";
      
      my $file = shift;
      
      open my $fh, '<:raw:perlio:encoding(UTF-16):crlf', $file;
      
      my $csv = Text::CSV_PP->new({
          sep_char    => '🎬',
          quote_char  => '🎥',
          escape_char => '🎥',
          binary      => 1,
      });
      
      my @rows;
      
      while (my $row = $csv->getline($fh)) {
          push @rows, $row;
      }
      
      $csv->eof() or $csv->error_diag();
      
      close $fh;
      
      binmode STDOUT, ':raw:perlio::encoding(UTF-16LE):crlf';
      
      for my $row (@rows) {
          $row->[4] =~ s/\n\s*/, /g;
      
          printf "%-24s %s\n", $row->[0], $row->[4];
      }
      
      exit 0;
      

      See these nodes for an explanation of the UTF-16 PerlIO nonsense required on Microsoft Windows.

Re^5: Speeds vs functionality (utf8 csv)
by tye (Sage) on Jul 31, 2014 at 03:09 UTC

    There are already at least a few CSV parsing modules on CPAN that don't just wrap Text::CSV_XS. A pure-Perl CSV parser is likely going to "just work" when given a file handle with the right encoding declared and separator/quote/escape strings properly decoded.

    Parse::CSV and Text::xSV are the first two I would try. My expectation is that both will handle utf-8 just fine. And if either doesn't, I suspect that fixing that problem won't be difficult.

    - tye        

      Parse::CSV says, "The actual parsing is done using Text::CSV_XS." It just wraps Text::CSV_XS.

      Text::xSV says, "When I say single character separator, I mean it." One glance at the source code and it's obvious the author doesn't mean single character; he means single byte. There's nothing at all in the module about any character encoding—least of all about one of the Unicode character encoding schemes (UTF-8, UTF-16, etc.). What's more, the string delimiter character, quote ("), is hardwired into the module. It's not user-configurable.

      I've done my research. I know the landscape. There isn't a module on CPAN that will parse the example Unicode CSV records in my post—nothing even close. If there was one, I'd be using it, and I wouldn't have written what I wrote.

      If you prove me wrong by demonstrating how to parse the Academy Award Best Picture winners Unicode CSV records using an existing CPAN module, I'll thank you profusely for finally solving my problem, I'll publicly apologize to you for suggesting you were wrong, and I'll 🙊.

        Ah. I was fooled by Parse::CSV making a big deal that "other modules" wrapped Text::CSV_XS. Thanks for the correction.

        "When I say single character separator, I mean it." One glance at the source code and it's obvious the author doesn't mean single character; he means single byte. There's nothing at all in the module about any character encoding—least of all about one of the Unicode character encoding schemes

        Yes, that is what I expected. A Perl module doing absolutely nothing about character encodings is the way that a module is most likely to be able to deal with UTF-8 characters just fine. When modules try to deal with UTF-8 characters, then you end up having to deal with how the module author chose to do things rather than just dealing with how Perl chose to deal with UTF-8.

        I've had a hand in getting Unicode support into many layers of quite a few projects and the biggest problems have always been with the modules that try to do stuff with encodings. The only problems I recall with modules that don't deal with encodings is the few that deal with protocols with something like a Content-Length: header where the module naively uses length() when it should have used bytes::length().

        But CSV parsing isn't even close to rocket surgery. There are a few common pitfalls. It takes just a small bit of competence and/or research to implement CSV parsing quite correctly. I really don't see the big deal with Text::CSV_XS needing to be all-singing/all-dancing. That just leads to bloat.

        Heck, if I were implementing a CSV parsing module, I'd probably have separate code for the case of single-character separators, quotes, and escapes. Because the reasonable way to implement CSV parsing efficiently is rather different between when "quote" is a single character and when it is more than 1 character.

        So I see no problem having a whole separate module for dealing with multi-character quotes. Use the standard module if you don't have to deal with such. Use the other module when you do. Each module is simpler because the multi-character one doesn't have to also try to include code to maximize efficiency for when a quote is a single character.

        - tye        

        Being able to parse that example (or CSV data likewise) is exactly why I started implementing multi-byte separation characters.

        As said elsewhere in this thread, if I am happy with the result, I'll try to also implement quotation and escapes as such. Quotation being on a way higher priority than escapes. Current state in development:

        $ cat films.csv
        🎥Film🎥🎬🎥Year🎥🎬🎥Awards🎥🎬🎥Nominations🎥🎬🎥Director🎥
        🎥12 Years a Slave🎥🎬2013🎬3🎬9🎬🎥🎥🎥 Steve McQueen🎥
        🎥Argo🎥🎬2012🎬3🎬7🎬🎥🎥🎥 Ben Affleck🎥
        🎥The Artist🎥🎬2012🎬5🎬10🎬🎥🎥🎥 Michel Hazanavicius🎥
        🎥The King's Speech🎥🎬2010🎬4🎬12🎬🎥🎥🎥 Tom Hooper🎥
        🎥The Hurt Locker🎥🎬2009🎬6🎬9🎬🎥🎥🎥 Kathryn Bigelow🎥
        🎥Slumdog Millionaire🎥🎬2008🎬8🎬10🎬🎥🎥🎥 Danny Boyle🎥
        🎥No Country for Old Men🎥🎬2007🎬4🎬8🎬🎥🎥🎥 Joel Coen
        🎥🎥 Ethan Coen🎥
        🎥The Departed🎥🎬2006🎬4🎬5🎬🎥🎥🎥 Martin Scorsese🎥
        $ head -1 films.csv | dump
        DUMP 0.6.01
        
        00000000  F0 9F 8E A5 46 69 6C 6D  F0 9F 8E A5 F0 9F 8E AC    ....Film........
        00000010  F0 9F 8E A5 59 65 61 72  F0 9F 8E A5 F0 9F 8E AC    ....Year........
        00000020  F0 9F 8E A5 41 77 61 72  64 73 F0 9F 8E A5 F0 9F    ....Awards......
        00000030  8E AC F0 9F 8E A5 4E 6F  6D 69 6E 61 74 69 6F 6E    ......Nomination
        00000040  73 F0 9F 8E A5 F0 9F 8E  AC F0 9F 8E A5 44 69 72    s............Dir
        00000050  65 63 74 6F 72 F0 9F 8E  A5 0A                      ector.....
        
        $ perl -C3 -MCSV -E'csv (out => *STDOUT, in => csv (in => "films.csv", sep => "\N{CLAPPER BOARD}"))'
        "🎥Film🎥","🎥Year🎥","🎥Awards🎥","🎥Nominations🎥","🎥Director🎥"
        "🎥12 Years a Slave🎥",2013,3,9,"🎥🎥🎥 Steve McQueen🎥"
        "🎥Argo🎥",2012,3,7,"🎥🎥🎥 Ben Affleck🎥"
        "🎥The Artist🎥",2012,5,10,"🎥🎥🎥 Michel Hazanavicius🎥"
        "🎥The King's Speech🎥",2010,4,12,"🎥🎥🎥 Tom Hooper🎥"
        "🎥The Hurt Locker🎥",2009,6,9,"🎥🎥🎥 Kathryn Bigelow🎥"
        "🎥Slumdog Millionaire🎥",2008,8,10,"🎥🎥🎥 Danny Boyle🎥"
        "🎥No Country for Old Men🎥",2007,4,8,"🎥🎥🎥 Joel Coen"
        "🎥🎥 Ethan Coen🎥"
        "🎥The Departed🎥",2006,4,5,"🎥🎥🎥 Martin Scorsese🎥"
        $
        

        Enjoy, Have FUN! H.Merijn
Re^5: Speeds vs functionality
by Anonymous Monk on Jul 31, 2014 at 03:06 UTC