http://qs321.pair.com?node_id=1095606


in reply to Re^2: Speeds vs functionality
in thread Speeds vs functionality

The cache, as implemented currently, was implemented to achief a boost of (iirc) about 25%. It is needed to reduce the access to the object (the $self hash), as those lookups are very very expensive.

Unicode whitespace isn't important for this parser, as it is no special "character", unless it being the separator, the quotation or the escape character. Unicode whitespace will just end up being binary.

XS is not PP :) Those characters could be int indeed, but that would probably mean that the whole parser (written in 1998 and modified/extended over time) has to be rewritten. It /might/ be worth the effort in the end, but I do not have the time to start that experiment.

Never tried fsm (unless the current state-machine already is an FSM). I simplified the parser as I got it when I took over maint. Over time a lot of bugs were fixed and new (required and requested) features where added.

update: added remark about FSM


Enjoy, Have FUN! H.Merijn

Replies are listed 'Best First'.
Re^4: Speeds vs functionality
by salva (Canon) on Jul 31, 2014 at 09:11 UTC
    Is there any reason stopping you for keeping the parser state as a persistent C struct?

    Correct me if I am wrong: currently, the state is keep exclusively on the Perl side and the cache is a (ugly) hack to be able to regenerate the C struct faster.

    Why don't just store the state on the C side and keep it as a pointer inside a IV? That's what most XS libs do and I am sure it would improve the parser speed a lot, and at the same time simplify the code!

    Are the module users allowed to modify the object hash directly?

      I am keeping it as a cache, because the user is allowed to alter the behavior of the parser between parses. And that is useful. I also agree that the cache-dealing code has by now grown out to a horrid state, and I was already playing with the idea of replacing it with a single PV that holds a memcpy of the csv struct (which might need some extending then).

      In earlier days, the end user was allowed to alter the hash. I now only allow (reliable) changes through the method calls. That implies that I in theory can move all code to XS.

      I however am not yet sure if that would simplify the code, though it probably will :)


      Enjoy, Have FUN! H.Merijn
Re^4: Speeds vs functionality
by Jim (Curate) on Jul 31, 2014 at 00:09 UTC

    I believe Modern Perl should have a core module that can easily parse these simple Unicode CSV records. It should handle them in any character encoding scheme of Unicode:  UTF-8, UTF-16, or UTF-32. And it should handle the Unicode byte order mark seamlessly.

    Why not?

    🎥Film🎥🎬🎥Year🎥🎬🎥Awards🎥🎬🎥Nominations🎥🎬🎥Director🎥
    🎥12 Years a Slave🎥🎬2013🎬3🎬9🎬🎥🎥🎥 Steve McQueen🎥
    🎥Argo🎥🎬2012🎬3🎬7🎬🎥🎥🎥 Ben Affleck🎥
    🎥The Artist🎥🎬2012🎬5🎬10🎬🎥🎥🎥 Michel Hazanavicius🎥
    🎥The King's Speech🎥🎬2010🎬4🎬12🎬🎥🎥🎥 Tom Hooper🎥
    🎥The Hurt Locker🎥🎬2009🎬6🎬9🎬🎥🎥🎥 Kathryn Bigelow🎥
    🎥Slumdog Millionaire🎥🎬2008🎬8🎬10🎬🎥🎥🎥 Danny Boyle🎥
    🎥No Country for Old Men🎥🎬2007🎬4🎬8🎬🎥🎥🎥 Joel Coen
    🎥🎥 Ethan Coen🎥
    🎥The Departed🎥🎬2006🎬4🎬5🎬🎥🎥🎥 Martin Scorsese🎥
    

    sep_char	🎬	U+1F3AC CLAPPER BOARD (UTF-8: F0 9F 8E AC)
    quote_char	🎥	U+1F3A5 MOVIE CAMERA  (UTF-8: F0 9F 8E A5)
    escape_char	🎥	U+1F3A5 MOVIE CAMERA  (UTF-8: F0 9F 8E A5)
    
    "Film","Year","Awards","Nominations","Director"
    "12 Years a Slave",2013,3,9,"🎥 Steve McQueen"
    "Argo",2012,3,7,"🎥 Ben Affleck"
    "The Artist",2012,5,10,"🎥 Michel Hazanavicius"
    "The King's Speech",2010,4,12,"🎥 Tom Hooper"
    "The Hurt Locker",2009,6,9,"🎥 Kathryn Bigelow"
    "Slumdog Millionaire",2008,8,10,"🎥 Danny Boyle"
    "No Country for Old Men",2007,4,8,"🎥 Joel Coen
    🎥 Ethan Coen"
    "The Departed",2006,4,5,"🎥 Martin Scorsese"
    

    I recognize that the current XS core module for parsing CSV records, Text::CSV_XS (marvelously maintained by Tux), may not be the right module to use as the basis for a new, fully Unicode-capable module. But because Perl's native Unicode capabilities exceed those of most other programming languages, Perl should have a proper FSM-based Unicode CSV parser, even if it's pure Perl and not XS.

    I long ago accepted that Unicode conformance and comparative slowness go hand in hand 👫. So what? Look what you're trading a few seconds here and there for:  the technological foundation of World Peace ☮ and Universal Love 💕.

    UPDATE:  Removed references to core module. I don't care about that. I just want a Unicode-capable Perl CSV module.

      Text::CSV_PP is able to parse that text, at least in UTF-8.

      use v5.12;
      use warnings;
      use utf8::all;
      use Text::CSV_PP;
      
      my $csv = Text::CSV_PP->new ( 
          { binary      => 1 , 
            quote_char  => '🎥' ,
            escape_char => '🎥' ,
            sep_char    => '🎬'  } )
        or die "Cannot use CSV_PP: "
         .Text::CSV_PP->error_diag ();
      
      my @rows;
      my $fh = *DATA;
      while ( my $row = $csv->getline( $fh ) ) {
               push @rows, $row;
      }
      $csv->eof or $csv->error_diag();
      for ( @rows ) {
          printf("%-25s%s\n", $_->[0], $_->[4]);
      }
      __DATA__
      🎥Film🎥🎬🎥Year🎥🎬🎥Awards🎥🎬🎥Nominations🎥🎬🎥Director🎥
      🎥12 Years a Slave🎥🎬2013🎬3🎬9🎬🎥🎥🎥 Steve McQueen🎥
      🎥Argo🎥🎬2012🎬3🎬7🎬🎥🎥🎥 Ben Affleck🎥
      🎥The Artist🎥🎬2012🎬5🎬10🎬🎥🎥🎥 Michel Hazanavicius🎥
      🎥The King's Speech🎥🎬2010🎬4🎬12🎬🎥🎥🎥 Tom Hooper🎥
      🎥The Hurt Locker🎥🎬2009🎬6🎬9🎬🎥🎥🎥 Kathryn Bigelow🎥
      🎥Slumdog Millionaire🎥🎬2008🎬8🎬10🎬🎥🎥🎥 Danny Boyle🎥
      🎥No Country for Old Men🎥🎬2007🎬4🎬8🎬🎥🎥🎥 Joel Coen 🎥🎥 Ethan Coen🎥
      🎥The Departed🎥🎬2006🎬4🎬5🎬🎥🎥🎥 Martin Scorsese🎥
      

      Output:

      Film                     Director
      12 Years a Slave         🎥 Steve McQueen
      Argo                     🎥 Ben Affleck
      The Artist               🎥 Michel Hazanavicius
      The King's Speech        🎥 Tom Hooper
      The Hurt Locker          🎥 Kathryn Bigelow
      Slumdog Millionaire      🎥 Danny Boyle
      No Country for Old Men   🎥 Joel Coen 🎥 Ethan Coen
      The Departed             🎥 Martin Scorsese
      

        Booyah! farang FTW!

        Here's my test with a very lightly refactored version of the same script:

        use v5.14;
        use strict;
        use warnings;
        use utf8;
        
        use Text::CSV_PP;
        
        binmode STDOUT, ':encoding(UTF-8)';
        
        my $csv = Text::CSV_PP->new({
            sep_char    => '🎬',
            quote_char  => '🎥',
            escape_char => '🎥',
            binary      => 1,
        });
        
        my @rows;
        
        my $fh = *DATA;
        
        while (my $row = $csv->getline($fh)) {
            push @rows, $row;
        }
        
        $csv->eof() or $csv->error_diag();
        
        for my $row (@rows) {
            $row->[4] =~ s/\n\s*/, /g;
        
            printf "%-24s %s\n", $row->[0], $row->[4];
        }
        
        exit 0;
        
        __DATA__
        🎥Film🎥🎬🎥Year🎥🎬🎥Awards🎥🎬🎥Nominations🎥🎬🎥Director🎥
        🎥12 Years a Slave🎥🎬2013🎬3🎬9🎬🎥🎥🎥 Steve McQueen🎥
        🎥Argo🎥🎬2012🎬3🎬7🎬🎥🎥🎥 Ben Affleck🎥
        🎥The Artist🎥🎬2012🎬5🎬10🎬🎥🎥🎥 Michel Hazanavicius🎥
        🎥The King's Speech🎥🎬2010🎬4🎬12🎬🎥🎥🎥 Tom Hooper🎥
        🎥The Hurt Locker🎥🎬2009🎬6🎬9🎬🎥🎥🎥 Kathryn Bigelow🎥
        🎥Slumdog Millionaire🎥🎬2008🎬8🎬10🎬🎥🎥🎥 Danny Boyle🎥
        🎥No Country for Old Men🎥🎬2007🎬4🎬8🎬🎥🎥🎥 Joel Coen
        🎥🎥 Ethan Coen🎥
        🎥The Departed🎥🎬2006🎬4🎬5🎬🎥🎥🎥 Martin Scorsese🎥
        

        This correctly produces:

        Film                     Director
        12 Years a Slave         🎥 Steve McQueen
        Argo                     🎥 Ben Affleck
        The Artist               🎥 Michel Hazanavicius
        The King's Speech        🎥 Tom Hooper
        The Hurt Locker          🎥 Kathryn Bigelow
        Slumdog Millionaire      🎥 Danny Boyle
        No Country for Old Men   🎥 Joel Coen, 🎥 Ethan Coen
        The Departed             🎥 Martin Scorsese
        

        Notice that this version handles the literal newline (\n, CR-LF) in the Coen brothers record, which I change to ',' in the output.

        Thank you, farang. I stand corrected:  there is a Unicode-capable CSV parser/generator Perl module on CPAN. And I think you just solved a very long-lived problem for me.

        shoot :) I had used the mutators to set seperator and those values and I couldn't get it to work :) thanks farang

        It works on a UTF-16 CSV file.

        use v5.14;
        use strict;
        use warnings;
        use utf8;
        
        use autodie qw( open close );
        use Text::CSV_PP;
        
        @ARGV == 1 or die "Usage: perl $0 <CSV file>\n";
        
        my $file = shift;
        
        open my $fh, '<:raw:perlio:encoding(UTF-16):crlf', $file;
        
        my $csv = Text::CSV_PP->new({
            sep_char    => '🎬',
            quote_char  => '🎥',
            escape_char => '🎥',
            binary      => 1,
        });
        
        my @rows;
        
        while (my $row = $csv->getline($fh)) {
            push @rows, $row;
        }
        
        $csv->eof() or $csv->error_diag();
        
        close $fh;
        
        binmode STDOUT, ':raw:perlio::encoding(UTF-16LE):crlf';
        
        for my $row (@rows) {
            $row->[4] =~ s/\n\s*/, /g;
        
            printf "%-24s %s\n", $row->[0], $row->[4];
        }
        
        exit 0;
        

        See these nodes for an explanation of the UTF-16 PerlIO nonsense required on Microsoft Windows.

      There are already at least a few CSV parsing modules on CPAN that don't just wrap Text::CSV_XS. A pure-Perl CSV parser is likely going to "just work" when given a file handle with the right encoding declared and separator/quote/escape strings properly decoded.

      Parse::CSV and Text::xSV are the first two I would try. My expectation is that both will handle utf-8 just fine. And if either doesn't, I suspect that fixing that problem won't be difficult.

      - tye        

        Parse::CSV says, "The actual parsing is done using Text::CSV_XS." It just wraps Text::CSV_XS.

        Text::xSV says, "When I say single character separator, I mean it." One glance at the source code and it's obvious the author doesn't mean single character; he means single byte. There's nothing at all in the module about any character encoding—least of all about one of the Unicode character encoding schemes (UTF-8, UTF-16, etc.). What's more, the string delimiter character, quote ("), is hardwired into the module. It's not user-configurable.

        I've done my research. I know the landscape. There isn't a module on CPAN that will parse the example Unicode CSV records in my post—nothing even close. If there was one, I'd be using it, and I wouldn't have written what I wrote.

        If you prove me wrong by demonstrating how to parse the Academy Award Best Picture winners Unicode CSV records using an existing CPAN module, I'll thank you profusely for finally solving my problem, I'll publicly apologize to you for suggesting you were wrong, and I'll 🙊.