Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^6: Speeds vs functionality

by Jim (Curate)
on Jul 31, 2014 at 04:17 UTC ( #1095680=note: print w/replies, xml ) Need Help??


in reply to Re^5: Speeds vs functionality (utf8 csv)
in thread Speeds vs functionality

Parse::CSV says, "The actual parsing is done using Text::CSV_XS." It just wraps Text::CSV_XS.

Text::xSV says, "When I say single character separator, I mean it." One glance at the source code and it's obvious the author doesn't mean single character; he means single byte. There's nothing at all in the module about any character encoding—least of all about one of the Unicode character encoding schemes (UTF-8, UTF-16, etc.). What's more, the string delimiter character, quote ("), is hardwired into the module. It's not user-configurable.

I've done my research. I know the landscape. There isn't a module on CPAN that will parse the example Unicode CSV records in my post—nothing even close. If there was one, I'd be using it, and I wouldn't have written what I wrote.

If you prove me wrong by demonstrating how to parse the Academy Award Best Picture winners Unicode CSV records using an existing CPAN module, I'll thank you profusely for finally solving my problem, I'll publicly apologize to you for suggesting you were wrong, and I'll 🙊.

Replies are listed 'Best First'.
Re^7: Speeds vs functionality (utf8 csv)
by tye (Sage) on Jul 31, 2014 at 07:33 UTC

    Ah. I was fooled by Parse::CSV making a big deal that "other modules" wrapped Text::CSV_XS. Thanks for the correction.

    "When I say single character separator, I mean it." One glance at the source code and it's obvious the author doesn't mean single character; he means single byte. There's nothing at all in the module about any character encoding—least of all about one of the Unicode character encoding schemes

    Yes, that is what I expected. A Perl module doing absolutely nothing about character encodings is the way that a module is most likely to be able to deal with UTF-8 characters just fine. When modules try to deal with UTF-8 characters, then you end up having to deal with how the module author chose to do things rather than just dealing with how Perl chose to deal with UTF-8.

    I've had a hand in getting Unicode support into many layers of quite a few projects and the biggest problems have always been with the modules that try to do stuff with encodings. The only problems I recall with modules that don't deal with encodings is the few that deal with protocols with something like a Content-Length: header where the module naively uses length() when it should have used bytes::length().

    But CSV parsing isn't even close to rocket surgery. There are a few common pitfalls. It takes just a small bit of competence and/or research to implement CSV parsing quite correctly. I really don't see the big deal with Text::CSV_XS needing to be all-singing/all-dancing. That just leads to bloat.

    Heck, if I were implementing a CSV parsing module, I'd probably have separate code for the case of single-character separators, quotes, and escapes. Because the reasonable way to implement CSV parsing efficiently is rather different between when "quote" is a single character and when it is more than 1 character.

    So I see no problem having a whole separate module for dealing with multi-character quotes. Use the standard module if you don't have to deal with such. Use the other module when you do. Each module is simpler because the multi-character one doesn't have to also try to include code to maximize efficiency for when a quote is a single character.

    - tye        

      Heck, if I were implementing a CSV parsing module, I'd probably have separate code for the case of single-character separators, quotes, and escapes. Because the reasonable way to implement CSV parsing efficiently is rather different between when "quote" is a single character and when it is more than 1 character.
      So I see no problem having a whole separate module for dealing with multi-character quotes. Use the standard module if you don't have to deal with such. Use the other module when you do. Each module is simpler because the multi-character one doesn't have to also try to include code to maximize efficiency for when a quote is a single character.

      Do you mean character or byte?

      I think you're using "multi-character" when what you actually mean is a single character (i.e., a single Unicode code point) that is encoded using multiple bytes in any one of the Unicode character encoding schemes:  UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE. I don't think you truly mean a user-perceived character that consists of two or more Unicode code points (e.g., g̈ — U+0067 LATIN SMALL LETTER G + U+0308 COMBINING DIAERESIS).

      In my Academy Award Best Picture winners example, every CSV metacharacter is a single character. The field separator character is 🎬 (U+1F3AC CLAPPER BOARD), and both the string delimiter character and the string delimiter escape character are 🎥 (U+1F3A5 MOVIE CAMERA). These two characters are, or course, encoded using multiple bytes in every one of the Unicode character encoding schemes. In UTF-8, they're encoded using four bytes. In UTF-16, they're also encoded using four bytes (two surrogate code points). And in UTF-32, they're encoded using four bytes, naturally.

      I'd like to see a truly Unicode-conformant CSV parser/generator module in Perl 5. It would leverage Perl's existing Unicode and character encoding capabilities; it wouldn't roll its own encoding handling. It would parse already-decoded CSV records. The input to the finite-state machine would be Unicode code points, not bytes. (More ambitiously, the input to the FSM might be any arbitrary user-perceived character, or extended grapheme cluster.)

      Why not?

        I was never considering single-byte anything. Writing code in Perl means that I don't have to (unlike writing code in XS). Yes, I actually meant what I said. Yes, I realized that your example was using multi-byte single-character tokens.

        The reason that single-character vs. multi-character (usually) leads to different approaches is because [^"\\]+ as part of a regex works fine for those single-character quote and escape values (respectively) but isn't even close to what you have to do if either of those is multi-character.

        And you are quite wrong about:

        One glance at the source code and it's obvious the author doesn't mean single character; he means single byte.

        For one, the author of Text::xSV didn't have to think about multi-byte characters. Their module is written in Perl so, unless they do something moderately strange or stupid, then multi-byte characters "just work" (provided the user of the module does the little bit of extra work to ensure that Perl has/will properly decode the strings/streams being given to the module).

        Looking at the code for Text::xSV in some detail, I see that 90% of the uses of the separator character would work completely fine with a separator that is even composed of more than one multi-byte character. There is one important place where the code would break for a multi-character separator (but that, indeed, continues to work for a separator that is a single multi-byte character):

        my $start_field_ms = qr/\G([^"$q_sep]*)/;

        Now, fixing the unfortunate hard-coding of the quote character is probably quite a simple task. And that would probably be sufficient to make the module work fine on multi-byte quote characters. Certainly much easier than trying to get multi-byte character support into a much more complex XS module.

        Why not?

        Because you haven't done the tiny bit of work to fix Text::xSV? Or the small amount of work to write a simple CSV parser in Perl?

        No matter. I'm almost done writing my new CSV module.

        - tye        

Re^7: Speeds vs functionality
by Tux (Abbot) on Jul 31, 2014 at 06:37 UTC

    Being able to parse that example (or CSV data likewise) is exactly why I started implementing multi-byte separation characters.

    As said elsewhere in this thread, if I am happy with the result, I'll try to also implement quotation and escapes as such. Quotation being on a way higher priority than escapes. Current state in development:

    $ cat films.csv
    🎥Film🎥🎬🎥Year🎥🎬🎥Awards🎥🎬🎥Nominations🎥🎬🎥Director🎥
    🎥12 Years a Slave🎥🎬2013🎬3🎬9🎬🎥🎥🎥 Steve McQueen🎥
    🎥Argo🎥🎬2012🎬3🎬7🎬🎥🎥🎥 Ben Affleck🎥
    🎥The Artist🎥🎬2012🎬5🎬10🎬🎥🎥🎥 Michel Hazanavicius🎥
    🎥The King's Speech🎥🎬2010🎬4🎬12🎬🎥🎥🎥 Tom Hooper🎥
    🎥The Hurt Locker🎥🎬2009🎬6🎬9🎬🎥🎥🎥 Kathryn Bigelow🎥
    🎥Slumdog Millionaire🎥🎬2008🎬8🎬10🎬🎥🎥🎥 Danny Boyle🎥
    🎥No Country for Old Men🎥🎬2007🎬4🎬8🎬🎥🎥🎥 Joel Coen
    🎥🎥 Ethan Coen🎥
    🎥The Departed🎥🎬2006🎬4🎬5🎬🎥🎥🎥 Martin Scorsese🎥
    $ head -1 films.csv | dump
    DUMP 0.6.01
    
    00000000  F0 9F 8E A5 46 69 6C 6D  F0 9F 8E A5 F0 9F 8E AC    ....Film........
    00000010  F0 9F 8E A5 59 65 61 72  F0 9F 8E A5 F0 9F 8E AC    ....Year........
    00000020  F0 9F 8E A5 41 77 61 72  64 73 F0 9F 8E A5 F0 9F    ....Awards......
    00000030  8E AC F0 9F 8E A5 4E 6F  6D 69 6E 61 74 69 6F 6E    ......Nomination
    00000040  73 F0 9F 8E A5 F0 9F 8E  AC F0 9F 8E A5 44 69 72    s............Dir
    00000050  65 63 74 6F 72 F0 9F 8E  A5 0A                      ector.....
    
    $ perl -C3 -MCSV -E'csv (out => *STDOUT, in => csv (in => "films.csv", sep => "\N{CLAPPER BOARD}"))'
    "🎥Film🎥","🎥Year🎥","🎥Awards🎥","🎥Nominations🎥","🎥Director🎥"
    "🎥12 Years a Slave🎥",2013,3,9,"🎥🎥🎥 Steve McQueen🎥"
    "🎥Argo🎥",2012,3,7,"🎥🎥🎥 Ben Affleck🎥"
    "🎥The Artist🎥",2012,5,10,"🎥🎥🎥 Michel Hazanavicius🎥"
    "🎥The King's Speech🎥",2010,4,12,"🎥🎥🎥 Tom Hooper🎥"
    "🎥The Hurt Locker🎥",2009,6,9,"🎥🎥🎥 Kathryn Bigelow🎥"
    "🎥Slumdog Millionaire🎥",2008,8,10,"🎥🎥🎥 Danny Boyle🎥"
    "🎥No Country for Old Men🎥",2007,4,8,"🎥🎥🎥 Joel Coen"
    "🎥🎥 Ethan Coen🎥"
    "🎥The Departed🎥",2006,4,5,"🎥🎥🎥 Martin Scorsese🎥"
    $
    

    Enjoy, Have FUN! H.Merijn

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1095680]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (6)
As of 2020-09-28 23:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    If at first I don’t succeed, I …










    Results (144 votes). Check out past polls.

    Notices?