Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: Speeds vs functionality

by Tux (Canon)
on Jul 29, 2014 at 10:42 UTC ( [id://1095479]=note: print w/replies, xml ) Need Help??


in reply to Speeds vs functionality

OK, more specific

The module is Text::CSV_XS (surprise), and the new feature is support for multi-byte sep_char which includes the support of UTF-8 separation characters.

The fact that it is the separation character that can now allow a character instead of a single byte implies a huge impact. The first check on every byte in a CSV stream is the check on the separation character. Every extra test on that byte will cause that extra test to be executed for every single byte in the stream. This is still the fastest way. Making that check conditional on the state of the stream will just cause another (or more) test to be executed instead.

The performance drop for the fastest stream test I do is measured between 5 and 10%. For all versions of perl I tested with.

At this moment, I think it is worth it, but I am still in doubt.

$ perl -MCSV -C3 -E'csv (out => *STDOUT, in => [[ 1, 2 ]], sep => "\x{060c}")'
1،2
$ perl -MCSV -C3 -E'csv (out => *STDOUT, in => [[ 1, 2 ]], sep => "\N{FULLWIDTH COMMA}")'
1,2
$ perl -MCSV -C3 -E'csv (out => *STDOUT, in => [[ 1, 2 ]], sep => "\N{FULLWIDTH COMMA}")' | \
  perl -MCSV -E'DDumper (csv (in => *STDIN, sep => "\x{ff0c}"))'
[
    [   1,
        2
        ]
    ]
$

Enjoy, Have FUN! H.Merijn

Replies are listed 'Best First'.
Re^2: Speeds vs functionality
by salva (Canon) on Jul 29, 2014 at 11:26 UTC
    I have just looked over the code (this one, right?) and it seems to me that a better approach can be used to check for separators.

    Currently you check at every character for the two possibilities (single or multi-byte separator):

    if (c == csv->sep_char || is_SEPX (c)) {

    A better way would be to consider the multi-byte separator as a single-byte separator plus a tail:

    /* somewhere on the object constructor */ csv->sep_tail_len = sep_len - 1; csv->sep_tail = sep + 1; csv->sep_char = *sep; ... /* then, on the parser */ if (c == csv->sep_char) { if (!csv->sep_tail_len || ((csv->size - csv->used >= csv->sep_tail_len) && !memcmp(csv->bptr + csv->used, csv->sep_tail, csv->sep_tail_l +en))) { /* you have a separator! */
    I think that would minimize the impact of supporting the extra multi-byte checks on the common single-byte separator case.
Re^2: Speeds vs functionality
by BrowserUk (Patriarch) on Jul 29, 2014 at 12:52 UTC
    The first check on every byte in a CSV stream is the check on the separation character. Every extra test on that byte will cause that extra test to be executed for every single byte in the stream.

    Is it really so difficult to lift the single/multi-byte test out of the loop?

    Even if it means that everything inside the loop is duplicated, that needn't imply a maintenance problem.

    You could, for example, make the body of the (now two) loops an inlined function. They've been a part of the standard for 15 years and gcc had them long before that.

    If you really feel the need to support compilers that don't, you could always substitute (another)of those aweful multiline macros.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      For speed it is just one single loop. The test for the separation character occurs - besides the check for every next byte - 5 extra times when looking ahead, e.g. after an escape character or a quotation character. Splitting the test out of the loop currently is difficult.

      The code is littered with multi-line macros, and I do not think they are awful at all. They work also on all old compilers, and as I am the maintainer, there is no one else that will see them. When digging through perl5 core code, one gets used to multi-line macros. It doesn't bother me.

      I will have another look at the approach salva suggested and see if I can improve speed there. Having also $paid work, that will not finish this week though.

      FWIW all feedback here warmly welcomed and appreciated, even if I might not agree on some


      Enjoy, Have FUN! H.Merijn
        For speed it is just one single loop.

        My point was that by duplicating that loop you can have the single byte case in one, and the multibyte case in the other and decide which loop to enter, thus neither case carries the burden of the repeated, single/multi bytes tests within the loop, and both cases benefit.

        The inline functions (my preference) or multiline macros (yours?) discussion was simply a way to mitigate some or all of the copy&paste code duplication.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: Speeds vs functionality
by duelafn (Parson) on Jul 29, 2014 at 11:25 UTC

    For what it's worth, I feel it is worth it.

    I am a Text::CSV_XS user, I do use it in large data (but not performance critical) situations, I doubt I would ever need that functionality, but consider the option to be worth the (reasonably small) performance penalty.

    Good Day,
        Dean

Re^2: Speeds vs functionality
by Jim (Curate) on Jul 30, 2014 at 06:58 UTC

      If I am satisfied with sep, quote_char and maybe even escape_char will be the next to deal with.

      Even with the rewrite as suggested by salva (more or less) I still see a slowdown of close to 10%.


      Enjoy, Have FUN! H.Merijn
        How are your benchmarking the code? I would like to try some ideas myself.
Re^2: Speeds vs functionality
by oiskuu (Hermit) on Jul 30, 2014 at 08:52 UTC

    Looking at cx_Parse + stuff, some thoughts and questions arise:

    • Is it necessary to fiddle with the cache? Just undef _cache on perl side to enforce parameter changes; this ought to be a rare event? Stashing an opaque C struct avoids needless copying.
    • Then what about unicode whitespace characters?
    • quote_char, escape_char, etc., could be ints and default to -1 when undefined. Easier to test against. However ...
    • Have you tried writing this thing as an fsm?
    • enum { TXT, BIN, WSPACE, UTF8, QUOT, ESC, SEP, CR, NL_EOLX, ..., NN }; enum { EXFIELD, INFIELD = 1*NN, INQUOT = 2*NN, CRSEEN = 3*NN } state; while ((c = getc()) != EOF) { int ctype = cached->xlat[c]; if () ... /* perhaps peel the most likely case(s) */ switch (state + ctype) { case WSPACE: continue; /* nop: allow_whitespace test in xlat[] */ case BIN: error(); /* again, resolved when constructing xlat[] */ case TXT: state = INFIELD; putc(c); continue; case INFIELD+TXT: case INQUOT+TXT: case INQUOT+SEP: ... putc(c); ... case UTF8: case INFIELD+UTF8: ...accumulate/xlat... case CRSEEN+NL_EOLX: ...; state = 0; continue; case CRSEEN+...: error(); default: error(); } ...
      Or possibly:
      enum { EXFIELD, INFIELD = 0x100, INQUOT = 0x200, CRS = 0x300 } sta +te; ... int action = cached->xlat[state + c]; decode(action); ...

    Ultimately, the (handful of) UTF sequences may also be resolved by walking trie-like state tables.

      The cache, as implemented currently, was implemented to achief a boost of (iirc) about 25%. It is needed to reduce the access to the object (the $self hash), as those lookups are very very expensive.

      Unicode whitespace isn't important for this parser, as it is no special "character", unless it being the separator, the quotation or the escape character. Unicode whitespace will just end up being binary.

      XS is not PP :) Those characters could be int indeed, but that would probably mean that the whole parser (written in 1998 and modified/extended over time) has to be rewritten. It /might/ be worth the effort in the end, but I do not have the time to start that experiment.

      Never tried fsm (unless the current state-machine already is an FSM). I simplified the parser as I got it when I took over maint. Over time a lot of bugs were fixed and new (required and requested) features where added.

      update: added remark about FSM


      Enjoy, Have FUN! H.Merijn
        Is there any reason stopping you for keeping the parser state as a persistent C struct?

        Correct me if I am wrong: currently, the state is keep exclusively on the Perl side and the cache is a (ugly) hack to be able to regenerate the C struct faster.

        Why don't just store the state on the C side and keep it as a pointer inside a IV? That's what most XS libs do and I am sure it would improve the parser speed a lot, and at the same time simplify the code!

        Are the module users allowed to modify the object hash directly?

        I believe Modern Perl should have a core module that can easily parse these simple Unicode CSV records. It should handle them in any character encoding scheme of Unicode:  UTF-8, UTF-16, or UTF-32. And it should handle the Unicode byte order mark seamlessly.

        Why not?

        🎥Film🎥🎬🎥Year🎥🎬🎥Awards🎥🎬🎥Nominations🎥🎬🎥Director🎥
        🎥12 Years a Slave🎥🎬2013🎬3🎬9🎬🎥🎥🎥 Steve McQueen🎥
        🎥Argo🎥🎬2012🎬3🎬7🎬🎥🎥🎥 Ben Affleck🎥
        🎥The Artist🎥🎬2012🎬5🎬10🎬🎥🎥🎥 Michel Hazanavicius🎥
        🎥The King's Speech🎥🎬2010🎬4🎬12🎬🎥🎥🎥 Tom Hooper🎥
        🎥The Hurt Locker🎥🎬2009🎬6🎬9🎬🎥🎥🎥 Kathryn Bigelow🎥
        🎥Slumdog Millionaire🎥🎬2008🎬8🎬10🎬🎥🎥🎥 Danny Boyle🎥
        🎥No Country for Old Men🎥🎬2007🎬4🎬8🎬🎥🎥🎥 Joel Coen
        🎥🎥 Ethan Coen🎥
        🎥The Departed🎥🎬2006🎬4🎬5🎬🎥🎥🎥 Martin Scorsese🎥
        

        sep_char	🎬	U+1F3AC CLAPPER BOARD (UTF-8: F0 9F 8E AC)
        quote_char	🎥	U+1F3A5 MOVIE CAMERA  (UTF-8: F0 9F 8E A5)
        escape_char	🎥	U+1F3A5 MOVIE CAMERA  (UTF-8: F0 9F 8E A5)
        
        "Film","Year","Awards","Nominations","Director"
        "12 Years a Slave",2013,3,9,"🎥 Steve McQueen"
        "Argo",2012,3,7,"🎥 Ben Affleck"
        "The Artist",2012,5,10,"🎥 Michel Hazanavicius"
        "The King's Speech",2010,4,12,"🎥 Tom Hooper"
        "The Hurt Locker",2009,6,9,"🎥 Kathryn Bigelow"
        "Slumdog Millionaire",2008,8,10,"🎥 Danny Boyle"
        "No Country for Old Men",2007,4,8,"🎥 Joel Coen
        🎥 Ethan Coen"
        "The Departed",2006,4,5,"🎥 Martin Scorsese"
        

        I recognize that the current XS core module for parsing CSV records, Text::CSV_XS (marvelously maintained by Tux), may not be the right module to use as the basis for a new, fully Unicode-capable module. But because Perl's native Unicode capabilities exceed those of most other programming languages, Perl should have a proper FSM-based Unicode CSV parser, even if it's pure Perl and not XS.

        I long ago accepted that Unicode conformance and comparative slowness go hand in hand 👫. So what? Look what you're trading a few seconds here and there for:  the technological foundation of World Peace ☮ and Universal Love 💕.

        UPDATE:  Removed references to core module. I don't care about that. I just want a Unicode-capable Perl CSV module.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1095479]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (3)
As of 2024-03-29 05:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found