Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

performance enhancement

by alandev (Scribe)
on Jul 19, 2006 at 13:41 UTC ( [id://562280] : perlquestion . print w/replies, xml ) Need Help??

alandev has asked for the wisdom of the Perl Monks concerning the following question:

This node falls below the community's threshold of quality. You may see it by logging in.

Replies are listed 'Best First'.
Re: performance enhancement
by Tanktalus (Canon) on Jul 19, 2006 at 13:49 UTC

    You'll have to give us more to work with. "StripLTSpace" isn't part of the perl core that I'm aware of, so, please tell us what it contains.

    Also, please show how it is the bottleneck in your program - if you're not calling it very much, then speeding it up won't make much of a difference. If it's coupled with I/O of some sort (e.g., reading from a website or even a local hard disk), the overhead of stripping the string is going to be completely dwarfed by the overhead of reading the file anyway.

      StripLTSpace function is in String::Strip

        It looks like the guts of String::Strip are written in C, and the algorithm used doesn't look particularly inefficient so I guess you would struggle to get it much quicker.


Re: performance enhancement
by ysth (Canon) on Jul 19, 2006 at 21:38 UTC
    The "normal" way to do this is with two substitutions: $str =~ s/^\s+//; $str =~ s/\s+\z//;. Is that not fast enough? I see a few problems with String::Strip:
    • Doesn't handle unicode whitespace when given utf8 input.
    • Truncates strings that contain null characters. (Also, will violate bounds if fed strings that perl wasn't able to put a "safety" null terminator on.)
    • When stripping leading spaces, ends up copying the whole string - something that the substitutions optimize away - which will be a disadvantage for large strings.
    • When copying the string, relies on overlapping strcpy working - something about which the C standard says "the behavior is undefined."

      The "normal" way to do this is with two substitutions

      Ive often pondered on an optimisation of $s=~s/^\s+|\s+$/g so that this is no longer true. So far its been over my head in the sense of requiring too much research time to implement compared to other useful tasks that I can do, but maybe one day...

      And for people wondering why this isn't the recommended way, its because this pattern will try to match every point in the string. The regex engine isnt currently smart enough to optimise this to only try the pattern twice.


        Why s/^\s+|\s*$/g rather than s/^\s+|\s+$/g, s/^\s*|\s*$/g or s/^\s*|\s+$/g?

        A benchmark suggests the two substitution approach is faster than any of the single substitution approaches and that there are interesting variations between the different single substitution options:

        Rate starstar plusstar plusplus starplus twosub starstar 47.0/s -- -8% -25% -28% -42% plusstar 51.2/s 9% -- -18% -21% -37% plusplus 62.5/s 33% 22% -- -4% -23% starplus 65.1/s 39% 27% 4% -- -20% twosub 81.6/s 74% 59% 31% 25% --

        The benchmark uses a single large string (100_000 characters) with a fairly large run of spaces (1000) at the start and end.

        DWIM is Perl's answer to Gödel
Re: performance enhancement
by marto (Cardinal) on Jul 19, 2006 at 14:41 UTC