Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Producing a list of offsets efficiently

by BrowserUk (Patriarch)
on May 28, 2005 at 20:02 UTC ( [id://461386]=perlquestion: print w/replies, xml ) Need Help??

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Given a string of arbitrary length, what efficient ways are there of producing a list of the offsets at which a given char occurs within that string?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
  • Comment on Producing a list of offsets efficiently

Replies are listed 'Best First'.
Re: Producing a list of offsets efficiently
by tlm (Prior) on May 28, 2005 at 20:31 UTC

    index beats everything else I have tried, including various regexps. The fastest I have found is

    my @o; my $o = -1; push @o, $o while ($o = index($s, 'a', $o+1)) > -1;

    the lowliest monk

Re: Producing a list of offsets efficiently
by dws (Chancellor) on May 28, 2005 at 20:10 UTC

    my $string = "aXbcXdefgXhijXklmnopqrXstuvwXyz"; my $target = 'X'; my @offsets; while ( $string =~ /$target/g ) { push @offsets, pos($string) - length($target); } print "@offsets\n";

    Does the trick for me, and it scales to arbitrary substrings.


    (Update: hoisting length($target) out of the loop is a no-brainer. Alas, I had no brain at the time.)

      pos($string) - length($target) can be replaced with $-[0].

      Problem: If the string is very large with lots of occurances of 'X', your @offsets array is being constantly resized and copied as you add new values.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

        FWIW, push seems to be faster than direct indexing on a pre-grown array, as the following table shows.

        The only difference between the windex and windex_2 alternatives in the following table is that the former uses push and the latter uses direct indexing on a pre-grown array. The size of the string is 100_000, containing about 3800 matches. (Full code within the readmore tags.)

        Rate wregex windex_2 windex wregex 212/s -- -24% -32% windex_2 277/s 31% -- -11% windex 310/s 47% 12% --

        the lowliest monk

        In that case, I'd probably code up an XS module that makes two passes over the target string, allocating the result vector once after the first pass, and filling it in on the second.

        The resize and copies have an amortized constant cost per array element added. Put another way, pushing one element at a time averages out to a O(1) operation.
Re: Producing a list of offsets efficiently
by Forsaken (Friar) on May 28, 2005 at 23:02 UTC
    use strict; use warnings; my $string = 'insertreallylongstringwithlotsandlotsoffunkycharactersan +difonefeelsreallyspecialperhapsevenaspacesomewhere"; my $char_to_be_matched = 'X'; my @matches; my $offset = 0; foreach my $char (split(//, $string)) { if($char eq $char_to_be_matched) #alternatively some regex magic { push(@matches, $offset); } $offset++; }
    I'm pretty certain there's one or more things that I'm not taking into account in this example, but you get the general idea ;-)


    Remember rule one...
Re: Producing a list of offsets efficiently
by injunjoel (Priest) on May 28, 2005 at 21:15 UTC
    Greetings all
    not sure how efficient this is but I thought I would pay homage to TimToady.
    #!/usr/bin/perl -w use strict; my $string = "aXbcXdefgXhijXklmnopqrXstuvwXyz"; my $target = "X"; my $count = 0; my @indices = map{$count++;/$target/? $count-1:();}split //,$string; print "@indices"; exit


    -InjunJoel
    "I do not feel obliged to believe that the same God who endowed us with sense, reason and intellect has intended us to forego their use." -Galileo

      not sure how efficient this is...

      Not very much at all :) . It corresponds to wmap in the table below:

      Rate wmap count wgrep wregex windex wmap 5.42/s -- -12% -60% -97% -98% count 6.13/s 13% -- -54% -97% -98% wgrep 13.5/s 148% 120% -- -93% -96% wregex 198/s 3558% 3132% 1372% -- -37% windex 317/s 5741% 5062% 2251% 60% --

      the lowliest monk

        Of course wmap() and count() are going to be the slowest, because they go through the string twice: once to split it into individual elements, and again to compare each element to the desired one. The index() approach is very certainly the fastest. I wrote a function called aindex() a long time ago that did just that -- used index to step through a string and return all the indices of a substring in that string.

        Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
        How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
Re: Producing a list of offsets efficiently
by sh1tn (Priest) on May 29, 2005 at 00:15 UTC
    Regular expressions.


      How?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://461386]
Approved by Limbic~Region
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2024-04-25 16:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found