Producing a list of offsets efficiently

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Producing a list of offsets efficiently by tlm (Prior) on May 28, 2005 at 20:31 UTC
`index` beats everything else I have tried, including various regexps. The fastest I have found is `my @o; my $o = -1; push @o, $o while ($o = index($s, 'a', $o+1)) > -1;` [download] Read more... (1371 Bytes) the lowliest monk	[reply] [d/l] [select]
Re: Producing a list of offsets efficiently by dws (Chancellor) on May 28, 2005 at 20:10 UTC
`my $string = "aXbcXdefgXhijXklmnopqrXstuvwXyz"; my $target = 'X'; my @offsets; while ( $string =~ /$target/g ) { push @offsets, pos($string) - length($target); } print "@offsets\n";` [download] Does the trick for me, and it scales to arbitrary substrings. (Update: hoisting `length($target)` out of the loop is a no-brainer. Alas, I had no brain at the time.)	[reply] [d/l] [select]
Re^2: Producing a list of offsets efficiently by nobull (Friar) on May 28, 2005 at 20:20 UTC
`pos($string) - length($target)` can be replaced with `$-[0]`.	[reply] [d/l] [select]
Re^2: Producing a list of offsets efficiently by BrowserUk (Patriarch) on May 28, 2005 at 20:12 UTC
Problem: If the string is very large with lots of occurances of 'X', your `@offsets` array is being constantly resized and copied as you add new values. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.	[reply] [d/l] [select]
Re^3: Producing a list of offsets efficiently by tlm (Prior) on May 28, 2005 at 21:27 UTC
FWIW, `push` seems to be faster than direct indexing on a pre-grown array, as the following table shows. The only difference between the `windex` and `windex_2` alternatives in the following table is that the former uses `push` and the latter uses direct indexing on a pre-grown array. The size of the string is 100_000, containing about 3800 matches. (Full code within the readmore tags.) `Rate wregex windex_2 windex wregex 212/s -- -24% -32% windex_2 277/s 31% -- -11% windex 310/s 47% 12% --` [download] Read more... (1113 Bytes) the lowliest monk	[reply] [d/l] [select]
Re^3: Producing a list of offsets efficiently by dws (Chancellor) on May 28, 2005 at 20:14 UTC
In that case, I'd probably code up an XS module that makes two passes over the target string, allocating the result vector once after the first pass, and filling it in on the second.	[reply]
Re^3: Producing a list of offsets efficiently by tilly (Archbishop) on May 29, 2005 at 18:32 UTC
The resize and copies have an amortized constant cost per array element added. Put another way, pushing one element at a time averages out to a O(1) operation.	[reply]
Re^4: Producing a list of offsets efficiently by BrowserUk (Patriarch) on May 29, 2005 at 21:38 UTC
Re^5: Producing a list of offsets efficiently by tilly (Archbishop) on May 30, 2005 at 05:20 UTC
Some notes below your chosen depth have not been shown here
Re: Producing a list of offsets efficiently by Forsaken (Friar) on May 28, 2005 at 23:02 UTC
`use strict; use warnings; my $string = 'insertreallylongstringwithlotsandlotsoffunkycharactersan +difonefeelsreallyspecialperhapsevenaspacesomewhere"; my $char_to_be_matched = 'X'; my @matches; my $offset = 0; foreach my $char (split(//, $string)) { if($char eq $char_to_be_matched) #alternatively some regex magic { push(@matches, $offset); } $offset++; }` [download] I'm pretty certain there's one or more things that I'm not taking into account in this example, but you get the general idea ;-) Remember rule one...	[reply] [d/l]
Re: Producing a list of offsets efficiently by injunjoel (Priest) on May 28, 2005 at 21:15 UTC
Greetings all not sure how efficient this is but I thought I would pay homage to TimToady. `#!/usr/bin/perl -w use strict; my $string = "aXbcXdefgXhijXklmnopqrXstuvwXyz"; my $target = "X"; my $count = 0; my @indices = map{$count++;/$target/? $count-1:();}split //,$string; print "@indices"; exit` [download] -InjunJoel "I do not feel obliged to believe that the same God who endowed us with sense, reason and intellect has intended us to forego their use." -Galileo	[reply] [d/l]
Re^2: Producing a list of offsets efficiently by tlm (Prior) on May 28, 2005 at 22:00 UTC
not sure how efficient this is... Not very much at all :) . It corresponds to `wmap` in the table below: `Rate wmap count wgrep wregex windex wmap 5.42/s -- -12% -60% -97% -98% count 6.13/s 13% -- -54% -97% -98% wgrep 13.5/s 148% 120% -- -93% -96% wregex 198/s 3558% 3132% 1372% -- -37% windex 317/s 5741% 5062% 2251% 60% --` [download] Read more... (1131 Bytes) the lowliest monk	[reply] [d/l] [select]
Re^3: Producing a list of offsets efficiently by japhy (Canon) on May 29, 2005 at 17:11 UTC
Of course wmap() and count() are going to be the slowest, because they go through the string twice: once to split it into individual elements, and again to compare each element to the desired one. The index() approach is very certainly the fastest. I wrote a function called aindex() a long time ago that did just that -- used index to step through a string and return all the indices of a substring in that string. Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart	[reply]
Re^4: Producing a list of offsets efficiently by tlm (Prior) on May 29, 2005 at 17:57 UTC
Re: Producing a list of offsets efficiently by sh1tn (Priest) on May 29, 2005 at 00:15 UTC
Regular expressions.	[reply]
Re^2: Producing a list of offsets efficiently by BrowserUk (Patriarch) on May 29, 2005 at 00:39 UTC
How? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.	[reply] [d/l]


We don't bite newbies here... much
	PerlMonks