perl-diddler has asked for the wisdom of the Perl Monks concerning the following question:
I'm prepared to be embarrassed (as much as one can be), but note, that in my code I often want to search for the first character that is NOT something -- like what is the index of the first non-space character?
To find the first incidence of a character, we can use
my $start = 1+index $_,q(");
die "No quoted string found" unless $start;
$_ = substr $_,$start;
my $stop = index, $_, q(");
$_ = substr $_,0,$stop;
I was noticing in a section of code, a place where I get rid of leading space and have been forced into
s/^\s*//;
which tends to be notably slower than equivalent index/substr code.
So I was wondering how to find the first character that is *not* a character, like:
my $start = 1+nindex $_," ";
Though certainly at least a set would be useful, like:
my $start = 1+nindex $_,"[\t ]";
as long as it didn't invoke the regex engine.
Has anyone else run into a need for this type of paradigm: finding the first byte
that is NOT the listed byte (or one of the listed bytes)?
Is there a need for a 2nd set of instructions to skip bytes until 'not equal', vs. index's skip bytes until 'equal'?
Thanks!
Re: opposite of index+rindex? How-to? Needed?
by choroba (Cardinal) on Aug 29, 2019 at 22:20 UTC
|
You can use tr///c to replace all characters other than space and tab to something, and then use index to search for that character.
#!/usr/bin/perl
use warnings;
use strict;
sub regex_pos {
my ($string) = @_;
$string =~ /[^ \t]/g;
return pos($string) - 1
}
sub tr_pos {
my ($string) = @_;
(my $tr = $string) =~ tr/ \t/!!/c;
return index $tr, '!'
}
for my $s (" \t x ", "\t\t\t\t\t\t \xff...\t ") {
regex_pos($s) == tr_pos($s) or die;
}
use Benchmark qw{ cmpthese };
my $s = ' ' x 200 . "\t" x 200 . "\x01" . " " x 200;
cmpthese(-3, {
regex => sub { regex_pos($s) },
tr => sub { tr_pos($s) },
});
On my machine, it seems about 3 times faster than regex in 5.26.1:
Rate regex tr
regex 316530/s -- -67%
tr 954355/s 202% --
but only marginally faster in blead perl:
Rate regex tr
regex 896519/s -- -6%
tr 954359/s 6% --
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
but only marginally faster in blead perl
From 5.30.0 perldelta:
Regular expression pattern matching of things like C<qr/[^I<a>]/> is
significantly sped up, where I<a> is any ASCII character. Other class
+es
can get this speed up, but which ones is complicated and depends on th
+e
underlying bit patterns of those characters, so differs between ASCII
and EBCDIC platforms, but all case pairs, like C<qr/[Gg]/> are include
+d,
as is C<[^01]>.
Dave. | [reply] [d/l] |
Re: opposite of index+rindex? How-to? Needed?
by davido (Cardinal) on Aug 30, 2019 at 02:57 UTC
|
String::Index provides ncindex and ncrindex, which I think provide what you're after.
my $pos = ncindex($string, 'c'); # find first non-c
The needle string ('c' in this case) can be more than one character, in which case it will find the position of the first character in $string that isn't found in the string of search characters.
| [reply] [d/l] [select] |
Re: opposite of index+rindex? How-to? Needed? (updated)
by AnomalousMonk (Archbishop) on Aug 29, 2019 at 22:00 UTC
|
... what is the index of the first non-space character?
c:\@Work\Perl\monks>perl -wMstrict -le
"my $s = ' a ';
;;
$s =~ m{ \S }xms;
print $-[0];
"
4
... as long as it didn't invoke the regex engine.
What's the point of that?
Update: More generally:
c:\@Work\Perl\monks>perl -wMstrict -le
"my $s = 'xyzzyBARxkcdFOOyz';
;;
my $vowel = qr{ [AEIOUaeiou] }xms;
;;
print qq{offset of first vowel: }, $s =~ m{ ($vowel) }xms ? $-[1] : '
+none';
;;
print qq{offset of last vowel: }, $s =~ m{ .* ($vowel) }xms ? $-[1] :
+ 'none';
;;
print qq{offset of first vowel: }, 'XYZ' =~ m{ ($vowel) }xms ? $-[1]
+: 'none';
"
offset of first vowel: 6
offset of last vowel: 14
offset of first vowel: none
See @- @+ in perlvar.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
And alternatively with look-ahead:
;;
print qq{offset of first vowel: }, $s =~ m{ (?= $vowel ) }xgms ? pos
+$s : 'none';
;;
print qq{offset of last vowel: }, $s =~ m{ .* (?= $vowel ) }xgms ? po
+s $s : 'none';
Note 'g' modifier.
To negate, one can use negative look-ahead (?! $vowel ). Upd. Not really! Look at the AnomalousMonk reply. | [reply] [d/l] [select] |
|
c:\@Work\Perl\monks>perl -wMstrict -le
"my $vowel = qr{ [AEIOUaeiou] }xms;
;;
my $s = 'xyzzyBARxkcdFOOyz';
;;
print 'offset of first vowel: ', $s =~ m{ (?= $vowel) }xms ? $+[0] :
+'none';
print 'offset of last vowel: ', $s =~ m{ .* (?= $vowel) }xms ? $+[0]
+: 'none';
print 'offset of first vowel: ', 'XYZ' =~ m{ (?= $vowel) }xms ? $+[0]
+ : 'none';
"
offset of first vowel: 6
offset of last vowel: 14
offset of first vowel: none
I like the look-ahead idea because it avoids a capture and so may be slightly faster.
To negate, one can use negative look-ahead (?! $vowel ).
I think there's a problem here:
c:\@Work\Perl\monks>perl -wMstrict -le
"my $vowel = qr{ [AEIOUaeiou] }xms;
;;
my $s = 'aePDQioVWXua';
;;
print 'offset of first non-vowel: ', $s =~ m{ (?! $vowel) }xmsg ? pos
+($s) : 'none';
print 'offset of last non-vowel: ', $s =~ m{ .* (?! $vowel) }xmsg ? p
+os($s) : 'none';
;;
$s = 'aei';
print 'offset of first non-vowel: ', $s =~ m{ (?! $vowel) }xmsg ? pos
+($s) : 'none';
"
offset of first non-vowel: 2
offset of last non-vowel: 12
offset of first non-vowel: 3
The second and third cases give questionable results because there's always a place in a string where a negative look-ahead will succeed if it is true nowhere else: just beyond the end of the string. ($+[0] has the same problem in these cases as pos.)
Update: However, a positive look-ahead to a negated char class works:
c:\@Work\Perl\monks>perl -wMstrict -le
"my $non_vowel = qr{ [^AEIOUaeiou] }xms;
;;
my $s = 'aePDQioVWXua';
;;
print 'offset of first non-vowel: ', $s =~ m{ (?= $non_vowel) }xmsg ?
+ pos($s) : 'none';
print 'offset of last non-vowel: ', $s =~ m{ .* (?= $non_vowel) }xmsg
+ ? pos($s) : 'none';
;;
$s = 'aei';
print 'offset of first non-vowel: ', $s =~ m{ (?= $non_vowel) }xmsg ?
+ pos($s) : 'none';
"
offset of first non-vowel: 2
offset of last non-vowel: 9
offset of first non-vowel: none
The positive assertion requires that something be present, so the overall match can fail. ($+[0] works as well as pos.)
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: opposite of index+rindex? How-to? Needed?
by perl-diddler (Chaplain) on Aug 31, 2019 at 00:02 UTC
|
To answer the question as to why I was avoiding the regex engine, and to try some alternatives for space skipping (that would work in most perls, not just the latest, though that's great that 5.30 got that boost), I wrote a bench prog. The first part is the positive -- using index and substr to do parsing instead of regex.
Note, for many things 'tr' seems to work alot like the regex engine so the more general case one uses with 'tr', the more it's like a regex instead of a map -- like mapping all characters except the ones you want -- unless you enumerate all the characters and perl builds a translation matrix, -- if you use a 'class' for example, it might use the regular regex engine and pull in its relative slowness. I say relative, as these examples can show:
#!/usr/bin/perl
use strict; use warnings;
# vim=:SetNumberAndWidth
######################################################################
+##########
use Benchmark qw(:all);
our @words;
my $count=@ARGV?$ARGV[0]:50000;
my $str=q( <location href="debug/noarch/post-build-checks-debugsou
+rce-84.87+git20170929.5b244d1-1.1.noarch.rpm"/>);
sub getss1() {
local $_ = $str;
my $start = 1+index $_, q(");
my $end = index $_, q("), $start;
$_ = substr $_, $start, $end;
}
use String::Index qw(cindex ncindex);
sub getss2() {
local $_ = $str;
my $start = 1+cindex $_, q(");
my $end = cindex $_, q("), $start;
$_ = substr $_, $start, $end;
}
sub getss3() {
@words = split q( ),$str;
local $_ = $words[1];
$_ = $words[1];
my $start = 1+index $_, q(");
my $end = index $_, q("), $start;
$_ = substr $_, $start, $end;
}
sub getsub1() {
local $_ = $str;
s/^[^"]*"([^"]+)".*$/$1/;
$_;
}
sub getsub2() {
local $_ = $str;
m{^[^"]*"([^"]+)".*$};
$1;
}
cmpthese($count, {
'ss1' => 'getss1',
'ss2' => 'getss2',
'ss3' => 'getss3',
'sub1' => 'getsub1',
'sub2' => 'getsub2',
});
# vim: ts=2 sw=2 ai number
And a run:
> /tmp/bench
Rate sub1 sub2 ss2 ss3 ss1
sub1 384615/s -- -38% -38% -54% -69%
sub2 625000/s 62% -- -0% -25% -50%
ss2 625000/s 62% 0% -- -25% -50%
ss3 833333/s 117% 33% 33% -- -33%
ss1 1250000/s 225% 100% 100% 50% --
ss1 is with normal index+substr to isolate a string and is by far the fastest.
substitution is the slowest and regex is about twice as fast as that (but still only
50% index+substr).
ss2 uses the cindex routine -- I'd suspect the nindex would be along the same speed lines -- a good choice for a general 'nindex'.
But for the cases I mentioned with space or whitespace, using 'split' with its single space literal
arg, incurs the least overhead (apart from not using it as in ss1). For a general case of looking at different fields in my input that are separated by blanks, and for removing leading space, split seems to be the
optimal choice for narrowing down the words (I get rid of the '<' after the spaces, then use a hash of
the 1st 4 chars of the tag). If I needed faster (though this is not really worth the effort at this point), I
can setup constants equivalent to the 1st 4 characters that equate to numbers then call tag-specific routines based on an array rather than a hash).
So in exploring some of the suggestions here and writing a reply, I think I stumbled onto what I'll use for now, which 'split'. No doubt its speed and possibly a related algorithm has likely been incorporated into the 5.30's regex for the leading whitespace case.
Thanks for the hints... | [reply] [d/l] [select] |
|
|