creamygoodness has asked for the wisdom of the Perl Monks concerning the following question:
Greets,
In the Perl core, there's a function called Perl_sv_utf8_upgrade_flags_grow. Many paths ultimately lead to this function; I got there by way of the XS macro SvPVutf8, but the easiest way invoke it from Perl-space is via utf8::upgrade.
It turns out that this function doesn't play nice with capture variables like $1. After it is invoked on them, they no longer capture properly:
marvin@smokie:~/perltest $ bleadperl dollar_one_utf8_upgrade.pl Without utf8::upgrade... a b c With utf8::upgrade... a a a Problem persists... a a a marvin@smokie:~/perltest $
Here's the test script:
use strict; use warnings; my $text = "a b c"; print "Without utf8::upgrade...\n"; while ( $text =~ /(\S)/g ) { print "$1\n"; } print "With utf8::upgrade...\n"; while ( $text =~ /(\S)/g ) { print "$1\n"; utf8::upgrade($1); } print "Problem persists...\n"; my $more_text = "d e f"; while ( $more_text =~ /(\S)/g ) { print "$1\n"; }
The problem appears to be that Perl_sv_utf8_upgrade_flags_grow turns on the SVf_POK flag by way of SvPV_force. It doesn't seem to have anything to do with whether the SVf_UTF8 flag is on in either the string being regexed or the capture variable itself.
Applying the following patch to sv.c in blead appears to kill the bug at the source. All of Perl's test cases still pass after it is applied.
marvin@smokie:~/projects/perl-git $ git diff diff --git a/sv.c b/sv.c index a53669a..280f064 100644 --- a/sv.c +++ b/sv.c @@ -3231,6 +3231,8 @@ Perl_sv_utf8_upgrade_flags_grow(pTHX_ register S +V *const sv, const I32 flags, if (extra) SvGROW(sv, SvCUR(sv) + extra); return len; } + } else if (SvGMAGICAL(sv) && mg_find(sv, PERL_MAGIC_sv)) { + ; } else { (void) SvPV_force(sv,len); }
I'd like to solve this issue and supply a working patch via perlbug, so I can say that I solved a UTF-8 bug in the Perl core. :) However, I'm not sure that this patch is legit, because I don't understand exactly what PERL_MAGIC_sv is all about.
I think what's going on is that $1 and friends are magical variables that should never have the SVf_POK flag on, since that indicates that they contain real strings. The regex engine probably doesn't use the standard string assignment interface and goes through the magic interface instead, hence its work is no longer visible once the standard channel is open. But is it safe to have the SVf_POK flag off for the remainder of Perl_sv_utf8_upgrade_flags_grow? SvPV_force was probably called for a reason, after all.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: utf8::upgrade and $1
by Anonymous Monk on Aug 30, 2009 at 06:58 UTC | |
by creamygoodness (Curate) on Aug 30, 2009 at 12:45 UTC | |
by Your Mother (Archbishop) on Aug 30, 2009 at 17:15 UTC | |
by creamygoodness (Curate) on Aug 30, 2009 at 22:42 UTC | |
Re: utf8::upgrade and $1
by ikegami (Patriarch) on Aug 31, 2009 at 03:43 UTC | |
by creamygoodness (Curate) on Aug 31, 2009 at 05:25 UTC | |
by ikegami (Patriarch) on Aug 31, 2009 at 18:06 UTC |