comment on

Hello, fellow seekers of enlightenment,

I'm trying to construct a simple regex that checks if a variable contains characters valid in a unix path. The regex works as it should when there are no umlauts in the string, but when testing different inputs, I noticed it refuses to match any umlauts. What bugs me, is that it does match the exact same string when I use a variable, but not when handed down by $ENV{'PATH_TRANSLATED'} - which probably is a non-encoded 8bit string. A shortened example:

$testString = "/usr/home/december/public_html/experiments/html/files/b
+lëh.txt";
$fileAsked = $ENV{'PATH_TRANSLATED'};

print "Trying with: $testString\n";
print "Trying with: $fileAsked\n";

print "VALID1\n" if ($testString =~ /^([\w\s\/.]+)$/);
print "VALID2\n" if ($fileAsked =~ /^([\w\s\/.]+)$/);

print "SUCCEEDED1\n" if (utf8::upgrade($testString));
print "SUCCEEDED2\n" if (utf8::upgrade($fileAsked));

print "VALID3\n" if ($testString =~ /^([\w\s\/.]+)$/);
print "VALID4\n" if ($fileAsked =~ /^([\w\s\/.]+)$/);
[download]

prints:

Trying with: /usr/home/december/public_html/experiments/html/files/blë
+h.txt
Trying with: /usr/home/december/public_html/experiments/html/files/blë
+h.txt
SUCCEEDED1
SUCCEEDED2
VALID3
[download]

Note that both strings and regex's are exactly the same, but after conversion, one matches, and the other doesn't. I suspect some utf8 problems, or a wrong charset used for \w. Perl version is 5.8.3.

How do I make the \w match umlauts consistently? Do I need to set a locale even for utf8? This behavior doesn't seem logical to me.

In reply to problems matching umlauts in env vars by december

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


more useful options
	PerlMonks