problems matching umlauts in env vars

december has asked for the wisdom of the Perl Monks concerning the following question:

Hello, fellow seekers of enlightenment,

I'm trying to construct a simple regex that checks if a variable contains characters valid in a unix path. The regex works as it should when there are no umlauts in the string, but when testing different inputs, I noticed it refuses to match any umlauts. What bugs me, is that it does match the exact same string when I use a variable, but not when handed down by $ENV{'PATH_TRANSLATED'} - which probably is a non-encoded 8bit string. A shortened example:

$testString = "/usr/home/december/public_html/experiments/html/files/b
+lëh.txt";
$fileAsked = $ENV{'PATH_TRANSLATED'};

print "Trying with: $testString\n";
print "Trying with: $fileAsked\n";

print "VALID1\n" if ($testString =~ /^([\w\s\/.]+)$/);
print "VALID2\n" if ($fileAsked =~ /^([\w\s\/.]+)$/);

print "SUCCEEDED1\n" if (utf8::upgrade($testString));
print "SUCCEEDED2\n" if (utf8::upgrade($fileAsked));

print "VALID3\n" if ($testString =~ /^([\w\s\/.]+)$/);
print "VALID4\n" if ($fileAsked =~ /^([\w\s\/.]+)$/);
[download]

prints:

Trying with: /usr/home/december/public_html/experiments/html/files/blë
+h.txt
Trying with: /usr/home/december/public_html/experiments/html/files/blë
+h.txt
SUCCEEDED1
SUCCEEDED2
VALID3
[download]

Note that both strings and regex's are exactly the same, but after conversion, one matches, and the other doesn't. I suspect some utf8 problems, or a wrong charset used for \w. Perl version is 5.8.3.

How do I make the \w match umlauts consistently? Do I need to set a locale even for utf8? This behavior doesn't seem logical to me.

Comment on problems matching umlauts in env vars Select or Download Code

Replies are listed 'Best First'.
Re: problems matching umlauts in env vars by borisz (Canon) on Jul 23, 2004 at 01:43 UTC
Your problem is IMHO, that you locale is already in utf8. This means that your env var is in utf8 but your $testString is in latin1. If this is the case you need to `use Encode;`. And update the bytes from your environment to utf. `$fileAsked = Encode::decode(utf8 => $ENV{'aa'});` [download] Boris	[reply] [d/l] [select]
Re^2: problems matching umlauts in env vars by december (Pilgrim) on Jul 23, 2004 at 04:31 UTC
It's not in utf. When I encode it like you told me, it can't find the file anymore (the ë now changed to an utf sequence when I print the variable, and the filesystem doesn't like utf filenames). Both strings seem to be iso-8859-1, probably just plain 8bit. When I update it to utf, though, it's the DOT it doesn't want to match on - not the umlaut. `/^([\w\s.]+)$/ # unescaped /^([\w\s\.]+)$/ # or escaped` [download] Is there something wrong with the dot in the regex?	[reply] [d/l]
Re^3: problems matching umlauts in env vars by borisz (Canon) on Jul 23, 2004 at 08:08 UTC
No, inside the `[]` a . is the same as \. Boris	[reply] [d/l]
Re^4: problems matching umlauts in env vars by beable (Friar) on Jul 23, 2004 at 08:20 UTC
Re^5: problems matching umlauts in env vars by borisz (Canon) on Jul 23, 2004 at 08:26 UTC
Re: problems matching umlauts in env vars by allolex (Curate) on Jul 23, 2004 at 07:20 UTC
You need to define a locale that contains ä/ö/ü for \w to include them. You need to do this even for UTF-8. UTF-8 is just a standard way of representing characters, not the set of characters that can make up words in a particular language. `use locale; use POSIX 'locale_h'; my $loc = 'de_DE.utf8'; # German locale, for example. Run 'locale -a' + to get the exact locale name setlocale(LC_CTYPE, $loc) or die "Invalid locale $loc";` [download] Either that, or use this little trick off of my home node: `[A-Za-zÀ-ÿœŒ]` instead of \w :) I probably should add that the German locale will likely not match 'ë', since it does not exist in German. Maybe Dutch or French... -- Damon Allen Davison http://www.allolex.net	[reply] [d/l] [select]
Re^2: problems matching umlauts in env vars by december (Pilgrim) on Aug 02, 2004 at 04:37 UTC
Thanks for your reply. I have set the locale now, and that solves at least this problem. German locale should be using the iso-8859-1 (or rather iso-8859-15) charset, which does contain an e with umlauts. Standard French language doesn't have umlauts, but Dutch (my native language) does. Either way, all Western European countries use the same charset, which should be iso-8859-15 (that's latin1 plus euro). The problem now is that I don't know which charset will be given to me in the request... Could be pretty much anything.	[reply]
Re: problems matching umlauts in env vars by beable (Friar) on Jul 23, 2004 at 01:25 UTC
Here are the results I got: `Trying with: /usr/home/december/public_html/experiments/html/files/blë +h.txt Trying with: /usr/home/december/public_html/experiments/html/files/blë +h.txt SUCCEEDED1 SUCCEEDED2 VALID3 VALID4` [download] I'd suggest you add a line like this as the third line of your program, to check that the strings are the same: `die "strings are different!" if ($testString ne $fileAsked);` [download]	[reply] [d/l] [select]
Re^2: problems matching umlauts in env vars by december (Pilgrim) on Jul 23, 2004 at 03:56 UTC
They are equal (tested with 'eq'). Yet one matches, and the other doesn't - it's the same regex. How bizar. >:-\|	[reply]
Re: problems matching umlauts in env vars by graff (Chancellor) on Jul 24, 2004 at 08:29 UTC
You said: it does match the exact same string when I use a variable, but not when handed down by $ENV{'PATH_TRANSLATED'} - which probably is a non-encoded 8bit string. (emphasis added). Meanwhile, the docs for "utf8" have this to say about the "upgrade" call: Note that this should not be used to convert a legacy byte encoding to Unicode: use Encode for that. So, if your environment variable's value is actually set via some single-byte European character encoding ("Latin1"), then just passing it to utf8::upgrade amounts to just calling it utf8 when it really is not. The upgrade call returns the number of octets in the "converted" string (which doesn't really get converted -- it just gets it utf8 flag turned on, I think). So you'll get a non-zero return unless the string is completely empty. (I confess I'm a bit confused by the docs for "utf8::upgrade" -- especially its behavior wrt "characters in the range 0x80-0xFF". There are odd things about this range and its treatment in perl 5.8 that I still need to understand better.) Anyway, try this: `use Encode; # ... $fileAsked = decode( "iso8859-1", $fileAsked);` [download] and then see whether "VALID4" shows up. Check the Encode man page for more options (e.g. trapping character conversion failures using eval).	[reply] [d/l]
Re^2: problems matching umlauts in env vars by december (Pilgrim) on Aug 02, 2004 at 04:56 UTC
Yeah, it does, thanks. I don't understand how perl succeeds in converting the 8bit string in PATH_TRANSLATED to the one given in decode though... I've tested several browsers; some send utf-8, and some seem to send iso-8859-1 or iso-5589-15. The $fileAsked variable could be in any charset, really. Oh well... I hope those internationalisation issues will solve itself in the next couple of years, not just for Perl, but for all software really.	[reply]


Pathologically Eclectic Rubbish Lister
	PerlMonks