Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

problems matching umlauts in env vars

by december (Pilgrim)
on Jul 23, 2004 at 01:08 UTC ( #376768=perlquestion: print w/replies, xml ) Need Help??

december has asked for the wisdom of the Perl Monks concerning the following question:

Hello, fellow seekers of enlightenment,

I'm trying to construct a simple regex that checks if a variable contains characters valid in a unix path. The regex works as it should when there are no umlauts in the string, but when testing different inputs, I noticed it refuses to match any umlauts. What bugs me, is that it does match the exact same string when I use a variable, but not when handed down by $ENV{'PATH_TRANSLATED'} - which probably is a non-encoded 8bit string. A shortened example:

$testString = "/usr/home/december/public_html/experiments/html/files/b +lh.txt"; $fileAsked = $ENV{'PATH_TRANSLATED'}; print "Trying with: $testString\n"; print "Trying with: $fileAsked\n"; print "VALID1\n" if ($testString =~ /^([\w\s\/.]+)$/); print "VALID2\n" if ($fileAsked =~ /^([\w\s\/.]+)$/); print "SUCCEEDED1\n" if (utf8::upgrade($testString)); print "SUCCEEDED2\n" if (utf8::upgrade($fileAsked)); print "VALID3\n" if ($testString =~ /^([\w\s\/.]+)$/); print "VALID4\n" if ($fileAsked =~ /^([\w\s\/.]+)$/);


Trying with: /usr/home/december/public_html/experiments/html/files/bl +h.txt Trying with: /usr/home/december/public_html/experiments/html/files/bl +h.txt SUCCEEDED1 SUCCEEDED2 VALID3

Note that both strings and regex's are exactly the same, but after conversion, one matches, and the other doesn't. I suspect some utf8 problems, or a wrong charset used for \w. Perl version is 5.8.3.

How do I make the \w match umlauts consistently? Do I need to set a locale even for utf8? This behavior doesn't seem logical to me.

Replies are listed 'Best First'.
Re: problems matching umlauts in env vars
by borisz (Canon) on Jul 23, 2004 at 01:43 UTC
    Your problem is IMHO, that you locale is already in utf8. This means that your env var is in utf8 but your $testString is in latin1. If this is the case you need to use Encode;. And update the bytes from your environment to utf.
    $fileAsked = Encode::decode(utf8 => $ENV{'aa'});

      It's not in utf. When I encode it like you told me, it can't find the file anymore (the now changed to an utf sequence when I print the variable, and the filesystem doesn't like utf filenames). Both strings seem to be iso-8859-1, probably just plain 8bit. When I update it to utf, though, it's the DOT it doesn't want to match on - not the umlaut.

      /^([\w\s.]+)$/ # unescaped /^([\w\s\.]+)$/ # or escaped

      Is there something wrong with the dot in the regex?

        No, inside the [] a . is the same as \.
Re: problems matching umlauts in env vars
by allolex (Curate) on Jul 23, 2004 at 07:20 UTC

    You need to define a locale that contains // for \w to include them. You need to do this even for UTF-8. UTF-8 is just a standard way of representing characters, not the set of characters that can make up words in a particular language.

    use locale; use POSIX 'locale_h'; my $loc = 'de_DE.utf8'; # German locale, for example. Run 'locale -a' + to get the exact locale name setlocale(LC_CTYPE, $loc) or die "Invalid locale $loc";

    Either that, or use this little trick off of my home node: [A-Za-z-] instead of \w :)

    I probably should add that the German locale will likely not match '', since it does not exist in German. Maybe Dutch or French...

    Damon Allen Davison

      Thanks for your reply. I have set the locale now, and that solves at least this problem.

      German locale should be using the iso-8859-1 (or rather iso-8859-15) charset, which does contain an e with umlauts. Standard French language doesn't have umlauts, but Dutch (my native language) does. Either way, all Western European countries use the same charset, which should be iso-8859-15 (that's latin1 plus euro).

      The problem now is that I don't know which charset will be given to me in the request... Could be pretty much anything.

Re: problems matching umlauts in env vars
by beable (Friar) on Jul 23, 2004 at 01:25 UTC
    Here are the results I got:
    Trying with: /usr/home/december/public_html/experiments/html/files/bl +h.txt Trying with: /usr/home/december/public_html/experiments/html/files/bl +h.txt SUCCEEDED1 SUCCEEDED2 VALID3 VALID4

    I'd suggest you add a line like this as the third line of your program, to check that the strings are the same:

    die "strings are different!" if ($testString ne $fileAsked);
      They are equal (tested with 'eq'). Yet one matches, and the other doesn't - it's the same regex. How bizar. >:-|
Re: problems matching umlauts in env vars
by graff (Chancellor) on Jul 24, 2004 at 08:29 UTC
    You said:
    it does match the exact same string when I use a variable, but not when handed down by $ENV{'PATH_TRANSLATED'} - which probably is a non-encoded 8bit string.
    (emphasis added). Meanwhile, the docs for "utf8" have this to say about the "upgrade" call:
    Note that this should not be used to convert a legacy byte encoding to Unicode: use Encode for that.
    So, if your environment variable's value is actually set via some single-byte European character encoding ("Latin1"), then just passing it to utf8::upgrade amounts to just calling it utf8 when it really is not. The upgrade call returns the number of octets in the "converted" string (which doesn't really get converted -- it just gets it utf8 flag turned on, I think). So you'll get a non-zero return unless the string is completely empty.

    (I confess I'm a bit confused by the docs for "utf8::upgrade" -- especially its behavior wrt "characters in the range 0x80-0xFF". There are odd things about this range and its treatment in perl 5.8 that I still need to understand better.)

    Anyway, try this:

    use Encode; # ... $fileAsked = decode( "iso8859-1", $fileAsked);
    and then see whether "VALID4" shows up. Check the Encode man page for more options (e.g. trapping character conversion failures using eval).

      Yeah, it does, thanks. I don't understand how perl succeeds in converting the 8bit string in PATH_TRANSLATED to the one given in decode though... I've tested several browsers; some send utf-8, and some seem to send iso-8859-1 or iso-5589-15. The $fileAsked variable could be in any charset, really.

      Oh well... I hope those internationalisation issues will solve itself in the next couple of years, not just for Perl, but for all software really.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://376768]
Approved by FoxtrotUniform
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2022-05-24 12:13 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (82 votes). Check out past polls.