You need to define a locale that contains // for \w to include them. You need to do this even for UTF-8. UTF-8 is just a standard way of representing characters, not the set of characters that can make up words in a particular language.

use locale; use POSIX 'locale_h'; my $loc = 'de_DE.utf8'; # German locale, for example. Run 'locale -a' + to get the exact locale name setlocale(LC_CTYPE, $loc) or die "Invalid locale $loc";

Either that, or use this little trick off of my home node: [A-Za-z-] instead of \w :)

I probably should add that the German locale will likely not match '', since it does not exist in German. Maybe Dutch or French...

by december (Pilgrim) on Aug 02, 2004 at 04:37 UTC

    Thanks for your reply. I have set the locale now, and that solves at least this problem.

    German locale should be using the iso-8859-1 (or rather iso-8859-15) charset, which does contain an e with umlauts. Standard French language doesn't have umlauts, but Dutch (my native language) does. Either way, all Western European countries use the same charset, which should be iso-8859-15 (that's latin1 plus euro).

    The problem now is that I don't know which charset will be given to me in the request... Could be pretty much anything.