trouble with umlauts

nefertari has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks, perhaps one of you can help me:

Here at our university i am the author of a perl-CGI-script, that searches for preprints on a preprintserver. Until last week it worked (with the exception of upgrades on the server and not getting the privilege to run CGIs). Since last week there is trouble with umlauts. It seems to be due to an update of the server from Debian Potato to Woody (i don't know any of these).

Today in the morning i changed it to die if the input didn't match /^((?:[a-zA-ZфіќФжмп\s])*)$/, so that i could see, where our problems are. Now the funny part: "і" matches every second time! (But in this case no matches with the preprint list are found, although we have two authors with і in their names.)

To make it possible to search for people without umlauts on their keyboard we decided that oe should match і in the search. So if i search for oe i again get no matches.

Does anybody of you have an idea what could be wrong?

Our data are in a XML-file, and umlauts are in a very ugly way encoded: <UL>o</UL> stands for і. (I didn't design this part.) Then i parse them via XML::Parser and store only the matching preprints. One problem could be, that i write і and the other umlauts directly in the perlscript. But i don't know another way to do this. If you know, i would be glad, if you could tell me how to achieve this.

Comment on trouble with umlauts

Replies are listed 'Best First'.
Re: trouble with umlauts by Biker (Priest) on Mar 19, 2002 at 16:31 UTC
"One problem could be, that i write і and the other umlauts directly in the perlscript." Try working with the hex values instead of typing the national characters in the Perl script. Everything will go worng!	[reply]
perldoc perlunicode? by RMGir (Prior) on Mar 19, 2002 at 16:25 UTC
I'm not sure where the answer would lie, since I've never had to deal with accented characters in perl. But perldoc perlunicode may be a good place to start. I recall a great deal of traffic about utf8 and regular expressions on p5p, so this may be an area where it's important to make sure you're using a recent (recentest?) perl. Which perl version just got installed? It may have known bugs (to someone else, that is, not me.) I realize this isn't a very helpful answer, but I hope this points you to something that does help... -- Mike	[reply]
Re: trouble with umlauts by MZSanford (Curate) on Mar 19, 2002 at 16:56 UTC
I think Biker is on the best way by using the hex values. But, since i just finished a problem with this at work, i thought i would make one addition. Mine was not a CGI form,so this may not apply to you, but i found that windows installed as German, and Windows installed as English both have the ü character, but with diffrent hex values, which may make it difficult. from the frivolous to the serious	[reply]
Re: Re: trouble with umlauts by nefertari (Chaplain) on Mar 19, 2002 at 17:38 UTC
Something like this could be the problem here. one of our root people checked what happened to an і, and it changed somehow its value to two characters. But why does it match every second time? By the way, where can i find the hexcodes for the umlauts and the esszet?	[reply]
Re: Re: Re: trouble with umlauts by Biker (Priest) on Mar 19, 2002 at 18:40 UTC
"where can i find the hexcodes for the umlauts and the esszet?" `sprintf("%lx",ord('X')); # That's an ell, not a one.` [download] That should give you the hex value of the character 'X'. Everything will go worng!	[reply] [d/l]
Re: trouble with umlauts by mirod (Canon) on Mar 19, 2002 at 17:31 UTC
My guess would be a change in either the Perl version or the XML::Parser version. If the change in behaviour come from the Perl version then you can probably re-install the old version as perl5-xxx and use it for this CGI. If it is an XML::Parser problem then maybe Potato used XML::Parser 2.27 and Woody upgraded to XML::Parser 2.30, which has a problem with marking strings as UTF-8 (actually not marking them properly, I believe). You could re-install the old version, it is still on CPAN. You can also check the version of CGI.pm but I don't think it should impact, it is pretty good at keeping backward compatibility.	[reply]
Re: Re: trouble with umlauts by nefertari (Chaplain) on Mar 19, 2002 at 17:35 UTC
The main problem over here is that i have only user rights on the system here. By the way, when i tested in in offline mode it worked. But this is, because the webserver and "my" computer are different. I will try the tips tomorrow, when i'm back at the university (here is now evening).	[reply]
Re: trouble with umlauts by Anonymous Monk on Mar 19, 2002 at 16:59 UTC
This is all IIRC; and I don't Recall Correctly fairly often. One of the remaning problems with unicode is that regexes decide to match against unicode or nonunicode based solely on if it thinks the regex is unicode. Try forcing the matter by starting the regex with a unicode character. Adding ö{0} to the beginning should do it. Try reading the "fixed bugs" section of a newer perl to see what bugs you have.	[reply]
Re: Re: trouble with umlauts by theorbtwo (Prior) on Mar 19, 2002 at 18:54 UTC
BTW, this post is mine; I didn't relise that I wasn't yet logged in. I just remembered somthing that I had forgotten and I don't think anybody else mentioned: Have you `use`d utf8? Is the file acatualy being stored as utf8 (your script, that is)? We are using here a powerful strategy of synthesis: wishful thinking. -- The Wizard Book	[reply] [d/l]
Re: Re: Re: trouble with umlauts by nefertari (Chaplain) on Mar 20, 2002 at 08:45 UTC
How can i see, in which encoding it is stored? Would be interesting for me, although i found a solution to the problem.	[reply]
Re: Re: Re: Re: trouble with umlauts by theorbtwo (Prior) on Mar 21, 2002 at 07:33 UTC
Re: Re: Re: Re: Re: trouble with umlauts by nefertari (Chaplain) on Mar 21, 2002 at 08:28 UTC
Re: trouble with umlauts - update by nefertari (Chaplain) on Mar 20, 2002 at 08:29 UTC
I know a little bit more: The exit due to die every second is caused by Konqueror, in Version 2.1.1 It encodes the і as %C3%B6 that causes the regex to not match, or as %F6 which causes a match there, but no matches later. I think i will try to change the umlauts to our own scheme, and change them back in the output. Hopefully this will work. Thank you for your help and your ideas.	[reply]


Don't ask to ask, just ask
	PerlMonks