Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Re: Sorting utf-8

by Anonymous Monk
on Apr 24, 2003 at 11:03 UTC ( [id://252846]=note: print w/replies, xml ) Need Help??


in reply to Re: Sorting utf-8
in thread Sorting utf-8

>current locale if use locale is in effect. See perllocale.

But locales and unicode don't mix well:

perldoc perlunicode:

"Use of locales with Unicode data may lead to odd results.
Currently,Perl attempts to attach 8 bit locale info to characters
in the range 0..255, but this technique is demonstrably incorrect for
locales that use characters above that range when mapped into
Unicode.  Perls Unicode support will also tend to run slower.  Use of
locales with Unicode is discouraged."
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I would have thought Unicode::Collate is the correct way to go when sorting utf8 encoded data. But, looking at the docs, I couldnt make head or tail of it - it assumes you to know an awful lot about "Unicode Technical Standard #10"

If your current locale is, say, es_ES, how do you actually instantiate the correct Unicode::Collate object for that locale?

Replies are listed 'Best First'.
Re: Re: Re: Sorting utf-8
by dakkar (Hermit) on Apr 24, 2003 at 11:58 UTC

    Looking at the docs (and guessing at the UTS#10), I'd say that the Unicode collation algorithm is locale independent: it is supposed to give a collation key for any Unicode string (the fact that they are encoded in utf-8 is immaterial, BTW)

    So I'd just use Unicode::Collate->new()->sort(@list)

    If you want to customize the results, then you'll have to understand the UTS#10, but otherwise it should "just work"

    -- 
            dakkar - Mobilis in mobile
    

    Most of my code is tested...

    Perl is strongly typed, it just has very few types (Dan)

      Hmm, pretty cool if that's how it works - from looking at the constructor arguments I was wondering if you'd need a hash defining per-locale options, but maybe not.

      I would pay a LOT of money for a book that explained Unicode in a non-geeky manner :)

Re: Re: Re: Sorting utf-8
by webelan (Acolyte) on Apr 24, 2003 at 13:37 UTC
    Thank you, and also Dakkar, plus all the other kind people who posted,

    The solution is indeed to use "Unicode::Collate" together with a file called "allkeys.txt". No locales needed.

    Just put "use unicode::collate" and add the lines:
    my %tailoring; my $Collator; $Collator = Unicode::Collate->new(%tailoring); @char = $Collator->sort(@char);

    and sorting works "automagically"; French is now totally correct; for the other character sets such as Greek it looks logical but I'll get our translators to check the order for me just in case there are still some quirks.

    However, Swedish and Finnish no longer sort correctly, because in those languages Ä, Ö etc are considered to come after "Z" so it looks like I'm going to have to do an if/else with "normal" sort and "collate". But who cares, I'm a huge step forward from where I was this morning.

    Thanks a lot guys, Anne
      Good to hear its kind of working for you!

      But there will be a correct way for handling Swedish and Finnish unicode collation, so I wouldn't start switching between sort and collate in such cases until you've exhausted the "correct" way(s?) of doing this.

      Perhaps you could ask SADAHIRO Tomoyuki, the Unicode::Collate author?

      It would certainly be good if you can post any reply you get here, since this is the kind of stuff perl developers are going to have to know more and more about - I'm of course talking about the folks who don't know it already :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://252846]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2024-04-24 09:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found