Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Sorting utf-8

by webelan (Acolyte)
on Apr 24, 2003 at 09:06 UTC ( [id://252806]=perlquestion: print w/replies, xml ) Need Help??

webelan has asked for the wisdom of the Perl Monks concerning the following question:

I have a program that takes input from a utf-8 text file which contains text in 10 languages. The data is treated variously, but the bit posing a problem is a "simple" alphabetical sort, language by language. English works fine, no surprise. However, in French the É is placed after Z whereas it should be after E; and I won't even go into what it does with Greek ;)

Here's a snippet:

use strict; use warnings; use diagnostics; use utf8; use open ':utf8'; use FileHandle; # lots of treatment in between .... then foreach $term_name (sort keys %term) { if ( defined $term{$term_name}{$LL}{term} ) { $text = $term{$term_name}{$LL}{term}; $char_1 = uc(substr($text,0,1)); push @char, $char_1; } } remove_duplicates(@char); @char = sort(@char);
where $LL is the language code, and term is the term in that language. The subroutine remove_duplicates() just gets rid of excess letters so that @char contains only unique letters.

I'm using Perl ActiveState 5.8.0 on Windows NT4. I've read the docs on unicode and utf-8; unfortunately they all seem to imply that sort() works as is.

I'm looking either to sort in true alphabet fashion where É follows E, or to combining anything commencing with an accented character into the non-accented character. Any tips would be appreciated.

Thank you

Replies are listed 'Best First'.
Re: Sorting utf-8
by Jaap (Curate) on Apr 24, 2003 at 09:53 UTC
    You will have to create your own sorting algorithm. Default sort's cmp operator sorts on the 'ascii value' of the characters. Therefore é comes way after z.

    Chapter 3.2.153 of 'Learning Perl' by O'Reilly covers using your own algorithms for sort. The essence is to have a subroutine return -1 if $a < $b, return 0 if $a == $b and return 1 if $a > $b, where $a and $b are the two array elements to compare.

    You could perhaps make a hash like this:
    my %characterOrder = ( 'a' => 1, 'â' => 1, 'ä' => 1, ... 'b' => 2, 'c' => 3, 'Ç' => 4, ... );
    You can then compare the values of $characterOrder{$a} with $characterOrder{$b} likt this:
    sub GoodSort { $characterOrder{$a} <=> $characterOrder{$b} }
      Hi,

      Thanks for the reply. I thought of doing a listing, but then I realised that I need alphabets for Latin 1, Latin 2, Greek, Cyrilic and Maltese. If all else fails it is something to fall back on, but before that I wanted other thoughts/options.

      Anne
Re: Sorting utf-8
by PodMaster (Abbot) on Apr 24, 2003 at 10:03 UTC
    perldoc -f sort ... If SUBNAME or BLOCK is omitted, "sort"s in standard string comparison order.

    perldoc perlop ... Equality Operators ...

    Binary ``cmp'' returns -1, 0, or 1 depending on whether the left argument is stringwise less than, equal to, or greater than the right argument.

    ``lt'', ``le'', ``ge'', ``gt'' and ``cmp'' use the collation (sort) order specified by the current locale if use locale is in effect. See perllocale.

    perldoc perllocale ...
    SYNOPSIS @x = sort @y; # ASCII sorting order { use locale; @x = sort @y; # Locale-defined sorting order } @x = sort @y; # ASCII sorting order again


    MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
    I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
    ** The Third rule of perl club is a statement of fact: pod is sexy.

      >current locale if use locale is in effect. See perllocale.

      But locales and unicode don't mix well:

      perldoc perlunicode:

      "Use of locales with Unicode data may lead to odd results.
      Currently,Perl attempts to attach 8 bit locale info to characters
      in the range 0..255, but this technique is demonstrably incorrect for
      locales that use characters above that range when mapped into
      Unicode.  Perls Unicode support will also tend to run slower.  Use of
      locales with Unicode is discouraged."
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      
      I would have thought Unicode::Collate is the correct way to go when sorting utf8 encoded data. But, looking at the docs, I couldnt make head or tail of it - it assumes you to know an awful lot about "Unicode Technical Standard #10"

      If your current locale is, say, es_ES, how do you actually instantiate the correct Unicode::Collate object for that locale?

        Looking at the docs (and guessing at the UTS#10), I'd say that the Unicode collation algorithm is locale independent: it is supposed to give a collation key for any Unicode string (the fact that they are encoded in utf-8 is immaterial, BTW)

        So I'd just use Unicode::Collate->new()->sort(@list)

        If you want to customize the results, then you'll have to understand the UTS#10, but otherwise it should "just work"

        -- 
                dakkar - Mobilis in mobile
        

        Most of my code is tested...

        Perl is strongly typed, it just has very few types (Dan)

        Thank you, and also Dakkar, plus all the other kind people who posted,

        The solution is indeed to use "Unicode::Collate" together with a file called "allkeys.txt". No locales needed.

        Just put "use unicode::collate" and add the lines:
        my %tailoring; my $Collator; $Collator = Unicode::Collate->new(%tailoring); @char = $Collator->sort(@char);

        and sorting works "automagically"; French is now totally correct; for the other character sets such as Greek it looks logical but I'll get our translators to check the order for me just in case there are still some quirks.

        However, Swedish and Finnish no longer sort correctly, because in those languages Ä, Ö etc are considered to come after "Z" so it looks like I'm going to have to do an if/else with "normal" sort and "collate". But who cares, I'm a huge step forward from where I was this morning.

        Thanks a lot guys, Anne
      Hi, Thanks for your reply. I thought using locales might work, although I have never used them before. But when I ran
      use locale; print +(sort grep /\w/, map { chr() } 0..255), "\n";

      to find out exactly what kind of ordering I would get, my was it weird. These are just the first few characters:

      _01╣2▓3│456789aAß┴Ó└Ô┬õ─Ò├Õ┼µãbBcCþÃdD­ðeE

      Nonetheless, I tried the sort with
      #use locale; @char = sort(@char); #no locale;

      and I got the ordering

      A É B C D E

      for the French. Not quite what I expected. Now I'm going to take a look at "unicode::collate" as mentioned by another post.

      Thanks for your help though. It is definitely a steep learning curve.
      Anne

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://252806]
Approved by Tomte
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (7)
As of 2024-04-18 11:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found