Sorting utf-8

webelan has asked for the wisdom of the Perl Monks concerning the following question:

I have a program that takes input from a utf-8 text file which contains text in 10 languages. The data is treated variously, but the bit posing a problem is a "simple" alphabetical sort, language by language. English works fine, no surprise. However, in French the É is placed after Z whereas it should be after E; and I won't even go into what it does with Greek ;)

Here's a snippet:

use strict;
use warnings;
use diagnostics;
use utf8;
use open ':utf8'; 
use FileHandle;

# lots of treatment in between .... then

foreach $term_name (sort keys %term) {
    if ( defined $term{$term_name}{$LL}{term} )     {
        $text = $term{$term_name}{$LL}{term};
        $char_1 = uc(substr($text,0,1));
        push @char, $char_1;     
    }
}

remove_duplicates(@char);

@char = sort(@char);
[download]

where $LL is the language code, and term is the term in that language. The subroutine remove_duplicates() just gets rid of excess letters so that @char contains only unique letters.

I'm using Perl ActiveState 5.8.0 on Windows NT4. I've read the docs on unicode and utf-8; unfortunately they all seem to imply that sort() works as is.

I'm looking either to sort in true alphabet fashion where É follows E, or to combining anything commencing with an accented character into the non-accented character. Any tips would be appreciated.

Thank you

Comment on Sorting utf-8 Download Code

Replies are listed 'Best First'.

Re: Sorting utf-8
by Jaap (Curate) on Apr 24, 2003 at 09:53 UTC

my %characterOrder = (
'a' => 1,
'â' => 1,
'ä' => 1,
...
'b' => 2,
'c' => 3,
'Ç' => 4,
...
);
[download]

sub GoodSort
{
  $characterOrder{$a} <=> $characterOrder{$b}
}
[download]

[reply]
[d/l]
[select]

Re: Re: Sorting utf-8

by Anonymous Monk on Apr 24, 2003 at 10:31 UTC

[reply]

Re: Sorting utf-8
by PodMaster (Abbot) on Apr 24, 2003 at 10:03 UTC

If SUBNAME or BLOCK is omitted, "sort"s in standard string comparison order.

perldoc perlop ... Equality Operators ...

Binary ``cmp'' returns -1, 0, or 1 depending on whether the left argument is stringwise less than, equal to, or greater than the right argument.
``lt'', ``le'', ``ge'', ``gt'' and ``cmp'' use the collation (sort) order specified by the current locale if use locale is in effect. See perllocale.

SYNOPSIS
        @x = sort @y;       # ASCII sorting order
        {
            use locale;
            @x = sort @y;   # Locale-defined sorting order
        }
        @x = sort @y;       # ASCII sorting order again
[download]

MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
** The Third rule of perl club is a statement of fact: pod is sexy.

[reply]
[d/l]

Re: Re: Sorting utf-8

by Anonymous Monk on Apr 24, 2003 at 11:03 UTC

But locales and unicode don't mix well:

perldoc perlunicode:

"Use of locales with Unicode data may lead to odd results.
Currently,Perl attempts to attach 8 bit locale info to characters
in the range 0..255, but this technique is demonstrably incorrect for
locales that use characters above that range when mapped into
Unicode.  Perls Unicode support will also tend to run slower.  Use of
locales with Unicode is discouraged."
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If your current locale is, say, es_ES, how do you actually instantiate the correct Unicode::Collate object for that locale?

[reply]

Re: Re: Re: Sorting utf-8

by dakkar (Hermit) on Apr 24, 2003 at 11:58 UTC

Looking at the docs (and guessing at the UTS#10), I'd say that the Unicode collation algorithm is locale independent: it is supposed to give a collation key for any Unicode string (the fact that they are encoded in utf-8 is immaterial, BTW)

So I'd just use Unicode::Collate->new()->sort(@list)

If you want to customize the results, then you'll have to understand the UTS#10, but otherwise it should "just work"

-- 
        dakkar - Mobilis in mobile

Most of my code is tested...

Perl is strongly typed, it just has very few types (Dan)

[reply]
[d/l]

Re: Re: Re: Re: Sorting utf-8

by Anonymous Monk on Apr 24, 2003 at 12:54 UTC

Re: Re: Re: Sorting utf-8

by webelan (Acolyte) on Apr 24, 2003 at 13:37 UTC

my %tailoring;
my $Collator;
$Collator = Unicode::Collate->new(%tailoring);

@char = $Collator->sort(@char);
[download]

[reply]
[d/l]

Re: Re: Re: Re: Sorting utf-8

by Anonymous Monk on Apr 24, 2003 at 15:08 UTC

Re: Re: Re: Re: Sorting utf-8

by richyboy (Acolyte) on Apr 24, 2003 at 21:58 UTC

Re: Re: Sorting utf-8

by webelan (Acolyte) on Apr 24, 2003 at 12:53 UTC

use locale;
    print +(sort grep /\w/, map { chr() } 0..255), "\n";
[download]

#use locale;      
@char = sort(@char);
#no locale;
[download]

[reply]
[d/l]
[select]


Welcome to the Monastery
	PerlMonks