webelan has asked for the wisdom of the Perl Monks concerning the following question:
I have a program that takes input from a utf-8 text file which contains text in 10 languages. The data is treated variously, but the bit posing a problem is a "simple" alphabetical sort, language by language. English works fine, no surprise. However, in French the É is placed after Z whereas it should be after E; and I won't even go into what it does with Greek ;)
Here's a snippet:
use strict;
use warnings;
use diagnostics;
use utf8;
use open ':utf8';
use FileHandle;
# lots of treatment in between .... then
foreach $term_name (sort keys %term) {
if ( defined $term{$term_name}{$LL}{term} ) {
$text = $term{$term_name}{$LL}{term};
$char_1 = uc(substr($text,0,1));
push @char, $char_1;
}
}
remove_duplicates(@char);
@char = sort(@char);
where $LL is the language code, and term is the term in that language. The subroutine remove_duplicates() just gets rid of excess letters so that @char contains only unique letters.
I'm using Perl ActiveState 5.8.0 on Windows NT4. I've read the docs on unicode and utf-8; unfortunately they all seem to imply that sort() works as is.
I'm looking either to sort in true alphabet fashion where É follows E, or to combining anything commencing with an accented character into the non-accented character. Any tips would be appreciated. Thank you
Re: Sorting utf-8
by Jaap (Curate) on Apr 24, 2003 at 09:53 UTC
|
You will have to create your own sorting algorithm. Default sort's cmp operator sorts on the 'ascii value' of the characters. Therefore é comes way after z.
Chapter 3.2.153 of 'Learning Perl' by O'Reilly covers using your own algorithms for sort.
The essence is to have a subroutine return -1 if $a < $b, return 0 if $a == $b and return 1 if $a > $b, where $a and $b are the two array elements to compare.
You could perhaps make a hash like this:
my %characterOrder = (
'a' => 1,
'â' => 1,
'ä' => 1,
...
'b' => 2,
'c' => 3,
'Ç' => 4,
...
);
You can then compare the values of $characterOrder{$a} with $characterOrder{$b} likt this:
sub GoodSort
{
$characterOrder{$a} <=> $characterOrder{$b}
}
| [reply] [d/l] [select] |
|
Hi,
Thanks for the reply. I thought of doing a listing, but then I realised that I need alphabets for Latin 1, Latin 2, Greek, Cyrilic and Maltese. If all else fails it is something to fall back on, but before that I wanted other thoughts/options.
Anne
| [reply] |
Re: Sorting utf-8
by PodMaster (Abbot) on Apr 24, 2003 at 10:03 UTC
|
perldoc -f sort ... If SUBNAME or
BLOCK is omitted, "sort"s in standard string comparison order.
perldoc perlop ... Equality Operators ...
Binary ``cmp'' returns -1, 0, or 1 depending on whether the left argument is stringwise less than, equal to, or greater than the right argument.
``lt'', ``le'', ``ge'', ``gt'' and ``cmp'' use the collation (sort) order specified by the current locale if use locale is in effect. See perllocale.
perldoc perllocale ...
SYNOPSIS
@x = sort @y; # ASCII sorting order
{
use locale;
@x = sort @y; # Locale-defined sorting order
}
@x = sort @y; # ASCII sorting order again
MJD says you
can't just make shit up and expect the computer to know what you mean, retardo!
I run a Win32 PPM
repository for perl 5.6x+5.8x. I take requests.
** The Third rule of perl club is a statement of fact: pod is sexy.
|
| [reply] [d/l] |
|
"Use of locales with Unicode data may lead to odd results.
Currently,Perl attempts to attach 8 bit locale info to characters
in the range 0..255, but this technique is demonstrably incorrect for
locales that use characters above that range when mapped into
Unicode. Perls Unicode support will also tend to run slower. Use of
locales with Unicode is discouraged."
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I would have thought Unicode::Collate is the correct way to go when sorting utf8 encoded data. But, looking at the docs, I couldnt make head or tail of it - it assumes you to know an awful lot about "Unicode Technical Standard #10"
If your current locale is, say, es_ES, how do you actually instantiate the correct Unicode::Collate object for that locale? | [reply] |
|
Looking at the docs (and guessing at the UTS#10), I'd say that the Unicode collation algorithm is locale independent: it is supposed to give a collation key for any Unicode string (the fact that they are encoded in utf-8 is immaterial, BTW)
So I'd just use Unicode::Collate->new()->sort(@list)
If you want to customize the results, then you'll have to understand the UTS#10, but otherwise it should "just work"
--
dakkar - Mobilis in mobile
Most of my code is tested...
Perl is strongly typed, it just has very few types (Dan) | [reply] [d/l] |
|
|
Thank you, and also Dakkar, plus all the other kind people who posted,
The solution is indeed to use "Unicode::Collate" together with a file called "allkeys.txt". No locales needed.
Just put "use unicode::collate" and add the lines:
my %tailoring;
my $Collator;
$Collator = Unicode::Collate->new(%tailoring);
@char = $Collator->sort(@char);
and sorting works "automagically"; French is now totally correct; for the other character sets such as Greek it looks logical but I'll get our translators to check the order for me just in case there are still some quirks.
However, Swedish and Finnish no longer sort correctly, because in those languages Ä, Ö etc are considered to come after "Z" so it looks like I'm going to have to do an if/else with "normal" sort and "collate". But who cares, I'm a huge step forward from where I was this morning.
Thanks a lot guys,
Anne | [reply] [d/l] |
|
|
|
Hi,
Thanks for your reply. I thought using locales might work, although I have never used them before. But when I ran
use locale;
print +(sort grep /\w/, map { chr() } 0..255), "\n";
to find out exactly what kind of ordering I would get, my was it weird. These are just the first few characters:
_01╣2▓3│456789aAß┴Ó└Ô┬õ─Ò├Õ┼µãbBcCþÃdDðeE
Nonetheless, I tried the sort with
#use locale;
@char = sort(@char);
#no locale;
and I got the ordering
A É B C D E
for the French. Not quite what I expected.
Now I'm going to take a look at "unicode::collate" as mentioned by another post.
Thanks for your help though. It is definitely a steep learning curve.
Anne | [reply] [d/l] [select] |
|
|