Diacritic-Insensitive and Case-Insensitve Sorting

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Diacritic-Insensitive and Case-Insensitve Sorting by Roger (Parson) on Jan 05, 2004 at 05:51 UTC
`use strict; use warnings; use Unicode::Collate; # Level 2 collation: diacritic ordering my $collator = Unicode::Collate->new(level => 2); @sorted = $collator->sort(@not_sorted);` [download]	[reply] [d/l]
Re: Diacritic-Insensitive and Case-Insensitve Sorting by davido (Cardinal) on Jan 05, 2004 at 04:46 UTC
That's an interesting question. I was hesitant to post my solution because I'm thinking someone will come along with a POSIX module solution that normalizes diacritic symbols to their base character automatically. But in case that response doesn't come along, I'll give it a go... Use `tr///` to transliterate characters with diacritic symbols into their base character. Put the transliteration into a function to keep your code clean. And then use a Schwartzian Transform on your sort to ensure that conversions only happen once, for improved speed, and so that your original text is left in tact. `sub sterilize { my $char = shift; $char =~ tr/EéÉêÊè/e/; # All of your other transliterations go here too. return $char; } my @array = ( # Your stuff to be sorted goes here ); my @sorted; @sorted = map { $_->[1] } sort { ( $a->[0] cmp $b->[0] ) or ( $a->[1] cmp $b->[1] ) } map { [ sterilize($_), $_ ] } @array;` [download] As you can see, there are still a few minor blanks you have to fill in for this to be fully functional. You'll have to work out how to apply this to your database-tied hash, and you'll need to add the transliterations for the other diacritic symbols that I didn't enumerate (I don't know how to type them). But the idea should be clear enough. Good luck. ...now I'll sit back and watch for a more elegant solution. Update: Added short-circuit to the sort routine to force a defined order in cases where sorted strings are rendered equal by stripping diacritics. Dave	[reply] [d/l] [select]
Re: Re: Diacritic-Insensitive and Case-Insensitve Sorting by graff (Chancellor) on Jan 05, 2004 at 06:05 UTC
I was hesitant to post my solution because I'm thinking someone will come along with a POSIX module solution that normalizes diacritic symbols to their base character automatically. The same thought crossed my mind, in terms of using unicode character classes. I checked the perlfaqs in 5.8.1, and the perlunicode man page, and didn't find anything relevant, though I think there was a discussion about this sort of thing (removing accents) on the perl-unicode mail list within the last couple weeks. In any case, I would hesitate to look for that sort of solution -- there's a reasonably good chance that operations using unicode character classes will end up being slower than just doing plain old "tr///" on plain old latin1 bytes. That, combined with the fact that AM would need to convert everything to utf8 first, tends to make this somewhat unlikely to succeed as an "optimization". As for a POSIX (as opposed to unicode) module, I would guess that if someone decided to do this in "pure perl", it would end up as just a "tr///" statement...	[reply]
Re: Diacritic-Insensitive and Case-Insensitve Sorting by graff (Chancellor) on Jan 05, 2004 at 05:38 UTC
It would be interesting to see how "monstrous" your temporary sort function is -- it might not be as bad as you think, or might at least be on the right track. If you use "tr///" to normalize the case and accents, it's not clear to me that this would be bad in terms of scalability or generality (assuming that your "general" case always involves latin1 characters). Using "s///" is likely to be slower, of course, so you certainly don't want to do that. Is it the case that you have profiled the code and determined that your sort routine is really taking too much time? If you haven't tested the performance and confirmed that your sort is the bottleneck, it might not be worthwhile to try to optimize it (i.e. the real bottleneck might be somewhere else, like file i/o). You say the set of strings to sort will number in the few hundreds of thousands -- but how big are the strings? Are you sure you need to tie them into a DB_File, as opposed to simply having them in a memory-resident hash? Does your sort function look like the following? This is how I would do it, just off the top of my head -- I don't know how it would perform on large quantities of data. It might be more "optimal" to fold everything into a single big "tr///", but it just seemed quicker/easier to code it this way (mostly in terms of avoiding slips with too many or too few replacement characters on the RHS): `sub fold_sort { my ( $x, $y ) = ( $a, $b ); for ( $x, $y ) { tr/A-Z/a-z/; tr/ÀÁÂÃÄÅàáâãäå/a/; tr/ÈÉÊËèéêë/e/; tr/ÌÍÎÏìíîï/i/; tr/ÒÓÔÕÖØòóôõöø/o/; tr/ÙÚÛÜùúûü/u/; tr/ÇçÑñÝýÿ/ccnnyyy/; s/ß/ss/g; } $x cmp $y; }` [download] (update: added the s///g; for the double-S symbol) (My 5.8.1 under SuSE linux doesn't have any trouble with treating/keeping the latin1 data as-is, but folks with Red Hat and 5.8.0 might have to add a "use bytes" pragma to make it work.)	[reply] [d/l]
Re: Diacritic-Insensitive and Case-Insensitve Sorting by davido (Cardinal) on Jan 05, 2004 at 05:48 UTC
I dug deeper into the POD's and found that a variant on my original solution is actually discussed in the POD. See perlebcdic. The SORTING section is very educational. From the POD: MONO CASE then sort data. In order to minimize the expense of mono casing mixed test try to `tr///` towards the character set case most employed within the data. If the data are primarily UPPERCASE non Latin 1 then apply `tr/[a-z]/[A-Z]/` then sort(). If the data are primarily lowercase non Latin 1 then apply `tr/[A-Z]/[a-z]/` before sorting. If the data are primarily UPPERCASE and include Latin-1 characters then apply: `tr/[a-z]/[A-Z]/; tr/[àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ]/[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝ +Þ]/; s/ß/SS/g;` [download] then sort(). Do note however that such Latin-1 manipulation does not address the ÿ y WITH DIAERESIS character that will remain at code point 255 on ASCII machines, but 223 on most EBCDIC machines where it will sort to a place less than the EBCDIC numerals. With a Unicode enabled Perl you might try: `tr/^?/\x{178}/;` [download] The strategy of mono casing data before sorting does not preserve the case of the data and may not be acceptable for that reason. The POD method mentions the fact that transliteration will obliterate the original string's diacritic symbols and case. It's for that reason that I also used a Schwartzian Transform in my strategy. However, as Roger mentioned in the CB, that method is memory-expensive. Of course what we're doing is more than sorting normalized case, it's also normalized diacritic symbols. But the idea is similar. Dave	[reply] [d/l] [select]
Re: Diacritic-Insensitive and Case-Insensitve Sorting by Willard B. Trophy (Hermit) on Jan 05, 2004 at 15:02 UTC
If you `use locale;`, you'll get most of the way there. It doesn't exactly consider all accented versions of a character to be identical, but it does sort all versions of A before B. This might be good enough for you. It certainly got me 95% of the way when my job was sorting multilingual dictionaries. Using a code example from the perllocale pod, this is what I get as the collation order for my locale, en_CA: 0 1 2 3 4 5 6 7 8 9 _ A a À à Á á Â â Ã ã Ä ä Å å Æ æ B b C c Ç ç D d Ð ð E e È è É é Ê ê Ë ë F f G g H h I i Ì ì Í í Î î Ï ï J j K k L l M m N n Ñ ñ O o Ò ò Ó ó Ô ô Õ õ Ö ö Ø ø P p Q q R r S s ß T t U u Ù ù Ú ú Û û Ü ü V v W w X x Y y Ý ý ÿ Z z Þ þ Using locale is a bit slower than an unadorned sort, but it's far faster and has fewer pitfalls than rolling your own locale-emulation system. `lc` and `uc` do exactly what you'd expect under locale, too. -- bowling trophy thieves, die!	[reply] [d/l] [select]
Re: Diacritic-Insensitive and Case-Insensitve Sorting by ysth (Canon) on Jan 05, 2004 at 08:32 UTC
This doesn't help this questioner, but here's what I use for removing Unicode accents: use 5.008; use charnames (); sub deaccent { # split it into characters, then loop through them converting one +by one my @chars = split //, $_[0]; for my $char (@chars) { # look up the name (e.g. "LATIN SMALL LETTER O WITH TILDE") my $name = charnames::viacode(ord($char)); # only try to convert it if it was a valid char and had " WITH + " if ($name && $name =~ m/(.*) WITH /) { # take off the " WITH foo" and see if that is a valid char my $neword = charnames::vianame("$1"); $char = chr($neword) if $neword; } } return join '', @chars; } [download]	[reply] [d/l]
Re: Diacritic-Insensitive and Case-Insensitve Sorting by snellm (Monk) on Jan 05, 2004 at 13:24 UTC
TMTOWTDI - this is a more general solution that remaps characters in a string in a fairly (cough) efficient manner. You'd want to use this in conjunction with a Swartzian transform. my @lowervec = ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, ' ','!','"','#','$','%','&','\'','','','*','+','','','','', '0','1','2','3','4','5','6','7','8','9',':',';','<','=','>','?', '@','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o', 'p','q','r','s','t','u','v','w','x','y','z','[','\\',']','^','_', '`','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o', 'p','q','r','s','t','u','v','w','x','y','z','{','\|','}','~',127, 'c','u','e','a','a','a','a','c','e','e','e','i','i','i','a','a', 'e',145,'','o','o','o','u','u','y','o','u',155,156,157,158,159, 'a','i','o','u','n','n',166,167,168,169,170,'',172,173,174,175, 176,177,178,179,180,181,182,183,184,185,186,'',188,189,190,191, 192,193,194,195,196,197,198,199,200,'e',202,203,204,205,206,207, 208,209,210,211,212,213,'o',215,216,217,218,219,220,221,222,223, 'a','a',226,227,'a',229,230,'c',232,'e',234,'e',236,'i',238,239, 240,241,242,'o','o',245,'o',247,248,249,250,251,'u','y',254,255, ); # Lowercases string, remaps accented chars and removes some non-alphab +etic chars sub cleanstring($) { return join('', map { $lowervec[ord($_)] } split('', shift)); } [download] -- Michael Snell -- michael@snell.com	[reply] [d/l]
Re: Diacritic-Insensitive and Case-Insensitve Sorting by jweed (Chaplain) on Jan 05, 2004 at 04:54 UTC
If you're sure that the keys will only contain Alphabetic and diacritic characters, you could try a `s/[^A-Za-z]/,/` in davido's sterilize function. That turns ugly chars into a comma, which serves as a placeholder for any icky bits. Still incredibly hackish, but better than specifying all the weird marks by hand... Turns out this post was based on a misunderstanding b/c of an incorrect encoding on my browser. Please see Roger's solution, below. Who is Kayser Söze? Code is (almost) always untested.	[reply] [d/l]
Re: Re: Diacritic-Insensitive and Case-Insensitve Sorting by davido (Cardinal) on Jan 05, 2004 at 05:05 UTC
But tr/// is faster than s///, and also, not all characters with diacritic symbols are equal. For example, simply substituting "A~" with "," and "E`" with ",", means that "Sera'" and "Sere'" would both turn into "Ser,", and both be sorted as equals. That's probably not desired behavior. In other words, simply turning anything "ugly" into a comma will lose important information needed for an accurate sort. Remember that tr/ABC/A/ turns A, B, or C into A. It's the same as tr/ABC/AAA/. So in my example, I'm turning all of the "e"-like characters with diacritic symbols into an 'e'. I'm also retaining the original string so that in cases where Avo^ and Avo' would be sorted as equals, a defined order is retained. Dave	[reply]
Re: Re: Re: Diacritic-Insensitive and Case-Insensitve Sorting by BUU (Prior) on Jan 05, 2004 at 05:16 UTC
I think the biggest problem is that "E like characters" is human defined, there is no technological solution to determine what is "E like" and what is not. Therefor you basically always have to specify your transformation list by hand.	[reply]
Re: Re: Re: Diacritic-Insensitive and Case-Insensitve Sorting by jweed (Chaplain) on Jan 05, 2004 at 05:21 UTC
From the original node: For example, characters e E Ã?Â© Ã?? Ã?Âª Ã?? Ã?Â¨ should be considered equivalent So actually, my behavior is probably...desired behavior Who is Kayser Söze? Code is (almost) always untested.	[reply]
Re: Re: Re: Re: Diacritic-Insensitive and Case-Insensitve Sorting by davido (Cardinal) on Jan 05, 2004 at 05:27 UTC
Re: Re: Re: Re: Re: Diacritic-Insensitive and Case-Insensitve Sorting by jweed (Chaplain) on Jan 05, 2004 at 06:00 UTC