Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Diacritic-Insensitive and Case-Insensitve Sorting

by Anonymous Monk
on Jan 05, 2004 at 04:17 UTC ( [id://318769]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

This Perl Beginner would appreciate advice on how to tackle this problem ...

I need to sort lists of strings -- alphabetically, but insensitive to case or the presence of diacritics. For example, characters e E é É ê Ê è should be considered equivalent. I have a temporary sort function which I consider a monstrous hack -- it converts the diacritic characters to their non-diacritic equivalent, then lower-casing before performing the cmp comparison. It works, but it's inefficient and non-scaleable. There must be a better way -- anyone know what it is?

Background:
  • The characters are all standard 8-bit Latin-1 ISO 8859-1 (no Unicode).
  • The diacritics are those used in European languages: French, Portuguese, Spanish, German, etc.
  • The strings are the values of a large hash, presently about 15,000 but expected to grow to at least 100,000.
  • I use DB_File to store the hash - it is always tied.
  • I need to use the same sort function in other contexts -- the problem is about how to compare strings rather than how to sort a hash by value.
  • Comment on Diacritic-Insensitive and Case-Insensitve Sorting

Replies are listed 'Best First'.
Re: Diacritic-Insensitive and Case-Insensitve Sorting
by Roger (Parson) on Jan 05, 2004 at 05:51 UTC
    use strict; use warnings; use Unicode::Collate; # Level 2 collation: diacritic ordering my $collator = Unicode::Collate->new(level => 2); @sorted = $collator->sort(@not_sorted);
Re: Diacritic-Insensitive and Case-Insensitve Sorting
by davido (Cardinal) on Jan 05, 2004 at 04:46 UTC
    That's an interesting question. I was hesitant to post my solution because I'm thinking someone will come along with a POSIX module solution that normalizes diacritic symbols to their base character automatically. But in case that response doesn't come along, I'll give it a go...

    Use tr/// to transliterate characters with diacritic symbols into their base character. Put the transliteration into a function to keep your code clean. And then use a Schwartzian Transform on your sort to ensure that conversions only happen once, for improved speed, and so that your original text is left in tact.

    sub sterilize { my $char = shift; $char =~ tr/EéÉêÊè/e/; # All of your other transliterations go here too. return $char; } my @array = ( # Your stuff to be sorted goes here ); my @sorted; @sorted = map { $_->[1] } sort { ( $a->[0] cmp $b->[0] ) or ( $a->[1] cmp $b->[1] ) } map { [ sterilize($_), $_ ] } @array;

    As you can see, there are still a few minor blanks you have to fill in for this to be fully functional. You'll have to work out how to apply this to your database-tied hash, and you'll need to add the transliterations for the other diacritic symbols that I didn't enumerate (I don't know how to type them). But the idea should be clear enough.

    Good luck. ...now I'll sit back and watch for a more elegant solution.

    Update: Added short-circuit to the sort routine to force a defined order in cases where sorted strings are rendered equal by stripping diacritics.


    Dave

      I was hesitant to post my solution because I'm thinking someone will come along with a POSIX module solution that normalizes diacritic symbols to their base character automatically.

      The same thought crossed my mind, in terms of using unicode character classes. I checked the perlfaqs in 5.8.1, and the perlunicode man page, and didn't find anything relevant, though I think there was a discussion about this sort of thing (removing accents) on the perl-unicode mail list within the last couple weeks.

      In any case, I would hesitate to look for that sort of solution -- there's a reasonably good chance that operations using unicode character classes will end up being slower than just doing plain old "tr///" on plain old latin1 bytes. That, combined with the fact that AM would need to convert everything to utf8 first, tends to make this somewhat unlikely to succeed as an "optimization".

      As for a POSIX (as opposed to unicode) module, I would guess that if someone decided to do this in "pure perl", it would end up as just a "tr///" statement...

Re: Diacritic-Insensitive and Case-Insensitve Sorting
by graff (Chancellor) on Jan 05, 2004 at 05:38 UTC
    It would be interesting to see how "monstrous" your temporary sort function is -- it might not be as bad as you think, or might at least be on the right track. If you use "tr///" to normalize the case and accents, it's not clear to me that this would be bad in terms of scalability or generality (assuming that your "general" case always involves latin1 characters). Using "s///" is likely to be slower, of course, so you certainly don't want to do that.

    Is it the case that you have profiled the code and determined that your sort routine is really taking too much time? If you haven't tested the performance and confirmed that your sort is the bottleneck, it might not be worthwhile to try to optimize it (i.e. the real bottleneck might be somewhere else, like file i/o).

    You say the set of strings to sort will number in the few hundreds of thousands -- but how big are the strings? Are you sure you need to tie them into a DB_File, as opposed to simply having them in a memory-resident hash?

    Does your sort function look like the following? This is how I would do it, just off the top of my head -- I don't know how it would perform on large quantities of data. It might be more "optimal" to fold everything into a single big "tr///", but it just seemed quicker/easier to code it this way (mostly in terms of avoiding slips with too many or too few replacement characters on the RHS):

    sub fold_sort { my ( $x, $y ) = ( $a, $b ); for ( $x, $y ) { tr/A-Z/a-z/; tr/ÀÁÂÃÄÅàáâãäå/a/; tr/ÈÉÊËèéêë/e/; tr/ÌÍÎÏìíîï/i/; tr/ÒÓÔÕÖØòóôõöø/o/; tr/ÙÚÛÜùúûü/u/; tr/ÇçÑñÝýÿ/ccnnyyy/; s/ß/ss/g; } $x cmp $y; }
    (update: added the s///g; for the double-S symbol)

    (My 5.8.1 under SuSE linux doesn't have any trouble with treating/keeping the latin1 data as-is, but folks with Red Hat and 5.8.0 might have to add a "use bytes" pragma to make it work.)

Re: Diacritic-Insensitive and Case-Insensitve Sorting
by davido (Cardinal) on Jan 05, 2004 at 05:48 UTC
    I dug deeper into the POD's and found that a variant on my original solution is actually discussed in the POD. See perlebcdic. The SORTING section is very educational.

    From the POD:

    MONO CASE then sort data. In order to minimize the expense of mono casing mixed test try to tr/// towards the character set case most employed within the data. If the data are primarily UPPERCASE non Latin 1 then apply tr/[a-z]/[A-Z]/ then sort(). If the data are primarily lowercase non Latin 1 then apply tr/[A-Z]/[a-z]/ before sorting. If the data are primarily UPPERCASE and include Latin-1 characters then apply:

    tr/[a-z]/[A-Z]/; tr/[àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ]/[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝ +Þ]/; s/ß/SS/g;

    then sort(). Do note however that such Latin-1 manipulation does not address the ÿ y WITH DIAERESIS character that will remain at code point 255 on ASCII machines, but 223 on most EBCDIC machines where it will sort to a place less than the EBCDIC numerals. With a Unicode enabled Perl you might try:

    tr/^?/\x{178}/;

    The strategy of mono casing data before sorting does not preserve the case of the data and may not be acceptable for that reason.

    The POD method mentions the fact that transliteration will obliterate the original string's diacritic symbols and case. It's for that reason that I also used a Schwartzian Transform in my strategy. However, as Roger mentioned in the CB, that method is memory-expensive.

    Of course what we're doing is more than sorting normalized case, it's also normalized diacritic symbols. But the idea is similar.


    Dave

Re: Diacritic-Insensitive and Case-Insensitve Sorting
by Willard B. Trophy (Hermit) on Jan 05, 2004 at 15:02 UTC

    If you use locale;, you'll get most of the way there. It doesn't exactly consider all accented versions of a character to be identical, but it does sort all versions of A before B. This might be good enough for you. It certainly got me 95% of the way when my job was sorting multilingual dictionaries.

    Using a code example from the perllocale pod, this is what I get as the collation order for my locale, en_CA:

    0 1 2 3 4 5 6 7 8 9 _ A a À à Á á Â â Ã ã Ä ä Å å Æ æ B b C c Ç ç D d Ð ð E e È è É é Ê ê Ë ë F f G g H h I i Ì ì Í í Î î Ï ï J j K k L l M m N n Ñ ñ O o Ò ò Ó ó Ô ô Õ õ Ö ö Ø ø P p Q q R r S s ß T t U u Ù ù Ú ú Û û Ü ü V v W w X x Y y Ý ý ÿ Z z Þ þ

    Using locale is a bit slower than an unadorned sort, but it's far faster and has fewer pitfalls than rolling your own locale-emulation system. lc and uc do exactly what you'd expect under locale, too.

    --
    bowling trophy thieves, die!

Re: Diacritic-Insensitive and Case-Insensitve Sorting
by ysth (Canon) on Jan 05, 2004 at 08:32 UTC
    This doesn't help this questioner, but here's what I use for removing Unicode accents:
    use 5.008; use charnames (); sub deaccent { # split it into characters, then loop through them converting one +by one my @chars = split //, $_[0]; for my $char (@chars) { # look up the name (e.g. "LATIN SMALL LETTER O WITH TILDE") my $name = charnames::viacode(ord($char)); # only try to convert it if it was a valid char and had " WITH + " if ($name && $name =~ m/(.*) WITH /) { # take off the " WITH foo" and see if that is a valid char my $neword = charnames::vianame("$1"); $char = chr($neword) if $neword; } } return join '', @chars; }
Re: Diacritic-Insensitive and Case-Insensitve Sorting
by snellm (Monk) on Jan 05, 2004 at 13:24 UTC
    TMTOWTDI - this is a more general solution that remaps characters in a string in a fairly (cough) efficient manner. You'd want to use this in conjunction with a Swartzian transform.
    my @lowervec = ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, ' ','!','"','#','$','%','&','\'','','','*','+','','','','', '0','1','2','3','4','5','6','7','8','9',':',';','<','=','>','?', '@','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o', 'p','q','r','s','t','u','v','w','x','y','z','[','\\',']','^','_', '`','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o', 'p','q','r','s','t','u','v','w','x','y','z','{','|','}','~',127, 'c','u','e','a','a','a','a','c','e','e','e','i','i','i','a','a', 'e',145,'','o','o','o','u','u','y','o','u',155,156,157,158,159, 'a','i','o','u','n','n',166,167,168,169,170,'',172,173,174,175, 176,177,178,179,180,181,182,183,184,185,186,'',188,189,190,191, 192,193,194,195,196,197,198,199,200,'e',202,203,204,205,206,207, 208,209,210,211,212,213,'o',215,216,217,218,219,220,221,222,223, 'a','a',226,227,'a',229,230,'c',232,'e',234,'e',236,'i',238,239, 240,241,242,'o','o',245,'o',247,248,249,250,251,'u','y',254,255, ); # Lowercases string, remaps accented chars and removes some non-alphab +etic chars sub cleanstring($) { return join('', map { $lowervec[ord($_)] } split('', shift)); }

    -- Michael Snell
    -- michael@snell.com

Re: Diacritic-Insensitive and Case-Insensitve Sorting
by jweed (Chaplain) on Jan 05, 2004 at 04:54 UTC
    If you're sure that the keys will only contain Alphabetic and diacritic characters, you could try a s/[^A-Za-z]/,/ in davido's sterilize function. That turns ugly chars into a comma, which serves as a placeholder for any icky bits. Still incredibly hackish, but better than specifying all the weird marks by hand... Turns out this post was based on a misunderstanding b/c of an incorrect encoding on my browser. Please see Roger's solution, below.


    Who is Kayser Söze?
    Code is (almost) always untested.
      But tr/// is faster than s///, and also, not all characters with diacritic symbols are equal. For example, simply substituting "A~" with "," and "E`" with ",", means that "Sera'" and "Sere'" would both turn into "Ser,", and both be sorted as equals. That's probably not desired behavior.

      In other words, simply turning anything "ugly" into a comma will lose important information needed for an accurate sort.

      Remember that tr/ABC/A/ turns A, B, or C into A. It's the same as tr/ABC/AAA/. So in my example, I'm turning all of the "e"-like characters with diacritic symbols into an 'e'. I'm also retaining the original string so that in cases where Avo^ and Avo' would be sorted as equals, a defined order is retained.


      Dave

        I think the biggest problem is that "E like characters" is human defined, there is no technological solution to determine what is "E like" and what is not. Therefor you basically always have to specify your transformation list by hand.
        From the original node:
        For example, characters e E �© �? �ª �? �¨ should be considered equivalent
        So actually, my behavior is probably...desired behavior


        Who is Kayser Söze?
        Code is (almost) always untested.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://318769]
Approved by graff
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (2)
As of 2024-04-20 02:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found