Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re^2: The Björk Situation

by thundergnat (Deacon)
on Feb 15, 2006 at 19:02 UTC ( [id://530471]=note: print w/replies, xml ) Need Help??


in reply to Re: The Björk Situation
in thread The Björk Situation

You can speed this up considerably by transliterating everything you can and then only substituting characters that need it.

my $string = 'ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûüÝÿýÆ +æÞþÐðß'; print deaccent($string); sub deaccent{ my $phrase = shift; return $phrase unless ($phrase =~ m/[\xC0-\xFF]/); $phrase =~ tr/ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûü +Ýÿý/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/; my %trans = ( 'Æ' => 'AE', 'æ' => 'ae', 'Þ' => 'TH', 'þ' => 'th', 'Ð' => 'TH', 'ð' => 'th', 'ß' => 'ss' ); $phrase =~ s/([ÆæÞþÐðß])/$trans{$1}/g; return $phrase; }

Benchmarking puts it at about 6 times the speed. Moving the hash assignment outside the sub speeds both up about the same amount, they stay about 6:1 ratio.

use Benchmark qw( cmpthese ); my $string = 'ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûüÝÿýÆ +æÞþÐðß'; cmpthese( -5, { deaccent => sub { my $phrase = $string; return $phrase unless ($phrase =~ m/[\xC0-\xFF]/); $phrase =~ tr/ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûü +Ýÿý/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/; my %trans = ( 'Æ' => 'AE', 'æ' => 'ae', 'Þ' => 'TH', 'þ' => 'th', 'Ð' => 'TH', 'ð' => 'th', 'ß' => 'ss' ); $phrase =~ s/([ÆæÞþÐðß])/$trans{$1}/g; return $phrase; }, deaccent2 => sub{ my %acc = qw( À A Á A  A à A Ä A Å A Æ AE Ç C È E É E Ê E Ë E Ì I Í I Î I Ï I Ð TH Ñ N Ò O Ó O Ô O Õ O Ö O Ø O Ù U Ú U Û U Ü U Ý U Þ TH ß ss à a á a â a ã a ä a å a æ ae ç c è e é e ê e ë e ì i í i î i ï i ð th ñ n ò o ó o ô o õ o ö o ø o ù u ú u û u ü u ý y þ th ÿ y ); my $text = $string; $text =~ s/(.)/$acc{$1}?$acc{$1}:$1/eg; return $text; }, });

Returns on my system:

             Rate deaccent2  deaccent
deaccent2  4316/s        --      -86%
deaccent  30859/s      615%        --

With data that has fewer accented characters, the disparity should grow much greater since it will short circuit if there are no characters to be transliterated.

Replies are listed 'Best First'.
Re^3: The Björk Situation
by rhesa (Vicar) on Feb 15, 2006 at 19:12 UTC
    I thought I'd add Text::Unidecode in the mix:
    use Text::Unidecode; ... unidecode => sub { return unidecode($string) },
    The benchmark returns this on my system:
    Rate deaccent2 deaccent unidecode deaccent2 8614/s -- -83% -97% deaccent 50243/s 483% -- -81% unidecode 267338/s 3003% 432% --

      Actually, now that I've had a moment to look at it, unidecode DOESN'T fare so well, strictly from a speed point of view.

      You made the mistake of modifying $string directly so that in all but the first call, there are NO characters that need to be transliterated so it benchmarked much faster. Once that is fixed, it doesn't have such a big lead. (Actually, none at all ;-) )

      unidecode => sub{ my $text = $string; return unidecode($text); },
      Yields:
      
                   Rate unidecode deaccent2  deaccent
      unidecode  6797/s        --       -3%      -87%
      deaccent2  6979/s        3%        --      -86%
      deaccent  50687/s      646%      626%        --
      

      Never-the-less, unidecode probably IS the best choice as it handles Unicode up to \xFFFF not just up to \xFF.

        You made the mistake of modifying $string directly so that in all but the first call, there are NO characters that need to be transliterated so it is much faster. Once that is fixed, it doesn't have such a big lead.

        Whoops! You're right, I hadn't expected it to modify $string in-place. I suppose that's due to Benchmark imposing a void context on the return.

        My lesson learned today: Never trust your own benchmarks :)

      Good point. Though Text::Unidecode transliterates eth (ð) as d rather than the more generally accepted th. That's just quibbling though, you really shouldn't be using ANY of these functions lightly, since they destroy information and change the meaning of the text.

        More quibbling ;)

        http://en.wikipedia.org/wiki/Eth_(letter) says "the letter had its origin as a d with a cross-stroke added". I don't think d is such a bad transliteration then.

        In my view, it's the thorn (þ) that should become th. And in fact, Text::Unidecode does so.

        I do agree with you though that all these transliterations lose information. But that makes them well suited for internal representations, especially in text searches.

        Another advantage of Text::Unidecode is that it handles a lot more than what's in the Latin-1 supplement. This quote from the perldoc describes it best: "In other words, Unidecode's approach is broad (knowing about dozens of writing systems), but shallow (not being meticulous about any of them).".

        So for speed and generality, I'd recommend it. If you need precision, than transliteration may not be such a good idea altogether.

        As an Icelander I just wish to point out that we always transliterate 'ð' as 'd', not 'th'.

        So, as usual, the standard Perl module does the right thing.


        --
        Regards,
        Helgi Briem
        hbriem AT f-prot DOT com

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://530471]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2024-04-25 09:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found