Actually I shouldn't call it "crashed", it just can't detect if there are 2 "ö"
Here is the code to duplicate the problem, one "ö" is fine, a space between 2 "ö ö" is fine, "öñ" (%F6%F1) is fine
but not "öö"
use utf8;
use Text::Unaccent;
use Encode::Detect::Detector;
## my $author = "Sch%F6ttl";
my $author = "Sch%F6%F6ttl";
$author =~ s/%([a-zA-Z0-9][a-zA-Z0-9])/pack('C',hex($1))/eg;
my $encoding = Encode::Detect::Detector::detect($author);
print "encoding: $encoding: $author <br>\n";
if($encoding){
$author = unac_string($encoding, $author);
print "after unac: $author<br>\n";
}
| [reply] [d/l] |
I'm guessing the bug is in Text::Unaccent, but it's directly using the iconv C library, so I can't easily say for sure.
However, maybe this can work:
use strict;
use feature qw(unicode_strings say);
use Unicode::Normalize 'NFD';
my $author = "Sch\x{f6}\x{f6}ttl";
$author = NFD $author;
$author =~ s/\p{Combining_Diacritical_Marks}//g;
say $author;
This doesn't include and decode() or encode() of the incoming/outgoing strings. Also, I think that this can also break in cases where there are multiple combining characters.
| [reply] [d/l] |
thanks for your code, sorry I didn't explain the problem clear enough.
The input could be encoded in iso-8859-1 \x{f6}\x{f6}, or, maybe in utf-8, \x{c3}\x{b6}, I have to find out what is the charset first.
Encode::Detect::Detector is the one I am using to find out what is the charset of the string, utf-8 or iso-8859-1,
the logic is like:
$charset = = Encode::Detect::Detector::detect($input);
if($charset eq 'UTF-8'){
# do NFC ...
}elsif($charset eq 'iso-8859-1'){
# do NFD ...
}
Text::Unaccent unac_string($charset, $str) in my case. Text::Unaccent is working well if Detector can find it the correct code, it failed if Detector failed, of course, no charset.
Encode::Detect::Detector normally working well, but failed if input = \x{f6}\x{f6}.
| [reply] [d/l] |
Well, that would be a bug ... please go to CPAN and report it.
| [reply] |