Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Unicode surrogate is illegal in UTF-8

by Rodster001 (Pilgrim)
on Aug 03, 2015 at 18:14 UTC ( [id://1137280]=perlquestion: print w/replies, xml ) Need Help??

Rodster001 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have some invalid utf-8. I want to catch the warning messages, and just un/redefine the string.

For example:
print Dumper $text; $VAR1 = { 'string' => "Text\x{daed}", }; print $text->{'string'}; 'Unicode surrogate U+DAED is illegal in UTF-8 at line 5.
I've tried something like:
local $SIG{__WARN__} = sub { print "WARNING!\n"; };
Which catches the warning, but I lose context. I want to detect the warning then just set $text->{'string'} = "bad utf-8 encoding" or better yet, remove the offending character (regex?) so I end up with $text->{string} = "Text".

Thanks for your help.

Replies are listed 'Best First'.
Re: Unicode surrogate is illegal in UTF-8
by choroba (Cardinal) on Aug 03, 2015 at 18:40 UTC
    You can use eval with FATAL warnings:
    #!/usr/bin/perl use strict; use warnings; use open OUT => ':encoding(UTF-8)', ':std'; use warnings FATAL => 'utf8'; my $text = { string => "t\x{daed}\x{ffff}\x{daee}\x{c8}\n" }; 1 until eval { print $text->{string}; 1; } or do { my ($charcode) = $@ =~ /U\+(\S+)/ or die $@; print STDERR "Removing $charcode because of $@"; $text->{string} =~ s/\x{$charcode}//g; 0; # Try again! };

    Update: handles both "non-character" and "surrogate" cases. I wasn't able to trigger the "non_unicode" warnings.

    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      This doesn't seem to work for me. It reports the warning, but $@ does not get set.
        Nevermind, this worked for me (unrelated typo). Thanks!
      Could I generate/detect this warning without using "print" (i.e. so I could fix/replace silently)?
        You can print to a filehandle that doesn't lead anywhere:
        open my $VOID, '>', \ my $void; 1 until eval { print {$VOID} $text->{string}; # ...
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
        Could I generate/detect this warning without using "print" (i.e. so I could fix/replace silently)?

        Have a look at Handling Malformed Data.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1137280]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2024-04-25 18:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found