Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: Safely removing Unicode zero-width spaces and other non-printing characters

by kcott (Archbishop)
on Dec 04, 2019 at 06:04 UTC ( [id://11109645]=note: print w/replies, xml ) Need Help??


in reply to Safely removing Unicode zero-width spaces and other non-printing characters

G'day mldvx4,

From the examples you've shown, it looks like you may have a conflict between the UTF-16 used internally by MSWin and the UTF-8 used internally by Perl. That's a guess but it's the type of issue that I've seen in the past; you may have a different OS using UTF-? but that could well have similar problems.

The utf8 pragma only relates to your Perl source code. Have a look through that documentation for more details; and do note the emboldened text near the start of the DESCRIPTION.

Different versions of Perl have different levels of Unicode support. Check your version and see if its support (or lack thereof) might be related to your problems.

You don't indicate the source of your input data nor the target for the output. You may need to convert one or both within your script.

Take a look at the perl manpage. Under Reference Manual, you'll see a lot of links like "perluni*" — pick ones that are appropriate for your level of Unicode knowledge and read on from there.

With more information regarding OS, Perl version, I/O handling and so on; along with some sample code and input/output data; you may get a better answer.

Addendum: Regarding the substitution at the start of your post. I ran this quick test:

$ perl -E 'my $x = "A\N{NO-BREAK SPACE}B\N{NO-BREAK SPACE}C"; $x =~ s/ +\x{00A0}/ /g; say $x' A B C

Note that the source code only contains 7-bit ASCII characters. The is no need for the utf8 pragma here.

— Ken

Replies are listed 'Best First'.
Re^2: Safely removing Unicode zero-width spaces and other non-printing characters
by mldvx4 (Friar) on Dec 04, 2019 at 09:30 UTC

    The source of the data is a large number of RSS feeds used which point to an even larger number of individual web pages. The latter are what are harvested and processed with a few scripts. So normalizing the data at the source is not an option, since so few webmasters even publish mail addresses let alone fix their sites.

    Maybe there is a CPAN module or simple method to forcibly convert the incoming data (or outgoing data) to UTF? Just calling it UTF-8 fails, too: binmode(STDOUT, ":encoding(utf8)"); Is there a way to find out if it should be labeled UTF-16 instead? If so then how to force that mode?

    $ apt-cache policy perl | head -n 3 perl: Installed: 5.28.1-6 Candidate: 5.28.1-6

      It is input decoding which matters here. There is no way to convert incoming data to UTF without treating the original encoding of each individiual input. The issue with harvesting from different sites is that the encoding of these sites can be 1) different and 2) just broken for a few of the sites.

      Your code snippet s/\x{00A0}/ /gm; just works if all input has been properly decoded into to Perl's "character" semantics (I avoid to call it UTF-something because this is misleading), protected by the error handling of the Encode module.

      Of course, you need to encode your output, too. binmode(STDOUT, ":encoding(utf8)"), converts Perl's characters into a valid UTF-8 stream.

      The source of the data is a large number of RSS feeds used which point to an even larger number of individual web pages.

      Well, RSS is XML, and XML files should specify the encoding in the XML declaration, and XML parsers such as XML::LibXML do respect that declaration. However, it's possible that the XML declaration is missing or incorrect. In cases like that, one thing you might try is Encode::Guess, keeping in mind that it's just a guess. Or, if you're getting these feeds from web servers, you might look at the response headers for a hint.

        Yes, the RSS reads fine of course.

        The problem is with the pages which the RSS points to. HTML and XHTML is a hot mess. Even when a respectable CMS is used, the authors can still paste in something weird. It is looking like I may have to treat each site individually and making individual filters might not be worth the effort. However, I am hoping for an automated way to normalize incoming text.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11109645]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2024-04-18 22:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found