Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Safely removing Unicode zero-width spaces and other non-printing characters

by mldvx4 (Friar)
on Dec 04, 2019 at 05:15 UTC ( [id://11109644]=perlquestion: print w/replies, xml ) Need Help??

mldvx4 has asked for the wisdom of the Perl Monks concerning the following question:

(I've got a lot of data that I'm not able to duplicate inline here. When I try to paste it in, the problems go away. However, since this is for several automated scripts, manual intervention is not an option.)

When I try to do a replacement s/\x{00A0}/ /gm; on my data, many other characters go bad, and end up looking like this:

But he,<C2> along with so many of his<C2> fellow

Note the <C2>s wedged in there. I do not have a "use utf8;" in place in this script, because if I add it, then it screws up nearly all the UTF-8 characters, in this way for an example:

Desktop Thatâ<U+0080><U+0099>s More Elegant

Which should look like this instead:

Desktop That’s More Elegant

How should I go about approaching this problem. I think the solution might be to turn on UTF-8 but then what do I do to prevent the data from getting completely ruined?

Replies are listed 'Best First'.
Re: Safely removing Unicode zero-width spaces and other non-printing characters
by haukex (Archbishop) on Dec 04, 2019 at 08:30 UTC

    It sounds to me like perhaps some of your strings were not decoded properly when you loaded them into Perl. Note that you can still provide an SSCCE: at the very least inspect your strings (and post them here) using Data::Dumper with $Data::Dumper::Useqq=1; or with Data::Dump, or even better, use hexdump or od to show your input files, and Devel::Peek for the strings; I gave an example here. As for posting on PerlMonks, you can post Unicode as long as you put it in <pre> instead of <code> tags (you'll have to escape <, >, and & manually though).

    I do not have a "use utf8;" in place in this script, because if I add it, then it screws up nearly all the UTF-8 characters

    That's strange, since utf8 only affects how your source code is interpreted. If you have any non-ASCII characters in your source, then I'd strongly recommend to make sure the file is properly encoded as UTF-8 and then use utf8;. To look at the source file and verify its encoding, you might also be interested in my script enctool.

    And as kcott said, this also may depend on the Perl version you're using, for example, there's The 'unicode_strings' feature.

      Ah. Only one of the scripts has some non-ASCII in its output. In fact it is the one script which seems to be the trouble. The bitbucket links leads to a blank page though. Is there another site?

        The bitbucket links leads to a blank page though. Is there another site?

        It works fine on my end, but try this link instead.

Re: Safely removing Unicode zero-width spaces and other non-printing characters
by kcott (Archbishop) on Dec 04, 2019 at 06:04 UTC

    G'day mldvx4,

    From the examples you've shown, it looks like you may have a conflict between the UTF-16 used internally by MSWin and the UTF-8 used internally by Perl. That's a guess but it's the type of issue that I've seen in the past; you may have a different OS using UTF-? but that could well have similar problems.

    The utf8 pragma only relates to your Perl source code. Have a look through that documentation for more details; and do note the emboldened text near the start of the DESCRIPTION.

    Different versions of Perl have different levels of Unicode support. Check your version and see if its support (or lack thereof) might be related to your problems.

    You don't indicate the source of your input data nor the target for the output. You may need to convert one or both within your script.

    Take a look at the perl manpage. Under Reference Manual, you'll see a lot of links like "perluni*" — pick ones that are appropriate for your level of Unicode knowledge and read on from there.

    With more information regarding OS, Perl version, I/O handling and so on; along with some sample code and input/output data; you may get a better answer.

    Addendum: Regarding the substitution at the start of your post. I ran this quick test:

    $ perl -E 'my $x = "A\N{NO-BREAK SPACE}B\N{NO-BREAK SPACE}C"; $x =~ s/ +\x{00A0}/ /g; say $x' A B C

    Note that the source code only contains 7-bit ASCII characters. The is no need for the utf8 pragma here.

    — Ken

      The source of the data is a large number of RSS feeds used which point to an even larger number of individual web pages. The latter are what are harvested and processed with a few scripts. So normalizing the data at the source is not an option, since so few webmasters even publish mail addresses let alone fix their sites.

      Maybe there is a CPAN module or simple method to forcibly convert the incoming data (or outgoing data) to UTF? Just calling it UTF-8 fails, too: binmode(STDOUT, ":encoding(utf8)"); Is there a way to find out if it should be labeled UTF-16 instead? If so then how to force that mode?

      $ apt-cache policy perl | head -n 3 perl: Installed: 5.28.1-6 Candidate: 5.28.1-6

        It is input decoding which matters here. There is no way to convert incoming data to UTF without treating the original encoding of each individiual input. The issue with harvesting from different sites is that the encoding of these sites can be 1) different and 2) just broken for a few of the sites.

        Your code snippet s/\x{00A0}/ /gm; just works if all input has been properly decoded into to Perl's "character" semantics (I avoid to call it UTF-something because this is misleading), protected by the error handling of the Encode module.

        Of course, you need to encode your output, too. binmode(STDOUT, ":encoding(utf8)"), converts Perl's characters into a valid UTF-8 stream.

        The source of the data is a large number of RSS feeds used which point to an even larger number of individual web pages.

        Well, RSS is XML, and XML files should specify the encoding in the XML declaration, and XML parsers such as XML::LibXML do respect that declaration. However, it's possible that the XML declaration is missing or incorrect. In cases like that, one thing you might try is Encode::Guess, keeping in mind that it's just a guess. Or, if you're getting these feeds from web servers, you might look at the response headers for a hint.

Re: Safely removing Unicode zero-width spaces and other non-printing characters
by ikegami (Patriarch) on Dec 05, 2019 at 14:54 UTC

    For starters, U+00A0 is not a zero-width space; it's a (normal-width) non-breaking space.

    Furthermore, as a normal-width space, it isn't a non-printing character. That is to say, it is printing character.

    On to your question. To remove NBSP and non-printing characters, you can use the following:

    s/[\N{NBSP}\P{Print}]//g

    (In lieu of \N{NBSP}, once can use \xA0 or \x{A0} or \N{U+A0} or ...)

    The above expects Unicode characters (decoded text). You are providing encoded text instead (bytes). You need to properly decode your inputs and encode your outputs.

    For example, if your source code is encoded using UTF-8 rather than ASCII, you want:

    use utf8;

    For example, the following causes STDIN, STDOUT and STDERR to be decoded/encoded automatically, and it sets the default encoding for files opened in scope:

    use open ':std', ':encoding(UTF-8)';

    Failing to properly decode your inputs and encode your outputs explains the results you are seeing.

Re: Safely removing Unicode zero-width spaces and other non-printing characters
by harangzsolt33 (Chaplain) on Dec 04, 2019 at 14:09 UTC
    Desktop Thatâ<U+0080><U+0099>s More Elegant Which should look like this instead:

    Desktop That’s More Elegant

    You know, in HTML, it is possible to insert codes that produce UTF characters on the screen, and they exist in case you want the source code to be simple ASCII characters only. No UTF. I prefer that, because as you said, the UTF characters can mess up the code. For example, the above text should be:

    Desktop That&rsquo;s More Elegant

    How to encode UTF characters in HTML

    If I had the same problem, I would write a perl sub that replaces all these specific characters with the HTML equivalent first, and then just remove all 00 characters from the entire text and deal with the spaces and line breaks last.

      in HTML, it is possible to insert codes that produce UTF characters on the screen

      That's a possibility. However, there are also escape codes to allow representing arbitrary Unicode characters, such as "\N{U+NNNN}", which are implemented natively in Perl.

      I would write a perl sub that replaces all these specific characters with the HTML equivalent first

      No need to write a function yourself: HTML::Entities.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11109644]
Approved by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (2)
As of 2024-04-20 05:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found