Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Safely removing Unicode zero-width spaces and other non-printing characters

by ikegami (Patriarch)
on Dec 05, 2019 at 14:54 UTC ( [id://11109696]=note: print w/replies, xml ) Need Help??


in reply to Safely removing Unicode zero-width spaces and other non-printing characters

For starters, U+00A0 is not a zero-width space; it's a (normal-width) non-breaking space.

Furthermore, as a normal-width space, it isn't a non-printing character. That is to say, it is printing character.

On to your question. To remove NBSP and non-printing characters, you can use the following:

s/[\N{NBSP}\P{Print}]//g

(In lieu of \N{NBSP}, once can use \xA0 or \x{A0} or \N{U+A0} or ...)

The above expects Unicode characters (decoded text). You are providing encoded text instead (bytes). You need to properly decode your inputs and encode your outputs.

For example, if your source code is encoded using UTF-8 rather than ASCII, you want:

use utf8;

For example, the following causes STDIN, STDOUT and STDERR to be decoded/encoded automatically, and it sets the default encoding for files opened in scope:

use open ':std', ':encoding(UTF-8)';

Failing to properly decode your inputs and encode your outputs explains the results you are seeing.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11109696]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2024-04-24 18:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found