Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re: HTML::Entities - encode all non-alphanumeric and foreign chars?

by Sidhekin (Priest)
on Sep 23, 2007 at 19:28 UTC ( #640604=note: print w/replies, xml ) Need Help??

in reply to HTML::Entities - encode all non-alphanumeric and foreign chars?

Your problem is easier if you invert how you express the requirements: Rather than encode everything non-English + non-alphanumeric, encode everything but the English alphanumerics. Which ought to be something like this, depending on your idea of "English alphanumerics":

$encoded = encode_entities($input, '\W');

or ...

$encoded = encode_entities($input, '^\w');

or ...

$encoded = encode_entities($input, '^a-zA-Z0-9_');

(That these follow the regex character class syntax is not actually documented, but I'd be surprised to see it stop working. Certainly, as you noted, the use of hyphen to denote character ranges is documented ...)

print "Just another Perl ${\(trickster and hacker)},"
The Sidhekin proves Sidhe did it!

Replies are listed 'Best First'.
Re^2: HTML::Entities - encode all non-alphanumeric and foreign chars?
by punch_card_don (Curate) on Sep 23, 2007 at 20:03 UTC
    Hadn't imagined it would take regex elements...

    The first two

    $encoded = encode_entities($input, '\W'); $encoded = encode_entities($input, '^\w');
    wouldn't work for me. But I tried
    $encoded = encode_entities($input, '\\W'); # note double backslash
    and that did work, with one little picky issue - it was encoding every whiteepsace char as well whic, while not technically bothersome, is just not needed.

    So I tried the last formulation witha space added to list - had to add it as a simple typed space - wouldn't accept a \s:

    $encoded = encode_entities($input, '^a-zA-Z0-9_ ');
    and that does it perfectly.


      $encoded = encode_entities($input, '\\W'); # note double backslash

      Single backslash works for me. Sure you weren't trying with a double-quoted string?

      ('\w', '\\w', "\\w" should all be the same string, \w — whereas "\w" is just w.)

      Oh, and the same goes for \s. It should Just Work in a single-quoted string, but in a double-quoted string, you'll need to double the backslash.

      print "Just another Perl ${\(trickster and hacker)},"
      The Sidhekin proves Sidhe did it!

        Yes, absolutely right - double quotes. Replacing with single quotes makes '\W' work like a charm. Thanks.

        But, just to be finicky and difficult, '\W\s' is still converting spaces to &#32.


        Ya, of course it was. This is the list of UNSAFE characters to be encoded. So if I include '\W\s', that specificlaly tells it to encode spaces. What I want is '^\w\s' - anything that's not a word char or a space. Works perfect now.

        UPDATE 2

        OK, now this is very cool. With this formulation, I can create a very well defined list of what is and is not to be encoded. For example (what I'm using):

        $encoded = encode_entities($input'^\w\s.\-');
        encodes everything that is NOT a word char, or a space, or a period, or a dash (backslash needed to escape 'cause the dash is part of the module's syntax)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://640604]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (6)
As of 2021-01-25 16:57 GMT
Find Nodes?
    Voting Booth?