http://qs321.pair.com?node_id=251954

oakbox has asked for the wisdom of the Perl Monks concerning the following question:

Okay, I have some web forms that some of my Dutch customers use. My problem is that they occasionally use high bit characters like ë ï é and í. When I redisplay those pages, the system craps out with control characters A<<x0.

So I think, "Aha! I'll use a module to escape those pesky characters into HTML". So I use HTML::Entities decode on my text field inputs.

My problem is that HTML::Entities also escapes HTML characters that I WANT my text entry people to be able to use. I want them to be able to use <p> and <br> to have some basic formatting control.

HTML::Entities allows you to force only some characters to be encoded and to leave others alone. But there's no easy way to complement that list. In other words, there's no function for 'export everything BUT <> in the incoming string'.

I'm hoping you might be able to save me from creating a whole manual lookup table :)

oakbox

Replies are listed 'Best First'.
Re: High bit character encoding in HTML
by crenz (Priest) on Apr 21, 2003 at 09:00 UTC

    I had the same problem with German and Chinese pages. I just keep the original input and instead add an appropriate charset header in the HTML head. For the German pages, I use:

    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

    For the Chinese pages, I use utf8. I believe all the characters used in Dutch should be in iso-8859-1, so you could just use that one.

      *slap hand to forehead*

      Don't I feel like a silly goose. Due to bad design, I've got a lot of print Content statements peppering that particular script. Sure enough, some of them include the charset=iso-8859-1 and some do not. I popped in the appropriate metatags on all of them and presto, everything works as expected.

      This excersize has also pointed out the need for me to centralize my outputs in a sane place, probably at the module level. ++ to crenz for the splash of cold water in my face! :)

      oakbox

      Hmmm... couldn't you just use utf8 for everything then?

        Yes, you are right. I will be transitioning to UTF-8 as soon as I have the time :) It's actually quite easy, it's just not a priority for me right now. And I recommended iso8859-1 for him, because that's still the standard all tools (Browsers, perl) can deal with.

Re: High bit character encoding in HTML
by PodMaster (Abbot) on Apr 21, 2003 at 08:04 UTC
    Let me give it a shot ;)
    use HTML::Entities; my $crazyhtml = "<p> asdf ".chr(243)." asdf </p>"; die encode_entities($crazyhtml, "[^><]" ); __END__ <p> asdf ¡Ü asdf </p> at - line 3.
    Hmm, doesn't appear to work. Well, the good news is , HTML::Entities also exports "%char2entity and the %entity2char hashes which contain the mapping from all characters to the corresponding entities", so you can write your own.

    update: You know, it wouldn't be a bad idea for the author to bold that sentance in the manual, even if the pod is pretty short.


    MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
    I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
    ** The Third rule of perl club is a statement of fact: pod is sexy.