jck has asked for the wisdom of the Perl Monks concerning the following question:

i'm trying to clean up my CGI driven content with the following, e.g.:
$in{'content'} =~ s//’/g;
if you can't see it, the character that is being replaced is a right curly single quote

trouble is, it's not replacing anything, the right curly quote is still there after i do the update.

any suggestions?

Replies are listed 'Best First'.
Re: zapping gremlins
by imp (Priest) on Jul 02, 2006 at 04:55 UTC
    I suspect that the data is coming from MS Word, which loves to use custom quote and dash symbols.

    If you copy the character and run something like this:

    perl -e 'print ord(shift) . "\n"' ""

    You should get 226.
    But if you iterate over the list of characters sent in the actual form data it will be something different. (decimal 146, hex 92)

    You could match this with \x92

    $in{'content'} =~ s/\x92/’/g;
    I do the following for msword character stripping:
    # Somewhere in an init function I define the hex value => string mappi +ng # I find it easier to edit this way my %seed = ( '82' => ',', '83' => '<em>f</em>', '84' => ',,', '85' => '...', '88' => '^', '89' => ' /', '8B' => '<', '8C' => 'Oe', '91' => '`', '92' => '\'', '93' => '"', '94' => '"', '95' => '*', '96' => '-', '97' => '--', '98' => '<sup>~</sup>', '99' => '<sup>TM</sup>', '9B' => '>', '9C' => 'oe', ); # Build a mapping of the hex code to string lookup table. # I find it to be less error prone than maintaining it manually my %msword_replace = (); while (my ($hex, $replace_with) = each %seed) { $msword_replace{chr(hex($hex))} = $replace_with; } # Define a list of hex codes that will be used as the search patte +rn # Then create a regex with alternation for these codes my @hex_codes = map {'\x' . $_} keys %seed; $msword_re = sprintf "(%s)", join('|', @hex_codes); # And then later when I need to do the replacing: $data =~ s/$msword_re/$msword_replace{$1}/g;
    There is probably a commonly used module for doing this, but hope this example helps.
Re: zapping gremlins
by chromatic (Archbishop) on Jul 02, 2006 at 04:39 UTC

    Use a module or even CGI's escapeHTML() function perhaps.

Re: zapping gremlins
by davido (Cardinal) on Jul 02, 2006 at 04:12 UTC

    I suspect that the ' character you're trying to substitute has a different ordinal value or encoding than the one actually contained in the string being acted upon. Could there be a character encoding issue here? I guess we need the actual input data and a little more of a snippet that reproduces the problem so as to provide a more definitive answer.


Re: zapping gremlins
by TedPride (Priest) on Jul 02, 2006 at 07:52 UTC
    Won't this fix your problem?
    use HTML::Entities; encode_entities($mystr);
    Given, you might still want the curly quotes replaced with flat quotes, but if your object is just to make the text display properly, this should do the job.
      I agree. I think when the OP refers to "clean up" he may need decode_entities. It is worth noting that decode_entities returns utf8. It can bite you if you aren't expecting it.

      My rule of thumb is either not decode (leave the HTML as it is) or (if your not otherwise working with utf8) do a manual conversion similar to what imp discusses above.

      #!/usr/bin/perl use strict; use warnings; use HTML::Entities; print encode_entities(chr(0x92)), "\n"; print encode_entities(chr(0x2019)), "\n"; print sprintf '%02X', ord decode_entities('&rsquo;'); __DATA__ &#146; &rsquo; 2019

      update: Misread the question.

Re: zapping gremlins
by TedPride (Priest) on Jul 02, 2006 at 19:33 UTC
    No, the original post shows him changing characters to codes, which is what encode_entities does.