in reply to Re^2: replace string
in thread replace string

No need, XHTML is XML

Replies are listed 'Best First'.
Re^4: replace string
by sandy1028 (Sexton) on May 18, 2009 at 10:30 UTC
    The input is
    <b></b>Officially called <>“events,”< +/a> as "never events"
    the string should be converted to
    <b></b>Officially called <>“events,”< +/a>as "never events"
    How can I convert only such characters.
      use strict; use warnings; my $input = "<b></b>Officially called <>“event +s,”</a> as "never events""; print "input: $input\n"; $input =~ s/“/“/g; # change the lines $input =~ s/”/”/gi; $input =~ s/’/’/gi; print "processed input: $input\n";


      input: <b></b>Officially called <>“events,&rdq +uo;</a> as "never events" processed input: <b></b>Officially called <>“e +vents,”</a> as "never events"

      This was done using your REs and appears to provide the output you are looking for.

      Better idea would be to provide hex dump of data, so we know what the actual bytes are
      echo |hexdump 00000000: 45 43 48 4F 20 69 73 20 - 6F 6E 2E 0D 0A |ECHO is o +n. | 0000000d;
      echo |od -tx1 0000000 45 43 48 4f 20 69 73 20 6f 6e 2e 0d 0a 0000015
Re^4: replace string
by Errto (Vicar) on Jun 02, 2009 at 18:16 UTC

    In case anybody visits this node again, I have a guess what sandy1028 is talking about. HTML (and XHTML) define a set of named character entities such as  . Generic XML parsers will not recognize these entities because they are application specific. So he or she needs, for whatever reason, to translate XHTML-specific named character entities to their corresponding numeric character entities for use in some non-HTML XML application.

    A few years ago I had to do this myself when reformatting some good old-fashioned HTML into something that could be used in an XSLT stylesheet. Man was that a pain.

Re^4: replace string
by sandy1028 (Sexton) on May 18, 2009 at 09:32 UTC
    How to encode or decode ’ to ’

      If you provide sample data you may get more specific guidance. Your description is a bit too vague for anyone to be certain what you have as input and what you want as output. There are many possibilities.

      In addition to the suggestions already given, you may find perlunifaq and Encode helpful. I suspect you don't need Encode for what you are trying to do, but these will give you terminology and context to help you understand about encodings in general and about perl's internal representation of strings, which may be what you are trying to manipulate.

      Perl regular expressions support escape sequences that allow you to specify fairly arbitrary values in your string, including Unicode code points.

      \033 octal char (example: ESC) \x1B hex char (example: ESC) \x{263a} long hex char (example: Unicode SMILEY) \cK control char (example: VT) \N{name} named Unicode character

      It may be that all you need to do is specify the correct characters in your RE, using one of the escapes (probably long hex char or named Unicode character, depending on your preference). But it is possible you will have to decode your input first.

      If you use Devel::Peek's Dump to dump your input data and post that, then you might get more specific advice.

      Buy an encoder?