sed character codes

kettle has asked for the wisdom of the Perl Monks concerning the following question:

I want to regexp a UTF character using the code.
When doing something similar with unicode characters this works fine:

s/\x{00C0}//g;  # delete the upside-down, sentence 
                #  initial question mark used in Spanish
[download]

However, if I do something similar with a corresponding UTF-8 code i.e.,

s/\x{C0BF}//g; # also tried \x{C0 BF} which resulted, as expected, 
               #  in an 'illegal hexidecimal digit...' error message
[download]

nothing happens. How can I regexp these codes? Thanks! **that UTF8 code should be C2 BF...

Comment on sed character codes Select or Download Code

Replies are listed 'Best First'.
Re: sed character codes by ikegami (Patriarch) on Mar 30, 2006 at 01:35 UTC
If you wish to search/replace for a UTF-8 sequence, you'll need a string in UTF-8 format. Encode is the module to use to convert the string to UTF-8. Then, you can search for the bytes using `/\xC0\xBF/`. Of course, if the string was read in as ASCII or another single-byte encoding, it should already be in UTF-8, so you should be able to use `/\xC0\xBF/` already. At least, that's how I understand things. I don't have much experience in this area.	[reply] [d/l] [select]
Re^2: sed character codes by kettle (Beadle) on Mar 30, 2006 at 01:54 UTC
thanks! that was exactly what I was looking for. I just needed the formatting convention, which appears to be: \x[A-Z0-9]{2}\x[A-Z0-9]{2} More generally, do you (or does anyone else) happen to know where I could find this information for other character encodings?? joe	[reply]
Re^3: sed character codes by chanio (Priest) on Mar 30, 2006 at 03:50 UTC
Well, I would first read Encode and it's other perldoc links... Landlords production is only eaten by landlords... Wherever I lay my KNOPPIX disk, a new FREE LINUX nation could be established	[reply]