Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

ignore UTF codes

by kettle (Beadle)
on Mar 16, 2006 at 03:54 UTC ( [id://537038]=perlquestion: print w/replies, xml ) Need Help??

kettle has asked for the wisdom of the Perl Monks concerning the following question:

I have a long text document full of entries like the following:
Canciones\251STAMPID\253\277De quien es la cancion "STAND BY ME"*4 the cause

I want:
De quien es la cancion "STAND BY ME"*4 the cause

But perl seems to recognize the numbered encodings as special characters, i.e.,
s/\\253//g;

doesn't do anything. How can I get perl to treat these character encodings as plain text?? (I think this is UTF-8 but I'm not entirely sure...)
Thanks!

Replies are listed 'Best First'.
Re: ignore UTF codes
by zer (Deacon) on Mar 16, 2006 at 04:18 UTC
    s/\\253//g

    the '\\' makes the character '\'(the literal value).

Re: ignore UTF codes
by ayrnieu (Beadle) on Mar 16, 2006 at 04:20 UTC

    I cannot repeat your problem. However, please see if no locale doesn't help. Please also study perlunicode; someone else may have a better answer.

    Works for me:

    $ perl -le 'print "hello\253there"' | perl -pe 'tr/\253//d' hellothere
Re: ignore UTF codes
by zer (Deacon) on Mar 16, 2006 at 06:09 UTC
    $_="Canciones\\251STAMPID\\253\\277De quien es la cancion \"STAND BY M +E\"*4 the cause"; s/\\[0-9][0-9][0-9]//g; print;
      Thanks for the help!

      Actually, the above doesn't work, because '\251' (and all the other similarily structured codes) were being interpreted by perl as A SINGLE CHARACTER. Which is weird.

      However, it turns out I've found a solution. The problem was that the file was not, in fact encoded in UTF-8, but was encoded in Western(ISO-8859-1).

      I used xemacs to translate the page into UTF-8, and my problems more or less disappeared -- well, perl finally, grudgingly decided to recognize all the odd characters and I was able to get some useful work done!

      Thanks again for the help!
Re: ignore UTF codes
by kettle (Beadle) on Mar 16, 2006 at 04:35 UTC
    could it be that my text document is DOS formatted? Perl does not seem to be recognizing the UTF codes at all. I cannot do anything to access them, and when I try to manipulate the line, most of the time I get a line like this

    Malformed UTF-8 character (overflow at 0xa0c75a60, byte 0x70, after start byte 0xbf) in uc at ./qNa.pl line 15, <IN> line 25. joe
      you can access utf code... it depends on your situation... usualy from my experiances dos has been straight ascii... Let me see if i can find something for you

      -------------------------------------------
      Ok there is a utf8::is_utf8() module. Itll find out if your character is utf8. so for example

      $a=chr(0x74); print utf8::is_utf8($a)?"yes":"no"; $a=chr(0x470); print utf8::is_utf8($a)?"yes":"no";
      the output is "noyes" ...

      if you can provide some more code i can give you a more specific answer

        thanks! i also tried to convert from dos2unix, but that did nothing to solve my problem :-(
        In the example line I gave before:

        Canciones\251STAMPID\253\277De quien es la cancion "STAND BY ME"*4 the cause

        I would simply like to delete the '\253' from the line (there is more that I will eventually want to do, but if I could complete this simple action, the rest ought to be a piece of cake. my first attempt at this was:

        $_ = s/\\253//g;

        This failed miserably. The problem is that the '\253' is being treated as a single character (i.e., if I try to highlight just one digit, it highlights the entire 4 digit string) I'm trying to write a c++ program to convert the codes to ASCII.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://537038]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (2)
As of 2024-04-26 03:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found