Come for the quick hacks, stay for the epiphanies. | |
PerlMonks |
Re: Two octal values for eacute?by haukex (Archbishop) |
on May 23, 2020 at 21:16 UTC ( [id://11117185]=note: print w/replies, xml ) | Need Help?? |
The character U+00E9 LATIN SMALL LETTER E WITH ACUTE (é) is encoded in Latin-1, Latin-9, and CP-1252 as the single byte \xE9 (\351), but when encoded with UTF-8, it's the two-byte sequence \xC3\xA9 (\303\251). In other words, some of your files are encoded with one of the single-byte encodings, others are encoded with UTF-8, and you'll have to specify the correct encoding when opening them, as in e.g. open my $fh, '<:raw:encoding(UTF-8)', $filename or die "$filename $!"; (see "open" Best Practices). That way, when you read the data into Perl, the characters are correctly decoded and you'll always have the correct characters (e.g. "\N{U+00E9}") in your Perl strings. If you don't know the encoding of the input files, you could use a module like Encode::Guess, or I've written a tool that tries to be a little smarter: enctool - it allows you to narrow down the guesses by specifying what characters are expected to appear in the input file using e.g. the --one-of='\xE9' option. Some files, like HTML and XML, will often include a definition of the character set in their source, and (except for the cases where that declaration is incorrect) the appropriate parser modules (e.g. XML::LibXML) should honor that encoding. As an aside, if you're putting Unicode characters in your Perl source, you should save it as UTF-8 and add use utf8; at the top of the file. If you're writing Unicode characters to the console, add use open qw/:std :utf8/;. And of course always Use strict and warnings, and a recent version of Perl is strongly recommended when working with Unicode. If you have further issues with encodings when reading files, please see the tips for posting questions in this node. By the way, why are you looking for "é" characters in the first place? Maybe there's a more efficient way to do what you're doing with your regex, if you tell us what the task is.
In Section
Seekers of Perl Wisdom
|
|