Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re: Two octal values for eacute?

by haukex (Archbishop)
on May 23, 2020 at 21:16 UTC ( [id://11117185]=note: print w/replies, xml ) Need Help??


in reply to Two octal values for eacute?

Please see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

The character U+00E9 LATIN SMALL LETTER E WITH ACUTE (é) is encoded in Latin-1, Latin-9, and CP-1252 as the single byte \xE9 (\351), but when encoded with UTF-8, it's the two-byte sequence \xC3\xA9 (\303\251).

In other words, some of your files are encoded with one of the single-byte encodings, others are encoded with UTF-8, and you'll have to specify the correct encoding when opening them, as in e.g. open my $fh, '<:raw:encoding(UTF-8)', $filename or die "$filename $!"; (see "open" Best Practices). That way, when you read the data into Perl, the characters are correctly decoded and you'll always have the correct characters (e.g. "\N{U+00E9}") in your Perl strings.

If you don't know the encoding of the input files, you could use a module like Encode::Guess, or I've written a tool that tries to be a little smarter: enctool - it allows you to narrow down the guesses by specifying what characters are expected to appear in the input file using e.g. the --one-of='\xE9' option. Some files, like HTML and XML, will often include a definition of the character set in their source, and (except for the cases where that declaration is incorrect) the appropriate parser modules (e.g. XML::LibXML) should honor that encoding.

As an aside, if you're putting Unicode characters in your Perl source, you should save it as UTF-8 and add use utf8; at the top of the file. If you're writing Unicode characters to the console, add use open qw/:std :utf8/;. And of course always Use strict and warnings, and a recent version of Perl is strongly recommended when working with Unicode.

If you have further issues with encodings when reading files, please see the tips for posting questions in this node.

By the way, why are you looking for "é" characters in the first place? Maybe there's a more efficient way to do what you're doing with your regex, if you tell us what the task is.

Replies are listed 'Best First'.
Re^2: Two octal values for eacute?
by pianomonious (Novice) on May 23, 2020 at 22:09 UTC

    Thank you haukex!

    I feel like I once skimmed that first link you posted, but it's been years ago. I do have some refreshing to do then.

    You are correct - I do not know the encodings of the text files that I'm reading. They were probably exported as CSV from Excel or created by a Perl script from reading an Open Office .ods file. Tab delimited text files created differently over the course of 20+ years. That would make sense though since it's the older files that have a single byte eacute, then all of the sudden the two-byte eacute is the only variety found.

    I will read through the links "best practices" and all. Much appreciated there!!

    Oh, and I was looking for a small set of extended ascii characters to "flatten" (if you will) to an ascii counterpart as I could not reliably reproduce them - again pointing to the fact that they were probably encoded differently. I used a small subroutine to make two differently encoded eacutes into an 'e' to mitigate these headaches. The same sub also translated ellipses to '...', curved left/right double-quotes to straight double-quotes, long dashes to normal dashes and so on. All of these things that a spreadsheet program automatically substitutes in when you're typing. I didn't think of the encoding so much, but instead found octal regexes that could pluck out each of these characters so that I could insert what I felt was a suitable replacement. Nothing personal against the eacute!

    Thank you so much for your time and expertise!

      I was looking for a small set of extended ascii characters to "flatten" (if you will) to an ascii counterpart

      Sounds very much like Text::Unidecode!

      I do not know the encodings of the text files that I'm reading. They were probably exported as CSV from Excel or created by a Perl script from reading an Open Office .ods file. Tab delimited text files created differently over the course of 20+ years. That would make sense though since it's the older files that have a single byte eacute, then all of the sudden the two-byte eacute is the only variety found.

      Yes, that does sound likely. Here's a very simple example of how one might tell the difference between three of the encodings I named. Of course if you have more encodings than this, things can get more complex, and even if these encodings seem to work, you'll probably need to tweak the heuristics in the below example to fit your actual data.

      #!/usr/bin/env perl use warnings; use strict; use open qw/:std :utf8/; use Text::Unidecode; use Encode; print "# Text::Unidecode demo:\n"; my $test = "\N{U+201C}test\N{U+201D} \N{U+2013} test\N{U+2026}"; print " original: ", $test, "\n"; print "asciified: ", unidecode($test), "\n"; # set up some test data my $str = "\N{U+CF} spent 20\N{U+20AC} \N{U+C3}t the c\N{U+AA}f\N{U+E9 +}\n"; { open my $fh1, '>:raw:encoding(CP-1252)', 'one.txt' or die $!; print $fh1 $str; close $fh1; open my $fh2, '>:raw:encoding(Latin-9)', 'two.txt' or die $!; print $fh2 $str; close $fh2; open my $fh3, '>:raw:encoding(UTF-8)', 'three.txt' or die $!; print $fh3 $str; close $fh3; } my $expected_chars = qr/[\N{U+20AC}]/u; # heuristic my $unexpected_chars = qr/[\N{U+80}]/u; # heuristic for my $file (qw/ one.txt two.txt three.txt /) { # slurp the raw file as undecoded bytes open my $fh, '<:raw', $file or die "$file: $!"; my $bytes = do { local $/; <$fh> }; close $fh; my $string; # try different encodings for my $enc (qw/ UTF-8 Latin-9 CP-1252 /) { $string = eval { decode($enc, $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC) }; if ( defined $string && $string =~ $expected_chars && $string !~ $unexpected_chars ) { print "### $file looks like $enc\n"; last } else { print "### $file is NOT $enc\n" } } die "Failed to decode $file" unless defined $string; print $string; print unidecode($string); }

      Output (on a terminal with UTF-8 encoding):

      # Text::Unidecode demo:
       original: “test” – test…
      asciified: "test" - test...
      ### one.txt is NOT UTF-8
      ### one.txt is NOT Latin-9
      ### one.txt looks like CP-1252
      Ï spent 20€ Ãt the cªfé
      I spent 20EUR At the cafe
      ### two.txt is NOT UTF-8
      ### two.txt looks like Latin-9
      Ï spent 20€ Ãt the cªfé
      I spent 20EUR At the cafe
      ### three.txt looks like UTF-8
      Ï spent 20€ Ãt the cªfé
      I spent 20EUR At the cafe
      

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11117185]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (8)
As of 2024-03-28 12:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found