Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^7: UTF8 versus \w in pattern matching

by hippo (Bishop)
on Jul 06, 2021 at 12:47 UTC ( [id://11134706]=note: print w/replies, xml ) Need Help??


in reply to Re^6: UTF8 versus \w in pattern matching
in thread UTF8 versus \w in pattern matching

Using the formula my $re = qr/^([\/\w]+)/; as the pattern has the same problems.

For clarity, the test script which I provided works just as well with this regex. The point is that it demonstrates that there is nothing wrong with your perl code which does the regex matching and therefore the only logical conclusion is that your data is not what you think it is.

Are you decoding your UTF-8 data when you read it from the data files in your script? If not, that is the problem.

If you can provide a real SSCCE then I'm sure all will become clear.


🦛

Replies are listed 'Best First'.
Re^8: UTF8 versus \w in pattern matching
by mldvx4 (Friar) on Jul 06, 2021 at 13:54 UTC

    I've been trying for a real SSCCE. Here's one more try: When I fetch one of the source files using 'curl' directly to a file, and then import that file using Emacs, whittle it down to a few letters, like in the following, then I get the output $VAR1 = "t\x{f3}n";. That does not look like UTF-8 to me.

    #!/usr/bin/perl use utf8; use Data::Dumper; use warnings; use strict; my $a = "tón"; print Dumper($a),qq(\n);

    Is there a standard way to identify 8-bit, legacy text (which has been mislabeled upstream as UTF-8) and convert it into UTF-8 for continued work with regex?

      It doesn't look like UTF-8 because it isn't supposed to look like UTF-8.

      It has nothing to do with legacy 8-bit.

      Data::Dumper shows Unicode codepoints and not encodings.

      If you open the file in Emacs, it will use your preferred coding set to interpret the data, this is UTF-8 for current Emacsen. However, Emacs will fall back to ISO-8859-1 if the file doesn't contain valid UTF-8. Look at the Emacs modeline: If the first character is U, then it is UTF-8, if it is 1, then it is ISO-8859-1.

      You can enforce the encoding in Emacs with C-x RET f ISO-8859-1 RET. If you execute the file in this encoding, Perl will croak because you said use utf8; and your source code isn't valid UTF-8.

      If you then omit use utf8; with ISO-8859-1 encoding and run the file, you'll get $VAR1 = 't�n'; because now it is your Terminal which expects UTF-8 and gets an 8-bit character.

      If you then add use Encode; and change the last line to print encode('UTF-8',Dumper($a)); (like you should when using an UTF_8 terminal), then you'll get $VAR1 = 'tón';

      I don't recommend Data::Dumper for such diagnostics because it might, or might not use \x{} notation, as you just saw. It isn't easy, but it is rather straightforward if you keep track of the different places where encoding might occur.

        If you then add use Encode; and change the last line to print encode('UTF-8',Dumper($a)); (like you should when using an UTF_8 terminal), then you'll get $VAR1 = 'tón';

        Assuming the real code is going to use more than one print statement, this suggestion will require calling encode() for every print, which is not DRY programming. Alternative: use the binmode function, as binmode STDOUT, ':encoding(UTF-8)'; , sometime before any print statements, and just use normal print statements (like print Dumper($a);) throughout. This lets the I/O layer handle the translation from Perl's internal representation to UTF-8-encoded output.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11134706]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (7)
As of 2024-04-23 10:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found