Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^5: UTF8 versus \w in pattern matching

by hippo (Bishop)
on Jul 06, 2021 at 12:20 UTC ( [id://11134703]=note: print w/replies, xml ) Need Help??


in reply to Re^4: UTF8 versus \w in pattern matching
in thread UTF8 versus \w in pattern matching

So there's nothing wrong with your version of perl and it correctly matches the UTF-8 accented characters with \p{Word}, and presumably also with \w if you change the value of $re thus: my $re = qr/^([\/\w]+)/;

Are you definitely decoding the contents of these files when you read them in your perl script?

Might also be worth checking the actual data in the data files with eg. hexdump.


🦛

Replies are listed 'Best First'.
Re^6: UTF8 versus \w in pattern matching
by mldvx4 (Friar) on Jul 06, 2021 at 12:33 UTC

    Using the formula my $re = qr/^([\/\w]+)/; as the pattern has the same problems. I am quite sure that the input files are UTF-8. However, checking in different terminals, the script renders properly if I change the terminal to ISO-8859-15 away from UTF-8, even with \N{LATIN SMALL LETTER A WITH ACUTE} for the letters. So this may be a terminal problem, except I really wonder why the script, which has \N{LATIN SMALL LETTER A WITH ACUTE} is still outputting ISO-8859-15 instead of UTF-8.

      Using the formula my $re = qr/^([\/\w]+)/; as the pattern has the same problems.

      For clarity, the test script which I provided works just as well with this regex. The point is that it demonstrates that there is nothing wrong with your perl code which does the regex matching and therefore the only logical conclusion is that your data is not what you think it is.

      Are you decoding your UTF-8 data when you read it from the data files in your script? If not, that is the problem.

      If you can provide a real SSCCE then I'm sure all will become clear.


      🦛

        I've been trying for a real SSCCE. Here's one more try: When I fetch one of the source files using 'curl' directly to a file, and then import that file using Emacs, whittle it down to a few letters, like in the following, then I get the output $VAR1 = "t\x{f3}n";. That does not look like UTF-8 to me.

        #!/usr/bin/perl use utf8; use Data::Dumper; use warnings; use strict; my $a = "tón"; print Dumper($a),qq(\n);

        Is there a standard way to identify 8-bit, legacy text (which has been mislabeled upstream as UTF-8) and convert it into UTF-8 for continued work with regex?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11134703]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2024-03-29 05:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found