Re^7: UTF8 versus \w in pattern matching

Using the formula my $re = qr/^([\/\w]+)/; as the pattern has the same problems.

For clarity, the test script which I provided works just as well with this regex. The point is that it demonstrates that there is nothing wrong with your perl code which does the regex matching and therefore the only logical conclusion is that your data is not what you think it is.

Are you decoding your UTF-8 data when you read it from the data files in your script? If not, that is the problem.

If you can provide a real SSCCE then I'm sure all will become clear.

🦛

Comment on Re^7: UTF8 versus \w in pattern matching Download Code

Replies are listed 'Best First'.
Re^8: UTF8 versus \w in pattern matching by mldvx4 (Friar) on Jul 06, 2021 at 13:54 UTC
I've been trying for a real SSCCE. Here's one more try: When I fetch one of the source files using 'curl' directly to a file, and then import that file using Emacs, whittle it down to a few letters, like in the following, then I get the output `$VAR1 = "t\x{f3}n";`. That does not look like UTF-8 to me. `#!/usr/bin/perl use utf8; use Data::Dumper; use warnings; use strict; my $a = "tón"; print Dumper($a),qq(\n);` [download] Is there a standard way to identify 8-bit, legacy text (which has been mislabeled upstream as UTF-8) and convert it into UTF-8 for continued work with regex?	[reply] [d/l] [select]
Re^9: UTF8 versus \w in pattern matching by haj (Vicar) on Jul 06, 2021 at 18:21 UTC
It doesn't look like UTF-8 because it isn't supposed to look like UTF-8. It has nothing to do with legacy 8-bit. Data::Dumper shows Unicode codepoints and not encodings. If you open the file in Emacs, it will use your preferred coding set to interpret the data, this is UTF-8 for current Emacsen. However, Emacs will fall back to ISO-8859-1 if the file doesn't contain valid UTF-8. Look at the Emacs modeline: If the first character is `U`, then it is UTF-8, if it is `1`, then it is ISO-8859-1. You can enforce the encoding in Emacs with `C-x RET f ISO-8859-1 RET`. If you execute the file in this encoding, Perl will croak because you said `use utf8;` and your source code isn't valid UTF-8. If you then omit `use utf8;` with ISO-8859-1 encoding and run the file, you'll get `$VAR1 = 't�n';` because now it is your Terminal which expects UTF-8 and gets an 8-bit character. If you then add `use Encode;` and change the last line to `print encode('UTF-8',Dumper($a));` (like you should when using an UTF_8 terminal), then you'll get `$VAR1 = 'tón';` I don't recommend Data::Dumper for such diagnostics because it might, or might not use `\x{}` notation, as you just saw. It isn't easy, but it is rather straightforward if you keep track of the different places where encoding might occur.	[reply]
Re^10: UTF8 versus \w in pattern matching by pryrt (Abbot) on Jul 06, 2021 at 18:49 UTC
If you then add `use Encode`; and change the last line to `print encode('UTF-8',Dumper($a));` (like you should when using an UTF_8 terminal), then you'll get `$VAR1 = 'tón';` Assuming the real code is going to use more than one print statement, this suggestion will require calling `encode()` for every print, which is not DRY programming. Alternative: use the binmode function, as `binmode STDOUT, ':encoding(UTF-8)';` , sometime before any print statements, and just use normal print statements (like `print Dumper($a);`) throughout. This lets the I/O layer handle the translation from Perl's internal representation to UTF-8-encoded output.	[reply] [d/l] [select]
Re^11: UTF8 versus \w in pattern matching by ikegami (Patriarch) on Jul 06, 2021 at 21:01 UTC


Keep It Simple, Stupid
	PerlMonks