Matching non-ASCII file contents with file name.

mldvx4 has asked for the wisdom of the Perl Monks concerning the following question:

My goal is to use the substitution operator (s///) to replace occurrences of a question mark (?) with an inverted question mark (¿) om specific line in a large number of files. I am having trouble with what is actually getting substituted inside the file in that it does not match what ends up in the file name in the file system. I am grateful for any tips or guidance as to what to have that which is inside the files match various file names out in the file system. Perhaps it is matter of encoding, again?

It is claimed that the inverted questionmark is \x00BF, which strangely is C2 BF in UTF-8 according to a "Unicode Character Table site.

In the shell (Bash) on an EXT4, that seems to be the case and the Perl utility rename seems to work that way, too.

$ touch ż
$ ls ? > zz
$ xxd zz
00000000: c2bf 0a

$ echo 'ż' > yy
$ xxd yy
00000000: c2bf 0a  

$ touch xx
$ rename -v 's/xx/ż/;' xx
xx not renamed: ż already exists
$ rename --version
/usr/bin/rename using File::Rename version 1.13, File::Rename::Options
+ version 1.10
[download]

And those files show up in Apache2's access logs containing the escape sequence "%C2%BF" in the URL in place of the inverted question mark.
Yet, shouldn't that be \x00BF all the way through? My own Perl scripts work differently:

$ perl -e 'use utf8; print "ż\n"' > ww
$ xxd ww
00000000: bf0a

$ perl -e 'use utf8; $c="ż\n"; utf8::upgrade($c); print $c' > vv
$ xxd vv
00000000: bf0a
[download]

Though if I leave out the use utf8 part, then I kind of get the "right" result only according to xxd,

$ perl -e 'print "ż\n"' > uu
$ xxd uu
00000000: c2bf 0a 

$ curl --silent --head 'http://localhost/' | grep 'Content-Type'
Content-Type: text/html; charset=utf-8
[download]

While keeping UTF-8, how can I get "¿" inside the files to match the "¿" out in the file name and still look right?

Comment on Matching non-ASCII file contents with file name. Select or Download Code

Replies are listed 'Best First'.
Re: Matching non-ASCII file contents with file name. by Corion (Patriarch) on Dec 22, 2022 at 12:07 UTC
By having `use utf8;` in your code, you only tell Perl that your source code is in UTF-8 (so the inverted question mark gets recognized as that), not what the input and output should be encoded in. Perl knows that your output handle is (say) Latin-1 and as it can convert the Unicode string it read from the UTF-8 to Latin-1 it does so when printing. I find the approach of explicitly specifying the encodings for filenames the easiest way to get consistent results: `#!perl use strict; use warnings; use charnames ':full'; binmode STDOUT, ':encoding(UTF-8)'; print "\N{INVERTED QUESTION MARK}\n"` [download]	[reply] [d/l] [select]
Re^2: Matching non-ASCII file contents with file name. by mldvx4 (Friar) on Dec 23, 2022 at 06:39 UTC
Thanks, again! I see now the mistake but don't understand it. The following is what I had but which was not producing the right result: `my ($fh, $tempfile) = tempfile(); binmode( $fh, ":utf8" );` [download] With your corrections, the following produces the right character: `my ($fh, $tempfile) = tempfile(); binmode( $fh, ":encoding(UTF-8)" );` [download] What would be the difference between `binmode( $fh, ":utf8" );` and `binmode( $fh, ":encoding(UTF-8)" );` in regards to the output? I don't understand the difference.	[reply] [d/l] [select]
Re^3: Matching non-ASCII file contents with file name. by Corion (Patriarch) on Dec 23, 2022 at 07:16 UTC
Maybe the problem is elsewhere? Because binmode says: To mark FILEHANDLE as UTF-8, use `:utf8` or `:encoding(UTF-8)`. `:utf8` just marks the data as UTF-8 without further checking, while `:encoding(UTF-8)` checks the data for actually being valid UTF-8. I read this as that the two should behave identical (except for warnings). Maybe someone else knows where the differences come from.	[reply]
Re^4: Matching non-ASCII file contents with file name. by hippo (Bishop) on Dec 23, 2022 at 08:26 UTC
Re: Matching non-ASCII file contents with file name. by haukex (Archbishop) on Dec 22, 2022 at 12:40 UTC
It is claimed that the inverted questionmark is \x00BF, which strangely is C2 BF in UTF-8 ... And those files show up in Apache2's access logs containing the escape sequence "%C2%BF" in the URL in place of the inverted question mark. Yet, shouldn't that be \x00BF all the way through? `C2 BF` is the correct UTF-8 encoding for the unicode character `U+00BF INVERTED QUESTION MARK`. This question is complicated by the fact that we don't know what your shell/terminal's encoding is, which is why I suspect most of the examples you showed aren't representative. Plus, AFAIK, file name encodings are a very complicated topic, and therefore I think you might do yourself a favor by not using "ż" in filenames and URLs. I'll take your "While keeping UTF-8" to mean you want UTF-8 everywhere, so below is one way of getting that - at least on *NIX, where many tools assume UTF-8 filenames anyway and my shell and terminal are also UTF-8; these kind of filenames may be even more complicated on Windows, I'm not sure. BTW, my personal preference is using Perl's `\x` notation only for bytes in the range `00-FF`, while using `\N{}` for Unicode, as I do below. This is why I don't need the `use utf8;`, this source code is entirely ASCII (but you could `use utf8;` and then use `¿` instead of `\N{U+BF}` if you wanted). If you wanted this code to write UTF-8 to `STDOUT`, like for example `print "Wrote file $newname\n";`, you'd need to add a `use open qw/:std :encoding(UTF-8)/;`. $ cat test.pl #!/usr/bin/env perl use warnings; use strict; my $fname = 'test.txt'; my $newname = "new\N{U+BF}.txt"; open my $fh, '>:raw:encoding(UTF-8)', $fname or die "$fname: $!"; print $fh "Hello?\n"; close $fh; open my $ofh, '>:raw:encoding(UTF-8)', $newname or die "$newname: $!"; open my $ifh, '<:raw:encoding(UTF-8)', $fname or die "$fname: $!"; while ( my $line = <$ifh> ) { $line =~ s/\?/\N{U+BF}/g; print $ofh $line; } close $ifh; close $ofh; $ perl test.pl $ hexdump -C new¿.txt 00000000 48 65 6c 6c 6f c2 bf 0a \|Hello...\| 00000008	[reply] [d/l] [select]
Re^2: Matching non-ASCII file contents with file name. by mldvx4 (Friar) on Dec 23, 2022 at 06:39 UTC
Thanks! The detailed explanation helped and is appreciated. "Plus, AFAIK, file name encodings are a very complicated topic, and therefore I think you might do yourself a favor by not using "ż" in filenames and URLs." Of course. However, there are several reasons: Use of non-ASCII characters like Ö, Ř, Ó, Ô, 月, 日, or even ¿ or ¡ is to be expected these days, even in file names and thus URLs. The `rename` utility listed above deals with the renaming, and seems to match what can be produced manually via a local terminal emulator, a local console, or a remote ssh+tmux connection. So it was my script which was the odd man out and therefore needed correction. The file names, minus the inverted question mark, are the result of using `wget` to scrape the output from some legacy PHP scripts which are not / cannot be maintained any more. Aside from the very long file names, the method works reasonably well for converting the whole mess to a static HTML archive. Unfortunately, that leaves a question mark in the file name and that is not tolerated by web servers and use it to delimit the start of a query string and the end of the file name. So a replacement character is needed and ¿ seems the least problematic semantically.	[reply] [d/l] [select]