Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: help with cyrillic characters in odd places

by Anonymous Monk
on Feb 06, 2019 at 13:29 UTC ( [id://1229478]=note: print w/replies, xml ) Need Help??


in reply to help with cyrillic characters in odd places

Q1) Can I alter my perltidy command so that these chars are not problematic?
This doesn't seem to be mentioned in perldoc perltidy, but see what I found in Perl-Tidy-20181120/docs/perltidy.html:

-enc=s, --character-encoding=s

where s=none or utf8. This flag tells perltidy the character encoding of both the input and output character streams. The value utf8 causes the stream to be read and written as UTF-8. The value none causes the stream to be processed without special encoding assumptions. At present there is no automatic detection of character encoding (even if there is a 'use utf8' statement in your code) so this flag must be set for streams encoded in UTF-8. Incorrectly setting this parameter can cause data corruption, so please carefully check the output.

The default is none.

The abbreviations -utf8 or -UTF8 are equivalent to -enc=utf8. So to process a file named file.pl which is encoded in UTF-8 you can use:

perltidy -utf8 file.pl

Regarding your Cyrillic path issues, Path::Tiny seems to work with file names as byte-strings, not character-strings. Allow me to demonstrate:

$ ls -l
итого 0
-rw-r--r-- 1 user user 0 фев  6 16:20 привет
$ perl -MData::Dump=dd -MPath::Tiny=path -E'dd path(".")->children' bless([ pack("H*","d0bfd180d0b8d0b2d0b5d182"), pack("H*","d0bfd180d0b8d0b2d0b5d182"), ], "Path::Tiny")
$ perl -MData::Dump=dd -E'dd "привет"'
pack("H*","d0bfd180d0b8d0b2d0b5d182")
Data::Dump::dd output is the same for $path->children and a simple string consisting of UTF-8-encoded bytes. Perl wide characters are different when dumped using Data::Dump::dd or Data::Dumper::Dumper:
$ perl -MData::Dump=dd -Mutf8 -E'dd "привет"'
"\x{43F}\x{440}\x{438}\x{432}\x{435}\x{442}"
You seem to be using an IOLayer to encode all characters being printed to STDOUT from wide characters to UTF-8. Since for Perl code, wide strings and byte strings are mostly same data type, except wide strings can have ord values > 255, encode doesn't really know whether it is encoding actual wide characters into UTF-8 or it is wrongly encoding UTF-8 bytes as if they were Unicode code points. So when you are trying to print file names, they undergo an unnecessary conversion and get garbled in the process. Your options include decode'ing them back into wide characters before printing (beware: filenames can contain invalid UTF-8 and arbitrary bytes!) or disabling the IOLayer that encodes the strings.

with the advent of quantum computing, public key encryption would be passé
See Post-quantum_cryptography. This may be true for "classical" asymmetric cryptography which may be easy to break after we get powerful enough quantum computers (we don't, yet), but new approaches are already being developed that wouldn't rely on integer factorization / discrete logarithm / elliptic-curve discrete logarithm problems to be secure and also wouldn't be vulnerable to quantum computers.

Replies are listed 'Best First'.
Re^2: help with cyrillic characters in odd places
by Aldebaran (Curate) on Feb 16, 2019 at 02:20 UTC
    perltidy -utf8 file.pl

    Thanks again for your generous comments, Anonymous Monk. I changed my perltidy command in .bash_aliases :

    $ cat .bash_aliases alias pt='perltidy -i=2 -utf8 -b '

    I was able to replicate your results and see for myself. I thought I was gonna beat it by using Path::Tiny methods, but I seem only to have dug myself in deeper:

    I'm completely fanning on getting ->is_dir to work for me, encoded or decoded....

      A good strategy for you would be either to start splitting your code into smaller and smaller parts until some of them start working - and seeing which change made a difference - or combining small self-contained reproducible examples we provide back into a whole that resembles your current code - and seeing when it stops working.

      Right now, your encoding handling is doing the wrong thing. Let's start with a small file and get it to output UTF-8 from Perl wide characters:

      use warnings; binmode STDOUT, ":utf8"; print "\x{44b}\n";

      No warnings, the string literal is definitely wide, and the output is evidently UTF-8. This was achieved by adding a perl IO layer to STDOUT that encodes wide characters to UTF-8 bytes. We can verify that:

      use Data::Dumper; binmode STDOUT, ":utf8"; use PerlIO; print Dumper [ PerlIO::get_layers \*STDOUT, output => 1 ]; __END__ $VAR1 = [ 'unix', 'perlio', 'utf8' ];

      Your code,

      use open qw/:std :utf8/; use open OUT => ':encoding(UTF-8)', ':std';
      adds the UTF-8 encoding layer multiple times:
      $VAR1 = [ 'unix', 'perlio', 'utf8', 'encoding(utf-8-strict)', 'utf8' ];

      That would be one of the reasons why you are getting Mojibake instead of Cyrillic characters. It may be helpful to use more simple and explicit code for now, until you understand better the machinery that makes it all tick. Start with binmode STDOUT, ":utf8" and get your code to output correctly-encoded UTF-8 to STDOUT (after reading your code, I think you are almost there: everywhere you get UTF-8 bytes, you decode them correctly before printing). Once that works, start adding pragmas like open that save you typing.

      I am not sure why would your code (appear to) entirely skip non-ASCII files and directories, but perhaps we could shed some light on it once we get Unicode display problem resoled.

        Hmm, no, that was wrong, use open qw/:std :utf8/; use open OUT => ':encoding(UTF-8)', ':std'; alone doesn't cause Mojibake on my system (instead, I get proper UTF-8 output). Something else is going on.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1229478]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-04-19 21:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found