Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Perl, DOS and encodings

by siberia-man (Friar)
on Apr 29, 2020 at 17:53 UTC ( [id://11116231]=perlquestion: print w/replies, xml ) Need Help??

siberia-man has asked for the wisdom of the Perl Monks concerning the following question:

Hello dear Monks,

I came with the issue I have encountered recently.

I have the perl script git-md-toc helping me to generate Table of Content (TOC) from a markdown file and embed it into the original file.

It worked fine with the Latin charset. Later I found it doesn't work with other encodings. I extedned it to support other encodings by specifying a particular encoding via an additional command line option. It works fine as well (I tested it under Cygwin). However it fails under DOS session, if there is need to add a title of TOC to the file written with non-Latin charset/encoding.

For example, there is test file in UTF8 having some Cyrillic text. I need to update it adding TOC with the title in Russian.

This command in bash works fine (Perl 5.30 shipped with Cygwin):

git-md-toc -ut "some-text-in-russian" -Tutf8 "utf8-cyrillic.md"

But it fails in DOS sessions -- the title is being added in wrong encoding. To resolve the issue I have to use one more option (standalone StrawberryPerl 5.30):

git-md-toc --title-transcode=cp1251 -ut "some-text-in-russian" -Tutf8 +"utf8-cyrillic.md"

The thing confusing me is that the default DOS code page is 866 and the encoding for the title I have to specified is 1251.

My questions are:

  1. Is this something specific for DOS, Perl or combination of both?
  2. How does the script work in other systems (windows, linux especially with encodings unlike of mine)?

Replies are listed 'Best First'.
Re: Perl, DOS and encodings
by haj (Vicar) on Apr 29, 2020 at 19:37 UTC

    This is indeed related to weirdness of the cmd.exe command box. You can control the output encoding of Perl programs with the chcp command, but this does not affect the encoding of Perl's @ARGV.

    I found this superuser.com anwser helpful to find out the default encoding for my machine, and it is this encoding which is applied to parameters which you pass to Perl programs, regardless of your chcp settings. So, most probably, your Windows system is using the cyrillic default encoding of codepage 1251 for input - but defaults to codepage 866 for output.

    Cygwin is another story, of course. Contemporary Unix/Linux terminals are using UTF-8 as default encoding, and this is applied when you pass data from bash (the Cygwin shell) to your Perl program.

      Thank you for your response. I've just tested the suggestion from the superuser.com answer. To be honest, without your explanation that answer doesn't give much clues. Simply compare it.

      As I have already said, the code page defaults to 866 (or IBM CP866, the old code page since MSDOS 4.01). BodyName = koi8-r is another code page 20866. How does it indeed work -- I don't know, cmd.exe is definitely painful.
      C:\>chcp Active code page: 866 C:\>powershell -c "[System.Text.Encoding]::Default" IsSingleByte : True BodyName : koi8-r EncodingName : Cyrillic (Windows) HeaderName : windows-1251 WebName : windows-1251 WindowsCodePage : 1251 IsBrowserDisplay : True IsBrowserSave : True IsMailNewsDisplay : True IsMailNewsSave : True EncoderFallback : System.Text.InternalEncoderBestFitFallback DecoderFallback : System.Text.InternalDecoderBestFitFallback IsReadOnly : True CodePage : 1251
      I tested the command from my opening post with different codes pages, setting it to 1251 or 65001 (utf-8). The only correct encoding for Cyrillic text in CLI is 1251. The default encoding in Cygwin is en_US.UTF-8.

      Updated:

      I tested the script invoking it from the shell/batch script. It works correctly, if the title's encoding corresponds the encoding of the shell script. The code page 1251 only has to be specified in the batch script, independently of the encoding of the batch script itself.

        The relevant information is the WindowsCodePage entry. This is the encoding which is used by cmd.exe to pass cyrillic characters from your terminal input to your Perl program, and you can not change it using chcp.

        According to my experiments, which may be totally bogus, things get even more interesting if you write your command, including command line parameters with cyrillic characters, into a .bat file and execute that. In that case, the chcp setting will be used to decode the batch file - but still the Perl program will receive its @ARGV in the WindowsCodePage encoding.

        So, if your batch file is UTF-8 encoded, you need to chcp 65001 and use --title-transcode=cp1251 if you pass the title as a command line parameter.

Re: Perl, DOS and encodings
by Anonymous Monk on Apr 29, 2020 at 23:43 UTC
      Thank you for sharing a lot of links. I will read them later.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11116231]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (10)
As of 2024-04-23 09:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found