Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

strftime does not handle Unicode characters in format argument properly (at least, not consistently)

by Bruder Savigny (Initiate)
on Sep 21, 2020 at 18:36 UTC ( [id://11122019]=perlquestion: print w/replies, xml ) Need Help??

Bruder Savigny has asked for the wisdom of the Perl Monks concerning the following question:

I tried to use a UTF-8 non-breaking space (between day and name of month) in the format argument of POSIX::strftime, and hit (with Perl v5.32.0 and a UTF-8-encoded script file, without any non-default encoding settings) upon the following two oddities:

  1. a non-breaking space alone comes out as something unprintable (according to Emacs, Unicode 65533 (decimal) REPLACEMENT CHARACTER, but when examined in a hex-mode, looks like hexadecimal EFBFBD)
  2. when other non-ASCII characters figure in the format, they come out correctly, and this seems "infectious": the non-breaking space then comes out correctly as well! However, in that case, a string that is concatenated to what strftime returns gets garbled (perhaps erroneously encoded from an assumed iso-latin-1 (but really already utf-8) to utf-8), which does not happen in case 1

These behaviours can be demonstrated with the following script (The comments apply to the transparent space character in the format; the innocent-looking - inner, i.e. not syntactical - quotes in lines 3 and 4 are Unicode LEFT and RIGHT SINGLE QUOTATION MARK, the same as in the $string):

use POSIX qw(strftime); $string = 'hailed an über ‘cab’ on '; @t = (0, 0, 0, 23, 5, 2020, 4); print $string . strftime( '%d/%b', @t), "\n"; print $string . strftime( '%d %b', @t), "\n"; # UTF-8 nbsp print $string . strftime('‘%d %b’', @t), "\n"; # UTF-8 nbsp print $string . strftime('‘%d %b’', @t), "\n"; # ASCII space

This outputs (line numbers added):

1 hailed an über ‘cab’ on 23/Jun 2 hailed an über ‘cab’ on 23�Jun 3 hailed an über âcabâ on ‘23 Jun’ 4 hailed an über âcabâ on ‘23 Jun’

Note that

  • the #65533; entity in line 2 of the output is what the HTML renderer here makes of the character I described above (apparently agreeing with Emacs)
  • the space in line 3 between 23 and Jun is a proper UTF-8 non-breaking space (as in the code), and
  • the one in line 4, an ASCII space (also as in the code).

(I have deleted complaints about the wide characters in print for line 3 and 4 for brevity.)

I am guessing, rather vaguely, that this is down to strftime essentially being the C function and the latter not being Unicode-aware and maybe also the way that Perl identifies how strings are encoded and then "upgrades" some so as to harmonise their encodings (in this case under a wrong assumption), but ... :

The behaviour with a non-breaking space alone vs. (also) other non-ASCII characters seems definitely inconsistent. Why is the behaviour different between the non-breaking space and typographical quotation marks, which are all outside the ASCII block?

Also, can anything be done about it, i.e. is it possible to use non-breaking spaces in a format for strftime such that they come out correctly (and without having to resort to inserting extra - likely unwanted - non-ASCII characters), and is it possible to use any non-ASCII character in those format argument without confusing Perl? (Actually, I can think only of non-breaking spaces as useful, but other cultures may very plausibly have other use cases.)

  • Comment on strftime does not handle Unicode characters in format argument properly (at least, not consistently)
  • Select or Download Code

Replies are listed 'Best First'.
Re: strftime does not handle Unicode characters in format argument properly (at least, not consistently)
by choroba (Cardinal) on Sep 21, 2020 at 19:08 UTC
    When I add use utf8; and correctly set the encoding of the output, it seems to work:
    #!/usr/bin/perl use strict; use warnings; use utf8; use open OUT => ':encoding(UTF-8)', ':std'; use POSIX qw(strftime); my $string = 'hailed an über ‘cab’ on '; my @t = (0, 0, 0, 23, 5, 2020, 4); my $nbsp = chr 160; print $string . strftime( '%d/%b', @t), "\n"; print $string . strftime( "%d$nbsp%b", @t), "\n"; print $string . strftime("‘%d$nbsp%b’", @t), "\n"; print $string . strftime('‘%d %b’', @t), "\n";

    Update: I used the $nbsp here, as PerlMonk replaces the non-breakable space with a normal ASCII space, but it works with the nbsp character directly in the script, too.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: strftime does not handle Unicode characters in format argument properly (at least, not consistently)
by perlfan (Vicar) on Sep 21, 2020 at 18:53 UTC
    strftime is implemented in glibc or your standard C library, so I don't think this issue is related specifically to Perl. Perhaps you can use POSIX::strftime for _just_ the numerical/time bits, then feed this into sprintf.

      Many thanks to both of you, and sorry for the somewhat belated answer. I have to admit I had worked under the impression that UTF-8 works out of the box with Perl, and had to read up a lot on that. I have now understood that you should not rely on that, even if it mostly looks like it. The fix using both use utf8 and use open ':encoding(UTF-8) worked perfectly for me as well. The suggestion of using strftime for the numbers only would definitely have been a workable fallback solution that I hadn't thought of.

      The only thing that really puzzles me is the different outcome between the non-breaking space and the other non-ASCII characters. But then, it seems Perl has to do quite complex things around Unicode.

      Thanks again, and best wishes!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11122019]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2024-04-19 18:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found