Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Diiference between these two filenames / strings

by Anonymous Monk
on Sep 12, 2019 at 17:07 UTC ( [id://11106081]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

what is difference between below two filenames/strings, as first one doesn't seems needed any utf-8 encoding and second one needed.

How can I check the difference between below two filename/strings or what condition will differentiate these two files.

  1. test1℗ὓ.txt #this works without further encoding
  2. 1669-SCC-HôpitauxdeSaint-Maurice-POC.PIF #but this needs utf-8 encoding.

Thanks.

<P> tags added by Grandfather to improve readability

  • Comment on Diiference between these two filenames / strings

Replies are listed 'Best First'.
Re: Difference between these two filenames / strings
by LanX (Saint) on Sep 12, 2019 at 18:50 UTC
    Please follow the advice given in SO:how-can-i-dump-a-string-in-perl-to-see-if-there-are-any-character-differences and dump both strings either with

     perl -MDevel::Peek -e 'Dump "ABC"'

    or

    use Data::Dumper; local $Data::Dumper::Useqq = 1; print(Dumper("ABC"));

    to see what's happening.

    We can't tell how your data-sources encoded your strings and you haven't even used code-tags to help us at least a bit.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

    Update

    changed tiitle to make a difference

Re: Diiference between these two filenames / strings
by roboticus (Chancellor) on Sep 12, 2019 at 18:08 UTC

    Anonymous Monk:

    It's kind of hard to point to the difference from here: The strings you posted could have been encoded/decoded in various places between getting into your text editor and getting into the PerlMonks site. It could be that your local code page1 supports the first string but not the second (forcing it to UTF-8).

    As a native ASCIIan, I tend to find Unicode frustrating. Not because it's frustrating in and of itself, but because with enough systems to go through, there's frequently a monkey in the middle causing difficulty.

    It's the same experience I had with XML--too many programmers2 think you can create XML simply by doing a little string manipulation to wrap some stuff in tags and/or quotes. Then they insist that their "XML" file is valid, even when it violates *dozens* of rules laid out in the standard, and multiple XML validators insist that it isn't valid.

    Notes:

    1 Assuming you're running Windows.

    2 At least some 10+ years ago when I had to deal extensively with XML. The situation may suck less nowadays.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Diiference between these two filenames / strings
by daxim (Curate) on Sep 13, 2019 at 10:05 UTC
    All characters of string 2 are in Latin1, but some characters in string 1 are beyond codepoint 0xff.

    I'm making a guess: your program handles decoding not properly, and so you end up with string 2 encoded as Latin1 and string 1 decoded into Perl's internal format, which – when passed into the world outside the program – accidentally does the correct thing.

    This explanation fits the symptoms, but since you did not show any code, we can't be sure.

      Hi Daxim,

      I think your Hint did the trick.

      Following code suggested by another monk seems working fine for me.

      I have checked below code on filenames in diff languages, like Chinese, Japanese, Danish, polish, Spanish and off-course English.

      use Encode;

      $filename = decode_it($filename);

      $filename = encode('UTF-8', $filename);
      #--------------------------------------- sub decode_it { my $s = shift; eval { $s = decode('UTF-8', $s, 1); 1; } or do { $s = decode('latin1', $s, 1); }; return $s; }

      Thank you. Cheers.

Re: Diiference between these two filenames / strings
by Anonymous Monk on Sep 12, 2019 at 20:03 UTC

    So there's a few complicating factors here. First you need to understand how character encoding works. For that I would suggest a thorough reading of https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/.

    With that in mind, your source code files are usually saved in UTF-8, and strings that come into your program through filehandles or command line arguments also come in as UTF-8, unless you set layers that decode them. The source code itself is automatically decoded if you "use utf8;". With input strings you can use Encode to decode them if not the ":encoding(UTF-8)" layer on the handle. It's usually a good idea to work with decoded strings because then logically your string contains the characters you think it does, instead of the bytes that represent them in UTF-8.

    But the problem is that you're dealing with filenames here, and filenames are all sorts of broken in Perl.* Perl essentially treats them as their *internal* bytes like a buggy XS module might, regardless of what the *logical* contents of the string are. So that's why there is a difference here when you didn't do anything wrong. The first string can't be represented in your native encoding, so the internal bytes are accidentally the correct UTF-8 encoding of your filename. The second string can, so the internal bytes are not the same as the UTF-8 encoding, unless you utf8::upgrade it to change the internal encoding, or use Encode to explicitly change its logical contents to the UTF-8 encoding of the string. My recommendation would be: use decoded strings in general, and encode them explicitly for use as a filename.

    * https://rt.perl.org/Public/Bug/Display.html?id=130831

Re: Diiference between these two filenames / strings
by Anonymous Monk on Sep 14, 2019 at 03:12 UTC

    Thanks Monks, for all your suggestions and information.

    .

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11106081]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-04-25 05:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found