Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re^6: treat files with umlauts (utf)

by hazylife (Monk)
on Apr 01, 2014 at 14:00 UTC ( [id://1080566]=note: print w/replies, xml ) Need Help??


in reply to Re^5: treat files with umlauts (utf)
in thread treat files with umlauts (utf)

It has nothing to do with the: encoding of STDIN ... encoding of @ARGV elements
Correct.
It's only about the text used to write the script and how Perl should parse that source text.
Yes, so...
use utf8; my $scandir = 'something with umlauts it it'; # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # ...is this string literal not part of the source?
It has nothing to do with flagging variables
#!/usr/bin/perl

use strict;
use Devel::Peek;

{
    use utf8;
    my $var = 'für';
    print Dump \$var;
}

my $var = 'für';
print Dump \$var;

Replies are listed 'Best First'.
Re^7: treat files with umlauts (utf)
by kcott (Archbishop) on Apr 02, 2014 at 01:30 UTC
    use utf8; my $scandir = 'something with umlauts it it';

    That's exactly the same code you invented, five nodes back, in your original post in this thread: "Re^2: treat files with umlauts (utf)". It is not code the OP posted (or even described in his narrative). My response is unchanged.

    # ...is this string literal not part of the source?

    That string literal is only part of the source you've invented.

    ... use Devel::Peek; ...

    Posting code without explaining why you're doing so is not particularly helpful.

    If you're referring to the output from that containing:

    FLAGS = (PADMY,POK,pPOK,UTF8)

    Then the UTF8 part of that is caused by the umlaut in 'für'. But, the OP's posted code contains no umlauts. Only your invented code contains umlauts.

    Change 'für' to 'fur', and you'll get:

    FLAGS = (PADMY,POK,pPOK)

    Just like the OP's posted code, this does not contain any umlauts and there's no UTF8 in the output.

    You can keep inventing code that requires use utf8 all you want but the OP's posted code contains no umlauts (or any other characters) that require use utf8.

    Please be very clear on these points:

    • The OP's posted code does not contain umlauts.
    • The OP's posted code does not include an assignment to $scandir.
    • the OP's posted code does not require use utf8;.

    -- Ken

      The OP's posted code does not include an assignment to $scandir
      It does not, but the OP does mention UTF-8, so $scandir being UTF-8 is a possibility, to say the least.
      The OP's posted code does not contain umlauts
      It doesn't have to be umlauts, what matters is whether $scandir is UTF-8. Make this change to your code and see what happens:
      use utf8; # just for utf8::upgrade # bytewise, this is already UTF-8... my $scandir = './pm_1080490_utf8_readdir'; #... but we need to flag it as such for # the problem to manifest itself: utf8::upgrade $scandir; # now on to readdir
      the OP's posted code does not require use utf8
      Right, it does not. And use utf8 is not absolutely necessary in the above test code - use -CS/binmode or -CA to initialize $scandir.
      If you're referring to the output from that containing: FLAGS = (PADMY,POK,pPOK,UTF8) Then the UTF8 part of that is caused by the umlaut in 'für'.
      Does that code not answer your:
      [use utf8] has nothing to do with the:... flagging variables
      # under 'use utf8' FLAGS = (PADMY,POK,pPOK,UTF8) ... "f\303\274r"\0 [UTF8 "f\x{fc}r"] # no utf8 FLAGS = (PADMY,POK,pPOK) ... "f\303\274r"\0
      Does the first variable have the UTF8 flag or does it not? What about the second variable? Aren't those two strings exactly the same?
      I'm out of this thread.

        Hello all!

        Thank you all for your ideas and for your discussion, which taught me some more internals abuot UTF-Handling. I hope I will be able to work in the Umlaut field without further problems.

        as to the cause of the problem (as I understand it now): You gave me the correct hints: it was not the problem of readdir but the problem of $scandir.

        I have a configuration xml file, which I read in using XML::Simple. $scandir is read from this file using something like my $scandir = $config->{external_systems}->{filesIN}.

        Now, the config file is stored in ISO-8859-1. It seems that in this construction, $scandir is not stored as UTF, but as ISO-8859-1, although there are no umlauts in the directory name!

        Now, when I concat $scandir with the result of readdir, it seems that a non-utf-value (from xml file) is concatenated with an utf value (from readdir). And as soon as there is an umlaut in the filename, the resulting string is invalid, causing "-f" to say "this is not a file".

        I solved it by writing

        my $scandir = …; utf::downgrade($scandir);
        Then I could successfully read, copy and move the files.

        Hoping that this is the "correct" way of dealing with the problem and again thanks very much

        Mike

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1080566]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (2)
As of 2024-04-26 00:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found