http://qs321.pair.com?node_id=889217

lingaraj has asked for the wisdom of the Perl Monks concerning the following question:

sub http_fetch { my $tmp_file = $_[0]; my ( $http_fetch_file ) = $_[1]; my @tmp_content = (); #----------------------------------------------------------------- +-----------------------------------------------HTTP Fetch require LWP::UserAgent; my $ua = LWP::UserAgent->new; $ua->timeout(80); #$ua->show_progress(1); #$ua->agent('Mozilla/4.0 (compatible; MSIE 5.0; Windows 95)'); #$ua->proxy(["http"], "http://proxy.msat:80"); #$ua->env_proxy; eval { my $response_hash = $ua->get($tmp_file); open(HTMLFH,">$http_fetch_file"); print HTMLFH $response_hash->content; close(HTMLFH); }; if ($@) { print "HTTP Error\n"; } #----------------------------------------------------------------- +----------------------------------Load Content Into Array else { open(HTMLFH, $http_fetch_file); @tmp_content = <HTMLFH>; close(HTMLFH); } return @tmp_content; } .... ... ... ... ... .... .... open XMLFH ,">:utf8","$record{Curr_XML_File}"; print XMLFH $final_content; close XMLFH;
print in terminal it's printing fine... writing file give some illegal char...

Replies are listed 'Best First'.
Re: utf8 writing
by Eliya (Vicar) on Feb 20, 2011 at 22:21 UTC

    Your problem most likely is that Perl doesn't know that $final_content is encoded in UTF-8, because the content has never been decoded.

    Compare the following 4 cases.  Let's say you have the character Ω (Omega), Unicode number U+03A9. The UTF-8 encoding of this character is the two bytes CE A9.  Let's also assume you have a terminal that expects characters to be encoded in UTF-8.

    Case 1:

    my $text = "\xCE\xA9"; # Omega, UTF-8 encoded print $text; # prints OK open FH, ">:utf8", "myfile" or die $!; print FH $text; # wrong: C3 8E C2 A9

    This is what you have (presumably).  Perl doesn't know it is (or should be) handling an Omega, because it's never been told the two bytes CE A9 are supposed to represent an Omega.

    Thus, it treats it as two separate bytes when printing them to the terminal. The terminal sees CE A9, and, as it expects text to be encoded in UTF-8, renders them correctly.

    Not so, however, when you print to the file handle, which you've declared to be ":utf8". Here, Perl assumes the two bytes are two characters encoded in Latin-1 (the default assumption), and encodes them into UTF-8, producing the junk C3 8E (= 'Î'), and C2 A9 (= '©'), instead of the correct UTF-8 encoding for Omega, which would be CE A9.

    Case 2:

    use Encode; my $text = decode("UTF-8", "\xCE\xA9"); print $text; # wrong: "Wide character in print at..." open FH, ">:utf8", "myfile" or die $!; print FH $text; # OK

    Here, we're telling Perl the input is UTF-8 encoded, by decoding it. So, Perl treats it as one character (Omega), and prints it correctly to the file. However, we've forgotten to tell Perl that the terminal expects UTF-8, so it warns "Wide character in print".  With that fixed, we get

    Case 3:

    use Encode; my $text = decode("UTF-8", "\xCE\xA9"); binmode STDOUT, ":utf8"; print $text; # OK open FH, ">:utf8", "myfile" or die $!; print FH $text; # OK

    That's how everything is supposed to be — no errors or warnings.

    But there's another one:

    Case 4:

    my $text = "\xCE\xA9"; print $text; # OK open FH, ">", "myfile" or die $!; # no PerlIO encoding layer print FH $text; # OK

    This also renders correctly in the terminal, and produces the right content in the file.  However, although this appears to be correct, it isn't, at least not if you want to treat the content as text. For example, if you wanted to match against Omega (i.e. \x{03A9})

    print "is Omega" if $text =~ /\x{03A9}/; # doesn't match!

    it wouldn't work, because Perl here (in case 4) internally handles two separate bytes, instead of one character.  The same line of code would work fine in case 3.

    Note that although I'm using Encode's decode() routine in the examples, there are several other ways to decode data.  E.g., when reading from a file, you'd normally use a PerlIO layer with open, such as "<:encoding(UTF-8)".

    (See UTF8 related proof of concept exploit released at T-DOSE for why "<:encoding(UTF-8)", and not "<:utf8", when used as input layer.)

Re: utf8 writing
by GrandFather (Saint) on Feb 20, 2011 at 19:38 UTC

    Can you copy and paste a small part of the original text and the matching sample file contents into a code block in your node so we can see what the bad character is and where it cane from?

    Are you sure you are using a UTF8 aware viewer to examine the file?

    True laziness is hard work
Re: utf8 writing
by ikegami (Patriarch) on Feb 20, 2011 at 20:44 UTC
    open(HTMLFH, $http_fetch_file);
    should be
    open(HTMLFH, '<:utf8', $http_fetch_file);

    Update: Fixed typo (s/utt8/utf8/)

Re: utf8 writing
by Monkomatic (Sexton) on Feb 21, 2011 at 00:58 UTC

    I use IO::TEE for these kinds of things. Not sure of it will fix your problem though documents say nothing about utf8.

    http://search.cpan.org/~kenshan/IO-Tee-0.64/Tee.pm

    use IO::Tee; open my $ofh, '>>', 'LOGFILE.txt' or die "Cannot append to 'LOGFILE.tx +t':$!"; my $tee = IO::Tee->new(\*STDOUT, $ofh); # Prints to both file and stdo +ut #my $tee = IO::Tee->new(\*$ofh);Prints to file only print $tee "Opening $name";
    Worth a shot. Hope it Helps. Very nice writeup on the previous reply that was a hell of alot of work.