Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Parsing UTF-8 characters (Å is changed to Ã)

by ashesh28 (Initiate)
on Aug 24, 2016 at 02:57 UTC ( [id://1170279]=perlquestion: print w/replies, xml ) Need Help??

ashesh28 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks , Am in a bit of confusion here with no clue on what is causing this exception. I have written a perl script which extracts rows of data from SharePoint list using SOAP::Lite Module. The extraction works properly , but in certain scenarios the special character symbols are converted when i open the CSV file which is created by Perl script. Part of Code snippet which extract the mentioned columns and write it into a CSV file. I have already parsing them as UTF-8
my $element_rowlimit = name( 'rowLimit' => 10000 ); #print $soap->serializer->envelope( 'method' => 'GetListItems', $eleme +nt_listname, $element_query, $element_rowlimit ); my $som = $soap->GetListItems( $element_listname, $element_query, $ele +ment_rowlimit ); my @results = $som->dataof('//GetListItemsResult/listitems/data/row'); my $oc = Text::CSV->new({sep_char => ',', eol => $/ }) or die Text::CSV->error_diag(); open my $of, '>', 'Load_Data.csv' or die $!; binmode $of, ':utf8'; chomp @results; foreach my $data (@results) { my $item = $data->attr; chomp $item; $oc->print($of,[@$item{qw( ows_Job_x0020_ID ows_Justific +ation )}]); } close $of;
Lets say the Value of Justification column in Sharepoint is as : "RMS Roughness (Rq) is ~3.7Å for both wafers." But, when extracted by perl , the comment is changed to following : "RMS Roughness (Rq) is ~3.7Ã for both wafers."

Replies are listed 'Best First'.
Re: Parsing UTF-8 characters (Å is changed to Ã)
by ablanke (Monsignor) on Aug 24, 2016 at 08:16 UTC
    Hi,

    you have to make sure that perl knows the right encoding of your input data as well.

    Hexdump your data and you can be more certain about what you've got.

    This article could be helpful: http://perlmeister.com/lme/prod-0708.pdf

Re: Parsing UTF-8 characters (Å is changed to Ã)
by ikegami (Patriarch) on Aug 24, 2016 at 18:46 UTC

    It's probably a case of double-encoding.

    "Å" encoded using UTF-8 and then encoded using UTF-8 a second time would appear as "Ã" (followed by a control character) on an terminal expecting UTF-8.

    For example, the following produces something that looks like "Ã□" on my terminal:

    perl -e' use open ":std", ":encoding(UTF-8)"; use utf8; use feature qw( say ); $_ = "Å"; utf8::encode($_); # XXX Bug. Already handled by ":encoding". say; '

    In context, this would indicate that $item->{ows_Justification} contains text encoded using UTF-8 rather than decoded text as one would expect.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1170279]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (2)
As of 2024-04-26 03:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found