suaveant has asked for the wisdom of the Perl Monks concerning the following question:

A company distributing data via spreadsheet(grrr) or PDF(double grrr) has decided that it'd be a good idea to upgrade to Excel 2007 format. First I had to realize this and switch from Spreadsheet::ParseExcel to Spreadsheet::XLSX. That works pretty good but I am running into character encoding issues... A string which should be "OP«’ES" is coming out as "OP√√ES".

Spreadsheet::XLSX gives the ability to pass a converter to the parser, but I have no idea what conversion I should use. Anyone have any idea?

                - Ant
                - Some of my best work - (1 2 3)

  • Comment on Character coding issues with Spreadsheet::XLSX

Replies are listed 'Best First'.
Re: Character coding issues with Spreadsheet::XLSX
by graff (Chancellor) on Feb 20, 2009 at 01:50 UTC
    If M$ somehow decided that Excel 2007 would not change the way unicode is handled in spreadsheets, then this might help you out: xls2tsv uses the old Spreadsheet::ParseExcel, but if the unicode handling hasn't changed, then you'll find a consistent clue about when you need to "decode()" from UTF-16BE into utf8 to get what you want.

    Then again, if M$ did decide to change their unicode handling in Excel, you might need to get some sort of hex-dump picture of the character data in the cells of interest. Save a spreadsheet with known non-ascii characters in selected cells, and you should be able to work out what needs to be done.

      Well... that wasn't exactly it but was close enough to get me there... decode('utf8',$val) did it for me, thanks!

                      - Ant
                      - Some of my best work - (1 2 3)

        Where did you get the decode() routine from? What perl module did you use?? We are having the same problems but your answer was incomplete. Thanks for your help in advance, Jodyman