Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Win32::OLE and Word checkbox characters

by Cody Fendant (Hermit)
on Apr 04, 2019 at 23:07 UTC ( [id://1232168]=perlquestion: print w/replies, xml ) Need Help??

Cody Fendant has asked for the wisdom of the Perl Monks concerning the following question:

I've got a script which exports Word documents to HTML. It works well for the most part but when there are checkboxes in the content they don't appear in the HTML version.

The word document definitely has different characters, because I can see (not literal characters, my ASCII-art version of the characters):

[X] Yes
[ ] No

In the original content.

When it gets converted to HTML, all I get, for both [X] and [ ] chars, is (hex) C2A0, which I believe is just "non-breaking space".

Is there anything I can do about this? I have the latest version of Win32::OLE as far as I can see, 0.1712, but that's five years old.

Is there a flag I can set, is it a Unicode thing, is there some other approach to this? Any ideas gratefully received.

Replies are listed 'Best First'.
Re: Win32::OLE and Word checkbox characters
by kcott (Archbishop) on Apr 05, 2019 at 06:22 UTC

    G'day Cody Fendant,

    I can comment on the "characters" part. I'm not an MSWin user, so I'm unable to help with the "Win32::OLE and Word" part.

    '... all I get, ..., is (hex) C2A0, which I believe is just "non-breaking space".'

    C2 is LATIN CAPITAL LETTER A WITH CIRCUMFLEX (Â). A0 is NO-BREAK SPACE ( ). You can see both in the PDF: Unicode Code Chart: C1 Controls and Latin-1 Supplement.

    C2A0 () is in the PDF: Unicode Code Chart: Hangul Syllables. There are no formal names shown for any characters in that block of Unicode characters (AC00–D7AF).

    My gut feeling is that this is related to different encodings in the Word and HTML documents. Another monk may be able to help further with that. If you supplied some code showing the conversion from Word to HTML you might get a better answer.

    — Ken

      C2 A0 is the UTF-8 encoding of U+00A0 NO-BREAK SPACE:
      $ echo -e '\xc2\xa0'| perl -Mcharnames=:full -CI -wnE 'say charnames:: +viacode(ord)' NO-BREAK SPACE

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Win32::OLE and Word checkbox characters
by Veltro (Hermit) on Apr 05, 2019 at 07:54 UTC

    From what I can understand from the documentation is that you can change the code page that Win32::OLE is using. From the docs: "The default value is CP_ACP, which is the default ANSI codepage. Other possible values are CP_OEMCP, CP_MACCP, CP_UTF7 and CP_UTF8. These constants are not exported by default."

    I am not sure how to do this, but try:

    use Win32::OLE qw( CP_UTF8 ) ; Win32::OLE->Option( CP => CP_UTF8 ) ; my $word = Win32::OLE->new ('Word.Application', ...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1232168]
Approved by Paladin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (2)
As of 2024-04-25 20:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found