http://qs321.pair.com?node_id=305320

zakzebrowski has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I have a pesudo perl related question, which is how do I bulk convert word perfect documents to text? I've tried two approaches, none of which work in all cases... One approach is to use the program "wpd2sxw", which takes a word perfect document, and converts it to an sxw file, which really is a jar and an xml description file, which, useing File::Find and a bunch of system calls, I was reasonably able to bulk convert documents to text reasonably well. The problem is, is that this method does not support earlier word perfect versions.

I then tried using Microsoft Word & Win32::OLE, File::Find, Win32::Process, Win32::Process::Info to import and export Word Perfect Documents to text. I do this by starting microsoft word via ole, open a document, copy and paste the text contents (word is strange) into a new document, save that document as text, and it seems to work fine. The problem is that I've been getting a bunch of error messages, which, when they ocurr, require special handeling. (In one case a dialogue box comes up, which you press ok, which I can't trap for, and in another case a ole exception is raised, which I can trap for...) Plus this method take a *long* time to process, so I'm looking for something better...

This has led me to yet another google search. From what I can tell, correll ships "cv.exe" and "convert.exe" which converts wpd documents to text, however, they are not available in the current word perfect installation, (or, they are not available in the install we're using...)...

So, before I start the while(1){$me->bangHeadAgainstWall();} loop, is there any better way of doing this?

Cheers, Zak


----
Zak
undef$/;$mmm="J\nutsu\nutss\nuts\nutst\nuts A\nutsn\nutso\nutst\nutsh\ +nutse\nutsr\nuts P\nutse\nutsr\nutsl\nuts H\nutsa\nutsc\nutsk\nutse\n +utsr\nuts";open($DOH,"<",\$mmm);$_=$forbbiden=<$DOH>;s/\nuts//g;print +;

Replies are listed 'Best First'.
Re: (Sorta OT) - Historic Wordperfect Conversion
by Rich36 (Chaplain) on Nov 07, 2003 at 16:24 UTC

    Google for "PerfectScript" (maybe two words) and VB. PerfectScript is Corel's language for accessing their applications/documents. There are examples out there on how to call PerfectScript from VB using OLE. So you shouldn't have any problems porting that over to Win32::OLE.

    I'm pretty sure that you may have to have WordPerfect installed though to use PerfectScript though.


    «Rich36»
Re: (Sorta OT) - Historic Wordperfect Conversion
by talexb (Chancellor) on Nov 07, 2003 at 15:59 UTC

    I wrote a C program that converted from WP5 to sorta XML/XHTML ages ago -- I just reverse engineered the format, it wasn't too hard.

    I would probably approach the problem the same way today -- write a piece of C code that does the conversion, rather than trying to simulate C code in Perl. Of course, your time line may prevent you from doing that.

    --t. alex
    Life is short: get busy!

      Heh, I did exactly the same thing! I still remember parts of the specs.

      There's a series of header blocks that define global properties of the document, and can be pretty much safely ignored. In the first dozen bytes or so there's a word (i.e. 16 bit value) that contains the offset in the file that is the first byte past the header blocks. I don't remember the offset, nor do I remember the endianness.

      After the header blocks it's pretty simple. Characters 0x00-0x7f are emitted as is. There are also fixed-length and variable-length blocks embedded in the stream that you have to deal with. (These correspond to the WordPerfect markup, à la "Reveal Codes").

      Fixed length blocks are introduced with a marker byte in the range of 0xc0-0xcf. There follows a fixed amount of characters, followed by a matching 0xc0-0xcf.

      Variable-length blocks are a bit trickier. They are introduced with a marker byte in the range of 0xb0-0xbf. The first word after the marker byte encodes the length of the block. At the end of the block you also have a matching 0xb0-0xbf marker byte.

      With a hex-viewer you should be able to puzzle it out. It's funny how time changes things. The format used to be documented in several places on the Web. Right now I can't find a single useful/complete definition any more. Hmm, nor do I have my old source code to this application anymore either. Hmmmm.

Re: (Sorta OT) - Historic Wordperfect Conversion
by jZed (Prior) on Nov 07, 2003 at 16:06 UTC
    Well, it ain't perl and it ain't open and it ain't free, but it works and has a batch mode: Conversions Plus.
      FYI, I've now processed many many documents with this method, and it's speed is beautifull compared to the method I wrote up.... Thanks! :)


      ----
      Zak
      undef$/;$mmm="J\nutsu\nutss\nuts\nutst\nuts A\nutsn\nutso\nutst\nutsh\ +nutse\nutsr\nuts P\nutse\nutsr\nutsl\nuts H\nutsa\nutsc\nutsk\nutse\n +utsr\nuts";open($DOH,"<",\$mmm);$_=$forbbiden=<$DOH>;s/\nuts//g;print +;
Re: (Sorta OT) - Historic Wordperfect Conversion
by Anonymous Monk on Nov 10, 2003 at 05:33 UTC
    Hi Zak,

    Get the cvt utility.

    I've been converting 100's of documents per day for years on an SCO machine running WP 5.1. If I recall correctly it worked on a linux system using WP 8 as well. The cvt utility provided by WP is very reliable.

    I use it as follows:
    $WP5="/u/wp/shbin/cvt51"; $WP8="/u/wp8/shbin10/cvt"; $CVT= defined(-e '/u/wp/shbin/cvt51') ? $WP5 : $WP8; system ("$CVT $TMP1 -o $TMP2 asci >/dev/nul </dev/null");
    I'm not sure where you can get WP5/8. I think I once found one on E-bay.

    Bob drb@chicopmr.org
Re: (Sorta OT) - Historic Wordperfect Conversion
by Anonymous Monk on Nov 09, 2003 at 13:15 UTC
    Take a look at:

    http://www.sandelman.ottawa.on.ca/SSW/wp2x/wp2x.html

    This is an older unix program with source code that will convert WP to most anything. If it doesn't do the job, it may save some time dissecting your WP files.