Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Parse::RecDescent grammar for RTF

by Courage (Parson)
on Nov 01, 2002 at 07:04 UTC ( [id://209647]=perlquestion: print w/replies, xml ) Need Help??

Courage has asked for the wisdom of the Perl Monks concerning the following question:

I'm searching for a ready-to-use grammar for Parse::RecDescent that will process RTF file.

Unfortunately existing RTF parser on CPAN is quite old and, which is most unpleasant, not very stable (it refuses to parse valid RTF file because of some unbalanced thingies or something like that).

Could you please someone help with my search?

Thanks in advance,
Courage, the Cowardly Dog

Replies are listed 'Best First'.
Re: Parse::RecDescent grammar for RTF
by Willard B. Trophy (Hermit) on Nov 02, 2002 at 13:50 UTC
    I don't know of an RTF grammar for Parse::RecDescent. RTF is a bit of a mess, structurally, so parsing isn't trivial. Even major applications write RTF that isn't quite standard.

    I, too, have been burnt by the RTF parser on CPAN (RTF::Parser); it nearly does what I wanted, but is exceedingly hard to customize. Basically, if RTF::Parser doesn't do exactly what you want out of the box (and it can; its HTML output is pretty cool), look elsewhere. Ths speaks the voice of bitter experience.

    A low-level solution which works for me is RTF::Tokenizer, on which I've based a production system for converting RTF dictionary data to Quark XPress tags. RTF::Tokenizer has its quirks; give me a yell if you need help.

    --
    $,="\n";foreach(split('',"\3\3\3c>\0>c\177cc\0~c~``\0cc\177cc")) {$a++;$_=unpack('B8',$_);tr,01,\40#,;$b[$a%6].=$_};print@b,"\n"

Re: Parse::RecDescent grammar for RTF
by PodMaster (Abbot) on Nov 02, 2002 at 15:15 UTC
    How's your C knowledge?

    You can find the RTF specification at http://www.wotsit.org/ (like you can most any file format a programmer might need help with), and it includes a sample c reader (Appendix A).

    It doesn't look like it'd be too hard to develop a grammar, although it looks like it'd be lots easier to just develop a parser ;)

    Anybody looking for a project this smells like a good one.

    update:

    If you're looking for strategy, try looking at a Latex parser, cause LateX and RTF look very similar if you ask me.

    I'm suprised there isn't a opensource library already out there to do this (i know there is a non-free one that looks like it'd be useful).

    ____________________________________________________
    ** The Third rule of perl club is a statement of fact: pod is sexy.

      I'd revise your statement:

      > Anybody looking for a project this smells like a good one

      to:

      Anybody looking for a project, this smells.

      RTF is an unpleasant format. The basic spec might be okay for creating a writer, but creating a reader that will handle arbitrary RTF is quite hard.

      Thus speaks someone who has just spent the last five months dealing with RTF parsing.

      --
      $,="\n";foreach(split('',"\3\3\3c>\0>c\177cc\0~c~``\0cc\177cc")) {$a++;$_=unpack('B8',$_);tr,01,\40#,;$b[$a%6].=$_};print@b,"\n"

      Thank you for a very interesting link, I'll save it for a future.

      As it looks like ready-to-use grammar currently does not exists, I'll try writing one by myself and will show it on this site.
      However, I expect it to be extremely slow on parsing.

      I'll let you know about my further results.

      Courage, the Cowardly Dog

Re: Parse::RecDescent grammar for RTF
by graff (Chancellor) on Nov 02, 2002 at 13:48 UTC
    Having installed RTF::Parser on my linux laptop just now, and seeing the README page for that module, it doesn't look like it's in a usable state -- e.g. there doesn't appear to be any documentation for how to use it. (And there wasn't any explicit mention of how old it is -- but the downloaded files under .cpan/build date from July 1999.) This looks like a dead-end, orphaned module. (Should CPAN have something like a garbage-collection process to clear away stuff like this?)

    On the other hand, RTF::Tokenizer seems to be quite current, and is documented. You didn't mention what you need to do, but maybe this module will be able to help you out.

      I've chatted with Pete Sergeant; I don't think that RTF::Tokenizer will be developed further. It is useful, though, if you are careful to preprocess special characters before running the tokenizer. It really, really doesn't like \~ codes for non-break spaces.

      --
      $,="\n";foreach(split('',"\3\3\3c>\0>c\177cc\0~c~``\0cc\177cc")) {$a++;$_=unpack('B8',$_);tr,01,\40#,;$b[$a%6].=$_};print@b,"\n"

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://209647]
Approved by toma
Front-paged by Aristotle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (5)
As of 2024-04-19 06:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found