Courage has asked for the wisdom of the Perl Monks concerning the following question:
I'm searching for a ready-to-use grammar for Parse::RecDescent that will process RTF file.
Unfortunately existing RTF parser on CPAN is quite old and, which is most unpleasant, not very stable (it refuses to parse valid RTF file because of some unbalanced thingies or something like that).
Could you please someone help with my search?
Thanks in advance,
Courage, the Cowardly Dog
Re: Parse::RecDescent grammar for RTF
by Willard B. Trophy (Hermit) on Nov 02, 2002 at 13:50 UTC
|
I don't know of an RTF grammar for Parse::RecDescent. RTF is a bit of a mess, structurally, so parsing isn't trivial. Even major applications write RTF that isn't quite standard.
I, too, have been burnt by the RTF parser on CPAN (RTF::Parser); it nearly does what I wanted, but is exceedingly hard to customize. Basically, if RTF::Parser doesn't do exactly what you want out of the box (and it can; its HTML output is pretty cool), look elsewhere. Ths speaks the voice of bitter experience.
A low-level solution which works for me is RTF::Tokenizer, on which I've based a production system for converting RTF dictionary data to Quark XPress tags. RTF::Tokenizer has its quirks; give me a yell if you need help.
--
$,="\n";foreach(split('',"\3\3\3c>\0>c\177cc\0~c~``\0cc\177cc"))
{$a++;$_=unpack('B8',$_);tr,01,\40#,;$b[$a%6].=$_};print@b,"\n" | [reply] |
Re: Parse::RecDescent grammar for RTF
by PodMaster (Abbot) on Nov 02, 2002 at 15:15 UTC
|
How's your C knowledge?
You can find the RTF specification at http://www.wotsit.org/ (like you can most any file format a programmer might need help with), and it includes a sample c reader (Appendix A).
It doesn't look like it'd be too hard to develop a grammar, although it looks like it'd be lots easier to just develop a parser ;)
Anybody looking for a project this smells like a good one.
update:
If you're looking for strategy, try looking at a Latex parser, cause LateX and RTF look very similar if you ask me.
I'm suprised there isn't a opensource library already out there to do this (i know there is a non-free one that looks like it'd be useful).
____________________________________________________ ** The Third rule of perl club is a statement of fact: pod is sexy. | [reply] |
|
I'd revise your statement:
> Anybody looking for a project this smells like a good one
to:
Anybody looking for a project, this smells.
RTF is an unpleasant format. The basic spec might be okay for creating a writer, but creating a reader that will handle arbitrary RTF is quite hard.
Thus speaks someone who has just spent the last five months dealing with RTF parsing.
--
$,="\n";foreach(split('',"\3\3\3c>\0>c\177cc\0~c~``\0cc\177cc"))
{$a++;$_=unpack('B8',$_);tr,01,\40#,;$b[$a%6].=$_};print@b,"\n"
| [reply] |
|
Thank you for a very interesting link, I'll save it for a future.
As it looks like ready-to-use grammar currently does not exists, I'll try writing one by myself and will show it on this site.
However, I expect it to be extremely slow on parsing.
I'll let you know about my further results.
Courage, the Cowardly Dog
| [reply] |
Re: Parse::RecDescent grammar for RTF
by graff (Chancellor) on Nov 02, 2002 at 13:48 UTC
|
Having installed RTF::Parser on my linux laptop just now,
and seeing the README page
for that module, it doesn't look like it's in a usable state
-- e.g. there doesn't appear to be any documentation for how
to use it. (And there wasn't any explicit mention of how
old it is -- but the downloaded files under .cpan/build
date from July 1999.) This looks like a dead-end, orphaned
module. (Should CPAN have something like a garbage-collection
process to clear away stuff like this?)
On the other hand, RTF::Tokenizer seems to be quite current,
and is documented. You didn't mention what you need to do,
but maybe this module will be able to help you out. | [reply] |
|
I've chatted with Pete Sergeant; I don't think that RTF::Tokenizer will be developed further. It is useful, though, if you are careful to preprocess special characters before running the tokenizer. It really, really doesn't like \~ codes for non-break spaces.
--
$,="\n";foreach(split('',"\3\3\3c>\0>c\177cc\0~c~``\0cc\177cc"))
{$a++;$_=unpack('B8',$_);tr,01,\40#,;$b[$a%6].=$_};print@b,"\n"
| [reply] |
|
|