Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Rewrapping Net::NNTP output

by hacker (Priest)
on Feb 24, 2005 at 14:50 UTC ( [id://434084]=perlquestion: print w/replies, xml ) Need Help??

hacker has asked for the wisdom of the Perl Monks concerning the following question:

Lately, I've been spending a lot of time getting very familiar with XML, specifically with RSS, RDF, and Atom feeds.

To that end, I've written a script that uses Net::NNTP to fetch news articles and creates an RSS feed out of them. From there, I can convert that feed to HTML, which I then convert to a format suitable for display on a Palm handheld device, using Plucker. I do this in two formats because I specifically need it in both RSS and HTML formats simultaneously.

So far, so good.

My question is.. how do I take the body of the news article I receive, and "rewrap" the text, so it fits within a known width? I know about Text::Wrap, but this would require a bit more thinking to get right with the quoted material (custom regexes?).

The body of an article generally has quoted material buried somewhere within it, with '>' at the beginning of the quoted lines. This becomes a problem when the lines are quoting quoted material, like this:

> This is a sentence that might contain some of the original > person's quoted text. Its a first-level quoted mesage. This is a reply to that quoted material >> This is some text from the very first original post that >> wraps onto another line. > And someone here is replying to that original quoted > text. And this is the current poster's.

What I'd like to figure out, is how to rewrap this text, keeping the same kind of aspect, based on the width of the target device (which I will know before I convert it). For example, wrapping the text to a maximum width of 320 pixels, or a maximum width of 160 pixels, and so on.

Replies are listed 'Best First'.
Re: Rewrapping Net::NNTP output
by eieio (Pilgrim) on Feb 24, 2005 at 15:15 UTC
    Text::Autoformat works great for this very purpose. I use it to rewrap e-mail messages that are heavily quoted. It can be heavily customized for your specific application.
      I was aaaaaaalmost sold, until I tried it on a usenet post that was already wrapped incorrectly.. observe a snippet from comp.text.pdf:
      > 3. Scripting Languages, such as Python or Perl > =============================================== > > ReportLab Toolkit (Python) and PDF::API2 (Perl). > > 4. Lower-Level Programming languages such as C > =============================================== > > Look at PDFlib lite (simple version of the commercial one, not for commercial > use!) and ClibPDF. You will need a C compiler and some experienced C programmers > though.

      This will be reflowed to the specified width, using Text::Autoformat, but the broken lines aren't cuddled back up to their previous lines before reflowing the text. It looks like this:

      > 3. Scripting Languages, such as > Python or Perl > =============================================== > > ReportLab Toolkit (Python) and > PDF::API2 (Perl). > > 4. Lower-Level Programming languages > such as C > =============================================== > > Look at PDFlib lite (simple version > of the commercial one, not for commercial > use!) and ClibPDF. You will need a C > compiler and some experienced C programmers > though.

      The right-column is wrapped to the right width, but the text is still broken up. I wish there was a way to avoid this kind of behavior.

      I also tried Text::Reform and Text::Reflow with similar (negative) results.

      How are you handling cases like this in your code? You seem to be doing something similar to what I'm doing here also.

        Text::Autoformat isn't perfect. I'm not using Text::Autoformat in a completely automated situation. I'm able to manually intervene and do any final tweaks to the text. Regardless, it makes my life much easier.

        In fact, I'm not sure it can be perfect due to ambiguity with how many of us write. For example, given the following text:

        > this text should be considered a single paragraph > that was hard wrapped such that not every line has a quote character.
        We would want it to be formatted as:
        > this text should be considered a single > paragraph that was hard wrapped such > that not every line has a quote character.
        However, given this text:
        > (Sir Galahad approaches the Bridgekeeper) Stop! What is your name? Sir Galahad of Camelot. > What is your quest? I seek the Grail. > What is your favorite color? Blue. No yellow...
        we would want it to be formatted as:
        > (Sir Galahad approaches the Bridgekeeper) > Stop! What is your name? Sir Galahad of Camelot. > What is your quest? I seek the Grail. > What is your favorite color? Blue. No yellow...
        and not:
        > (Sir Galahad approaches the Bridgekeeper) > Stop! What is your name? > Sir Galahad of Camelot. > What is your quest? > I seek the Grail. > What is your favorite color? > Blue. No yellow...
        How is Text::Autoformat to tell the difference?

        geoff

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://434084]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2024-03-29 06:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found