Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Programatically reparagraphinating text

by hacker (Priest)
on Feb 16, 2007 at 01:24 UTC ( [id://600342]=perlquestion: print w/replies, xml ) Need Help??

hacker has asked for the wisdom of the Perl Monks concerning the following question:

Boy is that a mouthful...

What I'm trying to do, is take a series of old "e-zines" (phrack, t@p and such.. I have 288 of them, for a total of 9,899 issues) which are stored in plain old 7-bit ascii text (think BBS era), and reflow them so I can then wrap some XML around the elements, and convert them to HTML (yes, XML... then HTML).

Here's the catch.. unless I am going to go through them manually after they've been reformatted, with my human eyes, I'll never know if sections that should NOT have been touched, were.

For example, there are some that have ascii diagrams of pinouts, ascii representations of block diagrams and other things, which I'd like to keep intact.. but the paragraphs of text prior and after them, should be reflowed. Here's an example:

_ _ _______ | \/ | / _____/ |_||_|etal/ /hop _________/ / /__________/ (314)432-0756 24 Hours A Day, 300/1200 Baud

Here's another:

0000 00 0c 29 04 7d 25 00 50 56 c0 00 01 08 00 45 00 ..).}%.PV..... +E. 0010 00 58 bf 58 00 00 00 11 25 89 0a 0a 0a 01 c0 a8 .X.X....%..... +.. 0020 01 01 00 89 04 02 00 44 00 00 00 03 85 80 00 00 .......D...... +.. 0030 00 01 00 00 00 00 20 46 48 45 50 46 43 45 4c 45 ...... FHEPFCE +LE 0040 48 46 43 45 50 46 46 46 41 43 41 43 41 43 41 43 HFCEPFFFACACAC +AC 0050 41 43 41 43 41 42 4c 00 00 01 00 01 00 01 51 80 ACACABL....... +Q. 0060 00 04 c0 a8 01 4d

And one more...

Specifications When interfacing the CRT with a null modem cable, your cable should fit the diagram below. ÚÄÄÄÄ¿ ÚÄÄÄÄ¿ ³1 OÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅO 1 ³ ³2 OÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅO 3 ³ ³3 OÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅO 2 ³ ³4 OÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅO 8 ³ ³5 OÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅO 20³ ³6 OÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅO 7 ³ ³7 OÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ ÚÄÄÄÄÄÄÄÄÄÅO 4 ³ ³8 OÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÅO 5 ³ ³20 OÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅO 6 ³ ÀÄÄÄÄÙ ÀÄÄÄÄÙ Pin Definitions 1. Ground 6. Data Set Ready 2. Transmit Data 7. Ground 3. Receive Data 8. Data Carrier Detect 4. Request to Send 9. Data Terminal Ready 5. Clear to Send

So some rudimentary rules should be set... lines that end in say... \w\s+\w$\w, are probably the end of sentences.. and not part of a diagram.

I'm not really asking for the actual code, and I know this'll be a huge pile of regexes and unit tests, but what I AM asking for, is a list of the proper modules that I can use at my disposal to do this. Things like Text::Wrap, XML::LibXML, Text::Autoformat, and others. Thanks in advance, my fellow brethren...

Replies are listed 'Best First'.
Re: Programatically reparagraphinating text
by blahblahblah (Priest) on Feb 16, 2007 at 03:03 UTC
    A small suggestion based on your samples above: maybe you could use a dictionary to identify non-english blocks of text. Something like, if a block of text has X% "words" that are not found in a dictionary, then assume a higher likelihood that it should be left unchanged.

    Also, note that your links to modules above would work better with a "[cpan://..." style link, like: Text::Wrap XML::LibXML Text::Autoformat

    Joe

Re: Programatically reparagraphinating text
by ww (Archbishop) on Feb 16, 2007 at 14:45 UTC
    Interesting project!

    blahblahblah ++ re use of a dictionary. Coupled with the regex in the OP (or, perhaps, one that's rather more specific and insistent on the presence of periods), you may have something of a start on that part of the problem.

    It does seem to me that reflowing text (horizontally) around ascii art will be problematic, at best. Perhaps it would also be well to accept a less design-oriented target and accept leaving anything determined to be ascii art as an inline item (takeout box, dropin, for a couple of terms that may clarify my intent), with the reformatted text above and below.

    eg, NOT:

    test here yada ya da   0000 01 02 03 04...
    ya da'in continues       0010 0f 0e 0d...

    but rather:

    test here yada ya da

    0000 01 02 03 04...
    0010 0f 0e 0d...

    ya da'ing continues

    My next notion may be unmanageable, but might be worth exploring: Would creation of a second dictionary containing such common elements as the address fragments at the beginning of each line of a hex dump (2nd example) and the multiple spaces initiating each line in the BBS logo be worth the effort?

    and <big grin> while use of a dictionary might not have returned this result; the mouthful of the title might have been reduced by using "reparagraphing"?

Re: Programatically reparagraphinating text
by dsheroh (Monsignor) on Feb 16, 2007 at 16:01 UTC
    A couple ideas which come to mind based on your examples, although I don't really expect them to catch 100% of the cases which should be left alone:

    - If presented with N or more lines of the same length, it's likely a binary dump or a pinout diagram, so leave it alone. (I'd probably go with N=3, at least initially, but most dumps/diagrams tend to be longer than that, so you could probably use a larger value of N safely.)

    - Multiple consecutive lines with leading whitespace are likely to be ASCII art or columnar text, so leave them alone. (Just one line with leading whitespace is more likely to be the start of a paragraph. For extra credit, if a block of indented lines includes one non-indented line, leave it alone, too, since it's likely part of the ASCII art.)

    (I know this isn't modules, which is what you said you're looking for, but it looked like you may be looking for rules, too.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://600342]
Approved by liverpole
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (5)
As of 2024-04-19 17:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found