http://qs321.pair.com?node_id=781720

All,
A while back, I announced to the CB that I had a couple of days to write code for anyone with an interesting project. I got a couple of requests. The one germane to this thread was from tye who asked if I could write a routine that, when passed in the body of a post would detect if the post was unformatted and format it. At least those were his initial instructions (not wanting to influence my approach). They were later refined to:

I failed at this task for a number of reasons. First, I aimed too high. I considered too many formatting issues to fix. It is always better to have a working program that doesn't do much then have something that doesn't work at all. I should have asked clarifying questions about the scope of the project earlier on. The second reason was because I considered way too many edge cases. For instance, I considered joining two adjacent code blocks into one set of tags but dismissed it because some people want distinct download tags. Again, it is better to have a solution that works 90% of the time then one that does run at all. The last reason is because I ran out of time. I spent too much time in my head and not enough time writing code. Spending time thinking about a problem up front almost always saves you time in the long run, but I should have realized it would have been better to turn something over with a "to consider" section that someone else could have continued than nothing at all.

Since this is something that would benefit the entire site, I am posting a bit of failure "lessons learned". I believe tye is still interested in someone coming up with the routine so this would be a perfect opportunity for someone to help out the site without even needing to be a devil.

KISS

If there are tags (BR, P, div, C, bold or italic, etc) then it is formatted. It may be formatted wrong but anything you do now may not be the intention of the author. I know it would be tempting to convert PRE tags to C tags but the primary goal is to fix the completely unformatted posts of newbies not to create an auto-format tool. Refinements can be added later on.

Split the body into paragraph chunks. Determine if chunk is a code block. If not, insert appropriate P tags. If yes, insert C tags. If two or more adjacent blocks are all code, only insert 1 set of code tags to preserve the author's intentional vertical whitespace in code. Again, refinements can be added later.

Determining if a chunk contains code is not terribly difficult (though also not entirely accurate). There are a lot of tell tell indications (indentation, shebang line, $var, semi-colons, braces, etc). The two difficult things are telling if a chunk contains both code and regular text as well as where the code begins and ends. The author might not put a blank line between the last sentence and the introdcution of the code. This means you need to handle both P and C tags in 1 chunk. The indicators are not perfect for telling you where the code boundaries are either. A good script will probably start with a shebang line and a couple of use pragmas (strict and warnings) but these are newbies we are talking about. Indentation won't help and semi-colons can be red herrons.

Setting aside __END__ tags - I considered adding/removing lines from @is_code and using things like Deparse, PerlTidy, perl -c, etc but syntax errors are prevalent in newbies and perl itself is too eager to try and DWYM.

I hope you succeed where I failed.

Cheers - L~R

  • Comment on Adding Minimal Formatting To Unformatted Newbie Posts

Replies are listed 'Best First'.
Re: Adding Minimal Formatting To Unformatted Newbie Posts
by Argel (Prior) on Jul 20, 2009 at 19:21 UTC
    Thanks for the attempt and an interesting post!

    I wonder if it might help to set some goals/milestones? For example, I would think the first goal should be to address broken formatting, such as missing closing tags (e.g. bold and italics) and long lines within a PRE block since they can push nodelets right off the screen (see Nodelets on the left? for a workaround).

    Then the next goal might be to address trying to put code into code tags, etc.

    Another possibility is to detect potential issues and then present them to the user to correct. Maybe they would just be ignored, but maybe not. Regardless, that would at least allow for separating the project into the easier problem of detecting potential issues and the harder problems of trying to determine if they really are an issue and if so how to resolve them.

    Elda Taluta; Sarks Sark; Ark Arks

      I disagree. Those are separate problems and shouldn't be combined.

      Mis-closed tags are already detected and rendered harmless by the "HTML nesting enforcement" squad (enable them at Display Settings while they remain optional awaiting a couple of bug fixes). They also present the problems found to the user in hopes that they will correct them (control how much they will present to you at the same location).

      Inappropriate line lengths in PRE tags likewise won't be part of "fix an unformatted node".

      And the original problem here is pretty simple, despite Limbic~Region's admittedly aiming too high. I just added a reply clarifying how simple (nay, simplistic) I would like the solution to be.

      - tye        

        Point taken. I guess I would prioritize as follows: First make sure one node doesn't affect the formatting of another node or the rest of the page (such as long lines within PRE tags, etc.), and then move on to other possible formatting issus. That was my underlying thought process.

        Elda Taluta; Sarks Sark; Ark Arks

Re: Adding Minimal Formatting To Unformatted Newbie Posts
by ELISHEVA (Prior) on Jul 20, 2009 at 20:45 UTC

    First, many thanks for trying.

    IMHO the KISS solution is:

    1. if there are no <pre> tags, surround the entire post with <code> tags.
    2. if there are <pre> tags and no other formatting
      1. replace <pre> tags with <code> tags
      2. if there is any text above the first <pre> tag, surround it with <code> tags
      3. if any text is between each pair of </pre> and <pre>, surround it with <code> tags
      4. if any text is below the last <pre> tag, surround it with <code> tags

      Note: I omitted the obvious simplification (remove the <pre> tags and surround the entire post with <code> tags) on the assumption that the post author sees something conceptually distinct from the rest of the text in whatever is surrounded with <pre> tags. It is possible that it would be valuable to allow that section to have its own download link.

    The above algorithm won't make the text "pretty", but it will deal with the major sources of pain from badly formatted posts:

    • People who don't know how to use html (or aren't comfortable with it) tend to space text as they want to see it. We will, most likely, be preserving the user's original formatting. We have a halfway decent chance of making things look readable.
    • If the text contains embedded code, we will actually be able to read it - imagine that!
    • <pre> tags are the main reason for janitorial emergencies. Getting rid of these on otherwise unformatted posts would save the janitors time. For the rest of us, we don't have to wait on the janitors to get access to our precious sidebars.

    Attempting to insert both <c> and <p> tags is actually quite a difficult task because it requires us to distinguish between code and text. That is non-trivial. Since Perl borrows many words from English, it requires parsing not just the words but their context. I'm not surprised that you found the task too hard to do to your satisfaction in 2 days or so.

    Best, beth

    Update: explained why KISS doesn't include a very obvious simplification.

      That is extremely simple (just put code tags around and s=</?pre>=</code><code>=g, to restate it more tersely). I find the POD-like approach slightly more complex (see my earlier reply). However, I think your solution will almost never format the post as desired. I hope my POD-like solution will often format posts "correctly".

      Further, my solution makes it extremely simple to get a post formatted "correctly". Just separate code with blank lines and indent code. If we keep the idea that simple, then we should be able to get a lot of people to be able to swap in the requirements when posting.

      As for code in the middle of text, the main problem is square brackets. And I have a lot of ideas for making that mostly DWIM so I'm not concerned with that here (the square bracket problem needs to be fixed for chatter as well, which makes it almost orthogonal to this formatting problem).

      - tye        

        Yes it is very simple, even simplistic - by intent. It wasn't intended as a final solution but rather a stop gap (or scaffolding as you will) to address the major source of pain. Even if you were to reuse code from pod's implementation, I think you would find that applying the pod approach will require a certain amount of tweaking (see addendum below). Having a lot of DWIM ideas is great, but each idea adds implementation time however short. It adds up. And at least a few of those may turn into a "small matter of programming". I'm just saying while the kinks of those ideas are being worked out and tested, a very simple solution would buy us a great deal.

        The worst effects of this simple solution is that (a) we get ugly Courrier text instead of Times New Roman (b) normal text may look choppy because we lose the ability to wrap paragraphs (c) downloads may include a bunch of irrelevant stuff that needs to be deleted or commented out.

        But at least we will be able to read the post. Most of the unreadability in unformatted posts (and I mean here text with *no* tags except "pre") comes from the fact that unformatted text collapses all runs of whitespace outside "pre" tags into a single space. This makes code samples look like one long breathless mess. Only an obfu expert can read that.

        Best, beth

        Addendum: Although Pod has an algorithm for distinguishing text from code, even without markup, we can't just use it "as is". It relies on the assumption that the user will have normal text flush at the beginning of the line and code will be indented. This is often not the case if the user is cut&pasting from their code files. Here is a sample of code from node Can't call method "getAttribute" followed by output from pod2html sample.pod > foo.out. As you can see the output is unreadable:

        # this line added to bypass pod2html's error checking =head Dummy title # remainder is taken from node id=781506 #!/usr/bin/perl ################################################################# # Yahoo Weather Rss Information Atomizer # Version 0.7.1 # Loud-Soft.com # Provided As Is ################################################################# use strict; use XML::XPath; use LWP::Simple; use XML::XPath::XMLParser; use Getopt::Long; use File::Copy; ################################################################# # Variables ################################################################# # Constants (Change these to localize) my $zipcode = "60642"; my $unit = "F"; my $scripthome = "/Library/prlprograms/yweather.pl"; my $icondir = $scripthome."images/"; my $datadir = $scripthome."data/"; my $datafile = $datadir."weather.xml"; my $imagefile = $icondir."weather.png"; # Constants (Do not change these) my $pre="yweather"; my $uri="http://xml.weather.yahoo.com/ns/rss/1.0"; my $url="http://xml.weather.yahoo.com/forecastrss?p=$zipcode&u=$unit"; my %data; my $xp;

        The output looks like this:

        Dummy title

        #!/usr/bin/perl

        ################################################################# # Yahoo Weather Rss Information Atomizer # Version 0.7.1 # Loud-Soft.com # Provided As Is #################################################################

        use strict; use XML::XPath; use LWP::Simple; use XML::XPath::XMLParser; use Getopt::Long; use File::Copy;

        ################################################################# # Variables ################################################################# # Constants (Change these to localize) my $zipcode = ``60642''; my $unit = ``F''; my $scripthome = ``/Library/prlprograms/yweather.pl''; my $icondir = $scripthome.``images/''; my $datadir = $scripthome.``data/''; my $datafile = $datadir.``weather.xml''; my $imagefile = $icondir.``weather.png'';

        # Constants (Do not change these) my $pre=``yweather''; my $uri=``http://xml.weather.yahoo.com/ns/rss/1.0''; my $url=``http://xml.weather.yahoo.com/forecastrss?p=$zipcode&u=$unit''; my %data; my $xp;

Re: Adding Minimal Formatting To Unformatted Newbie Posts (clarifying)
by tye (Sage) on Jul 21, 2009 at 02:57 UTC

    Some other considerations that aren't really part of the primary goal and so might not be part of the accepted solution, but might help direct effort in more successful directions.

    I would like the rules for how we detect that a post "isn't formatted" and for how we transform it into a formated post to be very simple to communicate. People being surprised by a post being declared "unformatted" or not should be rare. People being surprised by how formatting got added should be rare. I'm fine with people more often being disappointed with how well the formatting was intuited, especially if they are disappointed but immediately understand why the very simple rules involved lead to those results.

    This is meant as a fall-back for formatting nodes, not a replacement.

    I find some key concept in POD to be ubiquitous in "simple formatting" schemes and so I think these would be very useful both for very often getting at "what the posting monk meant" and making the rules easy to understand.

    These are: 1) Formatting is only done to paragraphs. 2) Paragraphs are separated by blank lines. 3) A line of only whitespace characters is "blank". 4) "code" has indented lines. Then you enclose each paragraph in either P tags or C tags, depending on whether it is "code" or not.

    Yes, please join adjacent code "paragraphs" together so that they end up with only one set of C tags around the whole run and with the original interparagraph spacing preserved.

    I wouldn't add "the little stuff" nor would I consider the presense of "little stuff" to be an indication of "the node is formatted". "Little stuff" is tags like A, B, I, EM, STRONG, maybe BR. A single <p> or <code> or <c> should probably count as "already formatted".

    - tye        

Re: Adding Minimal Formatting To Unformatted Newbie Posts
by ambrus (Abbot) on Jul 20, 2009 at 23:07 UTC

    As a minimalistic solution, couldn't you just wrap the whole post in code tags if it doesn't match /<[a-zA-Z]/?

Re: Adding Minimal Formatting To Unformatted Newbie Posts
by jrsimmon (Hermit) on Jul 20, 2009 at 19:02 UTC
    Did you have a node or group of nodes that were to be used as examples?
      jrsimmon,
      I used two sources. The first was Nodes To Consider which is always changing. The second was by removing the HTML formatting from existing nodes. There are more surgery precision tools out there besides HTML::Strip but you get the idea. You should be able to figure out how to do this with WWW::Mechanize or the like but I didn't want to post code and add any unnecessary stress on the site (code is at home now anyway).

      Cheers - L~R

        ...code is at home now anyway - with slippers on, a cold beer and it's feet up, watching television perhaps ? :-D

        A user level that continues to overstate my experience :-))
Re: Adding Minimal Formatting To Unformatted Newbie Posts
by Anonymous Monk on Jul 20, 2009 at 20:10 UTC
    I think I have more KISS, just slow them down
    #~ my(@pun) = sort qw! ~ { } [ \ ] ^ _ ` : ; < = > @ # $ % ( ) * + / ! +; #~ my(@pun2) = qw( - , . ? ! & ' " ); #~ my $pun = join '', map quotemeta, @pun; #~ $pun = qr~[$pun]~; my $pun = qr~[\#\$\%\(\)\*\+\/\:\;\<\=\>\@\[\\\]\^_\`\{\}\~]{3,}~; if ( $node_text =~ /$pun/ ){ if( !/<\/?c>/ or ! /<\/?code>/ ){ die "Can't ask question without putting your code in code tags +!"; } } __END__
Re: Adding Minimal Formatting To Unformatted Newbie Posts
by wazoox (Prior) on Jul 21, 2009 at 16:31 UTC

    This isn't ambitious enough! The Right Thing To Do is to write some auto-learning algorithm that will crunch through the whole history of considered nodes, and learn by itself what constitutes bad formatting (needing correction) and good formatting (how the problem was overridden). I don't know if you could achieve that in a couple of days but at least you'd have tried :)

      The Right Thing To Do is to write some auto-learning algorithm....
      LOL. I couldn't help but think of this famous quote:
      I'm sorry Dave, I can't do that.
      ^_^

      Elda Taluta; Sarks Sark; Ark Arks