http://qs321.pair.com?node_id=781720

All,
A while back, I announced to the CB that I had a couple of days to write code for anyone with an interesting project. I got a couple of requests. The one germane to this thread was from tye who asked if I could write a routine that, when passed in the body of a post would detect if the post was unformatted and format it. At least those were his initial instructions (not wanting to influence my approach). They were later refined to:

I failed at this task for a number of reasons. First, I aimed too high. I considered too many formatting issues to fix. It is always better to have a working program that doesn't do much then have something that doesn't work at all. I should have asked clarifying questions about the scope of the project earlier on. The second reason was because I considered way too many edge cases. For instance, I considered joining two adjacent code blocks into one set of tags but dismissed it because some people want distinct download tags. Again, it is better to have a solution that works 90% of the time then one that does run at all. The last reason is because I ran out of time. I spent too much time in my head and not enough time writing code. Spending time thinking about a problem up front almost always saves you time in the long run, but I should have realized it would have been better to turn something over with a "to consider" section that someone else could have continued than nothing at all.

Since this is something that would benefit the entire site, I am posting a bit of failure "lessons learned". I believe tye is still interested in someone coming up with the routine so this would be a perfect opportunity for someone to help out the site without even needing to be a devil.

KISS

If there are tags (BR, P, div, C, bold or italic, etc) then it is formatted. It may be formatted wrong but anything you do now may not be the intention of the author. I know it would be tempting to convert PRE tags to C tags but the primary goal is to fix the completely unformatted posts of newbies not to create an auto-format tool. Refinements can be added later on.

Split the body into paragraph chunks. Determine if chunk is a code block. If not, insert appropriate P tags. If yes, insert C tags. If two or more adjacent blocks are all code, only insert 1 set of code tags to preserve the author's intentional vertical whitespace in code. Again, refinements can be added later.

Determining if a chunk contains code is not terribly difficult (though also not entirely accurate). There are a lot of tell tell indications (indentation, shebang line, $var, semi-colons, braces, etc). The two difficult things are telling if a chunk contains both code and regular text as well as where the code begins and ends. The author might not put a blank line between the last sentence and the introdcution of the code. This means you need to handle both P and C tags in 1 chunk. The indicators are not perfect for telling you where the code boundaries are either. A good script will probably start with a shebang line and a couple of use pragmas (strict and warnings) but these are newbies we are talking about. Indentation won't help and semi-colons can be red herrons.

Setting aside __END__ tags - I considered adding/removing lines from @is_code and using things like Deparse, PerlTidy, perl -c, etc but syntax errors are prevalent in newbies and perl itself is too eager to try and DWYM.

I hope you succeed where I failed.

Cheers - L~R