Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

AutoMagic HTML

by Cody Pendant (Prior)
on Jun 24, 2003 at 03:03 UTC ( [id://268377]=perlquestion: print w/replies, xml ) Need Help??

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on code to render HTML automatically from text files.

What I'd like is for a text file containing just linebreaks and some special characters to be rendered as HTML by some regexes.

I need to do three things:

  1. render some special-character shortcuts as bold and italic, i.e. in
    this is i italic text
    the line which starts with an i-space gets rendered as italics.

  2. I also need to do some multi-line ones, ( see 266313 ) to render lists.

  3. My final wish is to render the content, if it's not in one of those lists, into HTML-compliant paragraphs (at the moment it's just a bunch of text in a TD with double-BR-tags separating the "paragraphs").

So I can hack together regexes for each one of those, but I'm in TMTOWTDI mode at the moment, pondering various ways of doing it.

I could work on the lines of the text file as an array, I could put them together as one long string, or I could do a bit of both -- say do the single-line ones iterating over the string then join() and do the multi-line ones.

What do monks think?



“Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.”
M-J D

Replies are listed 'Best First'.
Re: AutoMagic HTML
by BrowserUk (Patriarch) on Jun 24, 2003 at 04:09 UTC
    I _always_ favoured /markup/ done like *this*.

    becoming

    I always favoured markup done like this.

    Paragraphs are simply blocks of text with no intervening blanks lines. Paragraphs are simply blocks of text with no intervening blanks lines. Paragraphs are simply blocks of text with no intervening blanks lines. Paragraphs are simply blocks of text with no intervening blanks lines. This is the start of a new para.

    Paragraphs are simply blocks of text with no intervening blanks lines.Paragraphs are simply blocks of text with no intervening blanks lines.Paragraphs are simply blocks of text with no intervening blanks lines.Paragraphs are simply blocks of text with no intervening blanks lines.

    This is the start of a new para.

    1 This is an H1 header Header lines are single lines starting with a numeric with blank lines + above and below. -list item 1 -list item 2 --nested list item 1 --nested list item 2 This is a para subordinate to the second item in the nested list. and another. --nested list item 3 The nested list ends with the -- line below. -- -list item 3 - ================= Any line consisting of say half a dozen or more =s becomes an HR. This is the final paragraph in this example.

    This is an H1 header

    Header lines are single lines starting with a numeric with blank lines above and below.

    • list item 1
    • list item 2
      • nested list item 1
      • nested list item 2

        This is a para subordinate to the second item in the nested list.

        And another.

      • nested list item 3

        The nested list ends with the -- line below.

    • list item 3

    Any line consisting of say half a dozen or more =s becomes an HR.

    This is the final paragraph in this example.

    This seems easy and intuative to type, relatively easy for the human eye to parse and see the intent in its raw form and uses simple enough rules to make it faiirly simple to perform the conversion process.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


Re: AutoMagic HTML
by hossman (Prior) on Jun 24, 2003 at 03:10 UTC
    You might want to look into HTML::FromText ... in particular, take a look at the "See Also" section.
Re: AutoMagic HTML
by jmcnamara (Monsignor) on Jun 24, 2003 at 08:22 UTC

    This idea seems similar to the scheme used by various Wikis where simple text markup is converted to Html.

    See for example C2 Wiki formatting, Usemod formatting or the Kwiki formatting rules.

    The above examples are written in Perl and the source code is readily available should you wish to use their formatting.

    --
    John.

      Thanks jmacnamara, that was very useful.

      The Wiki people, as I found out here start with the text as one long string.

      We have the text of a page in one big string which we split into lines to be processed individually. This colors our TextFormattingRules, especially those dealing with bullet lists. Since we've now forced authors to be newline conscious, we give them the opportunity to escape newlines with a back-slash (\) which we substitute with a blank.
      sub PrintBodyText { s/\\\n/ /g; foreach (split(/\n/, $_)){

      (there follows a long long list of regexes) which is kind of an interesting answer to my question, "should I work on an array or a block of text?", which I hadn't considered. It's kind of a "both".



      “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.”
      M-J D
Re: AutoMagic HTML
by erikharrison (Deacon) on Jun 24, 2003 at 06:36 UTC

    Well, you've got a simple markup. So what you're really doing is hacking up a parser for it. And we all know what happens to hacks . . .

    So, in the interest of elegance, ease, and extensibility, I'd use an existant parser generator, and keep the language definition around to extend. I'd use Parse::RecDescent myself, but use whatever you like of course.

    Cheers,
    Erik

    Light a man a fire, he's warm for a day. Catch a man on fire, and he's warm for the rest of his life. - Terry Pratchet

Re: AutoMagic HTML
by bart (Canon) on Jun 24, 2003 at 13:17 UTC
    the line which starts with an i-space gets rendered as italics.
    You don't ever expect normal text to start with "i"+space?
    i like tea.
    I'd take something more exceptional for the markup, or provide a way to escape it.
      If e.e. cummings ever uses my software, that could be an issue. Otherwise I won't sweat it.

      “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.”
      M-J D
Re: AutoMagic HTML
by Cody Pendant (Prior) on Jun 24, 2003 at 04:33 UTC
    Well thanks, but those posts aren't really what I was hoping for -- perhaps I should abstract the problem a bit more to make it more Monk-y?

    I have some replacements to perform on a text file which are best done as line-by-line processing on the file.

    I have some more which are best done as applying patterns to a block of text which happens to contain newlines.

    I could slurp it into an array, work on the array and do some, then join it and work on the others, or start by slurping into one big file and do it with multi-line regexes, which is best?



    “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.”
    M-J D
      > which is best?

      Best is what YOU think is best. TMTOWTDI and no one can tell what's the best way if you don't define "best" ;-)
      Is best the code that is

      • fastest
      • shortest
      • least memory consuming
      • clearly to understand
      • at most obfuscated
      • ...

      Having said that, I would do it on a line-by-line basis, something like this

      while (<>) { # replace italics and bold s{^([ib])\s+(.*)}[<$1>$2</$1>]; # find ordered lists if (my $hit= /^\[\s*$/ .. /^\]\s*$/) { if ($hit==1) { print "<ol>\n"; next; } if ($hit=~ /e0/i) { print "</ol>\n"; next; } s[^][<li>]; s[$][</li>]; } }
      This won't work with cascaded, ordered lists, but it's a starting point.
        * fastest * shortest * least memory consuming

        Sorry, good point, I meant "fastest", as this rendering is supposed to happen on the fly as HTML is output to the browser.



        “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.”
        M-J D

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://268377]
Approved by dws
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (6)
As of 2024-04-24 03:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found