http://qs321.pair.com?node_id=445791

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

OK, the title may be confusing, but I don't know how to write a summary for this problem.

I'm writing a really simple Wiki-like engine. The problem is, Wiki is much like Perl's double-quote mark, but sometimes you want a single-quote:

This is **bold**, but ``this is **not** bold``.

becomes

This is <b>bold</b>, but this is **not** bold.

In short, I want the text between two pairs of backticks not to be processed. I can't think of any way to do this with simple regexps, so please help me. :)

By the way, I'm using something like s/\*\*(.+?)\*\*/<b>$1</b>/gs for the tags processing; anyone has a better idea?

Replies are listed 'Best First'.
Re: Interpolate Text Not Inside a Certain Tag
by dragonchild (Archbishop) on Apr 07, 2005 at 17:17 UTC
    I can't think of any way to do this with simple regexps, . . .

    Then don't use a regex. Why does everyone think that all string manipulation should be handled with a regex?!? Use a simple character-by-character parser, noting state as you move through the string.

    # UNTESTED !!! my $in_quote = 0; my $in_bold = 0; my $final_string = ''; foreach my $char ( split //, $string ) { if ( $char eq "'" ) { $in_quote = 1 - $in_quote; } else if ( $char eq '*' && !$in_quote ) { $final_string .= '<'; $final_string .= $in_bold ? '/' : ''; $final_string .= 'b>'; $in_bold = 1 - $in_bold; } else { $final_string .= $char; } }
Using Multiple m/\G.../gc to Tokenize
by ikegami (Patriarch) on Apr 07, 2005 at 17:28 UTC

    An easily extendable solution:

    { for ($text) { # Alias $_ to $text. /\G `` (.*?) `` /gcsx && do { print($1 ); redo }; /\G \*\* (.*?) \*\* /gcsx && do { print("<b>$1</b>"); redo }; /\G ( . # Catchall. (?: # These four lines are optional. (?!``) # They are here to speed things up (?!\*\*) # by avoiding calling print for .)* # single characters. ) /gcsx && do { print($1); redo }; } }

    Handles mismatched ** and `` by treating them as normal characters.

    Update:

    • Removed "\n"s from prints.
    • Added /s option to regexps, since I'm guessing newlines are not special.
    • The 1st .*? was ((?:(?!``).)*
    • The 2nd .*? was ((?:(?!\*\*).)*

    Tested:

Re: Interpolate Text Not Inside a Certain Tag
by tlm (Prior) on Apr 07, 2005 at 17:29 UTC
Re: Interpolate Text Not Inside a Certain Tag
by jonadab (Parson) on Apr 07, 2005 at 17:25 UTC

    First off, dragonchild's answer is well worth considering, and probably the better choice. But for the sake of interest... I think it may be possible to do this with a regex, provided the problem really is as simple as the way you have stated it and not complicated by additional nesting or somesuch. Something along these lines...

    s!(?:(?:([']{2})([^']+)[']{2})|(?:([*]{2})([^'*]+)[*]{2}))!($3 eq '**')?"<b>$4</b>":$2!ge;

    ...might work. (No, lots of paretheses don't bother me. Yep, I knew a lisp variant before I learned Perl.) But dragonchild's solution is easier to read and maintain.

    update: fixed silly paren-counting error

    "In adjectives, with the addition of inflectional endings, a changeable long vowel (Qamets or Tsere) in an open, propretonic syllable will reduce to Vocal Shewa. This type of change occurs when the open, pretonic syllable of the masculine singular adjective becomes propretonic with the addition of inflectional endings."  — Pratico & Van Pelt, BBHG, p68
      It doesn't work for the simple case "**bold**". I haven't tried anything else.

        Yeah, I wasn't thinking and used the match variables as if there were only two sets of parens, rather than four. The updated version works for that simple case. However, I'd worry about unexpected data screwing it up potentially rather badly, and it only handles one type of quote mark; if you're allowed to have both single and double quote marks and nest them and escape quote marks within quotes with backslashes, stuff gets messy fast.

Re: Interpolate Text Not Inside a Certain Tag
by Anonymous Monk on Apr 07, 2005 at 17:55 UTC

    Alright, thanks all!

    I've already thought of using "character-by-character parser" as dragonchild suggested, but I guess Perl scripts are just less sexy when you have to use techniques too common for other languages ;) . Just kidding; the truth is, as this is meant to be a really simple script (for personal use), regex seemed to be a good option: short and, well, usually simple. Of course, I'd choose char-by-char parsing over using overly-complicated regular expressions, so, anyway, thanks again.

Re: Interpolate Text Not Inside a Certain Tag
by satchm0h (Beadle) on Apr 07, 2005 at 20:39 UTC
    I realize you already have a solution, but what about this:
    sub boldify { local $/ = undef; my $input = shift; my @parts = split /``/, $input; foreach my $i (0..scalar(@parts)) { $parts[$i] =~ s/\*\*(.+?)\*\*/<b>$1<\\b>/gs if ($i % 2 == 0); } return join '', @parts; }

    Here's a test:

      Quite interesting, but I don't think it's flexible enough. For example, if later I decide that `` enclosed by spaces ( /\s``\s/ ) shouldn't be recognized as an "escape mark", well, how can we detect it?