Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Golf: Fix de facto HTML comments

by tye (Sage)
on Jul 18, 2004 at 18:58 UTC ( [id://375412] : perlmeditation . print w/replies, xml ) Need Help??

Your challenge is to 'golf' some Perl code (produce code that requires the fewest [key] strokes -- fewest characters) that mostly just does s/--/-/g, but with some simple restrictions. I was surprised that I implemented this simple task over a dozen times before I finally got it right. I golfed mine down to 80 characters, so I wanted to see what y'all can come up with. Getting a correct solution may be a bigger challenge than golfing the solution.


A 'de facto HTML comment' is started by "<!--" and ended by "-->" and can contain anything between those two delimiters except, of course, "-->". This is such a nice, simple, easy-to-parse definition that it has advantages over a standard HTML comment.

Some (notorious but still very popular) browsers only handle de facto HTML comments. Many browsers only handle standard HTML comments.1

Your task is to golf some code that will adjust de facto HTML comments so that they are also standard HTML comments. I'll let those who are curious about the details of standard HTML comments visit Google. The only detail we need to worry about for the golf is that "--" inside of a de facto HTML comment is the problem.

Although "<!-- foo -- -- bar -->" is a valid HTML comment according to both the standard and de facto definitions, I'll make the task much easier by just requiring that all occurrences of "--" be replaced inside of the de facto comments. But we want to change as few pixels as possible so we'll transform the above comment to something like "<!-- foo - - bar -->".

If you can code a solution that changes even fewer characters but still makes sure each de facto comment ends up also being a standard comment, then you'll get bonus points (in the tradition of Whose Line Is It Anyway).

I chose "" (the "not" symbol, "\xAC", &#xAC;=¬) because it looks a lot like "-" in most fonts and is still in Latin-1. The soft hyphen (&#xAD=&shy;) looks even closer to "-" but shouldn't be displayed at all in most cases, so I rejected it. The en dash is "–", &#x2013;, &ndash;, and is "\x96" in Windows-1252 (Microsoft's extension to Latin-1 which is nearly the de facto interpretation of "Latin-1") and it also looks even more like "-". But some browsers are still standards-compliant enough that they won't display that. How does your browser display it ()?

The rules

  1. Insert as few characters as possible into the following code:
    #!/usr/bin/perl -w use strict; $| = 1; $/ = ''; for( <DATA> ) { #2345678 1 2345678 2 2345678 3 2345678... # Replace this line with your code ; print; }
    Some sample input is shown later.
  2. Your code must make it so that, for each "<!--" that starts a de facto HTML comment, the next occurrence of "--"s after it is the first two characters of "-->" (which ends the comment). Bonus points for instead making each comment valid according to the HTML standards.
  3. Your code should change as few characters as possible.
    • So it should not change any characters outside of de facto HTML comments. (If there is a "<!--" that is never followed by a "-->" then your code can either treat the rest of the string as being inside a comment or outside, whatever makes your code shorter.)
    • Rerunning your code on output from your code should make no changes.
    • Your code must only change "-" to "". So running tr/\x95/-/ on the input and output should give the same results.
    Points deducted for changing too many characters but even more points deducted for not producing comments that fit both definitions.
  4. You can assume the input and output are 8-bit Latin-1. Or you can assume utf-8 strings if you prefer. Other encodings might be legal though I can't think of any advantage.
  5. You get penalized for causing global side effects. This means that using "$a" instead of "my $x" isn't going to be a net win here. You can use global variables for their intended purposes but you'll get a small penalty if you change them and don't change them back (either to their previous value or to their standard default value).
  6. You get penalized for causing warnings.
  7. Please hide your solutions like spoilers (such as using a table or similar to set identical foreground and background colors and/or using READMORE tags and putting "spoilers" in your node title).

Later I'll post my solution and some test code that covers some of the rules. For now, I don't want to hint at techniques to try.

Here is some test data (but don't assume this is the only data you need to handle):

__END__ ---<!-- -->---> <--!-- <!-- -- --> --> <!---->--<!----->-<!------>---<!-------> <!---><!----> <!--->--<!----> <!--->---<!----> <!--->----<!----> -<!-->--<!-->--<!-->---<!--> <!--><!-->-<!-->--<!-->--<!-->---<!-->-- <!-- - - --> <!--- ---> <!---- ---->

1 Some browsers don't manage to get either definiton right. I have a copy of Opera that appears to require < and > to be balanced inside of HTML comments. Opera impresses me both with its nice features and how it manages to have bugs that are just so, well, stupid. (:

- tye        

Replies are listed 'Best First'.
Re: Golf: Fix de facto HTML comments (spoiler)
by blokhead (Monsignor) on Jul 18, 2004 at 19:26 UTC
    This solution seems too straight-forward and obvious, so I feel like I must have missed an important detail...

    51 characters:

    Update: I did miss an important detail, so here's a revision (62 chars):


Re: Golf: Fix de facto HTML comments
by dws (Chancellor) on Jul 18, 2004 at 22:11 UTC

    51 characters, though I'm sure there's a golf trick that could drive that down further.

    # 345678 1 2345678 2 2345678 3 2345678 4 2345678 5 s/<!--(.+?)-->/(my$x=$1)=~s#--#-#g;"<!--$x-->"/seg

    Hm... Looks like a nearly identical approach to blockhead's first approach. (Evil minds think alike.) Here's a tweak that takes it to 53 characters.

    # 345678 1 2345678 2 2345678 3 2345678 4 2345678 5 23 s/<!--(.+?)-->/(my$x=$1)=~s#--#-#g;"<!-- $x -->"/seg

    Changed once more (to 49), since I was a doofus and didn't read tye's instructions carefully enough. I think this one loses points for changing too much, but wins for making the comments legal both ways.

    # 345678 1 2345678 2 2345678 3 2345678 4 2345678 s/<!--(.+?)-->/(my$x=$1)=~s#-##g;"<!--$x-->"/seg

      I was thinking along the same lines too, but I had a different approach to it, which saved me three characters.

      46 (but changes too much)

      He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life.

      Chady |

      Simple changes give 43. I was a little worried about regex engine reentrancy, but seems to work ok,

      # 345678 1 2345678 2 2345678 3 2345678 4 2345678 s/<!--(.+?)-->/$_=$1;y#-##;"<!--$_-->"/seg
Re: Golf: Fix de facto HTML comments
by bageler (Hermit) on Jul 19, 2004 at 02:12 UTC
    here's 56. I used a period since the other char acted funky in my editor.

    edit: trimmed two chars, removed unneccessary character replacements.
Re: Golf: Fix de facto HTML comments
by ysth (Canon) on Jul 18, 2004 at 22:00 UTC
    Here's what I got (63, but much clearer than blokhead's, at least to me): Update: Sigh; I see I fell prey to the same trap blokhead did. I'll see if I can fix it.
Re: Golf: Fix de facto HTML comments
by BrowserUk (Patriarch) on Jul 19, 2004 at 07:14 UTC
    #2345678 1 2345678 2 2345678 3 2345678 4 2345678 5 2345678 6 2345 s/(?<=<!--)([^>]*?)-->/local$_=$1;s!-(\s*)-!-$1\x95!g;"$_-->"/eg;
    #2345678 1 2345678 2 2345678 3 2345678 4 2345678 5 2345678 6 234 s/<!--([^>]*?)-->/local$_=$1;s!-(\s*)-!-$1\x95!g;"<!--$_-->"/eg;

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
Re: Golf: Fix de facto HTML comments
by Chady (Priest) on Jul 19, 2004 at 07:47 UTC


    I believe this does it right.

    He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life.

    Chady |
Re: Golf: Fix de facto HTML comments
by Wassercrats (Initiate) on Jul 18, 2004 at 23:40 UTC
    What would be the count on this if I took advantage of the idioms and defaults and stuff?
    while ($input =~ s/(.*?)(<.*?>)//) { $w=$1; $q=$2; 1 while $q =~ s/(--.*?)-(.*?-->)/$1$2/sg; $output .= "$w$q"; }
      Looks about 116.
      You said you wanted to be around when I made a mistake; well, this could be it, sweetheart.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://375412]
Approved by dws
Front-paged by ysth
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2024-02-24 06:56 GMT
Voting Booth?
My favourite way to spend a leap day ...

Results (22 votes). Check out past polls.