http://qs321.pair.com?node_id=713858

bobafifi has asked for the wisdom of the Perl Monks concerning the following question:

I've got an HTML page I'm trying to do a find/replace on. For example
perl -i -pe 's/<TD><FONT FACE=arial SIZE=-1>/widget/g' * test.html
but there's a carriage return in the HTML between
<TD> <FONT FACE=arial...
Can somebody please tell me how to code for the carriage return?
Thanks!
Bob

Replies are listed 'Best First'.
Re: Removing the carriage return in a Find & Replace?
by psini (Deacon) on Sep 26, 2008 at 11:51 UTC

    Per HTML standard, <CR>, <LF> and <space> are interchangeable separators in an HTML document. Moreover, a string of two or more separators is treated like a single separator.

    So, if you want to catch <TD><FONT FACE=arial SIZE=-1> with a regex you should expect 0 or more separators wherever a separator is optional and 1 or more wherever it is required. That said, I think that /<TD>\s*<FONT\s+FACE=arial\s+SIZE=-1>/ should be enough.

    Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."

      Thanks for the quick reply psini!
      Using your suggestion, I just tried
      perl -i -pe 's/<TD>\s*<FONT FACE=arial SIZE=-1>/widget/g' * test.php
      unfortunately it didn't work.

      However, when I remove the carriage return in the html and run
      perl -i -pe 's/<TD><FONT FACE=arial SIZE=-1>/widget/g' * test.php
      no problem. Not sure why, but the s* doesn't seem to be recognized.

      Thanks again,
      Bob

        Because you've told perl to read the file a line at a time (well, more you haven't told it not to do otherwise and line is the default) so $_ will only contain <TD>\n and the next line will have <FONT ....>. At no point is the entire contents you expect to match in $_ simultaneously and in the right order so the match never happens and the substitution never triggers.

        See the documentation for the -0 switch in perlrun, specifically the part about turning on paragraph mode.

        The cake is a lie.
        The cake is a lie.
        The cake is a lie.

        Are you sure it is a CR and not some evil non-printable character used by MS?

        Try editing the file with a text editor (not a word processor!), delete the current newline character, insert a CR and try again. If it works, the problem is to find what is the newline character used in the file.

        Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."