Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Replacing Text

by pyro.699 (Novice)
on Apr 05, 2007 at 22:37 UTC ( [id://608561]=perlquestion: print w/replies, xml ) Need Help??

pyro.699 has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I am looking for a way that i can easily replace text. I know how to use s/"find"/"replace"/g but i need something a littlemore advanced i think. I need a way that i can take this: <a href="./somelink.html">This is a url :O </a> and turn it into This is a url :O so basically it is just stripping the url tag. There are 300 shtml files containg invalid links, and while i am working on the site, i need to remove the links, but i do not want to go into every file, fing all the invalid urls, and replace them. Thanks a ton :) ~Cody Woolaver

Replies are listed 'Best First'.
Re: Replacing Text
by bobf (Monsignor) on Apr 05, 2007 at 22:54 UTC
Re: Replacing Text
by shmem (Chancellor) on Apr 05, 2007 at 22:57 UTC
    I guess the hrefsub example file inside the HTML::Parser distribution is what you are looking for, at least as a very good starting point.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: Replacing Text
by mreece (Friar) on Apr 05, 2007 at 22:50 UTC
    expect a lot of answers directing you towards various html scrubbers and token parsers, but here is the dirty regex:
    s{(?:<a\s[^>]+>)([^<]+)(?:</a>)}{$1}g;
    there are a lot of assumptions in there, like the link label doesn't contain < (ie, <img>), there are no line breaks, HTML is well-formed, no embedded comments (<a <!--oops--> href="">...</a>), etc, etc. but if all your use cases are like the example there, it will work.
      Well, some of them do have images in the link, and i have a feeling that it will make it even more complicated... lol Thanks
Re: Replacing Text
by saintly (Scribe) on Apr 06, 2007 at 01:58 UTC
    $contents_of_file =~ s/\<a.*?\>(.*?)\<\/a\>/$1/igs;
    And you're done. No need to get all complicated and stuff...

    If the links aren't ever broken across multiple lines, you don't even need to write a script:
    $ perl -pi -e 's/\<a.*?\>(.*?)\<\/a\>/$1/igs' *.html
    If there are links broken across multiple lines, like
    <a href="something"> foo! </a>
    Those would require manual cleanup, but you can just grep for 'href' to see if there's any left. If there look like a lot, then you can bust out a script to do it.
      No need to get all complicated and stuff...
      And then you go on to do precisely that. With a "manual cleanup" and another script to boot. And I bet it would still break.

      What is so hard about using a parser?

      #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new(*DATA); my $html; while (my $t = $p->get_token){ next if $t->is_start_tag('a') or $t->is_end_tag('a'); $html .= $t->as_is; } print "$html\n"; __DATA__ <p>some text</p> <a href="./somelink.html">This is a url :O </a> <p>some more text</p>
      output:
      <p>some text</p> This is a url :O <p>some more text</p>
      Somebody else has done all the hard work why give yourself pain?

      In my opinion using any regex on any html is the way to madness.

        Your solution is more easily maintainable and easy to understand, both admirable qualities and useful for production-level code.

        However, for a simple task like this that may be run only once, it may not be worth it to create a tool that does it. My first example would fit in with the script he already had to remove all the links in the file.

        An even simpler alternative is to just do the operation from the command line. No need to even open an editor. When I tested the above command line on a variety of web pages, it only missed one link: an href split across multiple lines (note: these are usually made by someone hand-coding HTML!). After fixing it by hand, the whole task was done.

        Remember that St. Wall has declared laziness to be one of the three great virtues of the programmer. Using the command line perl interpreter to get the job done is an even lazier way to solve the problem. I'm not claiming it's a better solution than yours, but it often helps to know several ways to solve the same problem so you can use the right tool for the job.

        Choosing the 'right' solution for even a simple task like this can take some consideration:
        • Will you need to perform this task again in the (near?) future?
        • Are the source files in good shape, with well-formed HTML?
        • Are there other projects you need to be working on, so you'd want to do this task quickly?
        • Is your tag-stripper going to be part of a larger application that some poor sap is going to have to maintain in 10 years?
        So for my estimation of the author's situation: 'no, yes, yes, no', I'd just use the command line, check my work with grep, fix any stray tags and then move on. No need to install CPAN modules for a trivial task. I'd consider writing a larger app for this to be like sandblasting a soupcracker. But I don't dismiss people who write an app to do this. They're the sort of people who will still have the app in 2 years when I want to do this task again, and they'll remember where they put it.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://608561]
Approved by McDarren
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (9)
As of 2024-03-28 10:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found