Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^2: Replacing Text

by wfsp (Abbot)
on Apr 06, 2007 at 06:08 UTC ( [id://608609]=note: print w/replies, xml ) Need Help??


in reply to Re: Replacing Text
in thread Replacing Text

No need to get all complicated and stuff...
And then you go on to do precisely that. With a "manual cleanup" and another script to boot. And I bet it would still break.

What is so hard about using a parser?

#!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new(*DATA); my $html; while (my $t = $p->get_token){ next if $t->is_start_tag('a') or $t->is_end_tag('a'); $html .= $t->as_is; } print "$html\n"; __DATA__ <p>some text</p> <a href="./somelink.html">This is a url :O </a> <p>some more text</p>
output:
<p>some text</p> This is a url :O <p>some more text</p>
Somebody else has done all the hard work why give yourself pain?

In my opinion using any regex on any html is the way to madness.

Replies are listed 'Best First'.
Re^3: Replacing Text
by saintly (Scribe) on Apr 06, 2007 at 14:16 UTC
    Your solution is more easily maintainable and easy to understand, both admirable qualities and useful for production-level code.

    However, for a simple task like this that may be run only once, it may not be worth it to create a tool that does it. My first example would fit in with the script he already had to remove all the links in the file.

    An even simpler alternative is to just do the operation from the command line. No need to even open an editor. When I tested the above command line on a variety of web pages, it only missed one link: an href split across multiple lines (note: these are usually made by someone hand-coding HTML!). After fixing it by hand, the whole task was done.

    Remember that St. Wall has declared laziness to be one of the three great virtues of the programmer. Using the command line perl interpreter to get the job done is an even lazier way to solve the problem. I'm not claiming it's a better solution than yours, but it often helps to know several ways to solve the same problem so you can use the right tool for the job.

    Choosing the 'right' solution for even a simple task like this can take some consideration:
    • Will you need to perform this task again in the (near?) future?
    • Are the source files in good shape, with well-formed HTML?
    • Are there other projects you need to be working on, so you'd want to do this task quickly?
    • Is your tag-stripper going to be part of a larger application that some poor sap is going to have to maintain in 10 years?
    So for my estimation of the author's situation: 'no, yes, yes, no', I'd just use the command line, check my work with grep, fix any stray tags and then move on. No need to install CPAN modules for a trivial task. I'd consider writing a larger app for this to be like sandblasting a soupcracker. But I don't dismiss people who write an app to do this. They're the sort of people who will still have the app in 2 years when I want to do this task again, and they'll remember where they put it.
      We'll have to agree to disagree. The horrible truth is that _is_ how I do quick and dirty one offs!

      I do a lot of html parsing/rewriting. Machine generated, FrontPage generated (shudder), user generated, even beautifully hand knitted wfsp generated - and it always ends in tears.

      50KB or 50 bytes, I don't care. "Where's the parser!" As with everything the more you do it the quicker it gets.

      And anyways my one offs (in such a case as the OPs) always lead to "I wonder if any pages didn't have any links?", "How many?", "Which ones?", "Most popular link/least popular link?" And then, as night follows day, "What about a nice report? Sorted by file name/frequency and link/frequency?" You just can't predict what the site owner is going to come up with next. :-)

      I've settled on HTML::TokeParser::Simple because I think it is as writable as it is readable (Ovid++), far more writable, readable, robust etc. than any regex is going to be.

      Oh, and by the way, what's the command line? :-)

      update:
      As I was writing this the node expanded a tad! I agree with many of the latter points but my main view still stands.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://608609]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2024-04-25 19:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found