Re^3: Batch remove "404 Not Found" URLs

"Here's the what I've been using ... 's/s/http://example.com/[404 Not Found]/g'"

I doubt it. That won't even compile:

$ perl -MO=Deparse -e 's/s/http://example.com/[404 Not Found]/g'
Bareword found where operator expected at -e line 1, near "404 Not"
    (Missing operator before Not?)
syntax error at -e line 1, near "404 Not Found"
-e had compilation errors.
[download]

Even assuming the initial "s/s/" was a typo, and should have been just "s/"; it still doesn't compile:

$ perl -MO=Deparse -e 's/http://example.com/[404 Not Found]/g'
Bareword found where operator expected at -e line 1, near "404 Not"
    (Missing operator before Not?)
Regexp modifiers "/a" and "/l" are mutually exclusive at -e line 1, at
+ end of line
syntax error at -e line 1, near "404 Not Found"
-e had compilation errors.
[download]

Perhaps you meant something closer to this:

$ perl -MO=Deparse -e 's{http://example.com}{[404 Not Found]}g'
s[http://example.com][[404 Not Found]]g;
-e syntax OK
[download]

You really need to copy and paste verbatim code. Typing by hand, or making guesses, is extremely error-prone; we can only respond to what you posted (not something different, that was maybe intended, but not actually written). Unfortunately, when one such problem is found, it raises the question of whether other parts are not true representations of the real code, data, output, and so on.

While you probably could still do this with a one-liner; it's getting a bit complicated for that and I'd recommend a script. For a simple text substitution, a regex is probably fine; if it's actually more complex than your post suggests, you should find an alternative tool (see "Parsing HTML/XML with Regular Expressions" for a whole raft of options).

You talk about doing this in two passes; that seems wasteful to me and one pass is easy anyway. You say you want to end up with "<a>[404 Not Found]</a>"; use whatever you want but, in the code below, I've used "<span class="bad-url">[404 Not Found]</span>": that will render as plain text as it is, but allows you to highlight it with CSS if you so desire.

In the code below I've used Inline::Files purely for demonstration purposes. I'm assuming you're familiar with open. You can presumably get your list of HTML files with "*.htm" on the command line (the find and xargs seems overkill to me, but maybe you have a reason); using glob, within your script, is another option; there's also readdir; and there are many modules you could also use. I've also assumed that your "list of 300 URLs" is also in a file somewhere; however, it's far from clear if that's actually the case.

In the code below, the technique I'm demonstrating involves creating a hash from your list of URLs once, then substituting links which match one of those URLs. Do note that your post suggests that the href value is the same as the <a> tag content: my code reflects that; modify if necessary.

#!/usr/bin/env perl -l

use strict;
use warnings;

use Inline::Files;

my %bad_url;

while (<URLLIST>) {
    chomp;
    ++$bad_url{$_};
}

my $re = qr{(?x:
    (               # capture entire element to \$1
        <a          # match start of 'a' start tag
        \s+         # match whitespace after element name
        href="      # match start of href attribute
        (           # capture href value to \$2
            [^"]+   # match anything that isn't a "
        )           # end \$2 capture
        "           # match closing "
        \s*         # match optional whitespace
        >           # match end of 'a' start tag
        \s*         # match optional whitespace
        \g2         # match href value (captured in \$2)
        \s*         # match optional whitespace
        </a>        # match 'a' end tag
    )               # end \$1 capture
)};

my $replace = '<span class="bad-url">[404 Not Found]</span>';

for my $fh (\*HTM1, \*HTM2) {
    my $html = do { local $/; <$fh> };
    print '*** ORIGINAL ***';
    print $html;
    $html =~ s/$re/exists $bad_url{$2} ? $replace : $1/eg;
    print '*** MODIFIED ***';
    print $html;
}

__URLLIST__
http://bad1.com/
http://bad2.com/
http://bad3.com/
http://bad4.com/
__HTM1__
<h1>HTM1</h1>
<a href="http://bad1.com/">http://bad1.com/</a>
<a href="http://good.com/">http://good.com/</a>
<a href="http://bad2.com/">http://bad2.com/</a>
__HTM2__
<h1>HTM2</h1>
<a href="http://good.com/">http://good.com/</a>
<a href="http://bad2.com/">
    http://bad2.com/
</a>
<a href="http://good.com/">
    http://good.com/
</a>
<a 
    href="http://bad3.com/"
>http://bad3.com/</a>
<a href="http://bad4.com/">http://bad3.com/</a>
<a href="http://bad4.com/">http://bad4.com/</a>
[download]

Output:

*** ORIGINAL ***
<h1>HTM1</h1>
<a href="http://bad1.com/">http://bad1.com/</a>
<a href="http://good.com/">http://good.com/</a>
<a href="http://bad2.com/">http://bad2.com/</a>

*** MODIFIED ***
<h1>HTM1</h1>
<span class="bad-url">[404 Not Found]</span>
<a href="http://good.com/">http://good.com/</a>
<span class="bad-url">[404 Not Found]</span>

*** ORIGINAL ***
<h1>HTM2</h1>
<a href="http://good.com/">http://good.com/</a>
<a href="http://bad2.com/">
    http://bad2.com/
</a>
<a href="http://good.com/">
    http://good.com/
</a>
<a 
    href="http://bad3.com/"
>http://bad3.com/</a>
<a href="http://bad4.com/">http://bad3.com/</a>
<a href="http://bad4.com/">http://bad4.com/</a>

*** MODIFIED ***
<h1>HTM2</h1>
<a href="http://good.com/">http://good.com/</a>
<span class="bad-url">[404 Not Found]</span>
<a href="http://good.com/">
    http://good.com/
</a>
<span class="bad-url">[404 Not Found]</span>
<a href="http://bad4.com/">http://bad3.com/</a>
<span class="bad-url">[404 Not Found]</span>
[download]

— Ken

Comment on Re^3: Batch remove "404 Not Found" URLs Select or Download Code

Replies are listed 'Best First'.
Re^4: Batch remove "404 Not Found" URLs by bobafifi (Beadle) on Oct 27, 2017 at 13:37 UTC
Thank you Ken! My apologies for the initial typos in the one-liner, it's been awhile since I've used this Perlmonks interface. Good suggestion on the span tags and CSS, I hadn't thought about that as I was really more focused on simply getting the text 404 Not Found to not be hyperlinked. I'll check out your script. Thanks again!	[reply]
Re^5: Batch remove "404 Not Found" URLs by kcott (Archbishop) on Oct 27, 2017 at 17:22 UTC
"Thank you Ken!" You're very welcome. "My apologies for the initial typos in the one-liner, it's been awhile since I've used this Perlmonks interface." Fixing typos is, in itself, perfectly fine; however, please indicate what changes you've made. See "How do I change/delete my post?" for why and how. You can use the "preview" button as many times as you want before creating the post. This is actually the easier option: you still make the same changes but, as no one can see the post yet, it won't become messy with deletions using `<strike>` tags (e.g. ~~original text, that turned out to be wrong, and now completely removed~~) and updates with `<del>` and `<ins>` tag pairs (e.g. ~~original text~~corrected text). You also won't need to write additional "Update: ..." paragraphs explaining the changes. — Ken	[reply] [d/l] [select]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks