in reply to Batch remove URLs
G'day bobafifi,
"I know how to remove individual URLs from the pages using a find/replace one liner, but doing them all in one pass has so far eluded me."
If you'd posted the part that you know, we could suggest how to extend that. Here's an example one-liner to change multiple lines in multiple files:
$ cat ABC A old A B old B C old C
$ cat DEF D old D E old E F old F
$ perl -pi -e 's/old/new/' ABC DEF
$ cat ABC A new A B new B C new C
$ cat DEF D new D E new E F new F
See perlrun for information on the -i and -p switches that I used.
— Ken
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^2: Batch remove "404 Not Found" URLs
by bobafifi (Beadle) on Oct 27, 2017 at 06:21 UTC | |
Here's what I've been using: find . -type f -name "*.htm" -print|xargs perl -i -pe 's/http:\/\/example\.com\/[404 Not Found]/g' I'm afraid I haven't described what I'm trying to accomplish very well, sorry. 1.) I have a list of 300 URLs 2.) I have a folder on my desktop with 100 .htm pages 3.) I want to run that list against those 100 pages and remove URLs 4.) This will leave the <a href tags in place with the text [404 Not Found] (instead of the URL - for example, <a href="[404 Not Found]">[404 Not Found]</a>). My plan then (since some of her links have descriptive text and others link text), was/is to render those dummy tags in the HTML inactive by doing another find/replace and leaving just <a>[404 Not Found]</a> to display 404 Not Found or the link's descriptive text in the browser. Thanks again Ken - I'll check out the perlrun link | [reply] [d/l] [select] |
by hippo (Bishop) on Oct 27, 2017 at 08:41 UTC | |
Assuming that you just want to get the job done and are not pursuing this as an academic exercise, I would abandon the one-liner approach. It can be done that way, but the more you throw into it the messier it gets. Here's one plan: You can now test the inner subroutine in isolation on a test file to your heart's content to get it perfectly right without destroying the initial content. Consider quotemeta for the search terms. If you get stuck with that approach, come back with specific questions, ideally as an SSCCE. Good luck. | [reply] |
by kcott (Archbishop) on Oct 27, 2017 at 08:42 UTC | |
"Here's the what I've been using ... 's/s/http://example.com/[404 Not Found]/g'" I doubt it. That won't even compile:
Even assuming the initial "s/s/" was a typo, and should have been just "s/"; it still doesn't compile:
Perhaps you meant something closer to this:
You really need to copy and paste verbatim code. Typing by hand, or making guesses, is extremely error-prone; we can only respond to what you posted (not something different, that was maybe intended, but not actually written). Unfortunately, when one such problem is found, it raises the question of whether other parts are not true representations of the real code, data, output, and so on. While you probably could still do this with a one-liner; it's getting a bit complicated for that and I'd recommend a script. For a simple text substitution, a regex is probably fine; if it's actually more complex than your post suggests, you should find an alternative tool (see "Parsing HTML/XML with Regular Expressions" for a whole raft of options). You talk about doing this in two passes; that seems wasteful to me and one pass is easy anyway. You say you want to end up with "<a>[404 Not Found]</a>"; use whatever you want but, in the code below, I've used "<span class="bad-url">[404 Not Found]</span>": that will render as plain text as it is, but allows you to highlight it with CSS if you so desire. In the code below I've used Inline::Files purely for demonstration purposes. I'm assuming you're familiar with open. You can presumably get your list of HTML files with "*.htm" on the command line (the find and xargs seems overkill to me, but maybe you have a reason); using glob, within your script, is another option; there's also readdir; and there are many modules you could also use. I've also assumed that your "list of 300 URLs" is also in a file somewhere; however, it's far from clear if that's actually the case. In the code below, the technique I'm demonstrating involves creating a hash from your list of URLs once, then substituting links which match one of those URLs. Do note that your post suggests that the href value is the same as the <a> tag content: my code reflects that; modify if necessary.
Output:
— Ken | [reply] [d/l] [select] |
by bobafifi (Beadle) on Oct 27, 2017 at 13:37 UTC | |
| [reply] |
by kcott (Archbishop) on Oct 27, 2017 at 17:22 UTC |