Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Batch Google URL Removal Script

by pileofrogs (Priest)
on Jan 22, 2010 at 01:26 UTC ( [id://818861]=perlmeditation: print w/replies, xml ) Need Help??

Hi all! I just set up a new sourceforce project for a perl script I wrote and wanted to toot my proverbial horn here on PM.

Need to submit a batch of URLS to Google's URL removal request form? Try tasty new http://urlremove.sourceforge.net.

I'm a network admin at a community college. We recently had an adventure where some instructors thought it would be helpful to post student phone numbers on web pages. We took down the pages containing phone numbers, but the phone numbers were still showing up in google searches in the summary paragraph in the search results. Google will naturally re-spider those pages and remove the offending material from the search results, but we wanted to do it faster. Google has a form you can fill out to submit a URL and Google will try and spider that URL sooner. Unfortunately, it only accepts one URL at a time and is generally tedious. Since I had a whole batch of URLS I needed to remove, I decided rather than go insane submitting them all by hand, I'd write a perl script to do it for me with WWW::Mechanize. And, since I didn't find any other tools out there to automate this task, I went and put it on sourceforge.

So, if you or anyone you know needs to submit a stack of URLs to Google's URL removal request form, give this script a try.

And if you just like looking at other people's perl and giving advice, please take a gander and let me know what you think.

Thanks PM! Thanks Perl!

http://urlremove.sourceforge.net

--Pileofrogs

Replies are listed 'Best First'.
Re: Batch Google URL Removal Script
by ambrus (Abbot) on Jan 22, 2010 at 21:36 UTC

    As a quick solution, wouldn't it work if you put up a robots.txt that asked robots away from all of those pages (you'd also take them down of course), then submit only one page in the google remove form, hoping that it would load the robots.txt and figure out to remove all other pages?

      Isn't it better to leave the page, but with blank content rather than removing it? So that it gets overwriten in google's cache, rather frozen there in perpetutity.

      Many times, I've come across pages where the host no longer serves the url, but it's ghost is available from Google's cache for weeks after.

        You can do it either way. When you submit the form you tell google if you've edited or remove the page, so I assume it handles them differently.

        The idea is by submitting this form, you don't have to wait around for weeks because Google spiders the site sooner.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://818861]
Approved by ww
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (8)
As of 2024-03-28 09:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found