in reply to how do I extract contact data from websites?
This node falls below the community's threshold of quality. You may see it by logging in.
•Re: Re: how do I extract contact data from websites?
by merlyn (Sage) on Jul 15, 2002 at 04:55 UTC
For those of you who are anti-spammers, I am not into spamming. I only wish to create a list that I can use to market my package for one of the construction trades.
That is spamming.
It's spamming because those people did not ask you to send them email. They published their address so that customers could write them. Not so they could have ads shoved down their throat.
You are evil.
I will not help you.
I know many hundreds of others that will not help you either.
May you rot in hell.
-- Randal L. Schwartz, Perl hacker
Re: Re: how do I extract contact data from websites?
by moof1138 (Curate) on Jul 15, 2002 at 05:23 UTC
Technically speaking SPAM is defined as unsolicited bulk email. So if you are building a list to mass mail contacts who have not requested the email, I think it is safe to say that would be SPAMming.
Re: Re: How do I extract contact data from websites?
by cjf (Parson) on Jul 15, 2002 at 06:01 UTC
To retrieve the web pages, and parse them for email addresses you should take a look at
the following modules:
- LWP::Simple - To grab the web pages.
- HTML::TokeParser - To parse the HTML.
- Email::Find - To find email addresses in plain text.
- Email::Valid - To validate the email addresses.
- Mail::Bulkmail - For sending out emails.
As for mail addresses, they'll be a lot harder to find due to the number of possible
formats they could be in. Try looking for keywords in certain orders (e.g. number, street,
city). This will be very difficult and unless you're willing to spend a lot of time
on it you probably won't have much luck.
Now, as one of our illustrious members articulately observed,
SPAM is bad. It is also, in all probability,
not the best way to go about obtaining new customers. Chances are you'll only get
blacklisted and obtain a less than desireable reputation. So please consider your methods
And some extra reading:
That said, I find the Anti-Spam fanatics far more annoying than
spam. Flaming a few possible (there are many false postives) spammers on a Perl forum
won't do anything to solve the problem. If you want to be productive, write better
If you want to be productive, write better filters.
This is functionally equivalent to the spammers' cry of
"just hit delete". I don't pay for bandwidth on the user
side of my sendmail installation; I pay for
bandwidth on the user side of my ISP's internal gateway.
Filters do nothing to reduce my costs, they just turn spam
into a DoS attack (wasting my resources and my time) from
theft of resources (using my resources to harass my users
against my will). That's like turning burglary into
vandalism: it's not necessarily "as bad" (though it might
be worse if the RBL blacklists a busy server, or a huge
bolus of spam takes down my machine), but it's still
Now, I'm willing to accept some spam as the cost
of teaching people about how the net works -- it's pretty
much inevitable, and generally speaking the more knowledge
that's out there, the better. But let's not trivialize the
cost of spam to "Just Hit Delete" or "Use Filters".
Update: Looks like I misread cjf's stance. My
The hell with paco, vote for Erudil!
And I upvoted your post for completely valid criticism.
Keeping that in mind, these aren't the spammers you're looking for, move along. I suppose I could justify the post by saying any one of the following:
- I was trying to help him learn Perl better by pointing him to some relevant modules.
- He could have been very inarticulate and failed to mention a legitimate goal he was trying to accomplish.
- The information in the post may be of use to someone with a legitimate goal. Spidering an intranet to retrieve statistics about the availability of contact information perhaps.
- A couple hundred spams is nothing compared to the major offenders. Even if the post directly causes these couple hundred spams, by helping out people with legitimate goals it more than balances out.
I'm sure I could think up many more, but to be honest, they're not very good reasons. My point was that responses like •Re: Re: how do I extract contact data from websites? do nothing to help the situation and only push this site closer to Slashdot-like discussion levels (which is a bad thing). So write filters, lobby your government, write secure software to reduce the number of open relays, but don't waste time with posts like that.