in reply to Re: how do I extract contact data from websites?
in thread how do I extract contact data from websites?

To retrieve the web pages, and parse them for email addresses you should take a look at the following modules:

  1. LWP::Simple - To grab the web pages.
  2. HTML::TokeParser - To parse the HTML.
  3. Email::Find - To find email addresses in plain text.
  4. Email::Valid - To validate the email addresses.
  5. Mail::Bulkmail - For sending out emails.

As for mail addresses, they'll be a lot harder to find due to the number of possible formats they could be in. Try looking for keywords in certain orders (e.g. number, street, city). This will be very difficult and unless you're willing to spend a lot of time on it you probably won't have much luck.

Now, as one of our illustrious members articulately observed, SPAM is bad. It is also, in all probability, not the best way to go about obtaining new customers. Chances are you'll only get blacklisted and obtain a less than desireable reputation. So please consider your methods carefully.

And some extra reading:

That said, I find the Anti-Spam fanatics far more annoying than spam. Flaming a few possible (there are many false postives) spammers on a Perl forum won't do anything to solve the problem. If you want to be productive, write better filters.

  • Comment on Re: Re: How do I extract contact data from websites?

Replies are listed 'Best First'.
(OT: The Cost of Spam) Re(3): How do I extract contact data from websites?
by FoxtrotUniform (Prior) on Jul 15, 2002 at 22:32 UTC
      If you want to be productive, write better filters.

    This is functionally equivalent to the spammers' cry of "just hit delete". I don't pay for bandwidth on the user side of my sendmail installation; I pay for bandwidth on the user side of my ISP's internal gateway. Filters do nothing to reduce my costs, they just turn spam into a DoS attack (wasting my resources and my time) from theft of resources (using my resources to harass my users against my will). That's like turning burglary into vandalism: it's not necessarily "as bad" (though it might be worse if the RBL blacklists a busy server, or a huge bolus of spam takes down my machine), but it's still criminal.

    Now, I'm willing to accept some spam as the cost of teaching people about how the net works -- it's pretty much inevitable, and generally speaking the more knowledge that's out there, the better. But let's not trivialize the cost of spam to "Just Hit Delete" or "Use Filters".

    Update: Looks like I misread cjf's stance. My apologies.

    The hell with paco, vote for Erudil!

Re: Re: Re: How do I extract contact data from websites?
by chromatic (Archbishop) on Jul 15, 2002 at 16:57 UTC

    I downvoted this node. I find it morally wrong to help someone commit theft while trying to justify it by saying, "This may not be what the potential thief confessed." Is unsolicited bulk commercial e-mail somehow less a theft of resources or somewhat more my responsibility if I don't use filters?

      And I upvoted your post for completely valid criticism.

      Keeping that in mind, these aren't the spammers you're looking for, move along. I suppose I could justify the post by saying any one of the following:

      • I was trying to help him learn Perl better by pointing him to some relevant modules.
      • He could have been very inarticulate and failed to mention a legitimate goal he was trying to accomplish.
      • The information in the post may be of use to someone with a legitimate goal. Spidering an intranet to retrieve statistics about the availability of contact information perhaps.
      • A couple hundred spams is nothing compared to the major offenders. Even if the post directly causes these couple hundred spams, by helping out people with legitimate goals it more than balances out.

      I'm sure I could think up many more, but to be honest, they're not very good reasons. My point was that responses like •Re: Re: how do I extract contact data from websites? do nothing to help the situation and only push this site closer to Slashdot-like discussion levels (which is a bad thing). So write filters, lobby your government, write secure software to reduce the number of open relays, but don't waste time with posts like that.