http://qs321.pair.com?node_id=181688


in reply to how do I extract contact data from websites?

This node falls below the community's threshold of quality. You may see it by logging in.
  • Comment on Re: how do I extract contact data from websites?

Replies are listed 'Best First'.
•Re: Re: how do I extract contact data from websites?
by merlyn (Sage) on Jul 15, 2002 at 04:55 UTC
    For those of you who are anti-spammers, I am not into spamming. I only wish to create a list that I can use to market my package for one of the construction trades.
    That is spamming.

    It's spamming because those people did not ask you to send them email. They published their address so that customers could write them. Not so they could have ads shoved down their throat.

    You are evil.

    I will not help you.

    I know many hundreds of others that will not help you either.

    May you rot in hell.

    -- Randal L. Schwartz, Perl hacker

      Speaking up because I --ed this one. I hate spam as much as the next guy, but obviously this monk is misguidedly thinking he's not doing something bad. It would have been more productive to correct him as cjf++ did, rather than simply berating him and adding an unnecessary insult to top it off. Of course he may be putting up a naive facade, but I prefer to err in favour of innocence and if he is indeed manipulating, then simply refusing him an answer about the task at hand is sufficient.

      Makeshifts last the longest.

Re: Re: how do I extract contact data from websites?
by moof1138 (Curate) on Jul 15, 2002 at 05:23 UTC
    Technically speaking SPAM is defined as unsolicited bulk email. So if you are building a list to mass mail contacts who have not requested the email, I think it is safe to say that would be SPAMming.
Re: Re: How do I extract contact data from websites?
by cjf (Parson) on Jul 15, 2002 at 06:01 UTC

    To retrieve the web pages, and parse them for email addresses you should take a look at the following modules:

    1. LWP::Simple - To grab the web pages.
    2. HTML::TokeParser - To parse the HTML.
    3. Email::Find - To find email addresses in plain text.
    4. Email::Valid - To validate the email addresses.
    5. Mail::Bulkmail - For sending out emails.

    As for mail addresses, they'll be a lot harder to find due to the number of possible formats they could be in. Try looking for keywords in certain orders (e.g. number, street, city). This will be very difficult and unless you're willing to spend a lot of time on it you probably won't have much luck.

    Now, as one of our illustrious members articulately observed, SPAM is bad. It is also, in all probability, not the best way to go about obtaining new customers. Chances are you'll only get blacklisted and obtain a less than desireable reputation. So please consider your methods carefully.

    And some extra reading:

    That said, I find the Anti-Spam fanatics far more annoying than spam. Flaming a few possible (there are many false postives) spammers on a Perl forum won't do anything to solve the problem. If you want to be productive, write better filters.

        If you want to be productive, write better filters.

      This is functionally equivalent to the spammers' cry of "just hit delete". I don't pay for bandwidth on the user side of my sendmail installation; I pay for bandwidth on the user side of my ISP's internal gateway. Filters do nothing to reduce my costs, they just turn spam into a DoS attack (wasting my resources and my time) from theft of resources (using my resources to harass my users against my will). That's like turning burglary into vandalism: it's not necessarily "as bad" (though it might be worse if the RBL blacklists a busy server, or a huge bolus of spam takes down my machine), but it's still criminal.

      Now, I'm willing to accept some spam as the cost of teaching people about how the net works -- it's pretty much inevitable, and generally speaking the more knowledge that's out there, the better. But let's not trivialize the cost of spam to "Just Hit Delete" or "Use Filters".

      Update: Looks like I misread cjf's stance. My apologies.

      --
      The hell with paco, vote for Erudil!
      :wq

      I downvoted this node. I find it morally wrong to help someone commit theft while trying to justify it by saying, "This may not be what the potential thief confessed." Is unsolicited bulk commercial e-mail somehow less a theft of resources or somewhat more my responsibility if I don't use filters?

        And I upvoted your post for completely valid criticism.

        Keeping that in mind, these aren't the spammers you're looking for, move along. I suppose I could justify the post by saying any one of the following:

        • I was trying to help him learn Perl better by pointing him to some relevant modules.
        • He could have been very inarticulate and failed to mention a legitimate goal he was trying to accomplish.
        • The information in the post may be of use to someone with a legitimate goal. Spidering an intranet to retrieve statistics about the availability of contact information perhaps.
        • A couple hundred spams is nothing compared to the major offenders. Even if the post directly causes these couple hundred spams, by helping out people with legitimate goals it more than balances out.

        I'm sure I could think up many more, but to be honest, they're not very good reasons. My point was that responses like •Re: Re: how do I extract contact data from websites? do nothing to help the situation and only push this site closer to Slashdot-like discussion levels (which is a bad thing). So write filters, lobby your government, write secure software to reduce the number of open relays, but don't waste time with posts like that.