Re: how do I extract contact data from websites?

http://qs321.pair.com?node_id=181688

in reply to how do I extract contact data from websites?

This node falls below the community's threshold of quality. You may see it by logging in.

Comment on Re: how do I extract contact data from websites?

Replies are listed 'Best First'.
•Re: Re: how do I extract contact data from websites? by merlyn (Sage) on Jul 15, 2002 at 04:55 UTC
For those of you who are anti-spammers, I am not into spamming. I only wish to create a list that I can use to market my package for one of the construction trades. That is spamming. It's spamming because those people did not ask you to send them email. They published their address so that customers could write them. Not so they could have ads shoved down their throat. You are evil. I will not help you. I know many hundreds of others that will not help you either. May you rot in hell. -- Randal L. Schwartz, Perl hacker	[reply]
Re^3: how do I extract contact data from websites? by Aristotle (Chancellor) on Jul 15, 2002 at 15:05 UTC
Speaking up because I --ed this one. I hate spam as much as the next guy, but obviously this monk is misguidedly thinking he's not doing something bad. It would have been more productive to correct him as cjf++ did, rather than simply berating him and adding an unnecessary insult to top it off. Of course he may be putting up a naive facade, but I prefer to err in favour of innocence and if he is indeed manipulating, then simply refusing him an answer about the task at hand is sufficient. Makeshifts last the longest.	[reply]
Re: Re: how do I extract contact data from websites? by moof1138 (Curate) on Jul 15, 2002 at 05:23 UTC
Technically speaking SPAM is defined as unsolicited bulk email. So if you are building a list to mass mail contacts who have not requested the email, I think it is safe to say that would be SPAMming.	[reply]
Re: Re: How do I extract contact data from websites? by cjf (Parson) on Jul 15, 2002 at 06:01 UTC
To retrieve the web pages, and parse them for email addresses you should take a look at the following modules: LWP::Simple - To grab the web pages. HTML::TokeParser - To parse the HTML. Email::Find - To find email addresses in plain text. Email::Valid - To validate the email addresses. Mail::Bulkmail - For sending out emails. As for mail addresses, they'll be a lot harder to find due to the number of possible formats they could be in. Try looking for keywords in certain orders (e.g. number, street, city). This will be very difficult and unless you're willing to spend a lot of time on it you probably won't have much luck. Now, as one of our illustrious members articulately observed, SPAM is bad. It is also, in all probability, not the best way to go about obtaining new customers. Chances are you'll only get blacklisted and obtain a less than desireable reputation. So please consider your methods carefully. And some extra reading: Why Unsolicited Bulk Email is Bad Business SpamAssassin That said, I find the Anti-Spam fanatics far more annoying than spam. Flaming a few possible (there are many false postives) spammers on a Perl forum won't do anything to solve the problem. If you want to be productive, write better filters.	[reply]
(OT: The Cost of Spam) Re(3): How do I extract contact data from websites? by FoxtrotUniform (Prior) on Jul 15, 2002 at 22:32 UTC
If you want to be productive, write better filters. This is functionally equivalent to the spammers' cry of "just hit delete". I don't pay for bandwidth on the user side of my `sendmail` installation; I pay for bandwidth on the user side of my ISP's internal gateway. Filters do nothing to reduce my costs, they just turn spam into a DoS attack (wasting my resources and my time) from theft of resources (using my resources to harass my users against my will). That's like turning burglary into vandalism: it's not necessarily "as bad" (though it might be worse if the RBL blacklists a busy server, or a huge bolus of spam takes down my machine), but it's still criminal. Now, I'm willing to accept some spam as the cost of teaching people about how the net works -- it's pretty much inevitable, and generally speaking the more knowledge that's out there, the better. But let's not trivialize the cost of spam to "Just Hit Delete" or "Use Filters". Update: Looks like I misread cjf's stance. My apologies. `-- The hell with paco, vote for Erudil! :wq`	[reply]
Re(4): The Cost of Spam by cjf (Parson) on Jul 16, 2002 at 06:46 UTC
You are correct. As the article I linked to in my original post points out, there are many negative consequences to unsolicited bulk email. My post was not refuting that, rather I was trying to point out that posts such as •Re: Re: how do I extract contact data from websites? do nothing to stop the problem. So, as I said in Re(4): How do I extract contact data from websites?, 'write filters, lobby your government, write secure software to reduce the number of open relays, but don't waste time with posts like that.' Enough said.	[reply]
Re: Re: Re: How do I extract contact data from websites? by chromatic (Archbishop) on Jul 15, 2002 at 16:57 UTC
I downvoted this node. I find it morally wrong to help someone commit theft while trying to justify it by saying, "This may not be what the potential thief confessed." Is unsolicited bulk commercial e-mail somehow less a theft of resources or somewhat more my responsibility if I don't use filters?	[reply]
Re(4): How do I extract contact data from websites? by cjf (Parson) on Jul 15, 2002 at 22:04 UTC
And I upvoted your post for completely valid criticism. Keeping that in mind, these aren't the spammers you're looking for, move along. I suppose I could justify the post by saying any one of the following: I was trying to help him learn Perl better by pointing him to some relevant modules. He could have been very inarticulate and failed to mention a legitimate goal he was trying to accomplish. The information in the post may be of use to someone with a legitimate goal. Spidering an intranet to retrieve statistics about the availability of contact information perhaps. A couple hundred spams is nothing compared to the major offenders. Even if the post directly causes these couple hundred spams, by helping out people with legitimate goals it more than balances out. I'm sure I could think up many more, but to be honest, they're not very good reasons. My point was that responses like •Re: Re: how do I extract contact data from websites? do nothing to help the situation and only push this site closer to Slashdot-like discussion levels (which is a bad thing). So write filters, lobby your government, write secure software to reduce the number of open relays, but don't waste time with posts like that.	[reply]

In Section Seekers of Perl Wisdom