Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Hiding mail addresses in mailto: with JavaScript

by projekt21 (Friar)
on Oct 24, 2003 at 17:19 UTC ( [id://301897]=CUFP: print w/replies, xml ) Need Help??

There are many ways to hide an email address within an HTML-page. Some use HTML entities or URI encoding. Others use JavaScript. I guess that the Spammer's harvesting robots can easily decode HTML entities or URI encoding (with Perl), but for JS they would need a JS-interpreter. So using JS seems a better way. The snippet uses a regex to replace the whole anchor (<a ...>...</a>) with a JS document.write of the splitted text. Maybe this is useful for someboby. Maybe you can direct me to a better regex. I've a version that uses HTML::TokeParser (less performant) instead, contact me if you are interested.
$html =~ s|(<[aA]\b # <a [^<>]*\b # any attributes [hH][rR][eE][fF]=\"? # href= [mM][aA][iI][lL][tT][oO]: # mailto: ([^\"\s<>]+) # EMAIL \"? # href closed [^<>]* # any attributes > # anchor closed (.+?) # TEXT </[aA]>) # </a> |antispamize($1, $2, $3)|sgex; sub antispamize { my($anchor, $email, $text) = @_; #$email =~ s/@/{at}/g; #$text =~ s/@/{at}/g; my $anchor = "<script language=\"JavaScript\">document.write('" . join("'+'", $anchor =~ /(.{1,4})/g) . "');</script>"; ## may be you want to add this #$anchor .= "<noscript>$text ($email)</noscript>"; return $anchor; }

Replies are listed 'Best First'.
•Re: Hiding mail addresses in mailto: with JavaScript
by merlyn (Sage) on Oct 25, 2003 at 12:47 UTC
    Researchers just a few months ago demostrated that all you have to do is encode some part of your string as an entity, and that suffices to foil all known spam scrapers. Plus it doesn't break on non-Javascript browsers, as your example does.

    Please don't use this javascript solution. You're solving a non-existant problem.

    And it'll be a long time before spammers go to the trouble of decoding entities on scraped pages. After all, there are alreadly millions of addresses in "XXX@yyy.ZZZ" form on the web that don't require the CPU to decode, and they're after numbers, not quality or cleverness.

    It also suffices to have at least one unusual character in your email address: my email address of <fred&barney@stonehenge.com> has never been spammed, despite appearing in numerous usenet posts and web pages. Yes, <barney@stonehenge.com> has gotten numerous hits from almost the first day the other had appeared, but never the whole thing.

    In summary, write your mailto links like this:

    <a href="mailto:merlyn&#64;stonehenge.com"> Send mail to <tt>merlyn&#64;stonehenge.com</tt>!</a>
    and it not only looks right, it acts right, and yet the spammers don't see it. Don't use Javascript.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      I wouldn't be so convinced that spammers are inevitably incompetent. Just look at how some of them have been studying tools like SpamAssassin and figuring out how to get around the filter. Given that various popular email list to web gateways read the same research that you did and are using HTML encoding to hide addresses, it is only a question of time before that becomes tempting enough for spammers to add a couple of new regular expressions to their web scrapers and catch either &#64; or @ in email addresses.

      Your fred&barney trick is likely safe for a long, long time. There aren't enough people with & in their email addresses to be worth behaviour modification from spammers. The same won't remain true of HTML encoding @.

        I didn't say anything about spammer's incompetence. I'm talking about the ratio of low-hanging fruit to hidden fruit. As long as there are 10,000 times as many "foo@bar.com" in web pages as there are encoded addresses, spammers have no motivation to change.

        The fact that smart spammers are working around SpamAssassin is actually a testimony to the market penetration of such tools, especially by large mail targets like AOL and Hotmail and Earthlink. So, we're probably seeing them worry about 10% of their addresses being undeliverable, not 1/10000 of their addresses not even appearing in the first place. (I could even make the argument that an address that is hard to scrap is also likely to be trapped in other ways as well, so there's really no point in sending to it.)

        Thus, I will continue to recommend at the moment only some html-entity protection, until someone shows me otherwise, in a case of an actual spamscrape.

        -- Randal L. Schwartz, Perl hacker
        Be sure to read my standard disclaimer if this is a reply.

Re: Hiding mail addresses in mailto: with JavaScript
by diotalevi (Canon) on Oct 24, 2003 at 17:37 UTC

    Well, no. It isn't better because now I can't use that from my JavaScript-less browser anymore. Also, you might want to deuglify your regex (and improve its power) thusly. You're getting into very, very deep water by trying to parse the other parts of the html document but if you restrict yourself to just the mailto: URIs then you're probably ok.

    s( (['"]) \s* (mailto: (?s:.)+) \1 )( your_function( $2 ) )ge

      You might have noticed that I've added a commented noscript-part for JavaScript-less browsers. The noscript part would only display the mail address in a form like user{at}domain.com. Other forms can be done easily. In the war against spam we sometimes have to make compromises.

      I dislike the use of JavaScript for important things (navigation et al.). IMHO, a website needs to be functional without JS. On the other side all modern browsers are capable of JavaScript but the harvesters are not.

      For the deep water: yes, you might be right, but as I have control over the html code, I might catch every mailto. Or I might miss one and expose an address to a spammer. I can live with this. Your regex won't help, as it catches less then mine. Furtheron I would not be able to use the javascript replacement.

      I have an alternative implementation using HTML::TokeParser. This should avoid deep water but performs worse that that regex. Another compromise.

      Anyway, many thanks for your remarks. This code is just a suggestion, I don't force anybody to use it. Besides the JS thing, do you think it's badly implemented?

      alex pleiner <alex@zeitform.de>
      zeitform Internet Dienste

Re: Hiding mail addresses in mailto: with JavaScript
by Chady (Priest) on Oct 24, 2003 at 18:47 UTC

    Also, you might want to consider the /i in your regex to avoid typing the case combinations: /[hH][rR][eE][fF]/ is better done /href/i -- at least it's better for your fingers ;)


    He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life.

    Chady | http://chady.net/

      Thanks for that hint and for caring about my fingers ;-) but I thought that /[hH][rR][eE][fF]/ was cheaper than /href/i. Am I wrong on that?

      alex pleiner <alex@zeitform.de>
      zeitform Internet Dienste

        Your original regex may be cheaper than the one that uses the i option, but the difference is so miniscule as to not matter. At that point, clarity is more important than speed.

        --t. alex
        Life is short: get busy!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://301897]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (4)
As of 2024-04-19 13:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found