Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

regexp to only allow for formally valid email addresses

by fraktalisman (Hermit)
on Mar 07, 2007 at 17:35 UTC ( [id://603647]=perlquestion: print w/replies, xml ) Need Help??

fraktalisman has asked for the wisdom of the Perl Monks concerning the following question:

Fellow monks, excuse me if this is too simple or if it has been asked. I used search, but it's late after a long days work, and I need a quick fix against misuse of an existing web feedback script by spammers.

Should'nt the following
$spammer=1 unless (($f{'Email'} =~ m/^[a-zA-Z_\-.0-9]+@[a-zA-Z_\-.0-9]+$/) or ($f{'Email'} eq ''));
only allow for $f{'Email'} to contain a valid email address without newlines in it and otherwise set $spammer=1 ? But it doesn't seem to, somehow I seem to be missing the point.

Replies are listed 'Best First'.
Re: regexp to only allow for formally valid email addresses
by Zaxo (Archbishop) on Mar 07, 2007 at 17:50 UTC

    use Email::Valid; # . . . my $spammer; $spammer = 1 unless Email::Valid->address($f{'Email'}) or $f{'Email'} eq '';

    There exists a regex that does what you want, but it is large and complex. Email::Valid uses a small parser.

    Update: Seperated my declaration from conditional assignment, was a thinko.

    After Compline,
    Zaxo

      There exists a regex that does what you want, but it is large and complex.

      In another post I mentioned an "impressive example" that works with the newest blead, posted by Abigail in clpmisc; for completeness, I'm pasting it hereafter:

Re: regexp to only allow for formally valid email addresses
by Fletch (Bishop) on Mar 07, 2007 at 17:51 UTC

    See Mail::RFC822::Address which has "the" regex for valid addresses. Right off I see that yours has one of the common problems that tend to tick me off, specifically disallowing "foo+identifier@example.com" style addresses (which lets me have one "foo@example.com" address but give out different "+identifier" tags to different people so I can label/tag/filter/toss accordingly).

    Update: Also see RFC::RFC822::Address for a Parse::RecDescent based parser rather than a regex.

Re: regexp to only allow for formally valid email addresses
by vrk (Chaplain) on Mar 07, 2007 at 17:53 UTC

    Use Mail::RFC822::Address. Also, the Regular Expression Library has some pretty interesting constructs.

    As to your regex, I don't see anything wrong in it, and it works with a couple of test cases as it should:

    $ perl -e 'print "valid\n" if ("foo\@bar" =~ m/^[a-zA-Z_\-.0-9]+@[a-zA +-Z_\-.0-9]+$/);' valid $ perl -e 'print "valid\n" if ("j.random.hacker\@perlmonks.com" =~ m/^ +[a-zA-Z_\-.0-9]+@[a-zA-Z_\-.0-9]+$/);'

    Of course, tests can never show the absence of errors. But I'm willing to bet you have a problem somewhere else in the program.

    UPDATE: Seems like others beat me to it... I just remembered that there was some nice discussion about this over at The Daily WTF.

    --
    print "Just Another Perl Adept\n";

      vrk wrote:
      Of course, tests can never show the absence of errors. But I'm willing to bet you have a problem somewhere else in the program.

      If we had bet, you'd won ;)
      There was another parameter that goes into the email header. The $f{'Email'} field validation was not the actual problem ...

Re: regexp to only allow for formally valid email addresses
by ikegami (Patriarch) on Mar 07, 2007 at 23:03 UTC
    In Perl 5.10, you'll be able to do
    my $email_address = qr{ (?(DEFINE) (?<addr_spec> (?&local_part) \@ (?&domain)) (?<local_part> (?&dot_atom) | (?&quoted_string)) (?<domain> (?&dot_atom) | (?&domain_literal)) (?<domain_literal> (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?& +FWS)? \] (?&CFWS)?) (?<dcontent> (?&dtext) | (?&quoted_pair)) (?<dtext> (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e]) (?<atext> (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|} +~]) (?<atom> (?&CFWS)? (?&atext)+ (?&CFWS)?) (?<dot_atom> (?&CFWS)? (?&dot_atom_text) (?&CFWS)?) (?<dot_atom_text> (?&atext)+ (?: \. (?&atext)+)*) (?<text> [\x01-\x09\x0b\x0c\x0e-\x7f]) (?<quoted_pair> \\ (?&text)) (?<qtext> (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e]) (?<qcontent> (?&qtext) | (?&quoted_pair)) (?<quoted_string> (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent +))* (?&FWS)? (?&DQUOTE) (?&CFWS)?) (?<word> (?&atom) | (?&quoted_string)) (?<phrase> (?&word)+) # Folding white space (?<FWS> (?: (?&WSP)* (?&CRLF))? (?&WSP)+) (?<ctext> (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e +]) (?<ccontent> (?&ctext) | (?&quoted_pair) | (?&comment)) (?<comment> \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) ) (?<CFWS> (?: (?&FWS)? (?&comment))* (?: (?:(?&FWS)? (?&comment)) | (?&FWS))) # No whitespace control (?<NO_WS_CTL> [\x01-\x08\x0b\x0c\x0e-\x1f\x7f]) (?<ALPHA> [A-Za-z]) (?<DIGIT> [0-9]) (?<CRLF> \x0d \x0a) (?<DQUOTE> ") (?<WSP> [\x20\x09]) ) (?&addr_spec) }x;

    Disallowing CR & LF would simply be a matter of changing
    (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
    to
    (?<FWS>             (?&WSP)+)

    Credit: The regexp was written by Abigail, who also wrote RFC::RFC822::Address.

Re: regexp to only allow for formally valid email addresses
by Moron (Curate) on Mar 07, 2007 at 19:59 UTC
    I agree with the suggestions to use a ready-made regexp. But if for some reason I had to reinvent one, I'd extend the \w token as much as necessary rather than go from scratch, something like...
    $spammer = length($f{'Email'}) && $f{'Email'} !~ /^(^\w|\-|\.)+\@(\w|\-|\.)+$/;

    -M

    Free your mind

      Word characters (\w) might include local characters like German Ä ö ü ß on a German webserver. Although these Umlaut characters should be, in theory, valid in email adresses (in my interpretation of RFC 822), I know from experience that their occurence in email addresses usually causes problems sooner or later. At least one German provider (T-Online) used to allow for those chars, but I would rather disallow and have the user enter an email address which is safe for international use.

Re: regexp to only allow for formally valid email addresses
by hangon (Deacon) on Mar 07, 2007 at 19:25 UTC

    You need to escape the dots in your regex.

    Update: Nevermind this post. I stand corrected and learned Yet Another Perl Nuance. Thanks Thelonius & Fletch.

      $ perl -le '$_ = "oh really?"; print unless /[.]/;' oh really?

        Unless you don't think the character classes are a bit redundant. I assume fraktalisman only wants to match \w as well as '-' and '.' since its for e-mail addresses.

        # my guess is that he's not trying to do this =~ /^[.]+@[.]+$/ # either of these make more sense for matching an e-mail address =~ /^[a-zA-Z_\-\.0-9]+@[a-zA-Z_\-\.0-9]+$/ =~ /^[\w\-\.]+@[\w\-\.]+$/

        Or am I missing something painfully obvious?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://603647]
Approved by polettix
Front-paged by kyle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-04-26 00:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found