 Welcome to the Monastery PerlMonks

by DaWolf (Curate)
 on Sep 25, 2004 at 16:46 UTC Need Help??

DaWolf has asked for the wisdom of the Perl Monks concerning the following question:

I know that this probably have been done zillions of times, but I have the nasty habit of reinventing some wheels (because I think it's the best way to learn), so I'd like to hear your opinions about this:

I have a RegEx that evaluates a form field to see if it's a "valid" e-mail address. So, I came to this:
```([a-z]|[0-9])+((\.)+([a-z]|[0-9])+)*(\-|\_)*([a-z]|[0-9])*\@{1}([a-z]|
+[0-9])+\.{1}([a-z]|[0-9]){2,}(\.([a-z|0-9]){2})*
Allow me to humbly explain each chunk:
1. ([a-z]|[0-9])+ : One or more letters and/or numbers
2. ((\.)+([a-z]|[0-9])+)* : Possibly a dot. Considering that there is a dot one or more letters and/or numbers must follow
3. (\-|\_)* : Possibly a dash or underscore.
4. ([a-z]|[0-9])* : Possibly more letters and/or numbers
5. \@{1} : Must have one (and only one) '@'
6. ([a-z]|[0-9])+ : Must have more letters and /or numbers followed by
7. \.{1} : a dot that must be followed by
8. ([a-z]|[0-9]){2,} : at least two letters and/or numbers
9. (\.([a-z|0-9]){2})* : and possibly followed by a dot and more letters and/or numbers
So, my questions are:
1. Have I forgot something here?
2. Is there anything more that I could or should add?

Replies are listed 'Best First'.
by Ovid (Cardinal) on Sep 25, 2004 at 17:04 UTC

It looks good. I just have a slight correction for it:

```\$RFC822PAT = <<'EOF';
[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\
xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xf
f\n\015()]*)*\)[\040\t]*)*(?:(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\x
ff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n\015
"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\
xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80
-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*
)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\
\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\
x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x8
0-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n
\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x
80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^
\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040
\t]*)*)*@[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([
^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\
\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\
x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-
\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()
]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\
x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\04
0\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\
n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\
015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?!
[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\
]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\
x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\01
5()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*|(?:[^(\040)<>@,;:".
\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]
)|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^
()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037]*(?:(?:\([^\\\x80-\xff\n\0
15()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][
^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)|"[^\\\x80-\xff\
n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^()<>@,;:".\\\[\]\
x80-\xff\000-\010\012-\037]*)*<[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?
:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-
\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:@[\040\t]*
(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015
()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()
]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\0
40)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\
[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\
xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*
)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80
-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x
80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t
]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\
\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])
*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x
80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80
-\xff\n\015()]*)*\)[\040\t]*)*)*(?:,[\040\t]*(?:\([^\\\x80-\xff\n\015(
)]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\
\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*@[\040\t
]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\0
15()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015
()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(
\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|
\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80
-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()
]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x
80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^
\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040
\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".
\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff
])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\
\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x
80-\xff\n\015()]*)*\)[\040\t]*)*)*)*:[\040\t]*(?:\([^\\\x80-\xff\n\015
()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\
\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)?(?:[^
(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-
\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\
n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|
\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))
[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff
\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\x
ff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(
?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\
000-\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\
xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\x
ff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)
*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*@[\040\t]*(?:\([^\\\x80-\x
ff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-
\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)
*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\
]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\]
)[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-
\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\x
ff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(
?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80
-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<
>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x8
0-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:
\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]
*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)
*\)[\040\t]*)*)*>)
EOF

Use Email::Valid :)

I know that's not exactly what you were looking for, but the following email addresses (from "Mastering Regular Expressions") are all valid:

• Alfred Neuman <Neuman@BBN-TENEXA>
• ":sysmail"@ Some-Group. Some-Org
• Muhammed.(I am the greatest) Ali @(the)Vegas.WBA

Cheers,
Ovid

New address of my CGI Course.

If this is a "slight" correction... *lol*

Thanks, Ovid, but I'd like to make a generic regex so I can use both in and outside Perl (in a JavaScript Function for websites, for an example).

The point that Ovid was making is that if you wish to use a regular expression to validate an email address's format, you've got to make a choice: Either use that whopping big regexp that he provided, or risk mistakenly rejecting valid addresses. If you don't mind rejecting valid addresses, go with a less intricate regular expression, but it won't be a robust solution.

Dave

by Your Mother (Archbishop) on Sep 25, 2004 at 21:10 UTC

I also recommend the "it's too frickin hard so let it be" approach. For most of the stuff I do, I allow the email to be semi-arbitrary and then if the user doesn't get his confirmation mail, etc, it's his responsibility. The UDP nature of email makes it ultimately impossible to trust 100% anyway, so I don't feel bad just handing the user input over to the mail stuff.

Although, surely someone has taken a crack at validation with Parse::RecDescent or somesuch? I'd love to see it if someone's done it.

Although, surely someone has taken a crack at validation with Parse::RecDescent or somesuch? I'd love to see it if someone's done it.

We were just talking about that in the CB. bart pointed out that Parse::RecDescent would probably be way too slow. Parse::YAPP might be better (see Re:x2 Easy Things (a plug for Parse::YAPP)) if you want a pure-Perl solution, but for speed there's no reason not to build a parser with yacc or bison or something of the sort and XS it.

Incidentally, I googled for an RFC822 yacc grammar (I can't believe that nobody's distributing one; it seems like the obvious thing to do) and came across this article. It's rather old, but probably still applies.

--
F o x t r o t U n i f o r m
Found a typo in this node? /msg me
% man 3 strfry

by strat (Canon) on Sep 26, 2004 at 12:31 UTC
You can write
```([a-z]|[0-9])+
also as
```([a-z0-9])+
or
```[a-z0-9]+
which will save you a lot of capturing clauses if you don't need to capture anything

I didn't analyze the whole regex, but at a first view some questions arised:

x) what's about _ and uppercase letters? -> \w instead of a-z0-9
x) \@{1} is the same as \@ => so {1} can be omitted, also in \.{1}
x) x.@domain.tld seems ok

BTW: with the flag /x you can write regexes more readable and comment them, e.g.
```\$mail =~ /^        # at the beginning
[a-zA-z0-9]      # one letter or digit
[a-zA-Z0-9_\-\.] # opt. letters, digits, underscores.
\@
[a-zA-z]         # one letter
...
\$/x;

for scripts in production I like to use Mail::RFC822::Address from CPAN...

Best regards,
perl -e "s>>*F>e=>y)\*martinF)stronat)=>print,print v8.8.8.32.11.32"

x) what's about _ and uppercase letters? -> \w instead of a-z0-9

The underscore is captured on item 3:

3) (\-|\_)* : Possibly a dash or underscore.

x) x.@domain.tld seems ok

Good points. Thanks for pointing them.

BTW: with the flag /x you can write regexes more readable and comment them

True, but I want to make a regex as generic as possible, so I can use it with other langs such as JavaScript and PHP for an example.

Thanks a lot. This was exactly the kind of reply I was looking for. :)
by DaWolf (Curate) on Sep 25, 2004 at 20:23 UTC
Please guys, let me make some statements here, since I was apparently misunderstood:

I've understood Ovid's post, davido and really appreciated his contribution (as usual), but I'd like to *learn* to build a regex for that matter.

Just because a "standard" way of building it exists it's not unmutable, e.g.:

Ovid first allows a whitespace and/or tab to be in the beginning of the e-mail address:
```[\040\t]*
I wouldn't do that. *My* line of thinking and the logic that reigns on the app I'm coding is: if the user has entered a whitespace in the beginning of the webform field, well, too bad for him. Even better, I'd write a JS function to disable whitespaces on that field. Since it's a web app it's better for me (and most probably for the user) to do it this way.

The *only thing* I want to do is to validate an e-mail address (not something like "My name <myemail@myserver.com>") using a regex built by *me*. I want to learn that. I don't care if there is a module that already has this implemented. I don't care if there is a "generally accepted" way of doing it. I want to do it myself because I want to understand the full concept of it and then learn from it.

All I've wanted was some advices and tips on how to do it and some critcics on my regex.

So far in this node, only Ovid provided me that (with the link to the RFC and the addresses examples fom Mastering Regular Expressions).

Cheers,

Personally, I think you're tilting at windmills. That said, I can offer a small correction to Ovid's post: RFC 822 has been superseded by RFC 2822. (Since the RFCs provide what looks like BNF for the actual address spec, I'd suggest using a parser-builder instead of a regex - I guess Parse::RecDescent and Parse::YAPP are pretty good, though I haven't used either. Parsers are worth learning about, too.)

--
F o x t r o t U n i f o r m
Found a typo in this node? /msg me
% man 3 strfry

If you want to see how that regex was built, buy the MRE book. If you're just making toys to decorate your cubicle walls, fine. But if you're truly setting up a web page for general people to visit from other places, you must permit the RFC-permitted addresses. And the regex for that is the longish thing that Ovid posted. Nothing less.

-- Randal L. Schwartz, Perl hacker
Be sure to read my standard disclaimer if this is a reply.

I've only spent a few minutes looking for it, but my 2nd-edition Mastering Regular Expressions doesn't seem to include that regex. So I can't check my suspicion that the original MRE regex didn't actually match all RFC822 addresses, just most of them - I vaguely remember a disclaimer about comments in hostnames not being recognized, or some such. (Probably something recursive, now that I think about it.) Even Ovid's monstrous regex may not be complete.

--
F o x t r o t U n i f o r m
Found a typo in this node? /msg me
% man 3 strfry

by jdalbec (Deacon) on Sep 26, 2004 at 17:17 UTC
1. ([a-z]|[0-9])+ : One or more letters and/or numbers
2. ((\.)+([a-z]|[0-9])+)* : Possibly a dot. Considering that there is a dot one or more letters and/or numbers must follow
3. (\-|\_)* : Possibly a dash or underscore.
4. ([a-z]|[0-9])* : Possibly more letters and/or numbers
5. \@{1} : Must have one (and only one) '@'
6. ([a-z]|[0-9])+ : Must have more letters and /or numbers followed by
7. \.{1} : a dot that must be followed by
8. ([a-z]|[0-9]){2,} : at least two letters and/or numbers
9. (\.([a-z|0-9]){2})* : and possibly followed by a dot and more letters and/or numbers

Please use <code> tags around Perl code so it displays correctly. You may want to use (?: ... ) for grouping so you don't get a bunch of assignments to \$1, \$2, etc. Also {1} is redundant. In several cases your comments don't match what the regexp is doing.

In 2. the (\.)+ allows two or more dots in a row. I think you probably just want to drop the + (and the redundant parentheses).

In 3. you have almost the same problem. Try changing * to ? unless you mean to allow multiple dashes/underscores in a row.

In 9. I think you want (?:\.(?:[a-z0-9]){2,})*. I'm leaving the * (even though you said "possibly") since I think you can have any number of domain levels in a FQDN.

janitored by ybiC: Balanced <code> tags around regex examples for proper rendering

Please use <code> tags around Perl code so it displays correctly.

Actually I used, but since the <code> tags are inside a <ol> tag maybe they aren't displaying correctly...

You may want to use (?: ... ) for grouping so you don't get a bunch of assignments to \$1, \$2, etc.

True, thanks for pointing this out.

Also {1} is redundant.

Also true. Thanks again, but...

In several cases your comments don't match what the regexp is doing.

Actually I've made a lot of testing and apparently it is... Could you clarify this part a little? I'm curious.

In 2. the (\.)+ allows two or more dots in a row. I think you probably just want to drop the + (and the redundant parentheses).

and

In 3. you have almost the same problem. Try changing * to ? unless you mean to allow multiple dashes/underscores in a row.

Actually I want to allow multiple dots, dashes and underscores since 'er.galvao.abbott@somedomain.com' or 'er_galvao_abbott@somedomain.com' are valid addresses.

In 9. I think you want (?:\.(?:[a-z0-9]){2,})*. I'm leaving the * (even though you said "possibly") since I think you can have any number of domain levels in a FQDN.

Thanks, I'll give this chunk a try and see what happens.

merlyn says please stop encouraging DaWolf - he's heading down a dangerous path
It's true that you're heading down a dangerous path. Whatever regex you finally come up with, please don't use it in a production application. Use Email::Valid on the server and don't try to do complex processing in JavaScript.

What is the application you want to validate e-mail addresses for? Do you only care that they're syntactically valid or do you want to verify that there's a real person on the other end? In the second case you'll need to do more than just run the address through Email::Valid.

For educational purposes only, I'll point out that your regex rejects 'er_galvao_abbott@somedomain.com' and accepts 'er...galvao...abbott-_-_-_@somedomain.com'. Is that what you want?

by muba (Priest) on Sep 27, 2004 at 18:36 UTC

Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://393804]
Approved by SciDude
Front-paged by grinder
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (8)
As of 2022-05-24 13:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
Do you prefer to work remotely?

Results (82 votes). Check out past polls.

Notices?