How to extract an email address from a mailto URL?

jdlev has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to extract an email address from a mailto URL? by linuxer (Curate) on Dec 29, 2008 at 20:23 UTC
check (e.g. grep) for lines, which contain 'mailto:' (be more specific if you like to; match the 'href' ...); use Regexp::Common together with Regexp::Common::Email::Address to identify the mail address in matching lines PS: Don't remove your original question here. update RCEA added	[reply]
Re: How to extract an email address from a mailto URL? by CountZero (Bishop) on Dec 29, 2008 at 21:04 UTC
`grep` is indeed the answer to your question if you can be sure that the whole of the 'a' ... '/a' phrase is on the same line. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l]
Extracting email addresses from mailto URIs using HTML and URI parsers instead of regular expressions by dorward (Curate) on Dec 30, 2008 at 15:07 UTC
I don't like using regular expressions on HTML documents, so my approach would be to use a proper HTML parser instead. This has a number of benefits, including the decoding of entities in the HTML representing the email address. This code uses LWP::UserAgent to fetch the HTML document, HTML::TokeParser to read it, and URI to parse the URIs in it. #!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; use HTML::TokeParser; use URI; my $ua = LWP::UserAgent->new; $ua->timeout(10); my $root_uri = 'http://example.com/'; my $response = $ua->get($root_uri); if ($response->is_success) { my $html = $response->decoded_content; my $p = HTML::TokeParser->new( \$html ); while (my $tag = $p->get_tag('a')) { my $href = $tag->[1]{href}; next unless $href; my $uri = URI->new_abs( $href, $root_uri ); next unless ($uri->scheme eq 'mailto'); print $uri->to, "\n"; } } else { die $response->status_line; } [download]	[reply] [d/l]
Re: How to extract an email address from a mailto URL? by eye (Chaplain) on Dec 30, 2008 at 07:03 UTC
If you want to differentiate between addresses in anchor tags and other uses of "mailto:" in the file, read the entire file into memory and use the match operator (m//). As suggested previously, you should use Regexp::Common::Email::Address to help compose a regular expression for the email address and enclosing HTML. I would use "\s+" between the "a" and "href" and "\s*" adjacent to the equal sign to match HTML's treatment of whitespace. Note that HTML allows quoting with both single and double quotes. Also, older HTML allowed you to not quote the information after the equal sign in some circumstances.	[reply]
Quoting attribute values in HTML by dorward (Curate) on Dec 30, 2008 at 20:47 UTC
Also, older HTML allowed you to not quote the information after the equal sign in some circumstances. Newer HTML too; it is only XHTML that makes quoting attribute values mandatory in all circumstances. Since email addresses will include an @ character, the attributes are mandatory in this case. That said, it isn't as if nobody ever breaks the rules, so it is generally a good idea to write code that can cope (unless you know that the incoming data won't have that problem).	[reply]
Re^2: How to extract an email address from a mailto URL? by jdlev (Scribe) on Dec 30, 2008 at 13:17 UTC
My experience in perl is going on about 3 weeks...so some of what you are saying is greek to me. Can you provide an example of how you would do it? The source file to pull the information from has the tag as follows: showTollfree(1010) // --> '/script' Fax: (301)931-1285 'br''a href='mailto:KHargrove@servpro1010.com'>KHargrove@servpro1010.com'/a' '/td' I'm sorry to have to be wet nursed through this...but I have learned a ton of stuff over the last few weeks...I feel like my brain is going to explode!	[reply]
Re^3: How to extract an email address from a mailto URL? by linuxer (Curate) on Dec 30, 2008 at 14:07 UTC
Well, first install these two modules (and their unresolved dependencies if there are any): Regexp::Common Regexp::Common::Email::Address Then you can do something like this (Quickshot, untested): `#!/usr/bin/perl use strict; use warnings; use Regexp::Common qw(Email::Address); use Email::Address; my $filename = 'file_to_parse.dat'; open my $rh, '<', $filename or die "$filename: $!"; # Requirement: href=, mailto: and the mailaddress must be in the same +line! my @addresses = map { m/mailto:($RE{Email}{Address})/o; $1 } grep { m/href=.+?mailto:/ } <$rh> ; close $rh; { local $, = local $\ = "\n"; print @addresses; } __END__` [download]	[reply] [d/l]
Re^4: How to extract an email address from a mailto URL? by jdlev (Scribe) on Dec 30, 2008 at 16:09 UTC
Re: How to extract an email address from a mailto URL? by Anonymous Monk on Dec 30, 2008 at 09:46 UTC
Please use code tags, see Markup in the Monastery	[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks