Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re: [Re7]: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1

by wfsp (Abbot)
on Sep 26, 2010 at 13:51 UTC ( [id://862082]=note: print w/replies, xml ) Need Help??


in reply to [Re7]: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1
in thread Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1

To get the email address as well replace my while loop with
my (@text, $found_start); while (my $t = $p->get_token){ my $txt; if ($t->is_text){ $txt = $t->as_is; for ($txt){ s/^\s+//; s/\s+$//; } next unless $txt; $found_start++ if $txt =~ /^Hit/; } elsif ( $found_start and $t->is_start_tag(q{a}) and $t->get_attr(q{href}) ) { my $href = $t->get_attr(q{href}); if ($href =~ /mailto:/i){ $txt = $href; } else { next; } } else{ next; } next unless $found_start; push @text, $txt; last if $txt =~ /Listed since/; }
Hit 7 out of 120517 name 1 type: one (for example) Adress: Paris, 3ne Boulevard Saint Lo Telefon:048 + 334555664 , Fax: 048 + 334555667 MyWeb-Nummer: 222237520031111 Webmaster: mailto: webmaster@demosite.fr master Listed since: 20.08.2002
All the output should be written in only one new text file.
Well, open a new text file for writing. :-) See open for how to do that.

Bart has given some excellent tips on how to get a list of HTML files so that you can loop over them.

Good luck!

Replies are listed 'Best First'.
Re^2: [Re7]: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1
by Perlbeginner1 (Scribe) on Sep 26, 2010 at 18:45 UTC
    Hello Bart hello wfsp!

    many many thanks for your help! I try out these hints and your code!

    i come back and report all results.

    untill soon.

    best regards
    perlbeginner1

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://862082]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (8)
As of 2024-04-19 09:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found