Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re^6: getting LWP and HTML::TokeParser to run

by Perlbeginner1 (Scribe)
on Oct 10, 2010 at 19:33 UTC ( #864509=note: print w/replies, xml ) Need Help??


in reply to Re^5: getting LWP and HTML::TokeParser to run
in thread getting started with LWP and HTML::TokeParser

Hello Marto, hello Marshall, good evening!

My plan is to visit each of these five thousand - i do not intend to hammer the server;-)

I agree with Marshall: "On the other hand, this site has "bandwith to burn". I don't think that they will notice 5,000 pages. But do testing with a small set of pages."

This governmental site has a very very big server!

Well - if i get all the pages with "Mechanize" do i have to use HTML::TokeParser as well!? - for the parsing - in order to get the information of all single pages? i have read on the CPAN-Site for Mechanize:

$mech->find_all_inputs( ... criteria ... )

find_all_inputs() returns an array of all the input controls in the current form whose properties match all of the regexes passed in. The controls returned are all descended from HTML::Form::Input.

If no criteria are passed, all inputs will be returned.
If there is no current page, there is no form on the current page, or there are no submit controls in the current form then the return will be an empty array.

You may use a regex or a literal string:
# get all textarea controls whose names begin with "customer" my @customer_text_inputs = $mech->find_all_inputs( type => 'textarea', name_regex => qr/^customer/, ); # get all text or textarea controls called "customer" my @customer_text_inputs = $mech->find_all_inputs( type_regex => qr/^(text|textarea)$/, name => 'customer', );
Well that would be great if i can run Mechanize with some additional jobs for the parsing-part! If this is possibie it would be great!

@Marshall: i can have a look if they provide the information i need in an easier format than web pages? But i guess that i have to go the way to fetch page by page--- guess that it is the best way to do this in a nightly job!

i come back and report all findings! Untill soon!

regards Perlbeginner1


what is aimed: 17 lines of text. This information-set is wanted - 5081 Times:


see an example here: Allgemeine Daten der Schule / Behörde:

Schul-/Behördenname: Herzog-Philipp-Verbandsschule Grund- u. Werkrealschule
Schulart: Öffentliche Schule (04139579)
Hausadressse: Ebersbacher Str. 20, 88361 Altshausen
Postfachadresse: Keine Angabe
Telefon: 07584/92270
Fax: 07584/922729
E-Mail: poststelle@04139579.schule.bwl.de
Internet: www.hpv-altshausen.de
Übergeordnete Dienststelle: Staatliches Schulamt Markdorf
Schulleitung: Mößle, Georg
Stellv. Schulleitung: Schneider, Cornelia
Anzahl Schüler: 456
Anzahl Klassen: 19
Anzahl Lehrer: 39
Kreis: Ravensburg
Schulträger: <kein Eintrag> (Ohne Zuordnung)



this is a true PERL-Job. I think that PERL can do this kind of job with ease! All those 5081 pages are human readable - but if i try to click page by page and read all data - it would take me more than a month -

If i can do it with PERL then i only need to have the code for Parsing it once!!!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://864509]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2021-10-27 20:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My first memorable Perl project was:







    Results (94 votes). Check out past polls.

    Notices?