Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Hello Marto, hello Marshall, good evening!

My plan is to visit each of these five thousand - i do not intend to hammer the server;-)

I agree with Marshall: "On the other hand, this site has "bandwith to burn". I don't think that they will notice 5,000 pages. But do testing with a small set of pages."

This governmental site has a very very big server!

Well - if i get all the pages with "Mechanize" do i have to use HTML::TokeParser as well!? - for the parsing - in order to get the information of all single pages? i have read on the CPAN-Site for Mechanize:

$mech->find_all_inputs( ... criteria ... )

find_all_inputs() returns an array of all the input controls in the current form whose properties match all of the regexes passed in. The controls returned are all descended from HTML::Form::Input.

If no criteria are passed, all inputs will be returned.
If there is no current page, there is no form on the current page, or there are no submit controls in the current form then the return will be an empty array.

You may use a regex or a literal string:
# get all textarea controls whose names begin with "customer" my @customer_text_inputs = $mech->find_all_inputs( type => 'textarea', name_regex => qr/^customer/, ); # get all text or textarea controls called "customer" my @customer_text_inputs = $mech->find_all_inputs( type_regex => qr/^(text|textarea)$/, name => 'customer', );
Well that would be great if i can run Mechanize with some additional jobs for the parsing-part! If this is possibie it would be great!

@Marshall: i can have a look if they provide the information i need in an easier format than web pages? But i guess that i have to go the way to fetch page by page--- guess that it is the best way to do this in a nightly job!

i come back and report all findings! Untill soon!

regards Perlbeginner1


what is aimed: 17 lines of text. This information-set is wanted - 5081 Times:


see an example here: Allgemeine Daten der Schule / Behörde:

Schul-/Behördenname: Herzog-Philipp-Verbandsschule Grund- u. Werkrealschule
Schulart: Öffentliche Schule (04139579)
Hausadressse: Ebersbacher Str. 20, 88361 Altshausen
Postfachadresse: Keine Angabe
Telefon: 07584/92270
Fax: 07584/922729
E-Mail: poststelle@04139579.schule.bwl.de
Internet: www.hpv-altshausen.de
Übergeordnete Dienststelle: Staatliches Schulamt Markdorf
Schulleitung: Mößle, Georg
Stellv. Schulleitung: Schneider, Cornelia
Anzahl Schüler: 456
Anzahl Klassen: 19
Anzahl Lehrer: 39
Kreis: Ravensburg
Schulträger: <kein Eintrag> (Ohne Zuordnung)



this is a true PERL-Job. I think that PERL can do this kind of job with ease! All those 5081 pages are human readable - but if i try to click page by page and read all data - it would take me more than a month -

If i can do it with PERL then i only need to have the code for Parsing it once!!!

In reply to Re^6: getting LWP and HTML::TokeParser to run by Perlbeginner1
in thread getting started with LWP and HTML::TokeParser by Perlbeginner1

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2021-12-08 01:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    R or B?



    Results (34 votes). Check out past polls.

    Notices?