Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^3: getting LWP and HTML::TokeParser to run

by Marshall (Canon)
on Oct 10, 2010 at 14:22 UTC ( #864480=note: print w/replies, xml ) Need Help??


in reply to Re^2: getting LWP and HTML::TokeParser to run
in thread getting started with LWP and HTML::TokeParser

I would go with marto's advice about WWW::Mechanize. I haven't used it yet, but I hear that it is great. I suspect that you will find it easier to use than any advice I could give about decoding the raw HTML to get the next pages to "click" on. You are getting about 5K pages from a huge government website that performs very well. I wouldn't worry too much about fancy error recovery with retries unless you are going to run this program often.

Update:
You can of course parse the HTML content of the search results with regex, but this is a mess...

my (@hrefs) = $mech->content =~ m|COMPLETEHREF=http://www.kultus-bw.de +/did_abfrage/detail.php\?id=\d+|g; print "$_\n" foreach @hrefs; #there are 5081 of these #these COMPLETEHREF's can be appended to a main url like this: my $example_url = 'http://www.kultusportal-bw.de/servlet/PB/menu/11884 +27/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail +.php?id=04146900';
Then things get hairy and you will want to whip out some of that HTML parser voo-doo to parse the resulting table. Also, the character codings aren't consistent, for example the page has ä, but not ü which is coded as ü

Replies are listed 'Best First'.
Re^4: getting LWP and HTML::TokeParser to run
by Perlbeginner1 (Scribe) on Oct 10, 2010 at 17:56 UTC
    hello Marto hello Marshall,

    many thanks for the hints. I am going to make some tests with Mechanize! I make use of Mechanize instead of LWP!!

    by the way:... i can read on the CPAN:
    >br>
    Features include:

    * All HTTP methods
    * High-level hyperlink and HTML form support, without having to parse HTML yourself
    * SSL support
    * Automatic cookies
    * Custom HTTP headers


    Mech supports performing a sequence of page fetches including following links and submitting forms. Each fetched page is parsed and its links and forms are extracted. A link or a form can be selected, form fields can be filled and the next page can be fetched. Mech also stores a history of the URLs you've visited, which can be queried and revisited. (end of citation)

    Well - Does this mean that id do not have to do Parsing of a result-page with use HTML::TokeParser!? in other words: in the feature-list i can read: "High-level hyperlink and HTML form support, without having to parse HTML yourself" - unbelieveable!!!! Well i cannot believe this! Does this mean that i do not have to parse the fetched HTML-Pages?

    Can i get the data-set of each of the 5000 Pages with Mechanize!?

    Well i have to make some tests! And perhaps someone can set me straight here!!

    BTW: You Marshall are right: this is a "very huge government website that performs very well." I do not think that i run into any troubles....

    after the first trials i come back and report all my findings.

    untill soon!!

    perlbeginner!

      If your plan is to visit each of these five thousand or so links please don't hammer the server.

      Well as far as use policy goes, do check. When I run automated scripts, I do it at night during low load times. And I often put in a sleep() after some number of requests to slow things down.

      One thing to investigate is whether or not this site provides the information that you need in an easier format than web pages? Many big sites do that. Some sites I use actually have a separate URL for automated requests and even provide tools to use their more efficient computer to computer methods.

      On the other hand, this site has "bandwith to burn". I don't think that they will notice 5,000 pages. But do testing with a small set of pages.

        Hello Marto, hello Marshall, good evening!

        My plan is to visit each of these five thousand - i do not intend to hammer the server;-)

        I agree with Marshall: "On the other hand, this site has "bandwith to burn". I don't think that they will notice 5,000 pages. But do testing with a small set of pages."

        This governmental site has a very very big server!

        Well - if i get all the pages with "Mechanize" do i have to use HTML::TokeParser as well!? - for the parsing - in order to get the information of all single pages? i have read on the CPAN-Site for Mechanize:

        $mech->find_all_inputs( ... criteria ... )

        find_all_inputs() returns an array of all the input controls in the current form whose properties match all of the regexes passed in. The controls returned are all descended from HTML::Form::Input.

        If no criteria are passed, all inputs will be returned.
        If there is no current page, there is no form on the current page, or there are no submit controls in the current form then the return will be an empty array.

        You may use a regex or a literal string:
        # get all textarea controls whose names begin with "customer" my @customer_text_inputs = $mech->find_all_inputs( type => 'textarea', name_regex => qr/^customer/, ); # get all text or textarea controls called "customer" my @customer_text_inputs = $mech->find_all_inputs( type_regex => qr/^(text|textarea)$/, name => 'customer', );
        Well that would be great if i can run Mechanize with some additional jobs for the parsing-part! If this is possibie it would be great!

        @Marshall: i can have a look if they provide the information i need in an easier format than web pages? But i guess that i have to go the way to fetch page by page--- guess that it is the best way to do this in a nightly job!

        i come back and report all findings! Untill soon!

        regards Perlbeginner1


        what is aimed: 17 lines of text. This information-set is wanted - 5081 Times:


        see an example here: Allgemeine Daten der Schule / Behörde:

        Schul-/Behördenname: Herzog-Philipp-Verbandsschule Grund- u. Werkrealschule
        Schulart: Öffentliche Schule (04139579)
        Hausadressse: Ebersbacher Str. 20, 88361 Altshausen
        Postfachadresse: Keine Angabe
        Telefon: 07584/92270
        Fax: 07584/922729
        E-Mail: poststelle@04139579.schule.bwl.de
        Internet: www.hpv-altshausen.de
        Übergeordnete Dienststelle: Staatliches Schulamt Markdorf
        Schulleitung: Mößle, Georg
        Stellv. Schulleitung: Schneider, Cornelia
        Anzahl Schüler: 456
        Anzahl Klassen: 19
        Anzahl Lehrer: 39
        Kreis: Ravensburg
        Schulträger: <kein Eintrag> (Ohne Zuordnung)



        this is a true PERL-Job. I think that PERL can do this kind of job with ease! All those 5081 pages are human readable - but if i try to click page by page and read all data - it would take me more than a month -

        If i can do it with PERL then i only need to have the code for Parsing it once!!!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://864480]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (1)
As of 2021-10-18 01:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My first memorable Perl project was:







    Results (72 votes). Check out past polls.

    Notices?