Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: getting LWP and HTML::TokeParser to run

by Marshall (Canon)
on Oct 10, 2010 at 11:09 UTC ( #864472=note: print w/replies, xml ) Need Help??


in reply to getting started with LWP and HTML::TokeParser

Go to: http://www.kultusportal-bw.de/servlet/PB/menu/1188427_pfhandler_yno/index.html

You will have to do a submit with a Suchbegriff: of "*"
That will result in a page of 5081 results. To get the sub-pages pages you want, you will have to "click" via LWP or whatever to follow these links all 5081 of them.

Start with trying to submit the search term of "*" on the main page and see if you can do that.

  • Comment on Re: getting LWP and HTML::TokeParser to run

Replies are listed 'Best First'.
Re^2: getting LWP and HTML::TokeParser to run
by Perlbeginner1 (Scribe) on Oct 10, 2010 at 12:25 UTC
    Hello Marshall

    many thanks for the reply! i can do as you adviced. I can see the 5081 results.

    Now i have to get the sub-pages pages. I have to "click" via LWP to follow these links - all 5081 of them.
    And then i have to do the Job with HTML-TREEBuilder or use HTML::TokeParser!
    I for one prefer HTML::TokeParser since i know this a little bit.

    i have very very little experience with HTML::TokeParser (not too much - so i guess that the parser-part will be something over my skills)


    but as first things come first how should the LWP-Part look!?

    any and all help will be greatly appreciated!

    perlbeginner1

      Here is an example which uses WWW::Mechanize to visit the page, populate the field and submit the form. Error checking is left as a exercise for you, this is a short example to get you started:

      #!/usr/bin/perl use strict; use warnings; use WWW::Mechanize; my $url = 'http://www.kultusportal-bw.de/servlet/PB/menu/1188427_pfhan +dler_yno/index.html'; my $mech = WWW::Mechanize->new(); $mech->get( $url ); $mech->field('einfache_suche','*'); $mech->submit(); # $mech->content now contains the results page.

      I can't read German, so you'd better check that you're not breaking any site policy regarding automation.

        I'm not an expert in robots.txt but I would understand http://www.kultusportal-bw.de/robots.txt as 'no agents allowed'.

        # robots.txt von 17-8 Uhr # email Sammler draussenbleiben User-agent:EmailCollector Disallow: / # Robots die durchdrehen fliegen raus User-agent: GagaRobot Disallow: / # Allow anything User-agent: * Disallow: Disallow: *ROOT=1161830$ Disallow: */servlet/PB/-s/*
      I would go with marto's advice about WWW::Mechanize. I haven't used it yet, but I hear that it is great. I suspect that you will find it easier to use than any advice I could give about decoding the raw HTML to get the next pages to "click" on. You are getting about 5K pages from a huge government website that performs very well. I wouldn't worry too much about fancy error recovery with retries unless you are going to run this program often.

      Update:
      You can of course parse the HTML content of the search results with regex, but this is a mess...

      my (@hrefs) = $mech->content =~ m|COMPLETEHREF=http://www.kultus-bw.de +/did_abfrage/detail.php\?id=\d+|g; print "$_\n" foreach @hrefs; #there are 5081 of these #these COMPLETEHREF's can be appended to a main url like this: my $example_url = 'http://www.kultusportal-bw.de/servlet/PB/menu/11884 +27/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail +.php?id=04146900';
      Then things get hairy and you will want to whip out some of that HTML parser voo-doo to parse the resulting table. Also, the character codings aren't consistent, for example the page has , but not which is coded as ü
        hello Marto hello Marshall,

        many thanks for the hints. I am going to make some tests with Mechanize! I make use of Mechanize instead of LWP!!

        by the way:... i can read on the CPAN:
        >br>
        Features include:

        * All HTTP methods
        * High-level hyperlink and HTML form support, without having to parse HTML yourself
        * SSL support
        * Automatic cookies
        * Custom HTTP headers


        Mech supports performing a sequence of page fetches including following links and submitting forms. Each fetched page is parsed and its links and forms are extracted. A link or a form can be selected, form fields can be filled and the next page can be fetched. Mech also stores a history of the URLs you've visited, which can be queried and revisited. (end of citation)

        Well - Does this mean that id do not have to do Parsing of a result-page with use HTML::TokeParser!? in other words: in the feature-list i can read: "High-level hyperlink and HTML form support, without having to parse HTML yourself" - unbelieveable!!!! Well i cannot believe this! Does this mean that i do not have to parse the fetched HTML-Pages?

        Can i get the data-set of each of the 5000 Pages with Mechanize!?

        Well i have to make some tests! And perhaps someone can set me straight here!!

        BTW: You Marshall are right: this is a "very huge government website that performs very well." I do not think that i run into any troubles....

        after the first trials i come back and report all my findings.

        untill soon!!

        perlbeginner!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://864472]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2021-10-20 15:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My first memorable Perl project was:







    Results (81 votes). Check out past polls.

    Notices?