Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re^2: getting LWP and HTML::TokeParser to run

by Perlbeginner1 (Scribe)
on Oct 10, 2010 at 12:25 UTC ( [id://864474]=note: print w/replies, xml ) Need Help??


in reply to Re: getting LWP and HTML::TokeParser to run
in thread getting started with LWP and HTML::TokeParser

Hello Marshall

many thanks for the reply! i can do as you adviced. I can see the 5081 results.

Now i have to get the sub-pages pages. I have to "click" via LWP to follow these links - all 5081 of them.
And then i have to do the Job with HTML-TREEBuilder or use HTML::TokeParser!
I for one prefer HTML::TokeParser since i know this a little bit.

i have very very little experience with HTML::TokeParser (not too much - so i guess that the parser-part will be something over my skills)


but as first things come first how should the LWP-Part look!?

any and all help will be greatly appreciated!

perlbeginner1
  • Comment on Re^2: getting LWP and HTML::TokeParser to run

Replies are listed 'Best First'.
Re^3: getting LWP and HTML::TokeParser to run
by marto (Cardinal) on Oct 10, 2010 at 13:22 UTC

    Here is an example which uses WWW::Mechanize to visit the page, populate the field and submit the form. Error checking is left as a exercise for you, this is a short example to get you started:

    #!/usr/bin/perl use strict; use warnings; use WWW::Mechanize; my $url = 'http://www.kultusportal-bw.de/servlet/PB/menu/1188427_pfhan +dler_yno/index.html'; my $mech = WWW::Mechanize->new(); $mech->get( $url ); $mech->field('einfache_suche','*'); $mech->submit(); # $mech->content now contains the results page.

    I can't read German, so you'd better check that you're not breaking any site policy regarding automation.

      I'm not an expert in robots.txt but I would understand http://www.kultusportal-bw.de/robots.txt as 'no agents allowed'.

      # robots.txt von 17-8 Uhr # email Sammler draussenbleiben User-agent:EmailCollector Disallow: / # Robots die durchdrehen fliegen raus User-agent: GagaRobot Disallow: / # Allow anything User-agent: * Disallow: Disallow: *ROOT=1161830$ Disallow: */servlet/PB/-s/*

        Really, IIRC this looks as though only user agents 'EmailCollector' and 'GagaRobot' are dissalowed. All other user agents are dissalowed from seeing '*ROOT=1161830$' and '*/serverlet/PB/-s/*', but as specified under their '# Allow anything' comment.

        Update: In my previous post I was really warning against documented terms of use, as I say I can't read German so am unable to tell if the site has any.

        Funny, now they serve a different robots.txt:

        # cat robots.txt.8-17 # robots.txt Tagsueber von 8-17 Uhr # Disallow robots thru 17 User-agent: kmcrawler Disallow: User-agent: * Disallow: / Disallow: *ROOT=1161830$ Disallow: */servlet/PB/-s/*

        Apart from that, I'm not sure about the ROOT and servlet lines. They look like patterns and not like URL path prefixes. Robots don't have to implement pattern matching, and most probably don't, even if Google's does. So many robots may consider this lines junk, and simply ignore them.

        With the 17-8 robots.txt, only EmailCollector and GagaRobot are excluded from the entire site, and all other robots are expected to avoid only URLs containing the ROOT and servlet patterns. Robot without a pattern matching engine will see that two lines as junk and ignore them.

        With the 8-17 robots.txt, only kmcrawler is allowed, all other robots have to avoid the site.

        From the text fragments it is obvious that you are expected to spider only in the night, and that you should behave. Don't collect e-mail addresses, don't waste server resources, don't cause large server load.

        There is an imprint claiming some (equivalents of) copyrights, especially non-private use of the layout and the content is prohibited, except for press releases. There is also a contact page that you should use when in doubt.


        Rough translations of the text fragments:

        von 17-8 Uhr
        from 17:00 to 08:00 (local time in Germany, I think)
        email Sammler draussenbleiben
        e-mail collector(s) stay outside
        Robots die durchdrehen fliegen raus
        robots running amok are kicked out
        Tagsueber von 8-17 Uhr
        during the day from 08:00 to 17:00

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re^3: getting LWP and HTML::TokeParser to run
by Marshall (Canon) on Oct 10, 2010 at 14:22 UTC
    I would go with marto's advice about WWW::Mechanize. I haven't used it yet, but I hear that it is great. I suspect that you will find it easier to use than any advice I could give about decoding the raw HTML to get the next pages to "click" on. You are getting about 5K pages from a huge government website that performs very well. I wouldn't worry too much about fancy error recovery with retries unless you are going to run this program often.

    Update:
    You can of course parse the HTML content of the search results with regex, but this is a mess...

    my (@hrefs) = $mech->content =~ m|COMPLETEHREF=http://www.kultus-bw.de +/did_abfrage/detail.php\?id=\d+|g; print "$_\n" foreach @hrefs; #there are 5081 of these #these COMPLETEHREF's can be appended to a main url like this: my $example_url = 'http://www.kultusportal-bw.de/servlet/PB/menu/11884 +27/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail +.php?id=04146900';
    Then things get hairy and you will want to whip out some of that HTML parser voo-doo to parse the resulting table. Also, the character codings aren't consistent, for example the page has ä, but not ü which is coded as ü
      hello Marto hello Marshall,

      many thanks for the hints. I am going to make some tests with Mechanize! I make use of Mechanize instead of LWP!!

      by the way:... i can read on the CPAN:
      >br>
      Features include:

      * All HTTP methods
      * High-level hyperlink and HTML form support, without having to parse HTML yourself
      * SSL support
      * Automatic cookies
      * Custom HTTP headers


      Mech supports performing a sequence of page fetches including following links and submitting forms. Each fetched page is parsed and its links and forms are extracted. A link or a form can be selected, form fields can be filled and the next page can be fetched. Mech also stores a history of the URLs you've visited, which can be queried and revisited. (end of citation)

      Well - Does this mean that id do not have to do Parsing of a result-page with use HTML::TokeParser!? in other words: in the feature-list i can read: "High-level hyperlink and HTML form support, without having to parse HTML yourself" - unbelieveable!!!! Well i cannot believe this! Does this mean that i do not have to parse the fetched HTML-Pages?

      Can i get the data-set of each of the 5000 Pages with Mechanize!?

      Well i have to make some tests! And perhaps someone can set me straight here!!

      BTW: You Marshall are right: this is a "very huge government website that performs very well." I do not think that i run into any troubles....

      after the first trials i come back and report all my findings.

      untill soon!!

      perlbeginner!

        If your plan is to visit each of these five thousand or so links please don't hammer the server.

        Well as far as use policy goes, do check. When I run automated scripts, I do it at night during low load times. And I often put in a sleep() after some number of requests to slow things down.

        One thing to investigate is whether or not this site provides the information that you need in an easier format than web pages? Many big sites do that. Some sites I use actually have a separate URL for automated requests and even provide tools to use their more efficient computer to computer methods.

        On the other hand, this site has "bandwith to burn". I don't think that they will notice 5,000 pages. But do testing with a small set of pages.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://864474]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2024-04-23 11:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found