Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Re: getting LWP and HTML::TokeParser to run

by Marshall (Canon)
on Oct 10, 2010 at 11:09 UTC ( #864472=note: print w/replies, xml ) Need Help??

in reply to getting started with LWP and HTML::TokeParser

Go to:

You will have to do a submit with a Suchbegriff: of "*"
That will result in a page of 5081 results. To get the sub-pages pages you want, you will have to "click" via LWP or whatever to follow these links all 5081 of them.

Start with trying to submit the search term of "*" on the main page and see if you can do that.

  • Comment on Re: getting LWP and HTML::TokeParser to run

Replies are listed 'Best First'.
Re^2: getting LWP and HTML::TokeParser to run
by Perlbeginner1 (Scribe) on Oct 10, 2010 at 12:25 UTC
    Hello Marshall

    many thanks for the reply! i can do as you adviced. I can see the 5081 results.

    Now i have to get the sub-pages pages. I have to "click" via LWP to follow these links - all 5081 of them.
    And then i have to do the Job with HTML-TREEBuilder or use HTML::TokeParser!
    I for one prefer HTML::TokeParser since i know this a little bit.

    i have very very little experience with HTML::TokeParser (not too much - so i guess that the parser-part will be something over my skills)

    but as first things come first how should the LWP-Part look!?

    any and all help will be greatly appreciated!


      Here is an example which uses WWW::Mechanize to visit the page, populate the field and submit the form. Error checking is left as a exercise for you, this is a short example to get you started:

      #!/usr/bin/perl use strict; use warnings; use WWW::Mechanize; my $url = ' +dler_yno/index.html'; my $mech = WWW::Mechanize->new(); $mech->get( $url ); $mech->field('einfache_suche','*'); $mech->submit(); # $mech->content now contains the results page.

      I can't read German, so you'd better check that you're not breaking any site policy regarding automation.

        I'm not an expert in robots.txt but I would understand as 'no agents allowed'.

        # robots.txt von 17-8 Uhr # email Sammler draussenbleiben User-agent:EmailCollector Disallow: / # Robots die durchdrehen fliegen raus User-agent: GagaRobot Disallow: / # Allow anything User-agent: * Disallow: Disallow: *ROOT=1161830$ Disallow: */servlet/PB/-s/*
      I would go with marto's advice about WWW::Mechanize. I haven't used it yet, but I hear that it is great. I suspect that you will find it easier to use than any advice I could give about decoding the raw HTML to get the next pages to "click" on. You are getting about 5K pages from a huge government website that performs very well. I wouldn't worry too much about fancy error recovery with retries unless you are going to run this program often.

      You can of course parse the HTML content of the search results with regex, but this is a mess...

      my (@hrefs) = $mech->content =~ m|COMPLETEHREF= +/did_abfrage/detail.php\?id=\d+|g; print "$_\n" foreach @hrefs; #there are 5081 of these #these COMPLETEHREF's can be appended to a main url like this: my $example_url = ' +27/index.html?COMPLETEHREF= +.php?id=04146900';
      Then things get hairy and you will want to whip out some of that HTML parser voo-doo to parse the resulting table. Also, the character codings aren't consistent, for example the page has , but not which is coded as ü
        hello Marto hello Marshall,

        many thanks for the hints. I am going to make some tests with Mechanize! I make use of Mechanize instead of LWP!!

        by the way:... i can read on the CPAN:
        Features include:

        * All HTTP methods
        * High-level hyperlink and HTML form support, without having to parse HTML yourself
        * SSL support
        * Automatic cookies
        * Custom HTTP headers

        Mech supports performing a sequence of page fetches including following links and submitting forms. Each fetched page is parsed and its links and forms are extracted. A link or a form can be selected, form fields can be filled and the next page can be fetched. Mech also stores a history of the URLs you've visited, which can be queried and revisited. (end of citation)

        Well - Does this mean that id do not have to do Parsing of a result-page with use HTML::TokeParser!? in other words: in the feature-list i can read: "High-level hyperlink and HTML form support, without having to parse HTML yourself" - unbelieveable!!!! Well i cannot believe this! Does this mean that i do not have to parse the fetched HTML-Pages?

        Can i get the data-set of each of the 5000 Pages with Mechanize!?

        Well i have to make some tests! And perhaps someone can set me straight here!!

        BTW: You Marshall are right: this is a "very huge government website that performs very well." I do not think that i run into any troubles....

        after the first trials i come back and report all my findings.

        untill soon!!


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://864472]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2021-10-20 15:52 GMT
Find Nodes?
    Voting Booth?
    My first memorable Perl project was:

    Results (81 votes). Check out past polls.