Re: getting LWP and HTML::TokeParser to run

Replies are listed 'Best First'.
Re^2: getting LWP and HTML::TokeParser to run by Perlbeginner1 (Scribe) on Oct 10, 2010 at 12:25 UTC
Hello Marshall many thanks for the reply! i can do as you adviced. I can see the 5081 results. Now i have to get the sub-pages pages. I have to "click" via LWP to follow these links - all 5081 of them. And then i have to do the Job with HTML-TREEBuilder or use HTML::TokeParser! I for one prefer HTML::TokeParser since i know this a little bit. i have very very little experience with HTML::TokeParser (not too much - so i guess that the parser-part will be something over my skills) but as first things come first how should the LWP-Part look!? any and all help will be greatly appreciated! perlbeginner1	[reply]
Re^3: getting LWP and HTML::TokeParser to run by marto (Cardinal) on Oct 10, 2010 at 13:22 UTC
Here is an example which uses WWW::Mechanize to visit the page, populate the field and submit the form. Error checking is left as a exercise for you, this is a short example to get you started: `#!/usr/bin/perl use strict; use warnings; use WWW::Mechanize; my $url = 'http://www.kultusportal-bw.de/servlet/PB/menu/1188427_pfhan +dler_yno/index.html'; my $mech = WWW::Mechanize->new(); $mech->get( $url ); $mech->field('einfache_suche','*'); $mech->submit(); # $mech->content now contains the results page.` [download] I can't read German, so you'd better check that you're not breaking any site policy regarding automation.	[reply] [d/l]
Re^4: getting LWP and HTML::TokeParser to run by BrimBorium (Friar) on Oct 10, 2010 at 18:04 UTC
I'm not an expert in robots.txt but I would understand http://www.kultusportal-bw.de/robots.txt as 'no agents allowed'. `# robots.txt von 17-8 Uhr # email Sammler draussenbleiben User-agent:EmailCollector Disallow: / # Robots die durchdrehen fliegen raus User-agent: GagaRobot Disallow: / # Allow anything User-agent: * Disallow: Disallow: ROOT=1161830$ Disallow: /servlet/PB/-s/*` [download]	[reply] [d/l]
Re^5: getting LWP and HTML::TokeParser to run by marto (Cardinal) on Oct 11, 2010 at 08:15 UTC
Re^5: getting LWP and HTML::TokeParser to run by afoken (Chancellor) on Oct 11, 2010 at 14:20 UTC
Re^3: getting LWP and HTML::TokeParser to run by Marshall (Canon) on Oct 10, 2010 at 14:22 UTC
I would go with marto's advice about WWW::Mechanize. I haven't used it yet, but I hear that it is great. I suspect that you will find it easier to use than any advice I could give about decoding the raw HTML to get the next pages to "click" on. You are getting about 5K pages from a huge government website that performs very well. I wouldn't worry too much about fancy error recovery with retries unless you are going to run this program often. Update: You can of course parse the HTML content of the search results with regex, but this is a mess... `my (@hrefs) = $mech->content =~ m\|COMPLETEHREF=http://www.kultus-bw.de +/did_abfrage/detail.php\?id=\d+\|g; print "$_\n" foreach @hrefs; #there are 5081 of these #these COMPLETEHREF's can be appended to a main url like this: my $example_url = 'http://www.kultusportal-bw.de/servlet/PB/menu/11884 +27/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail +.php?id=04146900';` [download] Then things get hairy and you will want to whip out some of that HTML parser voo-doo to parse the resulting table. Also, the character codings aren't consistent, for example the page has ä, but not ü which is coded as `ü`	[reply] [d/l] [select]
Re^4: getting LWP and HTML::TokeParser to run by Perlbeginner1 (Scribe) on Oct 10, 2010 at 17:56 UTC
hello Marto hello Marshall, many thanks for the hints. I am going to make some tests with Mechanize! I make use of Mechanize instead of LWP!! by the way:... i can read on the CPAN: >br> Features include: * All HTTP methods * High-level hyperlink and HTML form support, without having to parse HTML yourself * SSL support * Automatic cookies * Custom HTTP headers Mech supports performing a sequence of page fetches including following links and submitting forms. Each fetched page is parsed and its links and forms are extracted. A link or a form can be selected, form fields can be filled and the next page can be fetched. Mech also stores a history of the URLs you've visited, which can be queried and revisited. (end of citation) Well - Does this mean that id do not have to do Parsing of a result-page with use HTML::TokeParser!? in other words: in the feature-list i can read: "High-level hyperlink and HTML form support, without having to parse HTML yourself" - unbelieveable!!!! Well i cannot believe this! Does this mean that i do not have to parse the fetched HTML-Pages? Can i get the data-set of each of the 5000 Pages with Mechanize!? Well i have to make some tests! And perhaps someone can set me straight here!! BTW: You Marshall are right: this is a "very huge government website that performs very well." I do not think that i run into any troubles.... after the first trials i come back and report all my findings. untill soon!! perlbeginner!	[reply]
Re^5: getting LWP and HTML::TokeParser to run by marto (Cardinal) on Oct 10, 2010 at 18:16 UTC
Re^5: getting LWP and HTML::TokeParser to run by Marshall (Canon) on Oct 10, 2010 at 18:51 UTC
Re^6: getting LWP and HTML::TokeParser to run by Perlbeginner1 (Scribe) on Oct 10, 2010 at 19:33 UTC


Welcome to the Monastery
	PerlMonks