Re^2: getting LWP and HTML::TokeParser to run

Replies are listed 'Best First'.
Re^3: getting LWP and HTML::TokeParser to run by marto (Cardinal) on Oct 10, 2010 at 13:22 UTC
Here is an example which uses WWW::Mechanize to visit the page, populate the field and submit the form. Error checking is left as a exercise for you, this is a short example to get you started: `#!/usr/bin/perl use strict; use warnings; use WWW::Mechanize; my $url = 'http://www.kultusportal-bw.de/servlet/PB/menu/1188427_pfhan +dler_yno/index.html'; my $mech = WWW::Mechanize->new(); $mech->get( $url ); $mech->field('einfache_suche','*'); $mech->submit(); # $mech->content now contains the results page.` [download] I can't read German, so you'd better check that you're not breaking any site policy regarding automation.	[reply] [d/l]
Re^4: getting LWP and HTML::TokeParser to run by BrimBorium (Friar) on Oct 10, 2010 at 18:04 UTC
I'm not an expert in robots.txt but I would understand http://www.kultusportal-bw.de/robots.txt as 'no agents allowed'. `# robots.txt von 17-8 Uhr # email Sammler draussenbleiben User-agent:EmailCollector Disallow: / # Robots die durchdrehen fliegen raus User-agent: GagaRobot Disallow: / # Allow anything User-agent: * Disallow: Disallow: ROOT=1161830$ Disallow: /servlet/PB/-s/*` [download]	[reply] [d/l]
Re^5: getting LWP and HTML::TokeParser to run by marto (Cardinal) on Oct 11, 2010 at 08:15 UTC
Really, IIRC this looks as though only user agents 'EmailCollector' and 'GagaRobot' are dissalowed. All other user agents are dissalowed from seeing 'ROOT=1161830$' and '/serverlet/PB/-s/', but as specified under their '# Allow anything' comment. Update:* In my previous post I was really warning against documented terms of use, as I say I can't read German so am unable to tell if the site has any.	[reply]
Re^5: getting LWP and HTML::TokeParser to run by afoken (Chancellor) on Oct 11, 2010 at 14:20 UTC
Funny, now they serve a different robots.txt: `# cat robots.txt.8-17 # robots.txt Tagsueber von 8-17 Uhr # Disallow robots thru 17 User-agent: kmcrawler Disallow: User-agent: * Disallow: / Disallow: ROOT=1161830$ Disallow: /servlet/PB/-s/*` [download] Apart from that, I'm not sure about the ROOT and servlet lines. They look like patterns and not like URL path prefixes. Robots don't have to implement pattern matching, and most probably don't, even if Google's does. So many robots may consider this lines junk, and simply ignore them. With the 17-8 robots.txt, only EmailCollector and GagaRobot are excluded from the entire site, and all other robots are expected to avoid only URLs containing the ROOT and servlet patterns. Robot without a pattern matching engine will see that two lines as junk and ignore them. With the 8-17 robots.txt, only kmcrawler is allowed, all other robots have to avoid the site. From the text fragments it is obvious that you are expected to spider only in the night, and that you should behave. Don't collect e-mail addresses, don't waste server resources, don't cause large server load. There is an imprint claiming some (equivalents of) copyrights, especially non-private use of the layout and the content is prohibited, except for press releases. There is also a contact page that you should use when in doubt. Rough translations of the text fragments: von 17-8 Uhr from 17:00 to 08:00 (local time in Germany, I think) email Sammler draussenbleiben e-mail collector(s) stay outside Robots die durchdrehen fliegen raus robots running amok are kicked out Tagsueber von 8-17 Uhr during the day from 08:00 to 17:00 Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l]
Re^3: getting LWP and HTML::TokeParser to run by Marshall (Canon) on Oct 10, 2010 at 14:22 UTC
I would go with marto's advice about WWW::Mechanize. I haven't used it yet, but I hear that it is great. I suspect that you will find it easier to use than any advice I could give about decoding the raw HTML to get the next pages to "click" on. You are getting about 5K pages from a huge government website that performs very well. I wouldn't worry too much about fancy error recovery with retries unless you are going to run this program often. Update: You can of course parse the HTML content of the search results with regex, but this is a mess... `my (@hrefs) = $mech->content =~ m\|COMPLETEHREF=http://www.kultus-bw.de +/did_abfrage/detail.php\?id=\d+\|g; print "$_\n" foreach @hrefs; #there are 5081 of these #these COMPLETEHREF's can be appended to a main url like this: my $example_url = 'http://www.kultusportal-bw.de/servlet/PB/menu/11884 +27/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail +.php?id=04146900';` [download] Then things get hairy and you will want to whip out some of that HTML parser voo-doo to parse the resulting table. Also, the character codings aren't consistent, for example the page has ä, but not ü which is coded as `ü`	[reply] [d/l] [select]
Re^4: getting LWP and HTML::TokeParser to run by Perlbeginner1 (Scribe) on Oct 10, 2010 at 17:56 UTC
hello Marto hello Marshall, many thanks for the hints. I am going to make some tests with Mechanize! I make use of Mechanize instead of LWP!! by the way:... i can read on the CPAN: >br> Features include: * All HTTP methods * High-level hyperlink and HTML form support, without having to parse HTML yourself * SSL support * Automatic cookies * Custom HTTP headers Mech supports performing a sequence of page fetches including following links and submitting forms. Each fetched page is parsed and its links and forms are extracted. A link or a form can be selected, form fields can be filled and the next page can be fetched. Mech also stores a history of the URLs you've visited, which can be queried and revisited. (end of citation) Well - Does this mean that id do not have to do Parsing of a result-page with use HTML::TokeParser!? in other words: in the feature-list i can read: "High-level hyperlink and HTML form support, without having to parse HTML yourself" - unbelieveable!!!! Well i cannot believe this! Does this mean that i do not have to parse the fetched HTML-Pages? Can i get the data-set of each of the 5000 Pages with Mechanize!? Well i have to make some tests! And perhaps someone can set me straight here!! BTW: You Marshall are right: this is a "very huge government website that performs very well." I do not think that i run into any troubles.... after the first trials i come back and report all my findings. untill soon!! perlbeginner!	[reply]
Re^5: getting LWP and HTML::TokeParser to run by marto (Cardinal) on Oct 10, 2010 at 18:16 UTC
If your plan is to visit each of these five thousand or so links please don't hammer the server.	[reply]
Re^5: getting LWP and HTML::TokeParser to run by Marshall (Canon) on Oct 10, 2010 at 18:51 UTC
Well as far as use policy goes, do check. When I run automated scripts, I do it at night during low load times. And I often put in a sleep() after some number of requests to slow things down. One thing to investigate is whether or not this site provides the information that you need in an easier format than web pages? Many big sites do that. Some sites I use actually have a separate URL for automated requests and even provide tools to use their more efficient computer to computer methods. On the other hand, this site has "bandwith to burn". I don't think that they will notice 5,000 pages. But do testing with a small set of pages.	[reply]
Re^6: getting LWP and HTML::TokeParser to run by Perlbeginner1 (Scribe) on Oct 10, 2010 at 19:33 UTC


Welcome to the Monastery
	PerlMonks