Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

getting started with LWP and HTML::TokeParser

by Perlbeginner1 (Scribe)
on Oct 10, 2010 at 08:46 UTC ( #864461=perlquestion: print w/replies, xml ) Need Help??

Perlbeginner1 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perl Monks, I have a problem with LWP and HTML::TokeParser

i want to access an URL and this URL just has got many very very simmilar pages whith content of interest. To do this job - getting content from aparticular URL, the simplest way to do it is to use LWP::Simple's functions.

With Perl, we can call its get($url) function. It will try getting that URL's content. If it works, then it'll return the content; but if there's some error, it'll return undef.

so what is the problem: if you see this page here: http://www.kultusportal-bw.de/servlet/PB/menu/1188427_pfhandler_yno/index.html
and press all - then you get a site with lines (links):

http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04133309

http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04133309

with the endings from 04126159 to somewhat 0490000 (many of them are empty - so we have to run from zero to 06000000 to get all! In other words: in order to get all the pages we have to count the URL from somewhat 041000000 to 04999999 or even better to 06000000
If i am able to get this - to count up to and LWP runs well then i need to Parse the content with
HTML::TokeParser HTML::Treebullder LibXML or somehwat like this... in order to get the content out of the pages

This content is wanted out of each pages....:

http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04133309

Allgemeine Daten der Schule / Behörde:



Schul-/Behördenname: Herzog-Philipp-Verbandsschule Grund- u. Werkrealschule
Schulart: Öffentliche Schule (04139579)
Hausadressse: Ebersbacher Str. 20, 88361 Altshausen
Postfachadresse: Keine Angabe
Telefon: 07584/92270
Fax: 07584/922729
E-Mail: poststelle@04139579.schule.bwl.de
Internet: www.hpv-altshausen.de
Übergeordnete Dienststelle: Staatliches Schulamt Markdorf
Schulleitung: Mößle, Georg
Stellv. Schulleitung: Schneider, Cornelia
Anzahl Schüler: 456
Anzahl Klassen: 19
Anzahl Lehrer: 39
Kreis: Ravensburg
Schulträger: <kein Eintrag> (Ohne Zuordnung)




See a HTML-page - with the results:
04126159 http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPL +ETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04133309 <!-- WRAPPED CONTENT --> <table id="wrappedcontent"> <tr><td> <br/> <br> <p><a href="../../menu/1188427/index.html?COMPLETEHREF=h +ttp://www.kultus-bw.de/did_abfrage/schnellsuche.php">Schnellsuche</a> + | <a href="../../menu/1188427/index.html?COMPLETEHREF=http://www.kul +tus-bw.de/did_abfrage/maske.php">Erweiterte Suche</a> | <a href="../. +./menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_ab +frage/hilfe.php">Hilfe</a><script language="javascript"> document.write(' | <a href="javascript:history.back()">zur&uuml;ck zur + Trefferliste</a>'); </script> </p><h1>Allgemeine Daten der Schule / Beh&ouml;rde:</h1>&nbsp;<table + border="0" bgcolor="#EFEFEF" leftmargin="15" topmargin="5"><tr> <t +d><strong>Schul-/Behördenname:</strong>&nbsp;</td> <td width=500> + Herzog-Philipp-Verbandsschule Grund- u. Werkrealschule + </td></tr><tr> <td><strong>Schulart:</strong>&nbsp;</td> <td width +=500> Öffentliche Schule (04139579) </td></tr><tr><td +><strong>Hausadressse:</strong>&nbsp;</td><td>Ebersbacher Str. 20,&nb +sp;88361&nbsp;Altshausen</td></tr><tr> <td><strong>Postfachadresse:< +/strong>&nbsp;</td> <td> Keine Angabe </td></tr><tr> + <td><strong>Telefon:</strong>&nbsp;</td> <td> 07584/92270 + </td></tr><tr> <td><strong>Fax:</strong>&nbsp;</td> <td> + 07584/922729 </td></tr><tr> <td><strong>E-Mail:</stron +g>&nbsp;</td> <td> <a href="mailto:poststelle@04139579.schu +le.bwl.de" TARGET="_blank">poststelle@04139579.schule.bwl.de</a> + </td></tr><tr> <td><strong>Internet:</strong>&nbsp;</td> +<td> <a href="http://www.hpv-altshausen.de +" target="_blank">www.hpv-altshausen.de</a><br> </td +></tr><tr> <td><strong>&Uuml;bergeordnete Dienststelle:</strong> +&nbsp;</td> <td> <a href="http://www.s +chulamt-markdorf.de" target="_blank">Staatliches Schulamt Markdorf </ +a><br> </td></tr><tr> <td><strong>Schulleitung:</st +rong>&nbsp;</td> <td> M&ouml;&szlig;le, Georg </td>< +/tr><tr> <td><strong>Stellv. Schulleitung:</strong>&nbsp;</td> <td> + Schneider, Cornelia </td> </td></tr><tr> <td><stro +ng>Anzahl Sch&uuml;ler:</strong>&nbsp;</td> <td> 456 + </td></tr><tr> <td><strong>Anzahl Klassen:</strong>&nbsp;</td> <td +> 19 </td></tr><tr> <td><strong>Anzahl Lehrer:</stro +ng>&nbsp;</td> <td> 39 </td></tr><tr> <td><strong>K +reis:</strong>&nbsp;</td> <td> Ravensburg </td></tr> +<tr> <td><strong>Schulträger:</strong>&nbsp;</td> <td> &lt +;kein Eintrag&gt; (Ohne Zuordnung) + </td></tr></table><!--<table border="0"> <tr> <td><br><p>Die Adres +sdaten (Hausadresse, Postfachadresse, Telefon, Fax und Internet) werd +en vom Kultusministerium (Referat 15, Information und Kommunikation, +Iuk-Verfahren in Schulen und Schulverwaltung) zur Verfügung gestellt +- Änderungswünsche können Sie per E-Mail <a href="mailto:sc@schule.bw +l.de?subject=Meldung service-bw-Schuladressdatenänderung">an das Serv +ice Center SVN</a> übermitteln. </p><p>Für die Änderung aller anderen + Angaben wenden Sie sich bitte an Ihre obere Schulaufsichtsbehörde. < +/p><p>Die Schüler-, Lehrer- und Klassenzahlen beruhen auf Daten der l +etzten amtlichen Schulstatistik (Ende Januar).</p>//--><!-- </td> < +/tr></table>//--> </td></tr> </table> <!-- WRAPPED CONTENT END -->


this is what i have allready:
#!/usr/bin/perl use strict; # use warnings; # use diagnostics; # use LWP::Simple; # use HTML::TokeParser; # my $url = ' '; # Just an example: the URL where we have to count up in order to g +et all the pages we have to count the URL from somewhat 041000000 to +04999999 or even better to 06000000 use LWP::Simple; my $content = get $url; die "Couldn't get $url" unless defined $content; # Then go do things with $content, like this: # start a new Parser-job with my $p = HTML::TokeParser->new($url) or die "Can't open $url: ($!)"; #find the tags 'xyz' while (my $tag = $p->get_tag('div', '/html')) # my output... !! my $out_file='./output.xml';


Dear Monks - can i go furhter with this approach!? any and all help is greatly appreciated! your perlbeginner1

Replies are listed 'Best First'.
Re: getting LWP and HTML::TokeParser to run
by Marshall (Canon) on Oct 10, 2010 at 11:09 UTC
    Go to: http://www.kultusportal-bw.de/servlet/PB/menu/1188427_pfhandler_yno/index.html

    You will have to do a submit with a Suchbegriff: of "*"
    That will result in a page of 5081 results. To get the sub-pages pages you want, you will have to "click" via LWP or whatever to follow these links all 5081 of them.

    Start with trying to submit the search term of "*" on the main page and see if you can do that.

      Hello Marshall

      many thanks for the reply! i can do as you adviced. I can see the 5081 results.

      Now i have to get the sub-pages pages. I have to "click" via LWP to follow these links - all 5081 of them.
      And then i have to do the Job with HTML-TREEBuilder or use HTML::TokeParser!
      I for one prefer HTML::TokeParser since i know this a little bit.

      i have very very little experience with HTML::TokeParser (not too much - so i guess that the parser-part will be something over my skills)


      but as first things come first how should the LWP-Part look!?

      any and all help will be greatly appreciated!

      perlbeginner1

        Here is an example which uses WWW::Mechanize to visit the page, populate the field and submit the form. Error checking is left as a exercise for you, this is a short example to get you started:

        #!/usr/bin/perl use strict; use warnings; use WWW::Mechanize; my $url = 'http://www.kultusportal-bw.de/servlet/PB/menu/1188427_pfhan +dler_yno/index.html'; my $mech = WWW::Mechanize->new(); $mech->get( $url ); $mech->field('einfache_suche','*'); $mech->submit(); # $mech->content now contains the results page.

        I can't read German, so you'd better check that you're not breaking any site policy regarding automation.

        I would go with marto's advice about WWW::Mechanize. I haven't used it yet, but I hear that it is great. I suspect that you will find it easier to use than any advice I could give about decoding the raw HTML to get the next pages to "click" on. You are getting about 5K pages from a huge government website that performs very well. I wouldn't worry too much about fancy error recovery with retries unless you are going to run this program often.

        Update:
        You can of course parse the HTML content of the search results with regex, but this is a mess...

        my (@hrefs) = $mech->content =~ m|COMPLETEHREF=http://www.kultus-bw.de +/did_abfrage/detail.php\?id=\d+|g; print "$_\n" foreach @hrefs; #there are 5081 of these #these COMPLETEHREF's can be appended to a main url like this: my $example_url = 'http://www.kultusportal-bw.de/servlet/PB/menu/11884 +27/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail +.php?id=04146900';
        Then things get hairy and you will want to whip out some of that HTML parser voo-doo to parse the resulting table. Also, the character codings aren't consistent, for example the page has ä, but not ü which is coded as &uuml;
Re: getting started with LWP and HTML::TokeParser
by BrimBorium (Friar) on Oct 10, 2010 at 18:36 UTC

    I just want to adwise you to check if you are allowed to use your script on a government website. Just because it's possible it does not always mean it's a good idea. Be sensitive about data privacy when automating web things. When using the swiss army chanisaw, be aware you might cut your leg off if you're careless.

    BTW: I don't feel well with people posting valid mail adresses and phone numbers instead of example data ... because I hate spam.

    Did you read Choosing a username? Do you really want to stay Perlbeginner1 whole life? Just for curiosity ;-)

      Hi there Brimborium

      thx for sharing you ideas!

      since i am a teacher and since i am working in the field of education for years i know very well what i do! I have no troubles with parsing this govermental site!

      The data i am trying to get are readable - so i mechanize this reading...


      BTW: one word regarding the data: These data are offical Adress-data - names and numbers of shools - nothing else.

      some general adress-sets that contain nothing really sensitive!

      but again - thx for sharing your ideas. BTW: what is wrong with my username; i am a beginner.

      regards - perl beginner1

        There is nothing wrong with your name, but you will stay a beginner forever, at least with your name...

        I just want to point out to choose the right way to do things to avoid causing more damage than benefit. You can use a club to get a fly away from your friends shoulder, but he may not recover from your favour. You have a powerful tool with perl, I just want to be sure that you use it wisely.

        Reading a lot of files in a short time from a public server could be misinterpreted... if you are a teacher you should be aware of the consequences ... you might kill the server with a buggy script. I just want to predict you from having to say "Oh, I did not WANT that, it was really not my intention"

        A phonebook is also available to public, but I dislike the idea of having it machine redable for dialing bots ...

        I'm a software developer since many years and if you're using example or test code on real data on a public server, you really do not know what you do form my point of view.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://864461]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (4)
As of 2021-10-20 17:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My first memorable Perl project was:







    Results (81 votes). Check out past polls.

    Notices?