Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Dear Perl Monks, I have a problem with LWP and HTML::TokeParser

i want to access an URL and this URL just has got many very very simmilar pages whith content of interest. To do this job - getting content from aparticular URL, the simplest way to do it is to use LWP::Simple's functions.

With Perl, we can call its get($url) function. It will try getting that URL's content. If it works, then it'll return the content; but if there's some error, it'll return undef.

so what is the problem: if you see this page here: http://www.kultusportal-bw.de/servlet/PB/menu/1188427_pfhandler_yno/index.html
and press all - then you get a site with lines (links):

http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04133309

http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04133309

with the endings from 04126159 to somewhat 0490000 (many of them are empty - so we have to run from zero to 06000000 to get all! In other words: in order to get all the pages we have to count the URL from somewhat 041000000 to 04999999 or even better to 06000000
If i am able to get this - to count up to and LWP runs well then i need to Parse the content with
HTML::TokeParser HTML::Treebullder LibXML or somehwat like this... in order to get the content out of the pages

This content is wanted out of each pages....:

http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04133309

Allgemeine Daten der Schule / Behörde:



Schul-/Behördenname: Herzog-Philipp-Verbandsschule Grund- u. Werkrealschule
Schulart: Öffentliche Schule (04139579)
Hausadressse: Ebersbacher Str. 20, 88361 Altshausen
Postfachadresse: Keine Angabe
Telefon: 07584/92270
Fax: 07584/922729
E-Mail: poststelle@04139579.schule.bwl.de
Internet: www.hpv-altshausen.de
Übergeordnete Dienststelle: Staatliches Schulamt Markdorf
Schulleitung: Mößle, Georg
Stellv. Schulleitung: Schneider, Cornelia
Anzahl Schüler: 456
Anzahl Klassen: 19
Anzahl Lehrer: 39
Kreis: Ravensburg
Schulträger: <kein Eintrag> (Ohne Zuordnung)




See a HTML-page - with the results:
04126159 http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPL +ETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04133309 <!-- WRAPPED CONTENT --> <table id="wrappedcontent"> <tr><td> <br/> <br> <p><a href="../../menu/1188427/index.html?COMPLETEHREF=h +ttp://www.kultus-bw.de/did_abfrage/schnellsuche.php">Schnellsuche</a> + | <a href="../../menu/1188427/index.html?COMPLETEHREF=http://www.kul +tus-bw.de/did_abfrage/maske.php">Erweiterte Suche</a> | <a href="../. +./menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_ab +frage/hilfe.php">Hilfe</a><script language="javascript"> document.write(' | <a href="javascript:history.back()">zur&uuml;ck zur + Trefferliste</a>'); </script> </p><h1>Allgemeine Daten der Schule / Beh&ouml;rde:</h1>&nbsp;<table + border="0" bgcolor="#EFEFEF" leftmargin="15" topmargin="5"><tr> <t +d><strong>Schul-/Behördenname:</strong>&nbsp;</td> <td width=500> + Herzog-Philipp-Verbandsschule Grund- u. Werkrealschule + </td></tr><tr> <td><strong>Schulart:</strong>&nbsp;</td> <td width +=500> Öffentliche Schule (04139579) </td></tr><tr><td +><strong>Hausadressse:</strong>&nbsp;</td><td>Ebersbacher Str. 20,&nb +sp;88361&nbsp;Altshausen</td></tr><tr> <td><strong>Postfachadresse:< +/strong>&nbsp;</td> <td> Keine Angabe </td></tr><tr> + <td><strong>Telefon:</strong>&nbsp;</td> <td> 07584/92270 + </td></tr><tr> <td><strong>Fax:</strong>&nbsp;</td> <td> + 07584/922729 </td></tr><tr> <td><strong>E-Mail:</stron +g>&nbsp;</td> <td> <a href="mailto:poststelle@04139579.schu +le.bwl.de" TARGET="_blank">poststelle@04139579.schule.bwl.de</a> + </td></tr><tr> <td><strong>Internet:</strong>&nbsp;</td> +<td> <a href="http://www.hpv-altshausen.de +" target="_blank">www.hpv-altshausen.de</a><br> </td +></tr><tr> <td><strong>&Uuml;bergeordnete Dienststelle:</strong> +&nbsp;</td> <td> <a href="http://www.s +chulamt-markdorf.de" target="_blank">Staatliches Schulamt Markdorf </ +a><br> </td></tr><tr> <td><strong>Schulleitung:</st +rong>&nbsp;</td> <td> M&ouml;&szlig;le, Georg </td>< +/tr><tr> <td><strong>Stellv. Schulleitung:</strong>&nbsp;</td> <td> + Schneider, Cornelia </td> </td></tr><tr> <td><stro +ng>Anzahl Sch&uuml;ler:</strong>&nbsp;</td> <td> 456 + </td></tr><tr> <td><strong>Anzahl Klassen:</strong>&nbsp;</td> <td +> 19 </td></tr><tr> <td><strong>Anzahl Lehrer:</stro +ng>&nbsp;</td> <td> 39 </td></tr><tr> <td><strong>K +reis:</strong>&nbsp;</td> <td> Ravensburg </td></tr> +<tr> <td><strong>Schulträger:</strong>&nbsp;</td> <td> &lt +;kein Eintrag&gt; (Ohne Zuordnung) + </td></tr></table><!--<table border="0"> <tr> <td><br><p>Die Adres +sdaten (Hausadresse, Postfachadresse, Telefon, Fax und Internet) werd +en vom Kultusministerium (Referat 15, Information und Kommunikation, +Iuk-Verfahren in Schulen und Schulverwaltung) zur Verfügung gestellt +- Änderungswünsche können Sie per E-Mail <a href="mailto:sc@schule.bw +l.de?subject=Meldung service-bw-Schuladressdatenänderung">an das Serv +ice Center SVN</a> übermitteln. </p><p>Für die Änderung aller anderen + Angaben wenden Sie sich bitte an Ihre obere Schulaufsichtsbehörde. < +/p><p>Die Schüler-, Lehrer- und Klassenzahlen beruhen auf Daten der l +etzten amtlichen Schulstatistik (Ende Januar).</p>//--><!-- </td> < +/tr></table>//--> </td></tr> </table> <!-- WRAPPED CONTENT END -->


this is what i have allready:
#!/usr/bin/perl use strict; # use warnings; # use diagnostics; # use LWP::Simple; # use HTML::TokeParser; # my $url = ' '; # Just an example: the URL where we have to count up in order to g +et all the pages we have to count the URL from somewhat 041000000 to +04999999 or even better to 06000000 use LWP::Simple; my $content = get $url; die "Couldn't get $url" unless defined $content; # Then go do things with $content, like this: # start a new Parser-job with my $p = HTML::TokeParser->new($url) or die "Can't open $url: ($!)"; #find the tags 'xyz' while (my $tag = $p->get_tag('div', '/html')) # my output... !! my $out_file='./output.xml';


Dear Monks - can i go furhter with this approach!? any and all help is greatly appreciated! your perlbeginner1


In reply to getting started with LWP and HTML::TokeParser by Perlbeginner1

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (9)
As of 2024-04-18 08:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found