Dear Perl Monks, I have a problem with LWP and HTML::TokeParser

i want to access an URL and this URL just has got many very very simmilar pages whith content of interest. To do this job - getting content from aparticular URL, the simplest way to do it is to use LWP::Simple's functions.

With Perl, we can call its get($url) function. It will try getting that URL's content. If it works, then it'll return the content; but if there's some error, it'll return undef.

so what is the problem: if you see this page here: http://www.kultusportal-bw.de/servlet/PB/menu/1188427_pfhandler_yno/index.html
and press all - then you get a site with lines (links):

http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04133309

http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04133309

with the endings from 04126159 to somewhat 0490000 (many of them are empty - so we have to run from zero to 06000000 to get all! In other words: in order to get all the pages we have to count the URL from somewhat 041000000 to 04999999 or even better to 06000000
If i am able to get this - to count up to and LWP runs well then i need to Parse the content with
HTML::TokeParser HTML::Treebullder LibXML or somehwat like this... in order to get the content out of the pages

This content is wanted out of each pages....:

http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04133309

Allgemeine Daten der Schule / Behörde:



Schul-/Behördenname: Herzog-Philipp-Verbandsschule Grund- u. Werkrealschule
Schulart: Öffentliche Schule (04139579)
Hausadressse: Ebersbacher Str. 20, 88361 Altshausen
Postfachadresse: Keine Angabe
Telefon: 07584/92270
Fax: 07584/922729
E-Mail: poststelle@04139579.schule.bwl.de
Internet: www.hpv-altshausen.de
Übergeordnete Dienststelle: Staatliches Schulamt Markdorf
Schulleitung: Mößle, Georg
Stellv. Schulleitung: Schneider, Cornelia
Anzahl Schüler: 456
Anzahl Klassen: 19
Anzahl Lehrer: 39
Kreis: Ravensburg
Schulträger: <kein Eintrag> (Ohne Zuordnung)




See a HTML-page - with the results:
04126159 http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPL +ETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04133309 <!-- WRAPPED CONTENT --> <table id="wrappedcontent"> <tr><td> <br/> <br> <p><a href="../../menu/1188427/index.html?COMPLETEHREF=h +ttp://www.kultus-bw.de/did_abfrage/schnellsuche.php">Schnellsuche</a> + | <a href="../../menu/1188427/index.html?COMPLETEHREF=http://www.kul +tus-bw.de/did_abfrage/maske.php">Erweiterte Suche</a> | <a href="../. +./menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_ab +frage/hilfe.php">Hilfe</a><script language="javascript"> document.write(' | <a href="javascript:history.back()">zur&uuml;ck zur + Trefferliste</a>'); </script> </p><h1>Allgemeine Daten der Schule / Beh&ouml;rde:</h1>&nbsp;<table + border="0" bgcolor="#EFEFEF" leftmargin="15" topmargin="5"><tr> <t +d><strong>Schul-/Behördenname:</strong>&nbsp;</td> <td width=500> + Herzog-Philipp-Verbandsschule Grund- u. Werkrealschule + </td></tr><tr> <td><strong>Schulart:</strong>&nbsp;</td> <td width +=500> Öffentliche Schule (04139579) </td></tr><tr><td +><strong>Hausadressse:</strong>&nbsp;</td><td>Ebersbacher Str. 20,&nb +sp;88361&nbsp;Altshausen</td></tr><tr> <td><strong>Postfachadresse:< +/strong>&nbsp;</td> <td> Keine Angabe </td></tr><tr> + <td><strong>Telefon:</strong>&nbsp;</td> <td> 07584/92270 + </td></tr><tr> <td><strong>Fax:</strong>&nbsp;</td> <td> + 07584/922729 </td></tr><tr> <td><strong>E-Mail:</stron +g>&nbsp;</td> <td> <a href="mailto:poststelle@04139579.schu +le.bwl.de" TARGET="_blank">poststelle@04139579.schule.bwl.de</a> + </td></tr><tr> <td><strong>Internet:</strong>&nbsp;</td> +<td> <a href="http://www.hpv-altshausen.de +" target="_blank">www.hpv-altshausen.de</a><br> </td +></tr><tr> <td><strong>&Uuml;bergeordnete Dienststelle:</strong> +&nbsp;</td> <td> <a href="http://www.s +chulamt-markdorf.de" target="_blank">Staatliches Schulamt Markdorf </ +a><br> </td></tr><tr> <td><strong>Schulleitung:</st +rong>&nbsp;</td> <td> M&ouml;&szlig;le, Georg </td>< +/tr><tr> <td><strong>Stellv. Schulleitung:</strong>&nbsp;</td> <td> + Schneider, Cornelia </td> </td></tr><tr> <td><stro +ng>Anzahl Sch&uuml;ler:</strong>&nbsp;</td> <td> 456 + </td></tr><tr> <td><strong>Anzahl Klassen:</strong>&nbsp;</td> <td +> 19 </td></tr><tr> <td><strong>Anzahl Lehrer:</stro +ng>&nbsp;</td> <td> 39 </td></tr><tr> <td><strong>K +reis:</strong>&nbsp;</td> <td> Ravensburg </td></tr> +<tr> <td><strong>Schulträger:</strong>&nbsp;</td> <td> &lt +;kein Eintrag&gt; (Ohne Zuordnung) + </td></tr></table><!--<table border="0"> <tr> <td><br><p>Die Adres +sdaten (Hausadresse, Postfachadresse, Telefon, Fax und Internet) werd +en vom Kultusministerium (Referat 15, Information und Kommunikation, +Iuk-Verfahren in Schulen und Schulverwaltung) zur Verfügung gestellt +- Änderungswünsche können Sie per E-Mail <a href="mailto:sc@schule.bw +l.de?subject=Meldung service-bw-Schuladressdatenänderung">an das Serv +ice Center SVN</a> übermitteln. </p><p>Für die Änderung aller anderen + Angaben wenden Sie sich bitte an Ihre obere Schulaufsichtsbehörde. < +/p><p>Die Schüler-, Lehrer- und Klassenzahlen beruhen auf Daten der l +etzten amtlichen Schulstatistik (Ende Januar).</p>//--><!-- </td> < +/tr></table>//--> </td></tr> </table> <!-- WRAPPED CONTENT END -->


this is what i have allready:
#!/usr/bin/perl use strict; # use warnings; # use diagnostics; # use LWP::Simple; # use HTML::TokeParser; # my $url = ' '; # Just an example: the URL where we have to count up in order to g +et all the pages we have to count the URL from somewhat 041000000 to +04999999 or even better to 06000000 use LWP::Simple; my $content = get $url; die "Couldn't get $url" unless defined $content; # Then go do things with $content, like this: # start a new Parser-job with my $p = HTML::TokeParser->new($url) or die "Can't open $url: ($!)"; #find the tags 'xyz' while (my $tag = $p->get_tag('div', '/html')) # my output... !! my $out_file='./output.xml';


Dear Monks - can i go furhter with this approach!? any and all help is greatly appreciated! your perlbeginner1


In reply to getting started with LWP and HTML::TokeParser by Perlbeginner1

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.