Perlbeginner1 has asked for the wisdom of the Perl Monks concerning the following question:

good day dear monks

i try do a parser job with the use HTML::TreeBuilder::XPath; well i tried to find out the positons with Xpather but this was a bit too heavy. So i decided to do it with a simple example...



use strict; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; #use real file name here open(my $fh, "<", "file.html") or die $!; $tree->parse_file($fh); my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]}); my ($type) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]}); my ($adress) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]}); my ($adress_two) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]}); my ($telephone) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]}); my ($fax) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]}); my ($internet) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]}); my ($officer) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]}); my ($employees) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]}); my ($offices) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]}); my ($worker) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]}); my ($country) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]}); my ($the_council)= $tree->findnodes(qq{/html/body/table/tr[1]/td[2]}); print $name->as_text; print $type->as_text; print $adress->as_text; print $adress_two->as_text; print $telephone->as_text; print $fax->as_text; print $internet->as_text; print $officer->as_text; print $employees->as_text; print $offices->as_text; print $worker->as_text; print $country->as_text; print $the_council->as_text;
is this all right ? See one of the example sites: http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488


in the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!

That means i would be happy to have a template that can be runned with HTML::TokeParser::Simple and DBI. That would be great!!

love to hear from you

pb1
  • Comment on Xpather running against a simple HTML-document - testing and evaluation
  • Download Code

Replies are listed 'Best First'.
Re: Xpather running against a simple HTML-document - testing and evaluation
by ww (Archbishop) on Oct 16, 2010 at 14:31 UTC
    1. "i would be happy to have a template that can be runned with"
      Sorry, this is NOT a free-coding service. But some of those who responded to your prior post might be "happy" to take on that job for $$$
    2. "a simple example... is this all right ?"
      Around here, we favor the practice "Try it and see; then bring a question (if any) with specifics of what you tried and how it failed."
    3. This new thread appears to be a followup to Re^4: HTML-Parser: a newbie question: need to extract exactly line 999 out of 5000 files.. ". At a minimum, you could make life easier for those who would help (for free) by referencing that... and indeed, the entire previous thread. And some would argue that this should be merely a new node in that thread.
Re: Xpather running against a simple HTML-document - testing and evaluation
by morgon (Priest) on Oct 16, 2010 at 16:35 UTC
    My earlier example that you reuse here uses an xpath that works against the transformed html (using only line 999 - remember).

    If you want to parse the original html of course you have to use a different xpath.

    Which path to use is easy enough to determine with XPather (which is NOT a perl-module but a firefox-plugin): Right-click on what you want - select "Show in XPather".

    I won't write the whole script for you but if you want you can hire me...

      Good evening morgon and ww. and also hello dear Perl-Monks! great to hear from you!


      as parsing this one here: http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488

      i now make use of use some HTML::TableExtract magic: And here i make usage of some circumscances that have to do with the color... conclusion: The Parser-job can be considered to be done yet! Now i have to work on the saving in the database!


      see the code that does all the fuzz
      #!/usr/bin/perl use strict; use warnings; use HTML::TableExtract; use YAML; my $te = HTML::TableExtract->new( attribs => { border => 0, bgcolor => '#EFEFEF', leftmargin => 15, topmargin => 5, }); $te->parse_file('kultus-bw.html'); my ($table) = $te->tables; for my $row ( $table->rows ) { cleanup(@$row); print "@$row\n"; } sub cleanup { for ( @_ ) { s/\s+//; s/[\xa0 ]+\z//; s/\s+/ /g; } }




      see the Output:

      Schul-/Behördenname: Abendgymnasium Ostwürttemberg
      Schulart: Privatschule (04313488)
      Hausadressse: Friedrichstr.70, 73430 Aalen
      Postfachadresse: Keine Angabe
      Telefon: 07361/680040
      Fax: 07361/680040
      E-Mail: Keine Angabe
      Internet: www.abendgymnasium-ostwuerttemberg.de
      ÜbergeordneteDienststelle: Regierungspräsidium Stuttgart Abteilung 7 Schule und Bildung
      Schulleitung: Keine Angabe
      Stellv.Schulleitung: Keine Angabe
      AnzahlSchüler: 259
      AnzahlKlassen: 8
      AnzahlLehrer: Keine Angabe
      Kreis: Ostalbkreis
      Schulträger: <Verband/Verein> (Verband/Verein)



      See the HTML::TableExtract that does the magic!

      Also, the documentation for HTML::TableExtract is available. It does what it says it does: Extracts specific tables from HTML source code. And it does that really well.

      BTW: i want (need to do this with another table/site:


      see this page: http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=672.8924536341191

      Note: click all checkbuttons at the bottom of the site: Then you see a result-page with more than 6400 school-results:
      see at the right of the site Weitere Informationen anzeigen you can get detailed information if you click Weitere Informationen anzeigen

      9 (or ten lines)

      Schuldaten.
      Schulnummer:
      Amtliche Bezeichnung:
      Strasse:
      Plz und Ort:
      Telefon:
      Fax:
      E-Mail-Adresse:
      Schuldaten ändern]  :(this is UTF8 encoded or what)
      Schülergesamtzahl (this is UTF8 encoded or what)


      Question: can the HTML::TableExtract can be applied here to!? at the resultpage of more than 6400 shools: http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=672.8924536341191


      i guess so!
Re: Xpather running against a simple HTML-document - testing and evaluation
by Anonymous Monk on Oct 16, 2010 at 14:00 UTC
    is this all right ?

    Doesn't appear to be, they all the same wrong path.