Xpather running against a simple HTML-document

Perlbeginner1 has asked for the wisdom of the Perl Monks concerning the following question:

good day dear monks

i try do a parser job with the use HTML::TreeBuilder::XPath; well i tried to find out the positons with Xpather but this was a bit too heavy. So i decided to do it with a simple example...

use strict;

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;

#use real file name here
open(my $fh, "<", "file.html") or die $!;

$tree->parse_file($fh);

my ($name)       = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($type)       = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($adress)     = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($adress_two) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($telephone)  = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($fax)      = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($internet)   = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($officer)    = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($employees)  = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($offices)    = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($worker)     = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($country)    = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($the_council)= $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});


print $name->as_text;
print $type->as_text;
print $adress->as_text;
print $adress_two->as_text;
print $telephone->as_text;
print $fax->as_text;
print $internet->as_text;
print $officer->as_text;
print $employees->as_text;
print $offices->as_text;
print $worker->as_text;
print $country->as_text;
print $the_council->as_text;
[download]

is this all right ? See one of the example sites: http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488

in the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!

That means i would be happy to have a template that can be runned with HTML::TokeParser::Simple and DBI. That would be great!!

love to hear from you

pb1

Comment on Xpather running against a simple HTML-document - testing and evaluation Download Code

Replies are listed 'Best First'.
Re: Xpather running against a simple HTML-document - testing and evaluation by ww (Archbishop) on Oct 16, 2010 at 14:31 UTC
"i would be happy to have a template that can be runned with" Sorry, this is NOT a free-coding service. But some of those who responded to your prior post might be "happy" to take on that job for $$$ "a simple example... is this all right ?" Around here, we favor the practice "Try it and see; then bring a question (if any) with specifics of what you tried and how it failed." This new thread appears to be a followup to Re^4: HTML-Parser: a newbie question: need to extract exactly line 999 out of 5000 files.. ". At a minimum, you could make life easier for those who would help (for free) by referencing that... and indeed, the entire previous thread. And some would argue that this should be merely a new node in that thread.	[reply]
Re: Xpather running against a simple HTML-document - testing and evaluation by morgon (Priest) on Oct 16, 2010 at 16:35 UTC
My earlier example that you reuse here uses an xpath that works against the transformed html (using only line 999 - remember). If you want to parse the original html of course you have to use a different xpath. Which path to use is easy enough to determine with XPather (which is NOT a perl-module but a firefox-plugin): Right-click on what you want - select "Show in XPather". I won't write the whole script for you but if you want you can hire me...	[reply]
Re^2: Xpather running against a simple HTML-document - testing and evaluation by Perlbeginner1 (Scribe) on Oct 16, 2010 at 19:32 UTC
Good evening morgon and ww. and also hello dear Perl-Monks! great to hear from you! as parsing this one here: http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488 i now make use of use some HTML::TableExtract magic: And here i make usage of some circumscances that have to do with the color... conclusion: The Parser-job can be considered to be done yet! Now i have to work on the saving in the database! see the code that does all the fuzz `#!/usr/bin/perl use strict; use warnings; use HTML::TableExtract; use YAML; my $te = HTML::TableExtract->new( attribs => { border => 0, bgcolor => '#EFEFEF', leftmargin => 15, topmargin => 5, }); $te->parse_file('kultus-bw.html'); my ($table) = $te->tables; for my $row ( $table->rows ) { cleanup(@$row); print "@$row\n"; } sub cleanup { for ( @_ ) { s/\s+//; s/[\xa0 ]+\z//; s/\s+/ /g; } }` [download] see the Output: Schul-/Behördenname: Abendgymnasium Ostwürttemberg Schulart: Privatschule (04313488) Hausadressse: Friedrichstr.70, 73430 Aalen Postfachadresse: Keine Angabe Telefon: 07361/680040 Fax: 07361/680040 E-Mail: Keine Angabe Internet: www.abendgymnasium-ostwuerttemberg.de ÜbergeordneteDienststelle: Regierungspräsidium Stuttgart Abteilung 7 Schule und Bildung Schulleitung: Keine Angabe Stellv.Schulleitung: Keine Angabe AnzahlSchüler: 259 AnzahlKlassen: 8 AnzahlLehrer: Keine Angabe Kreis: Ostalbkreis Schulträger: <Verband/Verein> (Verband/Verein) See the HTML::TableExtract that does the magic! Also, the documentation for HTML::TableExtract is available. It does what it says it does: Extracts specific tables from HTML source code. And it does that really well. BTW: i want (need to do this with another table/site: see this page: http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=672.8924536341191 Note: click all checkbuttons at the bottom of the site: Then you see a result-page with more than 6400 school-results: see at the right of the site Weitere Informationen anzeigen you can get detailed information if you click Weitere Informationen anzeigen 9 (or ten lines) Schuldaten. Schulnummer: Amtliche Bezeichnung: Strasse: Plz und Ort: Telefon: Fax: E-Mail-Adresse: Schuldaten ändern] :(this is UTF8 encoded or what) Schülergesamtzahl (this is UTF8 encoded or what) Question: can the HTML::TableExtract can be applied here to!? at the resultpage of more than 6400 shools: http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=672.8924536341191 i guess so!	[reply] [d/l]
Re: Xpather running against a simple HTML-document - testing and evaluation by Anonymous Monk on Oct 16, 2010 at 14:00 UTC
is this all right ? Doesn't appear to be, they all the same wrong path.	[reply]