comment on

Hello good Morning dear Monks,

firt of all: Here at this place i have learned alot!! Yesterday Morgon gave me some hints to work with Xpather. Now i am trying go do some first steps. I want to apply all that i have learned!

i am currently working on a parser script: I have to parse all the detail-pages of this site here:<url>http://www.educa.ch/dyn/79362.asp?action=search#0</url> There are several ways to do it. i have to get rid of a lot of crap by only using the text data out of the page... See the page - wich is very very simple: <url> http://www.educa.ch/dyn/79376.asp?id=1187</url> Output:

Altes Schulhaus Ossingen
Guntibachstrasse 10
8475  Ossingen
sekretariat.psossingen@bluewin.ch
Tel:052 317 15 45
Fax:052 317 04 42
[download]

Well we see - i need a little PERL-script to get this six-lines of text out of the HTML-page.And yes: if i can i parse one page i can do it for all available 5000 to 6000 pages. I have to parse all of them. A True PERL-Job! I am sure Perl can do this job with ease! Well - how we do that: Personally I like HTML::TreeBuilder::XPath that we would have to install from CPAN. Here is how we would then extract the name from one of the files with it:

Note: i am not sure about the Arguments that i have to take! See below my trials:

use strict;
use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;

#use real file name here
open(my $fh, "<", "file.html") or die $!;

$tree->parse_file($fh);

my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]


});

print $name->as_text;
[download]

As we can see we simply use an xpath-expression to indentify the node we want.
So how to determine that?

Hmm - i tried to use a Firefox-plugin called XPather, that allows us to simply click on a html-element and extract the corresponding xpath.
So we load the file we want to parse in Firefox, click on the stuff we want, get the xpath and use that in the perl-script.
Well i am not very sure that i did the job with XPather very well. I tired to find the arguments for the follwing page:
See the page - wich is very very simple: http://www.educa.ch/dyn/79376.asp?id=1187 see the full page:
http://www.educa.ch/dyn/79363.asp?action=search#62

See below my trials: the arguments that i found with XPather ... are they really arguments -that help me to parse the above mentioned detai-result-page: http://www.educa.ch/dyn/79376.asp?id=1187

/html/body/div[3]/text()
/html/body/div[4]/text()
/html/body/div[6]/text()
/html/body/div[7]/text()
/html/body/div[9]/a/text()
/html/body/div[10]/text()
/html/body/div[11]/text()[1]
/html/body/div[11]/text()[2]
/html/body/div[12]/text()[1]
/html/body/div[12]/text()[2]
/html/body/div[13]/text()
[download]

see: http://www.educa.ch/dyn/79376.asp?id=1187

see the html code

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
+www.w3.org/TR/html4/loose.dtd"><html><head><meta name="generator" con
+tent="DigiOnline GmbH - WebWeaver 3.4 CMS - http://www.webweaver.de">
+<title>educa.ch</title><meta http-equiv="Content-Type" content="text/
+html; charset=iso-8859-1"><link rel="stylesheet" href="101.htm"><scri
+pt src="102.htm"></script><script language="JavaScript"><!--
var did='d79376';
var root=new Array('d200','d205','d73137','d1566','d79376','d');
var usefocus = 1;
function check() {
if ((self.focus) && (usefocus)) {
self.focus();
}
}
// --></script></head><body bgcolor="#FFFFFF" leftmargin="0" topmargin
+="0" marginwidth="0" marginheight="0" onload="check();"><table cellsp
+acing="0" cellpadding="0" border="0" width="100%"><tr><td width="15" 
+class="popuphead"><img src="/0.gif" alt="" width="15" height="16"></t
+d><td width="99%" class="popuphead">Adresse - Schulen in der Schweiz<
+/td><td width="20" class="popuphead" valign="middle"><a href="#" titl
+e="Print" onclick="window.print(); return false;"><img src="../pics/p
+rint16x13.gif" alt="Drucken" width="16" height="13"></a></td><td widt
+h="20" class="popuphead" valign="middle"><a href="#" title="close" on
+click="window.close(); return false;"><img src="../pics/close21x13.gi
+f" alt="Schliessen" width="21" height="13"></a></td></tr>


<tr bgcolor="#B2B2B2"><td colspan="4"><img src="/0.gif" alt="" width="
+1" height="1"></td></tr></table><div class="leerzeile">&#160;</div><d
+iv class="leerzeile"><img src="/0.gif" alt="" width="15"height="8">Al
+tes Schulhaus Ossingen    </div><div class="leerzeile">&#160;</div><d
+iv><img src="/0.gif" alt="" width="15" height="8">Guntibachstrasse 10
+</div><div><img src="/0.gif" alt="" width="15" height="8"></div><div>
+<img src="/0.gif" alt="" width="15" height="8">8475 &#160;Ossingen</d
+iv><div class="leerzeile">&#160;</div><div><img src="/0.gif" alt="" w
+idth="15" height="8"><a href="" target="_blank"></a></div><div><img s
+rc="/0.gif" alt="" width="15" height="8"><a href="mailto: sekretariat
+.psossingen@bluewin.ch">sekretariat.psossingen@bluewin.ch</a></div><d
+iv class="leerzeile">&#160;</div><div><img src="/0.gif" alt="" width=
+"15" height="8">Tel:<img src="/0.gif" alt="" width="6" height="8">052
+ 317 15 45 </div><div><img src="/0.gif" alt="" width="15" height="8">
+Fax:<img src="/0.gif" alt="" width="4" height="8">052 317 04 42 </div
+><div>&#160;</div></body></html>
[download]

Well - if i am able to identify the XPATH expressions for this site http://www.educa.ch/dyn/79376.asp?id=1187 then i am able to do the job!

Note: if i can do it for one site -i am able to do it for more than 5000 - since i have to parse al of them..;-) Well - we see that there are three tasks.

a. fetching the pages
b. parsing them
c. storing the results in a database

for the first task we can use LWP-USERAGENT or MECHANIZE for the next tasks we can use HTML-Parser! For the third task we need some knowledge of PERL::DBI

In reply to HTML::TreeBuilder:: identifing xpath-expression - first attempt by Perlbeginner1

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.