in reply to Re: HTML::TreeBuilder:: identifing xpath-expression - first attempt
in thread HTML::TreeBuilder:: identifing xpath-expression - first attempt

With help from htmltreexpather.pl - xpath helper, creates xpath search strings from html
#!/usr/bin/perl -- use strict; use warnings; use HTML::TreeBuilder::XPath; #~ $XML::XPathEngine::DEBUG = 1; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_content(<<'__HTML__'); <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http:// +www.w3.org/TR/html4/loose.dtd"><html><head><meta name="generator" con +tent="DigiOnline GmbH - WebWeaver 3.4 CMS - http://www.webweaver.de"> +<title>educa.ch</title><meta http-equiv="Content-Type" content="text/ +html; charset=iso-8859-1"><link rel="stylesheet" href="101.htm"><scri +pt src="102.htm"></script><script language="JavaScript"><!-- var did='d79376'; var root=new Array('d200','d205','d73137','d1566','d79376','d'); var usefocus = 1; function check() { if ((self.focus) && (usefocus)) { self.focus(); } } // --></script></head><body bgcolor="#FFFFFF" leftmargin="0" topmargin +="0" marginwidth="0" marginheight="0" onload="check();"><table cellsp +acing="0" cellpadding="0" border="0" width="100%"><tr><td width="15" +class="popuphead"><img src="/0.gif" alt="" width="15" height="16"></t +d><td width="99%" class="popuphead">Adresse - Schulen in der Schweiz< +/td><td width="20" class="popuphead" valign="middle"><a href="#" titl +e="Print" onclick="window.print(); return false;"><img src="../pics/p +rint16x13.gif" alt="Drucken" width="16" height="13"></a></td><td widt +h="20" class="popuphead" valign="middle"><a href="#" title="close" on +click="window.close(); return false;"><img src="../pics/close21x13.gi +f" alt="Schliessen" width="21" height="13"></a></td></tr> <tr bgcolor="#B2B2B2"><td colspan="4"><img src="/0.gif" alt="" width=" +1" height="1"></td></tr></table><div class="leerzeile">&#160;</div><d +iv class="leerzeile"><img src="/0.gif" alt="" width="15"height="8">Al +tes Schulhaus Ossingen </div><div class="leerzeile">&#160;</div><d +iv><img src="/0.gif" alt="" width="15" height="8">Guntibachstrasse 10 +</div><div><img src="/0.gif" alt="" width="15" height="8"></div><div> +<img src="/0.gif" alt="" width="15" height="8">8475 &#160;Ossingen</d +iv><div class="leerzeile">&#160;</div><div><img src="/0.gif" alt="" w +idth="15" height="8"><a href="" target="_blank"></a></div><div><img s +rc="/0.gif" alt="" width="15" height="8"><a href="mailto: sekretariat +.psossingen@bluewin.ch">sekretariat.psossingen@bluewin.ch</a></div><d +iv class="leerzeile">&#160;</div><div><img src="/0.gif" alt="" width= +"15" height="8">Tel:<img src="/0.gif" alt="" width="6" height="8">052 + 317 15 45 </div><div><img src="/0.gif" alt="" width="15" height="8"> +Fax:<img src="/0.gif" alt="" width="4" height="8">052 317 04 42 </div +><div>&#160;</div></body></html> __HTML__ # you can delete html/body for my $query ( qw! /html/body/div[2] /html/body/div[4] /html/body/div[6] /html/body/div[9] /html/body/div[11] /html/body/div[12] ! ) { print $query,"\n",$tree->findvalue($query),"\n\n"; } __END__ /html/body/div[2] Altes Schulhaus Ossingen /html/body/div[4] Guntibachstrasse 10 /html/body/div[6] 8475  Ossingen /html/body/div[9] sekretariat.psossingen@bluewin.ch /html/body/div[11] Tel:052 317 15 45 /html/body/div[12] Fax:052 317 04 42
  • Comment on Re^2: HTML::TreeBuilder:: identifing xpath-expression - first attempt
  • Download Code

Replies are listed 'Best First'.
Re^3: HTML::TreeBuilder:: identifing xpath-expression - first attempt
by Perlbeginner1 (Scribe) on Oct 17, 2010 at 13:35 UTC
    Hello Ken (kcott ) viveksnv, Khen1950fx hello anonymous Monk

    this is a great place for learning. I am so happy bout the answers -they show me this community is alive and so great - in helping and giving a helping hand.

    this is a great expericence!

    i will read all the answers later - since i have to leave the house at the moment!

    i come back later this day.
    meanwhle many many thanks for all!


    update



    Well - if i am able to identify the XPATH expressions for this site http://www.educa.ch/dyn/79376.asp?id=1187 then i am able to do the job!

    Note: if i can do it for one site -i am able to do it for more than 5000 - since i have to parse al of them..;-) Well - we see that there are three tasks.


    a. fetching the pages
    b. parsing them
    c. storing the results in a database



    for the first task we can use LWP-USERAGENT or MECHANIZE for the next tasks we can use HTML-Parser! For the third task we need some knowledge of PERL::DBI
      Hello dear anonymous_monk! I am triying to understand your posting!


      you refer to the page that explains and hepls finding xpaths. That is very very interesting! I am trying to learn something here.

      you use this link: http://www.perlmonks.org/?node_id=865792

      It leads to this code!

      this is a great great totuorial and a supergreat tool: Lemme ask yo +u if i got this right!? With that i can determine the paths - in ot +her words i can find out all the paths in a HTML-file!? $ perl htmltreexpather.pl select.html _tag option HTML::Element=HASH(0xb139ec) 0.1.1.0.0 Chose Some aaa /html/body/form/select/option /html/body/form/select/option /html/body[@bgcolor='red']/form[@action='/foo.cgi' and @name='queryfoo +']/select[@name='singlelist']/option[@value='aaa'] ------------------------------------------------------------------



      Question: this above mentioned code helps to throw out the Paths of a (general) HTML-document!?!?

      At least you make usage here:

      #!/usr/bin/perl -- use strict; use warnings; use HTML::TreeBuilder::XPath; #~ $XML::XPathEngine::DEBUG = 1; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_content(<<'__HTML__'); <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http:// www.w3.org/TR/html4/loose.dtd"><html><head><meta name="generator" con tent="DigiOnline GmbH - WebWeaver 3.4 CMS - http://www.webweaver.de"> <title>educa.ch</title><meta http-equiv="Content-Type" content="text/ html; charset=iso-8859-1"><link rel="stylesheet" href="101.htm"><scri pt src="102.htm"></script><script language="JavaScript"><!-- var did='d79376'; var root=new Array('d200','d205','d73137','d1566','d79376','d'); var usefocus = 1; function check() { if ((self.focus) && (usefocus)) { self.focus(); } } // --></script></head><body bgcolor="#FFFFFF" leftmargin="0" topmargin ="0" marginwidth="0" marginheight="0" onload="check();"><table cellsp acing="0" cellpadding="0" border="0" width="100%"><tr><td width="15" class="popuphead"><img src="/0.gif" alt="" width="15" height="16"></t d><td width="99%" class="popuphead">Adresse - Schulen in der Schweiz< /td><td width="20" class="popuphead" valign="middle"><a href="#" titl e="Print" onclick="window.print(); return false;"><img src="../pics/p rint16x13.gif" alt="Drucken" width="16" height="13"></a></td><td widt h="20" class="popuphead" valign="middle"><a href="#" title="close" on click="window.close(); return false;"><img src="../pics/close21x13.gi f" alt="Schliessen" width="21" height="13"></a></td></tr> <tr bgcolor="#B2B2B2"><td colspan="4"><img src="/0.gif" alt="" width=" 1" height="1"></td></tr></table><div class="leerzeile"> </div><d iv class="leerzeile"><img src="/0.gif" alt="" width="15"height="8">Al tes Schulhaus Ossingen </div><div class="leerzeile"> </div><d iv><img src="/0.gif" alt="" width="15" height="8">Guntibachstrasse 10 </div><div><img src="/0.gif" alt="" width="15" height="8"></div><div> <img src="/0.gif" alt="" width="15" height="8">8475 Ossingen</d iv><div class="leerzeile"> </div><div><img src="/0.gif" alt="" w idth="15" height="8"><a href="" target="_blank"></a></div><div><img s rc="/0.gif" alt="" width="15" height="8"><a href="mailto: sekretariat .psossingen@bluewin.ch">sekretariat.psossingen@bluewin.ch</a></div><d iv class="leerzeile"> </div><div><img src="/0.gif" alt="" width= "15" height="8">Tel:<img src="/0.gif" alt="" width="6" height="8">052 317 15 45 </div><div><img src="/0.gif" alt="" width="15" height="8"> Fax:<img src="/0.gif" alt="" width="4" height="8">052 317 04 42 </div ><div> </div></body></html> __HTML__ # you can delete html/body for my $query ( qw! /html/body/div[2] /html/body/div[4] /html/body/div[6] /html/body/div[9] /html/body/div[11] /html/body/div[12] ! ) { print $query,"\n",$tree->findvalue($query),"\n\n"; } __END__ /html/body/div[2] Altes Schulhaus Ossingen /html/body/div[4] Guntibachstrasse 10 /html/body/div[6] 8475 Ossingen /html/body/div[9] sekretariat.psossingen@bluewin.ch /html/body/div[11] Tel:052 317 15 45 /html/body/div[12] Fax:052 317 04 42



      That is very very impressive. I try to understand this code - and your usage of your example -that you were refering to!


      $ perl htmltreexpather.pl select.html _tag option HTML::Element=HASH(0xb139ec) 0.1.1.0.0 Chose Some aaa /html/body/form/select/option /html/body/form/select/option /html/body[@bgcolor='red']/form[@action='/foo.cgi' and @name='queryfoo +']/select[@name='singlelist']/option[@value='aaa']


      if i get you right - then i can use this script for many many cases - in order to get out the Xpaths!? Is this right

      look forwward to hear form you! I guess that i can learn alot! Plz help me here!

Re^3: HTML::TreeBuilder:: identifing xpath-expression - first attempt
by Perlbeginner1 (Scribe) on Oct 17, 2010 at 17:29 UTC
    Hello dear anonymous_monk! I am triying to understand your posting!


    you refer to the page that explains and hepls finding xpaths. That is very very interesting! I am trying to learn something here.

    you use this link: http://www.perlmonks.org/?node_id=865792

    It leads to this code!

    this is a great great totuorial and a supergreat tool: Lemme ask yo +u if i got this right!? With that i can determine the paths - in ot +her words i can find out all the paths in a HTML-file!? $ perl htmltreexpather.pl select.html _tag option HTML::Element=HASH(0xb139ec) 0.1.1.0.0 Chose Some aaa /html/body/form/select/option /html/body/form/select/option /html/body[@bgcolor='red']/form[@action='/foo.cgi' and @name='queryfoo +']/select[@name='singlelist']/option[@value='aaa'] ------------------------------------------------------------------



    Question: this above mentioned code helps to throw out the Paths of a (general) HTML-document!?!?

    At least you make usage here:

    #!/usr/bin/perl -- use strict; use warnings; use HTML::TreeBuilder::XPath; #~ $XML::XPathEngine::DEBUG = 1; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_content(<<'__HTML__'); <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http:// www.w3.org/TR/html4/loose.dtd"><html><head><meta name="generator" con tent="DigiOnline GmbH - WebWeaver 3.4 CMS - http://www.webweaver.de"> <title>educa.ch</title><meta http-equiv="Content-Type" content="text/ html; charset=iso-8859-1"><link rel="stylesheet" href="101.htm"><scri pt src="102.htm"></script><script language="JavaScript"><!-- var did='d79376'; var root=new Array('d200','d205','d73137','d1566','d79376','d'); var usefocus = 1; function check() { if ((self.focus) && (usefocus)) { self.focus(); } } // --></script></head><body bgcolor="#FFFFFF" leftmargin="0" topmargin ="0" marginwidth="0" marginheight="0" onload="check();"><table cellsp acing="0" cellpadding="0" border="0" width="100%"><tr><td width="15" class="popuphead"><img src="/0.gif" alt="" width="15" height="16"></t d><td width="99%" class="popuphead">Adresse - Schulen in der Schweiz< /td><td width="20" class="popuphead" valign="middle"><a href="#" titl e="Print" onclick="window.print(); return false;"><img src="../pics/p rint16x13.gif" alt="Drucken" width="16" height="13"></a></td><td widt h="20" class="popuphead" valign="middle"><a href="#" title="close" on click="window.close(); return false;"><img src="../pics/close21x13.gi f" alt="Schliessen" width="21" height="13"></a></td></tr> <tr bgcolor="#B2B2B2"><td colspan="4"><img src="/0.gif" alt="" width=" 1" height="1"></td></tr></table><div class="leerzeile"> </div><d iv class="leerzeile"><img src="/0.gif" alt="" width="15"height="8">Al tes Schulhaus Ossingen </div><div class="leerzeile"> </div><d iv><img src="/0.gif" alt="" width="15" height="8">Guntibachstrasse 10 </div><div><img src="/0.gif" alt="" width="15" height="8"></div><div> <img src="/0.gif" alt="" width="15" height="8">8475 Ossingen</d iv><div class="leerzeile"> </div><div><img src="/0.gif" alt="" w idth="15" height="8"><a href="" target="_blank"></a></div><div><img s rc="/0.gif" alt="" width="15" height="8"><a href="mailto: sekretariat .psossingen@bluewin.ch">sekretariat.psossingen@bluewin.ch</a></div><d iv class="leerzeile"> </div><div><img src="/0.gif" alt="" width= "15" height="8">Tel:<img src="/0.gif" alt="" width="6" height="8">052 317 15 45 </div><div><img src="/0.gif" alt="" width="15" height="8"> Fax:<img src="/0.gif" alt="" width="4" height="8">052 317 04 42 </div ><div> </div></body></html> __HTML__ # you can delete html/body for my $query ( qw! /html/body/div[2] /html/body/div[4] /html/body/div[6] /html/body/div[9] /html/body/div[11] /html/body/div[12] ! ) { print $query,"\n",$tree->findvalue($query),"\n\n"; } __END__ /html/body/div[2] Altes Schulhaus Ossingen /html/body/div[4] Guntibachstrasse 10 /html/body/div[6] 8475 Ossingen /html/body/div[9] sekretariat.psossingen@bluewin.ch /html/body/div[11] Tel:052 317 15 45 /html/body/div[12] Fax:052 317 04 42



    That is very very impressive. I try to understand this code - and your usage of your example -that you were refering to!


    $ perl htmltreexpather.pl select.html _tag option HTML::Element=HASH(0xb139ec) 0.1.1.0.0 Chose Some aaa /html/body/form/select/option /html/body/form/select/option /html/body[@bgcolor='red']/form[@action='/foo.cgi' and @name='queryfoo +']/select[@name='singlelist']/option[@value='aaa']


    if i get you right - then i can use this script for many many cases - in order to get out the Xpaths!? Is this right

    look forwward to hear form you! I guess that i can learn alot! Plz help me here!

      if i get you right - then i can use this script for many many cases - in order to get out the Xpaths!? Is this right

      No, not at all, I'm an asshole, I make shit up, I'm not here to help anyone, I just post stuff to tease, it never works :/