Hi Monks, I am trying to learn html parsing with Perl and need some help. I am using http://lawyerlist.com.au/1385-AIM-Legal.aspx and i have saved it on my disk as an html document. And extracting the table at depth => 2 which shows the phone numbers , fax, email etc. The code is as below
#!/usr/bin/perl use strict; use warnings; use HTML::TableExtract; my $te; my $ts; my $html_string; my $filename='1385-AIM-Legal.aspx.html'; my $row; my $col; open(my $fh, '<', $filename) or die "cannot open file $filename"; { local $/; $html_string = <$fh>; } close($fh); my $headers = ['Phone']; $te = HTML::TableExtract->new(depth => 2); $te->parse($html_string); foreach $ts ( $te->tables() ) { foreach $row ( $ts->rows() ) { print join ( "\t", @$row ), "\n"; } }
This doesn't give me a neat output like the phone numbers don't show proper, email doesn't show at all, address and some other fields have some java scripts in it. I just want to extract the main information and display neatly. Any help would be appreciated. Output of this script is given below:
Use of uninitialized value in join or string at ./parsehtml.perl line +32. AIM Legal Phone 03 9...setTimeout("document.getElementById('Phone1').innerHTM +L='03 9482 4607'",1000); Fax 03 9...setTimeout("document.getElementById('Phone2').innerHTML= +'03 9482 4607'",1000); Email var s='=b!isfg>(nbjmup;bjnmfhbmAcjhqpoe/dpn(?bjnmfhbmAcjhqpo +e/dpn=0b?';var i;for (i=0;i<s.length;i++) document.write( String.from +CharCode(s.charCodeAt(i)-1)); Street Address Michael StreetsetTimeout("document.getElementById(' +Address1').innerHTML='14 Michael Street'",1000); Fitzroy North. VIC 3065

In reply to Help with HTML::TableExtract by saeen

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.