saeen has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I am trying to learn html parsing with Perl and need some help. I am using http://lawyerlist.com.au/1385-AIM-Legal.aspx and i have saved it on my disk as an html document. And extracting the table at depth => 2 which shows the phone numbers , fax, email etc. The code is as below
#!/usr/bin/perl use strict; use warnings; use HTML::TableExtract; my $te; my $ts; my $html_string; my $filename='1385-AIM-Legal.aspx.html'; my $row; my $col; open(my $fh, '<', $filename) or die "cannot open file $filename"; { local $/; $html_string = <$fh>; } close($fh); my $headers = ['Phone']; $te = HTML::TableExtract->new(depth => 2); $te->parse($html_string); foreach $ts ( $te->tables() ) { foreach $row ( $ts->rows() ) { print join ( "\t", @$row ), "\n"; } }
This doesn't give me a neat output like the phone numbers don't show proper, email doesn't show at all, address and some other fields have some java scripts in it. I just want to extract the main information and display neatly. Any help would be appreciated. Output of this script is given below:
Use of uninitialized value in join or string at ./parsehtml.perl line +32. AIM Legal Phone 03 9...setTimeout("document.getElementById('Phone1').innerHTM +L='03 9482 4607'",1000); Fax 03 9...setTimeout("document.getElementById('Phone2').innerHTML= +'03 9482 4607'",1000); Email var s='=b!isfg>(nbjmup;bjnmfhbmAcjhqpoe/dpn(?bjnmfhbmAcjhqpo +e/dpn=0b?';var i;for (i=0;i<s.length;i++) document.write( String.from +CharCode(s.charCodeAt(i)-1)); Street Address Michael StreetsetTimeout("document.getElementById(' +Address1').innerHTML='14 Michael Street'",1000); Fitzroy North. VIC 3065

Replies are listed 'Best First'.
Re: Help with HTML::TableExtract
by Corion (Patriarch) on Mar 12, 2015 at 09:15 UTC

    You will need to learn how Javascript and HTML interact. The email is decoded from the Javascript string.

    Either you learn how to rewrite the Javascript algorithm in Perl or you run the Javascript code and fetch its result.

    I recommend rewriting the Javascript algorithm in Perl.

      Thanks Corion. I am not too sure how to rewrite javascript in perl. Is there any resource you can point me to ? Thanks

        It's pretty straightforward in this case. The Javascript code in question is something along these lines:

        var s='=b!isfg>(fybnqmfAfybnqmf/dpn(?fybnqmfAfybnqmf/dpn=0b?';var i;fo +r (i=0;i<s.length;i++) document.write( String.fromCharCode(s.charCode +At(i)-1));

        Even not knowing any Javascript it's not too difficult to tell what this does: it goes through the string one character at a time, and maps each character to the preceding character (in whatever character set Javascript uses by default).

        You could do the same in Perl more or less verbatim (using the chr and ord functions), but instead of using a loop, it's perhaps more idiomatic to resort to split, map and join:

        #!/usr/bin/perl use strict; use warnings; use feature qw/say/; my $s = '=b!isfg>(fybnqmfAfybnqmf/dpn(?fybnqmfAfybnqmf/dpn=0b?'; say join "", map { chr(ord($_) - 1) } split //, $s;

        This outputs:

        $ perl 1119765.pl <a href='example@example.com'>example@example.com</a> $

        I would look at the MDN, but that requires that you already know enough Javascript.