Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser?

by Starman (Initiate)
on Aug 23, 2003 at 00:11 UTC ( [id://285963]=perlquestion: print w/replies, xml ) Need Help??

Starman has asked for the wisdom of the Perl Monks concerning the following question:

I need to extract a snippet of html from a web page on a daily basis. I can grab the page and store it however I am still learning how to extract the html. Leading me in the right direction would be helpful. Here's a sample from the page. The table tag is the first instance on the page of the tag in this exact format. I want the complete text including tags outputed to a file. Thoughts?
<table border=0 align=center> <tr align=center> <td colspan=3><a href="/a.gif"><img src="/a.png" alt="a"></a> </td></tr> </TABLE>

Replies are listed 'Best First'.
Re: HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser?
by allolex (Curate) on Aug 23, 2003 at 08:07 UTC

    Here's an example using Ovid's HTML::TokeParser::Simple---The one module you didn't mention in your post ;)

    #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple (); use constant SKIP => 0; use constant COPY => 1; die "usage: $0 inputfile > outputfile\n" if @ARGV != 1; my $p = HTML::TokeParser::Simple->new(shift); my @results; my $state = SKIP; while(my $t = $p->get_token) { if ( $state == SKIP && $t->is_start_tag('table') && ( $t->return_a +ttr->{border} =~ /^0$/ && $t->return_attr->{align} =~ /center/ ) ) { $state = COPY; } if ( $state == COPY && $t->is_end_tag('table') ) { $state = SKIP; } elsif($state == COPY) { push @results, $t->as_is; } elsif ( $state == SKIP ) { next; } else { die "I'm confused about my state ($state) at token ".$t->as_is +; } } print "$_\n" for @results;

    Thanks to Aristotle for helping me with a similar problem months ago.

    --
    Allolex

Re: HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser?
by jeffa (Bishop) on Aug 23, 2003 at 13:35 UTC
    If the HTML you are parsing happens to be valid XML (they call that XHTML these days ;)) then you can use the uber module XML::Twig:
    use strict; use XML::Twig; my $t = XML::Twig->new( twig_handlers => {table => \&handler}, pretty_print => 'indented', ); $t->parse(\*DATA); sub handler { my($t,$table) = @_; $table->flush if $table->att('border') == 0 and $table->att('align') eq 'center' ; } __DATA__ <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>XML::Twig table extract test</title> </head> <body> <table><tr><td> <table border="0" align="center"> <tr align="center"> <td colspan="3"><a href="/a.gif"><img src="/a.png" alt="a" /> </a></td> </tr> </table> </td></tr></table> </body> </html>
    But, the HTML you have posted is not valid XHTML. I ran your HTML through HTML Tidy and embedded it inside another table for testing purposes. You can always fetch the web page and call HTML Tidy externally, or you can install XML::LibXML and use the technique merlyn presents at HTML tidy, using XML::LibXML to clean up the HTML you have to parse.

    So, why use something like XML::Twig instead of an HTML::Parser? Because you are extracted out a whole subset of HTML instead of individual tags or text. Another good candidate module for this kind of work is XML::XPath. The XPath language was designed to "address parts of an XML document". Here is a quick example that uses XPath (and the same DATA filehandle):

    use strict; use XML::XPath; my $xpath = XML::XPath->new(ioref => \*DATA); my $nodes = $xpath->find('//table[@border=0][@align="center"]'); print XML::XPath::XMLParser::as_string($nodes->get_nodelist);
    How's that for 5 lines of code? ;) You can find a good XPath tutorial at http://www.zvon.org/xxl/XPathTutorial/General/examples.html, by the way.

    It is important to use the right tool for the job, and i think that XML::Twig and XML::XPath are better tools for this job than the HTML parsers.

    Hope this helps :)

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
Re: HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser?
by BrowserUk (Patriarch) on Aug 23, 2003 at 00:30 UTC

    Onthe basis that it is the first table on the page and isn't nested and doesn't contain nested tables, then a one liner will do it.

    get the.url.com/path | perl -0777 -ne" print m[(<table.*?</table>)]si +" > file

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

      Unfortunately it is not the first table however it is the only that uses these atributes "border=0 align=center". And there are no tables nested within it although it is nested within one itself.

        In that case, adding those attributes to the regex should do the trick. This is obviously untested.

        get the.url.com/path | perl -0777 -ne" print m[(<table border=0 align=center>.*?</table>)]si +" > file

        That is still a one-liner, but I split it across a few lines as the auto codewrap did horribly things to it.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
        If I understand your problem, I can solve it! Of course, the same can be said for you.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://285963]
Approved by sauoq
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (7)
As of 2024-04-23 20:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found