HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser?

Starman has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser? by allolex (Curate) on Aug 23, 2003 at 08:07 UTC
Here's an example using Ovid's HTML::TokeParser::Simple---The one module you didn't mention in your post ;) #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple (); use constant SKIP => 0; use constant COPY => 1; die "usage: $0 inputfile > outputfile\n" if @ARGV != 1; my $p = HTML::TokeParser::Simple->new(shift); my @results; my $state = SKIP; while(my $t = $p->get_token) { if ( $state == SKIP && $t->is_start_tag('table') && ( $t->return_a +ttr->{border} =~ /^0$/ && $t->return_attr->{align} =~ /center/ ) ) { $state = COPY; } if ( $state == COPY && $t->is_end_tag('table') ) { $state = SKIP; } elsif($state == COPY) { push @results, $t->as_is; } elsif ( $state == SKIP ) { next; } else { die "I'm confused about my state ($state) at token ".$t->as_is +; } } print "$_\n" for @results; [download] Thanks to Aristotle for helping me with a similar problem months ago. -- Allolex	[reply] [d/l]
Re: HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser? by jeffa (Bishop) on Aug 23, 2003 at 13:35 UTC
If the HTML you are parsing happens to be valid XML (they call that XHTML these days ;)) then you can use the uber module XML::Twig: use strict; use XML::Twig; my $t = XML::Twig->new( twig_handlers => {table => \&handler}, pretty_print => 'indented', ); $t->parse(\DATA); sub handler { my($t,$table) = @_; $table->flush if $table->att('border') == 0 and $table->att('align') eq 'center' ; } __DATA__ <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>XML::Twig table extract test</title> </head> <body> <table><tr><td> <table border="0" align="center"> <tr align="center"> <td colspan="3"><a href="/a.gif"><img src="/a.png" alt="a" /> </a></td> </tr> </table> </td></tr></table> </body> </html> [download] But, the HTML you have posted is not valid XHTML. I ran your HTML through HTML Tidy and embedded it inside another table for testing purposes. You can always fetch the web page and call HTML Tidy externally, or you can install XML::LibXML and use the technique merlyn presents at HTML tidy, using XML::LibXML to clean up the HTML you have to parse. So, why use something like `XML::Twig` instead of an HTML::Parser? Because you are extracted out a whole subset of HTML instead of individual tags or text. Another good candidate module for this kind of work is XML::XPath. The XPath language was designed to "address parts of an XML document". Here is a quick example that uses XPath (and the same `DATA` filehandle): `use strict; use XML::XPath; my $xpath = XML::XPath->new(ioref => \DATA); my $nodes = $xpath->find('//table[@border=0][@align="center"]'); print XML::XPath::XMLParser::as_string($nodes->get_nodelist);` [download] How's that for 5 lines of code? ;) You can find a good XPath tutorial at http://www.zvon.org/xxl/XPathTutorial/General/examples.html, by the way. It is important to use the right tool for the job, and i think that `XML::Twig` and `XML::XPath` are better tools for this job than the HTML parsers. Hope this helps :) jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l] [select]
Re: HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser? by BrowserUk (Patriarch) on Aug 23, 2003 at 00:30 UTC
Onthe basis that it is the first table on the page and isn't nested and doesn't contain nested tables, then a one liner will do it. `get the.url.com/path \| perl -0777 -ne" print m[(<table.*?</table>)]si +" > file` [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller If I understand your problem, I can solve it! Of course, the same can be said for you.	[reply] [d/l]
Re: Re: HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser? by Starman (Initiate) on Aug 23, 2003 at 00:41 UTC
Unfortunately it is not the first table however it is the only that uses these atributes "border=0 align=center". And there are no tables nested within it although it is nested within one itself.	[reply]
Re: Re: Re: HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser? by BrowserUk (Patriarch) on Aug 23, 2003 at 01:19 UTC
In that case, adding those attributes to the regex should do the trick. This is obviously untested. `get the.url.com/path \| perl -0777 -ne" print m[(<table border=0 align=center>.*?</table>)]si +" > file` [download] That is still a one-liner, but I split it across a few lines as the auto codewrap did horribly things to it. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller If I understand your problem, I can solve it! Of course, the same can be said for you.	[reply] [d/l]
Re: Re: Re: Re: HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser? by Starman (Initiate) on Aug 24, 2003 at 18:58 UTC
Re: Re: Re: Re: Re: HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser? by BrowserUk (Patriarch) on Aug 24, 2003 at 19:51 UTC


Problems? Is your data what you think it is?
	PerlMonks