regexp for stripping tables

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: regexp for stripping tables by zby (Vicar) on Sep 09, 2003 at 08:56 UTC
If all you need is to strip the table tags you can use Regexp::Token (look in: Regexp::Token -- Use regular expressions to match tokens). If you want to strip the tabels together with the content of them there are two possibilities: The HTML you expect is in some predefined structure, and you can use this additionall structure to write the regexps (but we need to know that structure to help you here). There is no additionall structure in the HTML - in this case you need to use a HTML parser (like: HTML::TableExtractor or the whole HTML::Parser). Some more discussion about those techniques you'll find in: Scraping HTML: orthodoxy and reality, Regexps to change HTML tags/attributes, Parsing nested HTML with just regex	[reply]
Re: regexp for stripping tables by tachyon (Chancellor) on Sep 09, 2003 at 13:17 UTC
Don't try and do it with a regex. Nested structures are very hard to do with REs. Nested HTML is probably as hard as it gets. Here is how to do it right using HTML::Parser. Yes the Version 2 API takes a little getting used to but is very easy to use once you get your head around it. All we do is increment a counter when we find a opening table tag and decrement it when we find a closing tag. If we have a value > 0 in the counter we are in a table so don't add the original text to our data. If we have a value of 0 we are outside of the tables so add the origtext. { package MyParser; use base 'HTML::Parser'; sub start { my($self, $tagname, $attr, $attrseq, $origtext) = @_; $self->{table}++ if $tagname eq 'table'; $self->{data} .= $origtext unless $self->{table}; } sub end { my($self, $tagname, $origtext) = @_; $self->{data} .= "</$tagname>" unless $self->{table}; $self->{table}-- if $tagname eq 'table'; } sub text { my($self, $origtext, $is_cdata) = @_; $self->{data} .= $origtext unless $self->{table}; } sub comment { my($self, $origtext ) = @_; #$self->{data} .= $origtext if $want_comments } } my $p = MyParser->new; $p->parse_file(*DATA); $data = $p->{data}; print $data; __DATA__ <html> <head> <title></title> </head> <body> <p>Hello <tablE> ..... </taBle> <p>World <table> <tr><TABLE> Nested <table> Nested some more </table> </table> </tr> </table > <p>REs can be useful, but HTML parser rocks! </body> </html> [download] cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]
Re: Re: regexp for stripping tables by Anonymous Monk on Sep 10, 2003 at 01:39 UTC
Dear tachyon, Your posting was extremely helpful. I was able to plug your example into my program, and my problem went away. Without your example I would have been quite lost, since I have thus far avoided perl oop and event based parsers. So, in addition to fixing my small problem, I also got a little tutorial on oop and event-based parsing. Capital. Thank you, Daniel	[reply]
Re: regexp for stripping tables by seattlejohn (Deacon) on Sep 09, 2003 at 18:07 UTC
HTML::TableExtract subclasses HTML::Parser and provides pretty rich mechanisms for getting whatever information you may want. $perlmonks{seattlejohn} = 'John Clyman';	[reply]
Re: regexp for stripping tables by Anonymous Monk on Sep 09, 2003 at 23:56 UTC
Thank you all very much for your excellent suggestions, and for helping me solve my tables problem. Once again I am indebted to the perlmonks community. Daniel.	[reply]


Problems? Is your data what you think it is?
	PerlMonks