Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

regexp for stripping tables

by Anonymous Monk
on Sep 09, 2003 at 08:30 UTC ( [id://289965]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to write a perl script to strip tables from html files. There are many tables and some are nested, and I can't figure out how to write the regexp. Can anyone help this regexp newbie. Thanks. Daniel.

Replies are listed 'Best First'.
Re: regexp for stripping tables
by zby (Vicar) on Sep 09, 2003 at 08:56 UTC
Re: regexp for stripping tables
by tachyon (Chancellor) on Sep 09, 2003 at 13:17 UTC

    Don't try and do it with a regex. Nested structures are very hard to do with REs. Nested HTML is probably as hard as it gets. Here is how to do it right using HTML::Parser. Yes the Version 2 API takes a little getting used to but is very easy to use once you get your head around it. All we do is increment a counter when we find a opening table tag and decrement it when we find a closing tag. If we have a value > 0 in the counter we are in a table so don't add the original text to our data. If we have a value of 0 we are outside of the tables so add the origtext.

    { package MyParser; use base 'HTML::Parser'; sub start { my($self, $tagname, $attr, $attrseq, $origtext) = @_; $self->{table}++ if $tagname eq 'table'; $self->{data} .= $origtext unless $self->{table}; } sub end { my($self, $tagname, $origtext) = @_; $self->{data} .= "</$tagname>" unless $self->{table}; $self->{table}-- if $tagname eq 'table'; } sub text { my($self, $origtext, $is_cdata) = @_; $self->{data} .= $origtext unless $self->{table}; } sub comment { my($self, $origtext ) = @_; #$self->{data} .= $origtext if $want_comments } } my $p = MyParser->new; $p->parse_file(*DATA); $data = $p->{data}; print $data; __DATA__ <html> <head> <title></title> </head> <body> <p>Hello <tablE> ..... </taBle> <p>World <table> <tr><TABLE> Nested <table> Nested some more </table> </table> </tr> </table > <p>REs can be useful, but HTML parser rocks! </body> </html>

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      Dear tachyon,

      Your posting was extremely helpful. I was able to plug your example into my program, and my problem went away. Without your example I would have been quite lost, since I have thus far avoided perl oop and event based parsers. So, in addition to fixing my small problem, I also got a little tutorial on oop and event-based parsing. Capital.

      Thank you,

      Daniel

Re: regexp for stripping tables
by seattlejohn (Deacon) on Sep 09, 2003 at 18:07 UTC
    HTML::TableExtract subclasses HTML::Parser and provides pretty rich mechanisms for getting whatever information you may want.

            $perlmonks{seattlejohn} = 'John Clyman';

Re: regexp for stripping tables
by Anonymous Monk on Sep 09, 2003 at 23:56 UTC
    Thank you all very much for your excellent suggestions, and for helping me solve my tables problem. Once again I am indebted to the perlmonks community.

    Daniel.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://289965]
Approved by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2024-04-19 18:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found