LostS has asked for the wisdom of the Perl Monks concerning the following question:

Hey I asked a question the other day and got that working... Now I am finding another section I need to parse out of a page I am grabing via LWP::Simple. I have the page set to a variable $webpage Now I need to drop this section of code from that variable:
<!-- Begin MRTG Block --> <TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0> <TR> <TD WIDTH=63><A HREF="http://ee-staff.ethz.ch/~oetiker/webtools/mrtg/"><IMG BORDER=0 SRC="http://hamburg.harbinger.net/mrtg/cen/mrtg-l.png" WI +DTH=63 HEIGHT=25 ALT="MRTG"></A></TD> <TD WIDTH=25><A HREF="http://ee-staff.ethz.ch/~oetiker/webtools/mrtg/"><IMG BORDER=0 SRC="http://hamburg.harbinger.net/mrtg/cen/mrtg-m.png" WI +DTH=25 HEIGHT=25 ALT=""></A></TD> <TD WIDTH=388><A HREF="http://ee-staff.ethz.ch/~oetiker/webtools/mrtg/"><IMG BORDER=0 SRC="http://hamburg.harbinger.net/mrtg/cen/mrtg-r.png" WI +DTH=388 HEIGHT=25 ALT="Multi Router Traffic Grapher"></A></TD> </TR> </TABLE> <TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0> <TR VALIGN=top> <TD WIDTH=88 ALIGN=RIGHT><FONT FACE="Arial,Helvetica" SIZE=2> version 2.9.6</FONT></TD> <TD WIDTH=388 ALIGN=RIGHT><FONT FACE="Arial,Helvetica" SIZE=2> <A HREF="http://ee-staff.ethz.ch/~oetiker/">Tobias Oetiker</A> <A HREF="mailto:oetiker@ee.ethz.ch">&lt;oetiker@ee.ethz.ch&gt;</A> and &nbsp; <A HREF="http://www.bungi.com/">Dave&nbsp;Rand</A>&nbsp; <A HREF="mailto:dlr@bungi.com">&lt;dlr@bungi.com&gt;</A></FONT> </TD> </TR> </TABLE> <!-- End MRTG Block -->
How would you suggest I do this?? I am trying to just get rid of that part so I will be replacing it with nothing.

Replies are listed 'Best First'.
Re: Grabing a Page and Need to Parse it..
by larsen (Parson) on May 07, 2001 at 20:12 UTC
    General HTML parsing could be done via HTML::Parser and its relatives. Looking at your snippet of HTML, it seems that you will find HTML::TableExtract useful
      I found it...
      $traffictotals =~ s/<!-- Begin MRTG Block -->(.*?)<!-- End MRTG Block + -->//s;
      Works great :)
        Looks like you're parsing output from Multi Router Traffic Grapher, a neat SNMP tool that fetches bandwidth utilization info and generates HTML pages from those stats.

        Instead of parsing it's HTML output back into text, you might consider using UCD SNMP to query the device(s) directly.   UCD-SNMP includes snmpwalk and other command-line tools that are pretty slick.

        For a more perlish solution, Net::Snmp would also do the trick.   "(code) mind your snmPs & Qs" shows yet another perlish approach, this time using CPAN module SNMP to query devices for info.

        There are any number of possible reasons why these wouldn't work in your situation, but they seem worth mentioning.
            cheers,
            Don
            striving toward Perl Adept
            (it's pronounced "why-bick")

Re: Grabing a Page and Need to Parse it..
by swngnmonk (Pilgrim) on May 07, 2001 at 20:20 UTC
    Will the block always be wrapped by those comments?
    Unless I misunderstand your question, a simple regexp will remove all of that.
    $webpage =~ s/<!-- Begin MRTG Block -->.*<!-- End MRTG Block -->//os;
    Does the table occur more than once in the webpage? If so, add the 'g' (global - as often as it's encountered) option to the Regexp.
        also keep in mind that '.' does not match newlines. so you may be looking for [.\n]*? instead.

        update: I should have known that I was missing something. It seemed most out of character for merlyn to miss something like that.

        Ok, this is beyond the scope of the initial question, but I'm curious anyways - the /o caches the Regexp so it doesn't need to be re-compiled, correct? I'm not familiar with the innards of the interpreter, but in the event we returned to this RE, wouldn't that be an (albeit extremely minimal) optimization?