Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Im doing content migration (pushing the content of one set of web pages in .html format to another set of pages in .asp format) and while doing that I need to remove certain parts of the existing content. I need to remove the following chunk of code from each old page before moving the content to the new page.
<td align="left" valign="top" bgcolor="#04047c"> <table border="0" cellspacing="0" cellpadding="0" bgcolor="# +04047c" width="116"> <!--#include virtual="/bus_nav_all_graphic.ssi"--> </table> <p>&nbsp; </p> </td>
I tried doing a search on a row with a <td align=left ... bgcolor=#04047c... and ending it with </td>, but it doesnt want to work. Here is the code I used
#grabbing and deleting the left nav if (/<td align="left valign="top" bgcolor="#04047c.*?>/i ... / +<\/td.*?>/i){ # this is the line that starts the left nav # extract everything between the open and close td and delete + it $leftnav_temp = $_; $leftnav_temp =~ s/(.*?)\<td align="left valign="top" bgcolor +="#04047c\>(.*?)\<\/td\>/$2/i; chomp($leftnav_temp); $leftnav = "$leftnav_temp" ; $leftnav = ""; # Write the title to the output file print OUTFILE $title . "\n"; next }

Replies are listed 'Best First'.
Enough! (was Re: deleting content)
by chromatic (Archbishop) on May 02, 2001 at 23:27 UTC
    You've posted at least six similar questions over the past few days. The highest-ranked response to your first question recommended using the HTML::TokeParser module on the CPAN.

    Other monks told you that using a regex is tricky.

    I think you've demonstrated that by now.

    Additionally, I see that good advice about style and such has been largely neglected. Replies to this thread question your additional code. Other monks besides myself gave you code that didn't blindly copying variables back and forth, only to throw away their contents.

    If you've ignored the advice so far, why should anyone bother explaining anything else?

    There's no shame in being a new programmer. There's no shame in not understanding something.

    There should be more shame for repeatedly asking the same question then ignoring the answer.

Re: deleting content
by AidanLee (Chaplain) on May 02, 2001 at 22:53 UTC

    I'm not familiar with the use of ... between two regular expressions but that could just be my ignorance. Once again I'd suggest something closer to my response to one of your previous posts:

    s|<td .*? bgcolor="#04047c".*?>[.\n]*?</td>||g

    By the posts you've made it seems to me that you have many of these deletions of different bits of your file before passing the content on. Basically to follow the method I've got here for all your other deletions, use the s|||g pattern to get rid of the offending code. Put whatever is unique about the offending code between the first two '|' and then nothing between the middle and the last '|'. if you aren't sure about the case (upper/lower) of the text involved, do s|||gi instead.