I've looked everywhere for an answer I can understand and haven't found one, so I'm hoping for something clear and concise--something I, as a total newbie, can comprehend.

I have inherited 1300 HTML files full of garbage. I want to strip out a lot of code and several divs.

I have been using perl -pi.bak -e "s|oldstuff|newstuff|g" *.html to remove small parts and lines of code with limited success--I keep running into ugly character strings that are difficult to delete and require extensive backslashing.

I can deal with the painstaking removal method above but have run into a situation where it won't work. I want to delete a complete div, but the div ends non-uniquely with just <div> on a line by itself. I don't want to remove all the closing div tags from the files, so I'm not sure what to do at this point.

The lines appear on the same line number in each file, but I'm not sure whether this will be the case later on with other divs that may need to be removed--the data in the files is similar, but not exactly the same.

Each of the unwanted divs at this point starts with <div class="topsearchbar"> and ends with </div>. They are on lines 16-25 of the file.

Any pointers in the right direction would be appreciated.


In reply to Delete multiple lines of text from a file? by Erika

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.