Erika has asked for the wisdom of the Perl Monks concerning the following question:

I've looked everywhere for an answer I can understand and haven't found one, so I'm hoping for something clear and concise--something I, as a total newbie, can comprehend.

I have inherited 1300 HTML files full of garbage. I want to strip out a lot of code and several divs.

I have been using perl -pi.bak -e "s|oldstuff|newstuff|g" *.html to remove small parts and lines of code with limited success--I keep running into ugly character strings that are difficult to delete and require extensive backslashing.

I can deal with the painstaking removal method above but have run into a situation where it won't work. I want to delete a complete div, but the div ends non-uniquely with just <div> on a line by itself. I don't want to remove all the closing div tags from the files, so I'm not sure what to do at this point.

The lines appear on the same line number in each file, but I'm not sure whether this will be the case later on with other divs that may need to be removed--the data in the files is similar, but not exactly the same.

Each of the unwanted divs at this point starts with <div class="topsearchbar"> and ends with </div>. They are on lines 16-25 of the file.

Any pointers in the right direction would be appreciated.

Replies are listed 'Best First'.
Re: Delete multiple lines of text from a file?
by repellent (Priest) on Feb 20, 2010 at 06:31 UTC
      Each of the unwanted divs at this point starts with <div class="topsearchbar"> and ends with </div>. They are on lines 16-25 of the file.

    If you'd like to avoid printing lines 16-25, it could be as simple as:
    perl -i.bak -ne 'print unless 16 .. 25' *.html

    However, if you wish to parse out specific <div> section(s), it is best to use a well-tested module:
Re: Delete multiple lines of text from a file?
by ww (Archbishop) on Feb 20, 2010 at 05:46 UTC

    First, welcome! You've set yourself a real challenge... and come to a good place for ideas.

    Here's one (others will likely offer better ways; particularly those founded on the advice that one should use a well-tested html parser rather than regexen).

    Nonetheless, in this limited case, one might consider using a match clause with minimally greedy matching (see perlretut) and a substitution clause invoked on the match. But do this ONLY IF YOU ARE CERTAIN that none of the <div class="topsearchbar"> will contain any other <div>...</div> that you might want to retain.

    #!/usr/bin/perl use strict; use warnings; #824315 my $html; { local undef $/; # slurp the data (file for your application) $html = <DATA>; } my $delete = "<div class=\"topsearchbar\">.*?</div>"; my $tobedeleted; if ( $html =~ m|($delete)|s ) { $tobedeleted = $1; print "\n---------\n \$tobedeleted: $tobedeleted\n--------\n\n"; # above for info only; remove for production $html =~ s|$tobedeleted| |; } else { print "\n No match for class topsearchbar \n"; } print $html; __DATA__ <html> <head> <title>something</title> </head> <body> <h1>Headline above search bar</h1> <div class="topsearchbar"> <a href="foo.htm">foo</a> <a href="bar.shtml">bar</a> <img src="logo.png width="240" height="110" alt="logo for xyz corp"> <a href="baz.htm">baz</a> </div> <p>more stuff</p> <div class="somethingelse"> <p>stuff</p> </div> </body> </html>

    Output:

    --------- $tobedeleted: <div class="topsearchbar"> <a href="foo.htm">foo</a> <a href="bar.shtml">bar</a> <img src="logo.png width="240" height="110" alt="logo for xyz corp"> <a href="baz.htm">baz</a> </div> -------- <html> <head> <title>something</title> </head> <body> <h1>Headline above search bar</h1> <p>more stuff</p> <div class="somethingelse"> <p>stuff</p> </div> </body> </html>

    You'll want to put the names of the files you want to modify in an array, and loop over that, rather than using __DATA__ as this example does, and -- of course, to rename originals to ".bak" before saving the output to the original name, but you seem to have that well under control.

    And, of course, the print command which produces the info section (between the dashed lines) is not for production; solely for illustration here.