in reply to kill all lines that don't start with something

You should check out HTML::Parser, HTML::element and heck all of the HTML modules. :)
Update: See Ovid's reply below.
  • Comment on Re: kill all lines that don't start with something

Replies are listed 'Best First'.
Re: Re: kill all lines that don't start with something
by chicks (Scribe) on May 10, 2002 at 17:54 UTC
    I'm quite fond of that entire set of modules, but in this case I was expanding on the functionality of a sed script so keeping with the search and replace model fit quite nicely with the rest of the program.

      I understand your point, but the following will break your regex:

      <TD class="foo"> # you don't allow for attributes <td> # you assumed upper-case <TD # it's annoying, but legal, to have a newline there >

      If the last example seems contrived, I can assure you that it's not. I've had the misfortune of dealing with HTML written like that :) Further, that's the example which pretty much guarantees that no tweaks to your regex will handle that case. Sad, but true.

      If it makes you feel any better, you can get an idea of the scope of the problem of using regular expressions with HTML by reading about my sordid history making the same darned mistake.

      Cheers,
      Ovid

      Update: chicks has updated the original code snippet so that my comments and those of Mr. Muskrat don't appear to make sense. I think it would have been appropriate for chicks to make note of that. The original snippet resembled the following (I can't recall it exactly):

      $content =~ s/^(?!\s*<TD>).*$//mg;

      Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

        The point of my post was not meant to be HTML related in any way. I've been down those roads too. I'm working with HTML generated by a database and it's very consistant. I can also assure you that the work involved in doing it with the "proper" HTML tools would have far outweighed throwing a handful of regexes at it.