Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello brother monks...

I have a series of lines throughout some generated output which come in the form of:

</a> <!-- rem_me --></li> <li>

I planned on replacing all instances of the middle line:

<!-- rem_me --></li>

with <ul>

One would think that in my situation this would be a simple case of:

$html =~ s/<!-- rem_me --><\/li>/<ul>/sig;

But strangely this didn't work... A quick search with a hex editor shows nothing out of the ordinary... There are two 0A characters on either side of the <!-- rem_me --></li> line, which can only be attributed to the \n characters...

So what am I missing here?

Regards,

Fib Jones

Replies are listed 'Best First'.
Re: Weird situation...
by GrandFather (Saint) on Mar 19, 2008 at 23:42 UTC

    The general answer is "Parsing HTML is hard. Use a tool for it." Have a look on CPAN, there are plenty of HTML modules there. HTML::Sanitizer or HTML::Parser is likely most useful in this case.


    Perl is environmentally friendly - it saves trees
Re: Weird situation...
by ww (Archbishop) on Mar 20, 2008 at 02:27 UTC
    Just checking to make sure you *REALLY* want to do that:

    Opening a new <ul> as a replacement for the comment+</li> will result in a nested list... which will be doubly indented and which will require an additional </ul> at some point.

    If you're dealing with that, separately, then you'll be fine, but if not, your results may not be what you expect, and your html will surely be ill-formed/non-compliant.

Re: Weird situation...
by fibonacci_jones (Initiate) on Mar 19, 2008 at 23:23 UTC
    Let me try this again... the code didn't show up properly!!!!

    Hello brother monks... I have a series of lines throughout some generated output which come in the form of:


    </a> <!-- rem_me --></li> <li>

    I planned on replacing all instances of the middle line:

    <!-- rem_me --></li>

    with <ul>

    One would think that in my situation this would be a simple case of: $html =~ s/<!-- rem_me --><\/li>/<ul>/sig;
    But strangely this didn't work... A quick search with a hex editor shows nothing out of the ordinary... There are two 0A characters on either side of the line, which can only be attributed to the \n characters...

    So what am I missing here?

    Regards, Fib Jones

      Works for me, as I would expect it to.

      my $html = do { local $/ = undef; <DATA> }; $html =~ s/<!-- rem_me --><\/li>/<ul>/sig; print($html); __DATA__ </a> <!-- rem_me --></li> <li>
      </a> <ul> <li>

      What's rem_me in reality?

        It (<!-- rem_me -->) was a marker for the generated output... not all instances of:

        </a> </li> <li>
        were in need of being changed... so this marker was created in the hope that the above replacement would pick it up... Why wouldn't it work on my end? There's about 5 different instances, and there's no way I can just hardcode this directly into the page as the output is dynamic.