in reply to html tag matching confusion

1) You could try using B::DeParse on the command line to see what Perl makes of the script you are trying to use.

2) It is possible there could be a bug if you have an old Perl. At any rate to stay sane, why not make up your own tag names like "\nSTART\t" and do a global replace on the data first. Then maybe you could read it yourself and have less trouble debugging.

3) Also you just don't want to use dot-star. Really. ".*?" is dangerous especially for finding things with quotes embedded in them, as that link (Ovid's) will show.
Ovid suggests a negated character class. You could also use an available HTML parser, or in the beginning just strip out all the bad stuff first (you need to know you are not stripping good data by accident). You could also inch through the data using pos to parse a bit at a time.

Move SIG!

Replies are listed 'Best First'.
Re: Re: html tag matching confusion
by jarich (Curate) on Nov 25, 2001 at 14:26 UTC
    I think that in this particular instance, .*? ought to be fine, as embedded font tags are not legal (whereas in Ovid's example, embedded "s are fine).

    In this case we're looking for stuff between <font ...> and </font> so .*? works a charm, although, if you had code that had:

    <font ..> text <font ..> more text </font> and some </font>
    you'd get
    text <font ..> more text
    out. This would be awkward, but a negated character class won't save us. If it is possible that you're getting insane html, then you have to expect bugginess on any regexp we come up with.