- Complication: The closing LI tag is optional.
- Simplification: An LI element can only be closed by a few tags, they will always close an LI element, and one of them must be present. They are: </LI>, <LI>, </OL> and </UL>.
- Simplification: OL, UL and LI elements cannot be placed inside of a SPAN element.
- Complication: OL, UL and (indirectly) LI elements can be placed inside of an LI element.
our $ul_or_ol;
local $ul_or_ol = qr{
(?:
<ul \b
(?:
(??{ $ul_or_ol })
|
(?! </ul \b ).
)*
</ul \b [^>]* >
|
<ol \b
(?:
(??{ $ul_or_ol })
|
(?! </ol \b ).
)*
</ol \b [^>]* >
)
}xi;
my $re = qr{
<li \b [^>]* >
<span class="[^"]+">
[^<]+
</span>
(
(?:
$ul_or_ol
|
(?! < (?:li|/li|/ol|/ul) \b ).
)*
)
(?:
</li \b [^>]* >
|
(?= < (?:li|/ol|/ul) \b )
)
}xi;
- Bug/Assumption: The above can fail if there's a < or a > inside an attribute value, a comment, a script or a style.
- Bug/Assumption: The above can fail if some valid SGML constructs are used. Fortunately, noone uses them.
- Bug: Unmaintainable.
- Note: It can be simplified if more assumptions are made. It can be optimized as well.
Untested.
Update: Fixed to handle nested lists in the "MATCH HERE" portion.