in reply to question about lookaheads and threatexpert/html parsing

Don't even think of using RegExps. It won't work reliably.

(Sure, if you generate the HTML, you can write in a way that can be "parsed" by RegExps. But then, you would simply generate data in a format that does not need a complex parser.)

CPAN has several HTML parsers. One that is not that obvious is XML::LibXML. Its main purpose is parsing and generating XML, but it can also parse (and to some extend, generate) HTML. It supports XPath that easily allows tasks like "find all LI elements inside UL elements". From there, extracting the text from the LI elements is trivial.

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
  • Comment on Re: question about lookaheads and threatexpert/html parsing