in reply to Operator for "these expressions, in any order"

I seem to always be the one doing this, but here goes:

Why are you matching HTML with regexes? It's dangerous and fraught with peril, as well as being impossible to maintain or get right. Why not use something like, oh, HTML::Parser and have it deal with the problem of how to figure out what has way and you just ask it "Does tag ABC have attributes X, Y, and Z?"

Or ... attack the problem another way. Either these pages are static or they're not. If they are, then read them by hand. No matter how many you have, so long as they don't change, you'll finish, eventually. (A very large amount and q.v. solution #1.)

If they're generated in some fashion, then don't examine the output, examine the generator! A quick code review with a colleague and a whiteboard will quickly tell you if you're double-generating attributes. Now, if the code is dense and impenetrable, that's a good reason to rewrite it, and in the process guarantee that this issue is a non-starter.

Now, you might have issues with the idea of HTML being embedded in the code. Get it out and use templates. HTML doesn't belong in code, and vice-versa.

Of course, this entire discussion begs the question - why aren't you using CSS?

------
We are the carpenters and bricklayers of the Information Age.

Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

  • Comment on Re: Operator for "these expressions, in any order"

Replies are listed 'Best First'.
Re: Re: Operator for "these expressions, in any order"
by jsalvata (Initiate) on Feb 17, 2004 at 23:14 UTC
    I thought I would get one of these, I should have answered all these in my initial posting... just it is long to explain and I'm too lazy.

    > Why are you matching HTML with regexes?

    Because I'm only interested in very small parts of the file, and by hard experience I've learned this is the fastest way.

    Background information:

    Well, I have to admit that my question was a Perl Regexp question, but not a Perl question. The problem at hand is coding for Apache Jakarta JMeter, a Java load/performance testing application. There are several situations in which we need to analyze HTML, but it's never the whole thing, but just small bits of it. For example obtaining values in a particular hidden field to pass in a later request, or obtaining the URLs of embedded elements (images, CSSs, etc.) to download them too.

    Hope this answers your question on why we need to examine the output and not the generator.

    As for not doing the wrong thing upon oocurence of double-attributes, I'm not too worried about that -- I currently live happily with the (?:X|Y|Z){3} -- it's more of a "how would I do it?" question than a "how do I do it"?

    Still, JMeter is a test tool, and it should help you detect problems in your code (most relevantly performance problems and problems that only happen under load). Code review, as you suggest, is another way -- a complementary one, not an alternative one.

    > Why not use something like, oh, HTML::Parser ... ?

    We have implemented three alternative solutions: one based on HtmlParser, one on JTidy, and a crappy one I wrote using regexps. I am aware that the later can never be formally correct, but it is currently the fastest of the three and, I'm proud to say, the most reliable in real-world situations so far.

    Hope I've addressed all your relevant concerns.

    Salut,

    Jordi.