Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm still attempting to narrow down the cause of this problem.

I have some code which constructs a large HTML table where I need to color adjacent cells if their content meets a certain criteria. Thankfully, the entire table is constructed before being sent to STDOUT, so I can easily construct a regular expression which would search for the errant combination and color it appropriately before displaying. The substitution is similar to the following:

$table =~ s!<td>(bad condition text 1 \[<a href=".+?">toggle</a>\])</t +d><td>(bad condition text 2 \[<a href=".+?">toggle</a>\])</td>!<td st +yle="background-color:orange">$1</td><td style="back\ ground-color:orange">$2</td>!gs;
This code lives in a CGI file. The first time it executes, it colors the adjacent bad combinations correctly, but if the user clicks on specific links which causes the table to be redisplayed, instead of only coloring bad combinations, it appears that disjoint combinations (not adjacent) table elements are colored instead.

My guess (which may be completely wrong) is that some state variables used by the regular expression engine are not being reinitialized upon subsequent displays of the table -- hence, why disjoint elements may be colored.

I may be completely wrong in this hypothesis, but I can't come up with an alternative explanation to explain the behavior seen.

Any insight you may have which explains this odd behavior would certainly be appreciated.

Replies are listed 'Best First'.
Re: initializing internal regex variables?
by moritz (Cardinal) on Nov 17, 2009 at 07:45 UTC
    My guess (which may be completely wrong) is that some state variables used by the regular expression engine are not being reinitialized upon subsequent displays of the table

    If these tables are built in different HTTP requests, and you really use CGI as you wrote (and not FastCGI or mod_perl), then each of these requests is handled in a different process. There's no way an internal variable can accidentally be persistent across two processes, unless your OS is really buggy.

    Any insight you may have which explains this odd behavior would certainly be appreciated.

    Chances are that on subsequent requests the table is different, and your regex therefore works differently.

    I recommend not to build the table wrongly, and then correct it later on, but instead build it in its desired form straight away.

    Perl 6 - links to (nearly) everything that is Perl 6.
Re: initializing internal regex variables?
by JavaFan (Canon) on Nov 17, 2009 at 10:59 UTC
    it appears that disjoint combinations (not adjacent) table elements are colored instead.
    Well, there's nothing in the regex that requires the rows to be adjacent. /".+?"/ will happily match billions and billions of rows if required.

      Your comment appears to be what I needed to hear as changing the non-greedy match from /".*?"/ to /"[^"]*?"/ appears to work correctly. The negated character class was the trick.

      I'm still a bit confused about why there is such a difference in what is matched, but I'm think about it some more.

      Thanks for pointing me in the right direction.

        I think you expect
        perl -e' $_ = qq{...\n} .qq{<a href="foo">foo</a>\n} .qq{<a href="bar">bar</a>\n} .qq{...\n}; s!(<a href=")(.*?)(">bar</a>)!$1\[$2]$3!s; print; '
        to output
        ... <a href="foo">foo</a> <a href="[bar]">bar</a> ...
        but that's wrong. It outputs
        ... <a href="[foo">foo</a> <a href="bar]">bar</a> ...

        The pattern says to match

        • Match the start of the string,
          • followed by as few characters as possible (implicit leading /.*?/),
            • followed by the string '<a href="',
              • followed by as few characters as possible,
                • followed by the string '">bar</a>'.

        Keeping in mind that "as few characters as possible" is zero characters, let's check if the string matches:

        • Starting at the begining of the string,
          • Do 0 characters follow? Yes, so try to match the next atom.
            • Does the string '<a href="' follow? No, so backtrack.
          • Does 1 character follow? Yes, so try to match the next atom.
            • Does the string '<a href="' follow? No, so backtrack.
          • Do 2 characters follow? Yes, so try to match the next atom.
            • Does the string '<a href="' follow? No, so backtrack.
          • ...
          • Do 4 characters follow? Yes, so try to match the next atom.
            • Does the string '<a href="' follow? Yes, so try to match the next atom.
              • Do 0 characters follow? Yes, so try to match the next atom.
                • Does the string '">bar</a>' follow? No, so backtrack.
              • Does 1 character follow? Yes, so try to match the next atom.
                • Does the string '">bar</a>' follow? No, so backtrack.
              • Do 2 characters follow? Yes, so try to match the next atom.
                • Does the string '">bar</a>' follow? No, so backtrack.
              • ...
              • Do 25 characters follow? Yes, so try to match the next atom.
                • Does the string '">bar</a>' follow? Yes, so try to match the next atom.
                  • We have a match!
      Okay, maybe I'm reading more into your response than I should, but here are two questions:
      • is there any difference between /".+?"/ and /".*?"/? Yes, + matches one or more of the previous pattern, and * matches zero or more of the previous pattern, but given that all strings seen in the table are more than one character in length, is there any difference since I am specifying that the pattern is non-greedy?
      • is not the regular expression originally quoted non-greedy?
        • Thanks for any insight shared.
        is not the regular expression originally quoted non-greedy?
        It is. But what do you expect non-greedy to be? Some people think that non-greedy means "match an as short string as possible", without anything else. But there is just one such a string, and that's the empty string.

        Non-greedy does not mean, "don't match where you would match otherwise". If a pattern matches with greedy (sub) matches, it will match with non-greedy sub matches. And if a pattern doesn't match with non-greedy sub matches, it will not match with greedy sub matches.

        All greedy/non-greedy will do is change $&, it will not change whether or not a pattern matches.

Re: initializing internal regex variables?
by gmargo (Hermit) on Nov 17, 2009 at 13:49 UTC

    Just checking, but when

    the user clicks on specific links which causes the table to be redisplayed,
    is he firing off a Javascript routine that runs within the browser?

      No, but the (Fast)CGI calls itself causing the table to be redrawn.