Ninth Prince has asked for the wisdom of the Perl Monks concerning the following question:

I have a program that does the following. It goes out and gets a series of web pages. The web pages have the same format. On each page I match a regular expression as many times as it occurs. The basic form is  while ($content =~ m%pattern%gs) { Do something }. Here's my problem.

The match works fine on some web pages -- putting out any and all matches. On some web pages, however, it puts out all of the matches and then "hangs". It just keeps running and running and running and never moves on to the next web page.

So, my questions are, what do you think is going on here? Is it some sort of infinite loop? Also, what can I do to debug the code?

The confusing thing for me is that it works fine on some pages and then hangs on others. Also, to my naked eye, the pages that it hangs on don't seem (overtly) any different from the ones that it doesn't hang on.

Thanks in advance for your help!

Replies are listed 'Best First'.
Re: Regex infinite loop?
by moritz (Cardinal) on Oct 16, 2008 at 18:18 UTC
    If pattern contains nested quantifiers, it might take a very long time to match. Ages.

    Consider this script:

    #!/usr/bin/perl use strict; use warnings; use Benchmark qw(timethis); my $str = 'a' x shift; timethis(1, sub { $str =~ m/([abc]*[ab]*){2,12109}\d/; });

    For a string with length 12 it takes 2.4 seconds to determine that there's no match, for 13 it's 8.3 seconds and for 14 it's 28 seconds.. And it grows exponentially.

      I can't give you a technical answer to this, but I can tell you that I have been running this code for many weeks now. It runs once a day at the same time. When I run the program manually, it makes each match fairly quickly (a few seconds, at most).

      One thing, though. The code has been running fine for weeks, but whoever runs the website made some minor changes to the HTML. This forced me to have to go back and make a couple of (extremely) minor changes to the matching code. That said, I would note that the code does match, and match quickly, for probably 90% of the pages that I am pulling. It only hangs on a couple. The problem appears page-specific because when I change the order in which I pull the pages, it hangs on the same pages (regardless of where they are in my list of pages).

        Is both the regex and the html of the page that hangs your code a secret? I'm quite sure someone would already have found the solution if you posted the regex and the page code (or the url to that page).

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Regex infinite loop?
by ikegami (Patriarch) on Oct 16, 2008 at 18:03 UTC
    Something's changing pos($content)? The following will confirm/deny that diagnostic:
    while ( print(STDERR "[", pos($content), "]"), # DEBUG $content =~ m%pattern%gs ) { ... }

      I think I need to give everyone more to go on, so here it is. My matching code looks like the following.

      while (content =~ m%One(.*?)three(.?)five\s+six(.*?)this%gs) { $var1 = $1 ; $var2 = $2 ; $var3 = $3 ; print "$var1\t$var2\t$var3\n" ; }

      It gives me output that looks like you would expect, say, for example:

      two   four   seven

      When the code hangs, it prints out the same thing, but then nothing more.

      Now, I've made the following change in keeping with your suggestion.

      while (content =~ m%One(.*?)three(.?)five\s+six(.*?)this%gs) { print(STDERR "[", pos($content), "]") ; $var1 = $1 ; $var2 = $2 ; $var3 = $3 ; print "$var1\t$var2\t$var3\n" ; }

      I'm relatively new to PERL and not a professional coder, so I'm not sure what this is supposed to produce, but here's an example of what I am getting.

      [ two four seven 48430]

      On another match, however, it gives me:

      [48226] three more numbers

      When it hangs, it gives me the following.

      [50757] first three numbers second three numbers third three numbers [51826][52896]

      Does this help with understanding the problem?

        Why are you so reluctant to give us the regex (and not a similar regex), the data on which it hangs, and the rest of the program that might hang?

        That leaves us pointlessly fishing in the fog.

        "Sir, can you help me? my car is too slow" - "So, what kind of car is it?" - "a black one"

        Now, I've made the following change in keeping with your suggestion.

        No you didn't. You moved the print statement.

        Also, is that really the code you tested with?

        while (content =~ m%One(.*?)three(.?)five\s+six(.*?)this%gs) { ^^^^^^^

        A function that's called over and over and over? I doubt you'd get the output you got if that was the case. Since you're showing us code that has no relevance to yours, it's hard to help.

        If you try again, I'd switch to using the following:

        $|=1; while ( print("[", pos($content), "]"), $content =~ m%One(.*?)three(.?)five\s+six(.*?)this%gs ) { $var1 = $1 ; $var2 = $2 ; $var3 = $3 ; print "$var1\t$var2\t$var3\n" ; }

        But an educated guess on what I've seen leads me to think it's not a problem with your loop.

Re: Regex infinite loop?
by JavaFan (Canon) on Oct 16, 2008 at 18:17 UTC
    As you describe it, you say it gets all the matches, and then doesn't move to the next page. That suggests the problem isn't in while($content =~ m%pattern%s). It may be in "Do something". It may be in whatever you using the get the next page. It may be a network problem. It may be a problem on the server you're fetching from. And if you hadn't ruled it out already, it may be that the pattern takes a really long time to determine there's no match.

    But I have to say, you give extremely little information. It's just guessing what could be wrong with your program.

      I understand that I have given you very little to go on. I'm thinking that it's neither a network problem or a server problem because it always hangs on the same page. Even when I change the order in which I retrieve the pages, it still hangs on the same page. So, it seems like the problem is page-specific.

      I also have a line immediately following my {Do something} block that let's me know that I've exited the match. On the pages where my code hangs, it never exits the {Do something} block.

      On the time to match part, the match happens fairly quickly whether or not there are zero matches, one match, or multiple matches.

      One thing I should probably have mentioned, but neglected to, is that I've been running this code for a while now (it runs automatically once a day). Whoever runs the website that I'm pulling from made some minor changes to the HTML, so I needed to go back and change my matching pattern. But, like I said, it matches fine on some web pages, but then hangs on others.

        Talk code. Show us the regex, and the data on which it hangs.
        But if it "hangs" does it hang in the loop? You've code showing it exits the loop, but do you know it entered the loop? What I would do is first determine where the program "hangs": print a message before fetching a page; print a message after the page was retrieved; print a message before attempting the pattern; print a message when entering the body of the while; print a message just before exiting the body; print a message when exiting the while construct.

        And to avoid buffer problems, print those messages to STDERR.