Otogi has asked for the wisdom of the Perl Monks concerning the following question:

A script I created seem causes a segmentation fault. Here is the problem regex:
(?:.*\n)*? ((?:\t\.1\.3\.6\.1\.4\.1\.9\.2\.2\.1\.1\.10\.\d*[ ]=[ ]\d*[ ]\(INTEGER +\)\n)+) ((?:\t\.1\.3\.6\.1\.4\.1\.9\.2\.2\.1\.1\.11\.\d*[ ]=[ ]-*\d*[ ]\(INTEG +ER\)\n)+) ((?:\t\.1\.3\.6\.1\.4\.1\.9\.2\.2\.1\.1\.12\.\d*[ ]=[ ]\d*[ ]\(INTEGER +\)\n)+) ((?:\t\.1\.3\.6\.1\.4\.1\.9\.2\.2\.1\.1\.13\.\d*[ ]=[ ]\d*[ ]\(INTEGER +\)\n)+) ((?:\t\.1\.3\.6\.1\.4\.1\.9\.2\.2\.1\.1\.14\.\d*[ ]=[ ]\d*[ ]\(INTEGER +\)\n)+) ((?:\t\.1\.3\.6\.1\.4\.1\.9\.2\.2\.1\.1\.15\.\d*[ ]=[ ]\d*[ ]\(INTEGER +\)\n)+) ((?:\t\.1\.3\.6\.1\.4\.1\.9\.2\.2\.1\.1\.16\.\d*[ ]=[ ]\d*[ ]\(INTEGER +\)\n)+) ((?:\t\.1\.3\.6\.1\.4\.1\.9\.2\.2\.1\.1\.17\.\d*[ ]=[ ]\d*[ ]\(INTEGER +\)\n)+) (?:.*\n)*? ((?:\t\.1\.3\.6\.1\.4\.1\.9\.2\.2\.1\.1\.21\.\d*[ ]=[ ]\d*[ ]\(INTEGER +\)\n)+) (?:.*\n)*? ((?:\t\.1\.3\.6\.1\.4\.1\.9\.2\.2\.1\.1\.25\.\d*[ ]=[ ]\d*[ ]\(INTEGER +\)\n)+) (?:.*\n)*? ((?:\t\.1\.3\.6\.1\.4\.1\.9\.2\.2\.1\.1\.28\.\d*[ ]=[ ][^\n]*[ ]\(OCTE +TSTR\)\n)+)
When I run this in a simple script that runs through all the files and applies it to the input it does not cause a segmentation fault, however, in the main script I created it does. The difference between the two is that I am getting the regex from a database and I am doing more regexes against more files. Changing this particular regex ieven a little changes where the segmentation fault occurs and  $gdb perl core shows it occurs in S_regmatch (prog=0x82fc628) at regexec.c:2312. When I remove this regex from the list of regexes I apply my script runs perfectly well. Did anyone run across this problem before? does anyone have any tips on how to proceed and what to look for? I tried this on numerous perl versions (5.8.3 5.8.4 5.8.5 5.8.6 5.8.7 5.8.8) all cause the same segfault. Thanks for the help.

Replies are listed 'Best First'.
Re: regex causing segmentation fault (core dump)
by hv (Prior) on Mar 30, 2006 at 23:58 UTC

    As Tanktalus says, this is almost certainly caused by recursion going past the end of the available stack space. I expect that it is caused specifically by the very first fragment of your regexp:

    (?:.*\n)*?
    .. the first time you try to use it on a file that has a few thousand lines not matching the next part of the pattern.

    Since that fragment has no effect other than to ensure the rest of the pattern is anchored to the start of the line, I would suggest replacing:

    m/(?:.*\n)*?rest of pattern/
    with:
    m/^rest of pattern/m
    which I believe will solve your problem.

    Hugo

Re: regex causing segmentation fault (core dump)
by graff (Chancellor) on Mar 31, 2006 at 04:33 UTC
    I really like hv's suggestion, but I don't think he took it far enough, especially since you have additional cases of  (?:.*\n)*? later on in that monster.

    Let's see if understand the situation here... You have a multi-line string stored as a scalar, and you are trying to capture 11 substrings. Each substring to be captured represents one whole line of text, and except for just a couple minor variations, all these lines to be captured are identical except for the value of a two-digit number, which follows an otherwise identical initial string.

    I don't know what you need to do with all those captures once you get the match, but I expect it would not be hard to tweak the subsequent code just a little so that the overall process is a lot more compact and sensible (and doesn't crash).

    But then again, since you say you are "getting the regex from a database" and "doing more regexes against more files", the overall system must be a lot more complicated than I would expect, and maybe a simpler strategy would involve a rather large amount of refactoring. (But maybe that wouldn't be such a bad thing?)

    Still, I wonder if something like this might be a step in the right direction:

    my @captures = grep { /^\t\Q.1.3.6.1.4.1.9.2.2.1.1.\E(\d{2}).*?\([A-Z]+\)$/ and $1 =~ /1[0-7]|2[158]/ } split /\n/;
    It depends on how important it is for the regex match to be as extensive and explicit as you seem to want it to be. In other words, is your regex so long and cumbersome because there's a chance that some of these single-line patterns might occur at various points (out of sequence) throughout the input file, and you must be sure to match the full sequence in proper order?

    Or is it long and cumbersome because you just happen to be including all the details that you know about, even when they are redundant and/or unnecessary in terms of assuring correct matches?

    (update: removed spurious close paren from code snippet)

      well the regex is supposed to get, as you figured, a number value which represent a specific interface and a digit or a string as the second value depends on what that series of digits represent in the beginning. hv's suggestion still does not help I applied it to previous regexes to following the same reasoning and it still crashed . Debugging in gdb shows it crashes in regexec.c (using bt) I am starting to wonder if this is caused by something else entirely but you all seem to agree it has to do with the regex ( If you have suggestion that will completely prove thats the case it would helpful). I am trying to apply the suggested patch I do not want to use the development version and changing the regexec.c file to look like the patched version of regexec.c that i did gives lots of errors. Removing those errors will be a pain if even possible with my limited experience.
        the regex is supposed to get... a number value which represent a specific interface and a digit or a string as the second value depends on what that series of digits represent in the beginning

        I'm not sure I get all that, but it sounds like you should just focus on simplifying the process. Split the scalar up into lines, and look for matches one line at a time. I would expect that to simplify a lot of things for you. If you have trouble with that, tell us more about what you're actually trying to accomplish, show us a little bit of input data, what you want to extract from it, etc.

Re: regex causing segmentation fault (core dump)
by Tanktalus (Canon) on Mar 30, 2006 at 22:56 UTC

    I'd be willing to bet you just found one of those cases that would be solved by the iterative regexp matcher announced earlier today. If you're feeling adventurous, you could try getting dave_the_m's patch and backporting it to 5.8, and rebuilding your perl... ;-)

      I will try that and see what happens. Thank you.
Re: regex causing segmentation fault (core dump)
by duff (Parson) on Mar 30, 2006 at 22:58 UTC

    Simplify the RE perhaps? I think you're running afoul of the pathology mentioned in this article.

      I tried every version of regex (possible) that will match and tweaked it even thought it didnt make any sense to see what happens. It began to look like it has cursed numbers or something. I am currently applying that code mentioned in the article to see if it fixes it.