Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

regexp causes segfault

by Anonymous Monk
on Mar 05, 2003 at 21:25 UTC ( [id://240722]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

This regexp works fine for short strings but will cause segfault (due to stack overflow i suppose) on long strings.
$n = 40000; $_ = "'" . "=" x $n . "'"; printf "Match %d chars\n", length($&) if /'(\\'|''|[^'])+'/;
Can anyone give me an equivalent regexp with less stack requirement, which can match up to 4M chars inside leading and ending quote? Thanks in advance.

Replies are listed 'Best First'.
Re: regexp causes segfault
by tachyon (Chancellor) on Mar 06, 2003 at 00:46 UTC

    Do it C style using pos and substr. This is laser fast and tested to 40MB

    # proof it behaves right, uncomment to see # $_ = " '==\\'==' '==5==' '\\'\\'' '\\'3' '\\'' '1' '' " x 2; $n = 40000000; $_ = "'" . "=" x $n . "'"; my @pos; while ( /(?<!\\)'/gc ) { push @pos, pos; } for ( my $i= 0; $i <@pos; $i +=2 ) { my $begin = $pos[$i]; my $end = $pos[$i+1]-1; my $str = substr $_, $begin, ($end -$begin); # check what we have found using test string commented out #print "$begin $end '$str'\n"; print length($str), "\n"; }

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      tachyon,
      Should a beast of a regular expression like this be put into an EVAL block?
      Just wondering.


      Shirkdog
      That was me, forgot I was not logged in:-)

        I have no idea what you mean. All this regex does is walk the string (char by char) with a 1 char buffer (the last char). When it gets a match (last char not \\ and char eq ') it gets a match and we record the position. This is hardly a regex at all!

        Here it is completely C-ified - no regexes in sight. Possibly faster than the original post to boot but I can't be bothered to test.

        $str = " '==\\'==' '==5==' '\\'\\'' '\\'3' '\\'' '1' '' "; my $pos = 0; my $len = length $str; my $last = ''; my $char; while ( $pos < $len ) { $char = substr $str, $pos, 1; push @pos, $pos if $char eq "'" and $last ne "\\"; $pos++; $last = $char; } for ( my $i= 0; $i <@pos; $i +=2 ) { my $begin = $pos[$i]+1; my $end = $pos[$i+1]; my $str = substr $str, $begin, ($end -$begin); print "$begin $end |$str|\n"; }

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: regexp causes segfault
by hv (Prior) on Mar 05, 2003 at 21:37 UTC

    This will do it, but may be slow when failing:

    /'(?:[^\\']*(?:\\.|'' )?)*'/

    I think the slow failure mode can be overcome with a cut operator:

    /'(?:(?>[^\\']*)(?:\\.|'' )?)*'/

    Hugo

      I should clarify that this won't match exactly the same strings as code in the original post, since it always treats a backslash as escaping the following character, so that "'\\'" would be treated as a valid quoted string of a single escaped backslash, whereas the original code would ignore the first backslash and then treat the second backslash as escaping the quote. (And then fail, and backtrack, and do the right thing anyway: "'\\''" would probably be a better example.)

      Hugo
Re: regexp causes segfault
by pg (Canon) on Mar 06, 2003 at 02:31 UTC
    Just to add one point.

    When I tested your regexp with AS5.8.0, it didn't core dump, I guess you used some old version.

    It didn't work with AS5.8.0, however it was more robust, and gave a differet msg saying "Complex regular subexpression recursion limit (32766) exceeded".

    This makes sense, as they need a way to:
    1. avoid dead loop (better call it dead recursion)
    2. avoid memory allocateion problem

      This varies depending on platform: perl's configuration script tries to determine the right limit, but doesn't always get it right, and when the limit is too high you'll get the coredump when the real limit is hit.

      It is currently the intention to remove this limitation altogether for perl-5.10.0, by rewiring the regular expression engine to use a new internal stack (which can be grown as needed) rather than the system stack, but it isn't clear yet whether we can do that without slowing down the engine.

      Hugo
Re: regexp causes segfault
by shirkdog_perl (Beadle) on Mar 06, 2003 at 05:44 UTC
    That takes care of it Tach


    Cheers
Re: regexp causes segfault
by Weathros (Novice) on Mar 06, 2003 at 09:18 UTC
    See the escaping now... Regex probably isn't the best way to solve 4MB matches ;) If your'e not careful your computer can spend a looong trying to match ;)
Re: regexp causes segfault
by bart (Canon) on Mar 06, 2003 at 12:16 UTC
    Try non-capturing parentheses.
    /'(?:\\'|''|[^'])+'/
    It should offer some improvement.

    Update: Well, if it doesn't, it won't be much. I still get the same error messages as pg, on Windows. (Indigoperl 5.6.1, so it's not just a 5.8.0 thing.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://240722]
Approved by Mr. Muskrat
Front-paged by Enlil
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (1)
As of 2024-04-25 19:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found