acid06 has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to parse a Newick tree format on my own. Yes, I know there are modules at the CPAN that do it. But since it's a fairly simple format I decided I'd try to do it myself using some nice features, in order to learn a bit more from parsing stuff.

So there I was, really proud of my regexp when it just starts segfaulting right when I finally had solved all issues (syntax-wise).

I'm pretty much sure it's a bug since I tested it on Linux with perl 5.8.0 and now on Win32 with perl 5.8.7 (maybe someone can test it with 5.8.8, just in case). I just really wanted to know if I should be expecting segfaults all over around when using regexps with embedded code.
Here's the snippet:
my $str = "((A,B,(C,D,(E,F),G),H),(I,J,K))"; our @nodes; our $re; $re = qr{ \( (?: (?: ((?> [^()]+ )) # Non-parens without backtracking (?{ /./ }) # <--- any regex here segfaults )+ | (??{ $re }) # Group with matching parens )* \) }x;
Lately I've been feeling somewhat frustrated with perl (the interpreter).
Everytime I try to use some advanced feature, I end up getting segfaults all over my face. Not really fun.

In case someone wants to suggest me another way of doing it, below is the the original code in verbatim.
It segfaults at the split() inside the process_children function. I think I could do without this function, but I couldn't figure a way to insert that logic into the regexp itself.

This Bio::Tree::Node module isn't the one available at the CPAN and it's irrelevant to the problem itself, so I won't post it here - but it's fairly short and simple code.
#!/usr/bin/perl use warnings; use strict; use lib '.'; use Bio::Tree::Node; my $str = " \n (\n(A:0.333,B,(C,D,(E, \nF),G),H):0.456,(I,J,K):0.1 +23) \n "; my $current = Bio::Tree::Node->new; my $last; our $re = qr{ \( (?{ $current = $current->add_new_child }) (?: (?: ((?> [^()]+ )) # Non-parens without backtracking (?{ process_children($current, $^N) }) )+ | (??{ $re }) # Group with matching parens )* \)(?{ $current = $current->parent }) }x; $str =~ $re; sub process_children { my ($node_obj, $str) = @_; my @nodes = split '\s*,\s*', $str; for my $node (@nodes) { my ($tag, $contents) = split ':', $node; if ($tag) { $node_obj->add_new_child($tag, $contents) } else { $node_obj->contents($contents); } } }
Updated: typo.


acid06
perl -e "print pack('h*', 16369646), scalar reverse $="

Replies are listed 'Best First'.
Re: Perl segfaulting when using a hairy regex
by Paladin (Vicar) on Mar 09, 2006 at 23:02 UTC
    This has to do with the fact that the regex engine is not re-entrant. You can't use a regex inside a (?{}) construct. Fixing this is on the TODO list for the current Perl and the work that Nicholas Clark is currently working on may be a first step towards this.
Re: Perl segfaulting when using a hairy regex
by GrandFather (Saint) on Mar 09, 2006 at 22:59 UTC

    After removing the cruft the regex our $re = qr/(??{$re})/; looks a little self referential, and fails to compile.

    Update: read the actual question rather than just the code :(


    DWIM is Perl's answer to Gödel
Re: Perl segfaulting when using a hairy regex
by QM (Parson) on Mar 10, 2006 at 17:52 UTC
    To solve your original problem, do you need nested regexen? Since you're already in a regex context, you should be able to work up some patterns + embedded code to DWYM.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of