comment on

I'm trying to parse a Newick tree format on my own. Yes, I know there are modules at the CPAN that do it. But since it's a fairly simple format I decided I'd try to do it myself using some nice features, in order to learn a bit more from parsing stuff.

So there I was, really proud of my regexp when it just starts segfaulting right when I finally had solved all issues (syntax-wise).

I'm pretty much sure it's a bug since I tested it on Linux with perl 5.8.0 and now on Win32 with perl 5.8.7 (maybe someone can test it with 5.8.8, just in case). I just really wanted to know if I should be expecting segfaults all over around when using regexps with embedded code.

Here's the snippet:

my $str = "((A,B,(C,D,(E,F),G),H),(I,J,K))";
our @nodes;
our $re;
$re = qr{
    \(
    (?:
        (?:
            ((?> [^()]+ ))    # Non-parens without backtracking
            (?{  /./ })    # <--- any regex here segfaults
        )+
    |
        (??{ $re })     # Group with matching parens
    )*
    \)
}x;
[download]

Lately I've been feeling somewhat frustrated with perl (the interpreter).
Everytime I try to use some advanced feature, I end up getting segfaults all over my face. Not really fun.

In case someone wants to suggest me another way of doing it, below is the the original code in verbatim.

It segfaults at the split() inside the process_children function. I think I could do without this function, but I couldn't figure a way to insert that logic into the regexp itself.

This Bio::Tree::Node module isn't the one available at the CPAN and it's irrelevant to the problem itself, so I won't post it here - but it's fairly short and simple code.

#!/usr/bin/perl

use warnings;
use strict;

use lib '.';

use Bio::Tree::Node;

my $str = "  \n   (\n(A:0.333,B,(C,D,(E,  \nF),G),H):0.456,(I,J,K):0.1
+23)    \n   ";

my $current = Bio::Tree::Node->new;
my $last;

our $re = qr{
    \(
    (?{ $current = $current->add_new_child })
    (?:
        (?:
            ((?> [^()]+ ))    # Non-parens without backtracking
            (?{ process_children($current, $^N) })
        )+
    |
        (??{ $re })     # Group with matching parens
    )*
    \)(?{ $current = $current->parent })
}x;

$str =~ $re;

sub process_children {
    my ($node_obj, $str) = @_;
    my @nodes = split '\s*,\s*', $str;
    for my $node (@nodes) {
        my ($tag, $contents) = split ':', $node;
        if ($tag) {
            $node_obj->add_new_child($tag, $contents) 
        }
        else {
            $node_obj->contents($contents);
        }
    }
}
[download]

Updated: typo.

acid06
perl -e "print pack('h*', 16369646), scalar reverse $="

In reply to Perl segfaulting when using a hairy regex by acid06

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.