flaviusm has asked for the wisdom of the Perl Monks concerning the following question:

I have a string following the pattern:
"aaa. bbb ccc nr. ddd23-56" where a, b, c, d are alphanum

having "(aaa. bbb ccc) nr. ddd(23)-56" :
I want to extract whatever is between paranteses () 
The word between [] can be present or not.

This is what I have done and I need help with:
($street, $n, $number) = ($address =~ /(.*?)(nr\.)*\b\D*(\d+)[\d\w-]+\b$/i);
The above regexp extracts $number="23" but it doesn't extract the $street.

The correct results should be:
$street == "aaa. bb cc"
$number == "23"

Thank you very much.
P.S. The requirement is to do it in only one regexp

Replies are listed 'Best First'.
Re: Need help with perl regexp
by ikegami (Patriarch) on Oct 22, 2007 at 18:32 UTC
    (.*?) "" (nr\.)* "" \b\D* "aaa. bbb ccc nr. ddd" (\d+) "23" [\d\w-]+\b$ "-56"

    It doesn't even need to backtrack, since it found a solution on the first try!

    I'd divide the work into two steps:

    my ($street, $number) = $address =~ /^(.*?)(\d+)[\d\w-]+\b$/; $street =~ s/(?:nr\.)*$//g;

    The second regex is not quite sufficient, but I don't have time to perfect it right now.

Re: Need help with perl regexp
by mwah (Hermit) on Oct 22, 2007 at 18:43 UTC

    There is more than one possible solution depending on the "real" data structure.

    Something has been said on your regex by others, so I'll add another version (only)...

    ... my $address = ... ; my ($street, $n, $number) = $address =~ /(.+?) # first g +roup (nr\.\s+ | \s+) # number +prefix? [A-z]+ # letters + preceeding number (\d+)- # number /ix; print "STREET: $street, N: $n, NUMBER: $number\n" ...

    BTW, is this "$n" really needed?

    Regards

    mwa

      Thanks a lot. It works great. 
      
      The $n is not needed, but I didn't know how to handle the parameters if  "nr." is present or absent.
        The $n is not needed ...

        then the whole expression may be written more compact:

        ... my ($street, $number) = $address =~ /(.+?) # first grou +p (?:\s+ nr \.)? # any number + prefix? \s+ [A-z]+ (\d+) # number fol +lows alpha /xi; print "STREET: $street, NUMBER: $number\n"; ...

        (although it might backtrack more now ;-)

        Regards

        mwa

Re: Need help with perl regexp
by Krambambuli (Curate) on Oct 22, 2007 at 18:53 UTC
    Is the following doing what you wanted ?
    use warnings; use strict; while (my $address = <DATA>) { my ($street, $number) = ( $address =~ / \A # start of s +tring ( \w{3} \. \s \w{3} \s \w{3} ) # 'aaa. bbb +ccc' \s # followed b +y a space (?:nr\. \s )? # optional ' +nr. ' \w{3} # 'ddd' (\d+) # 'ddd...' /ix ); print "Street: $street, Number: $number\n"; } exit; __DATA__ aaa. bbb ccc nr. ddd23-56


    Krambambuli
    ---
    Enjoying Mark Jason Dominus' (aka Dominus) "Higher-Order Perl"

      Thank you Krambambuli. I don't understand completly the regexp you submitted, but it seems that the version submitted by "mwah" is more generic.

      I forgot to mention in the statement of the problem that the "aaa" is in fact "aaa+", a given token can contain any number of characters not just those I used in my example

      I appreciate the help of you all.

Re: Need help with perl regexp
by gamache (Friar) on Oct 22, 2007 at 18:34 UTC
    The (.*?) at the beginning of your regex is matching minimally, as requested. The problem is that you end up with a zero-length match. I'd use ^(.*?)\s+ instead, which will not only grab the street name like you intended, it will also truncate trailing spaces from it.

    The rest of your regex could use some love too, but this should get you started.

    I'd write:

    my ($street, $number) = $address=~/ ^(.+?) # capture street name \s+ # ignore following whitespace (?:nr\.\s*){0,1} # ignore optional "nr." and trailing w +hitespace \w+? # ignore alphanumerics which come befo +re... (\d+) # capture digits followed by... -\d+ \s* $ # ignore a hyphen, extra digits and en +d of line /x;
    (edited example to be more robust on street names like "33-44th St")

      That doesn't help.

      ^ (.*?) "aaa." \s+ " " (nr\.)* "" \b\D* "bbb ccc nr. ddd" (\d+) "23" [\d\w-]+\b$ "-56"
        That's not my regex that you just quoted.