jeanluca has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks

I have the following regexp problem:
#! /usr/bin/perl $str = "aaaaa\nbbbbb\nccccc\naaaaa\nddddd\neeeee\n" ; @a = $str =~ /(aaaaa.*?)/gm ; print "$a[0]\n" ;
What I need is that $a[0] prints:
aaaaa bbbbb ccccc
and $a1 prints:
aaaaa ddddd eeeee
I just red that '.' doesn't match a new line character...... :(
Somehow I assume that this should not be to complicated, but I missed something
Any suggestions ?

Thanks in advance
Luca

2005-12-21

Replies are listed 'Best First'.
Re: multi-line regexp
by prasadbabu (Prior) on Dec 21, 2005 at 11:07 UTC

    If i understood your question correctly, this will work.

    '.' matches newline character when you use the 's' option modifier in your regex. Also take a look at perlre.

    #! /usr/bin/perl $str = "aaaaa\nbbbbb\nccccc\naaaaa\nddddd\neeeee\n" ; @a = $str =~ /aaaaa(?:(?!aaaaa).)*/gs ; print "$a[0]\n$a[1]" ;

    updated: removed extra grouping.

    Thanks in advance

    Prasad

      yes, it all makes more sense now, but your regexp is complex...
      I would like to understand whats going on there with all the ?: Could you add some description of whats going on there ?

      Thanks a lot
      Luca
        jeanluca '?:' is used in regex grouping to avoid storing the matched string in the system variables like $1, $2 etc. '?!' is nothing but negative lookahead condition. You take a look at the perlre.

        Thanks in advance

        Prasad

        maybe this helps:
        perl -MYAPE::Regex::Explain -e 'print YAPE::Regex::Explain->new(qr/(aa +aaa(?:(?:(?!aaaaa).)*))/s)->explain'
        The regular expression: (?s-imx:(aaaaa(?:(?:(?!aaaaa).)*))) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?s-imx: group, but do not capture (with . matching \n) (case-sensitive) (with ^ and $ matching normally) (matching whitespace and # normally): ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- aaaaa 'aaaaa' ---------------------------------------------------------------------- (?: group, but do not capture: ---------------------------------------------------------------------- (?: group, but do not capture (0 or more times (matching the most amount possible)): ---------------------------------------------------------------------- (?! look ahead to see if there is not: ---------------------------------------------------------------------- aaaaa 'aaaaa' ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- . any character ---------------------------------------------------------------------- )* end of grouping ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------

        I sat all day, tried to understand prasadbabu's code, then i asked for help to id-perl.

        Then someone named Jacinta Richardson told me about this, and she said :

        / aaaaa # Find me aaaaa (?: # Followed by, but do not capture (?: # Group but do not capture (?! # Something which is not aaaaa . #and any char including newline ) )* # As many as possible ) /gs # Repeat the match, dots can include newlines

        The first grouping is unnessary, but not a problem.

        Negative look-aheads ask the regular _expression to look at the nextvalue and only include it in the match if it does not match that part of the _expression.

        Thus the regular _expression finds: aaaaa\nbbbbb\nccccc\n

        in its first run, stopping at the "aaaaa\n" which matches the negative look-ahead and then in its second run finds: aaaaa\nddddd\neeeee\n

        That's what she said, and then i realize that Jacinta Richadson known as jarich here.

        Thanks for your time Jarich, and hope this help jeanluca too

      hi

      your code's working here and its looks nice (at least for me), tobe honest i need more times to understand your code, i wish you can explain the process of your code (while i am reading my notes about pattern matching)

      Anyway i tried other ways, and so far i made a litle code like :

      $str = "aa\nbb\ncc\naa\ndd\nee\n"; @a = ($str =~ /aa\n.*\n.*/g); print "1 = $a[0]\n2 = $a[1]\n";
      And the other is :
      $str = "aa\nbb\ncc\naa\ndd\nee\n"; @a = ($str =~ /(aa.*).?(aa.*)/gs); print "1 = $a[0]\n2 = $a[1]\n";

      But anyway, prasadbabu code's nicer, that's why i asked for the explanation, or do you see something bad in my code ?

      Update : pKai code simpler and easy to understand for me :)

      thanks, zak

        Well, this code of yours makes some very specific assumptions about the input string:
        • /aa\n.*\n.*/g assumes, that every aa-line is followed by exactly 2 other lines which have to be extracted in addition to the aa-lead.
        • /(aa.*).?(aa.*)/gs extracts exactly 2 fields from the string beginning with aa.
        The regexes with look-ahead where proposed to cover a wider range of input strings.
Re: multi-line regexp
by Happy-the-monk (Canon) on Dec 21, 2005 at 11:13 UTC

    Any suggestions ?

    See in perldoc perlre on line 30 what the m-switch does.
    Without the s-switch, the dot (.) actually doesn't match a newline.

    Cheers, Sören

Re: multi-line regexp
by blazar (Canon) on Dec 21, 2005 at 11:39 UTC

    Your post is slightly confusing to me, and I suggest trying to be more accurate: e.g. s/muli/multi/, and use <code> (or <c>) tags!

    If I get it right, though, you're just confusing /m for /s, which is not uncommon after all. In doubt always check perldoc perlre!

Re: multi-line regexp
by pKai (Priest) on Dec 21, 2005 at 17:56 UTC
    Some variation of the theme:

    /(aaaaa.*?)(?=aaaaa|$)/sg

    A little bit more straight forward, as in avoiding negative look-ahead, which also confuses me on various occasions ;-)
Re: multi-line regexp
by GrandFather (Saint) on Dec 21, 2005 at 19:50 UTC

    This uses a minimum capture .*? and positive look ahead (?=...) with a conditional match aaaaa|$ to do the job:

    #! /usr/bin/perl use strict; use warnings; my $str = "aaaaa\nbbbbb\nccccc\naaaaa\nddddd\neeeee\n" ; my @a = $str =~ /(aaaaa.*?)(?=aaaaa|$)/gs ; print join "\n", @a;

    Prints:

    aaaaa bbbbb ccccc aaaaa ddddd eeeee

    DWIM is Perl's answer to Gödel

      And i think in this case, there is no difference between :

      my @a = $str =~ /(aaaaa.*?)(?=aaaaa|$)/gs ;

      and

      my @a = $str =~ /aaaaa.*?(?=aaaaa|$)/gs ;
      right ?

        Interesting. Yes you are right and I've learned something. Thank you :)


        DWIM is Perl's answer to Gödel