ultranerds has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I've got a question =) Currently, I have this code:

my @headers; while ($post->{post_message} =~ m|\=\=([a-z0-9_ &\[\]ÀÂÄàâäÇçÉÊÈËé +êèëÏÌÎïìîÖÔÒöôòÜÛÙüûùA-Z?!;«»()"\s]+)\=\=|g) { my $section = get_section_code($1,$post->{post_message}); my @subheaders; while ($section =~ m|^\=\=\=([a-z0-9_ &\[\]ÀÂÄàâäÇçÉÊÈËéêèëÏÌÎ +ïìîÖÔÒöôòÜÛÙüûùA-Z?!;«»()"\s\-]+)\=\=\=|g) { push @subheaders, $1; } push @headers, { Section => $1, SubLoop => \@subheaders }, }


This works fine, until I've started using ===sub header=== and ====sub-sub header====

(i.e using 3 ='s and 4 ='s)

Is there a way I can edit the regex, so it will look to see if the match starts either at the *beginning* of the variable, or the line starts with \n.

Here is some example content of $post->{post_message};

==Introduction== sdfsfsdfsdf dfgdgdg ===sub header=== [[testing]] ====sub-sub header==== bla bla ===sub header=== [[testing]] sub-sub header bla bla ==Brief History== sdfsdfsdfsdf [[testing]] [[testing]] ==Geography== little edit here ==Regions== ==Cities== ==Sights and Activites== ==Weather== ==Getting There== ==END Getting There==


TIA :)

Andy

Replies are listed 'Best First'.
Re: Regex question - either starts, or has newline prefixing?
by BioLion (Curate) on Jul 13, 2009 at 12:36 UTC

    I am not really sure what the output you want is? Can you give an example of what you want the output to be?

    do you want to capture the newline? skip it? if you want to match whether it is there or not you could try :

    while ($section =~ m|(?:\n)?\=\=\=([a-z0-9_ &\[\]ÀÂÄàâäÇçÉÊÈËéêèëÏÌÎ +ïìîÖÔÒöôòÜÛÙüûùA-Z?!;«»()"\s\-]+)\=\=\=|g) { push @subheaders, $1; }

    But this is makes no sense. I think your regex may no longer be working because your $section (the example data given?) is one big string (with \n's) in it.
    So doing a global pattern match (m//g) and including ^ to specify only a match at the start breaks your successful pattern because you are not matching against your input line by line (where ^ would be applicable) but as one big string, so it will only match the very start of the string, i.e. the "==Introduction==" bit, regardless of the /g modifier.

    Wow, went on a bit there. In short, your pattern is failing because you are matching against a slurp, not line by line, so the ^ breaks your pattern.

    This perlfaq might help.

    Alternately i have totally missed the point and should pipe down.

    Just a something something...
      Hi,

      Thanks for the reply =) You got it exactly right. This seems to have done the trick:

      my @headers; while ($post->{post_message} =~ m|(?:\n)?\=\=([a-z0-9_ &\[\]ÀÂÄàâä +ÇçÉÊÈËéêèëÏÌÎïìîÖÔÒöôòÜÛÙüûùA-Z?!;«»()"\s]+)\=\=|g) { my $section = get_section_code($1,$post->{post_message}); my @subheaders; while ($section =~ m|\n\=\=\=([a-z0-9_ &\[\]ÀÂÄàâäÇçÉÊÈËéêèëÏÌ +ÎïìîÖÔÒöôòÜÛÙüûùA-Z?!;«»()"\s\-]+)\=\=\=|g) { push @subheaders, $1; } push @headers, { Section => $1, SubLoop => \@subheaders }, }


      Thanks again for your time :)

      Cheers

      Andy

        No worries! Thanks for reading my impressively long winded explanation... I tend to ramble... like i am now...

        Just a something something...
Re: Regex question - either starts, or has newline prefixing?
by johngg (Canon) on Jul 13, 2009 at 12:46 UTC

    If I've understood correctly, you might want to try a look-behind assertion. Also, you can use a quantifier, capture and back-reference for your equals signs. It looks like you need to preserve your first capture as it will be obliterated by subsequent matches. Something like (not tested)

    ... my $rxHeader = qr {(?x) (?: (?<=\A) | (?<=\n) ) ([=]{2,}) ([ your big character class ]+) \1 }; ... while ( $post->{ post_message } =~ m{$rxHeader}g ) { my $section = get_section_code( $2 ,$post->{ post_message } ); my @subheaders; while ( $section =~ m{$rxHeader}g ) { push @subheaders, $2; ... } push @headers, { Section => $section, SubLoop => \@subheaders } }

    I hope I've understood correctly and this is useful.

    Cheers,

    JohnGG

    Update: Fixed typo, s/behaind/behind/

    Update 2: Fixed typo in code, s/{(?x}/{(?x)/

      Thanks - that gave me a fatal error though:

      Sequence (?x...) not recognized in regex; marked by <-- HERE in m/(?x <-- HERE k+/

      Cheers

      Andy
        Never mind - I decided to take the slightly more long-winded route, and use:

        my @headers; my @tmp_split = split /\n/, $post->{post_message}; foreach (@tmp_split) { chomp; if ($_ =~ /^\=\=([a-z0-9_ &\[\]ÀÂÄàâäÇçÉÊÈËéêèëÏÌÎïìîÖÔÒöôòÜÛÙ +üûùA-Z?!;«»()"\s]+)\=\=$/) { my $header = $1; # print qq|ok, got header of $1<br />\n|; my $section = get_section_code($1,$post->{post_message}); my @tmp_2 = split /\n/, $section; my @subheaders; foreach my $tmp (@tmp_2) { # print qq|checking sub-section for $header, line is: " +"| chomp; if ($tmp =~ /^\=\=\=([a-z0-9_ &\[\]ÀÂÄàâäÇçÉÊÈËéêèëÏÌÎ +ïìîÖÔÒöôòÜÛÙüûùA-Z?!;«»()"\s]+)\=\=\=$/) { push @subheaders, $1; } } push @headers, { Section => $header, SubLoop => \@subheade +rs }; @subheaders = undef; @tmp_2 = undef; } }


        Seems to work fine :)

        Cheers

        Andy

        Sorry, that's because there's a typo in my code :-(

        This

        {(?x}

        Should have been this

        {(?x)

        Cheers,

        JohnGG

Re: Regex question - either starts, or has newline prefixing?
by ikegami (Patriarch) on Jul 13, 2009 at 18:54 UTC
    I believe you want ^ with the m modifier.
    while ("abc\ndef\nghi" =~ /^(.)/mg) { print("$1\n"); # a,d,g }
Re: Regex question - either starts, or has newline prefixing?
by ww (Archbishop) on Jul 13, 2009 at 23:49 UTC

    You said: or the line starts with \n.

    This makes me wonder if part of the problem is conceptual: Generally, better, IMO, to think of \n as the end of a line (and, in the case of the data you present, the end of the previous line).

    Combining that notion with the comment above re slurping, would something like this demo be simpler?

    my $flag = 0; # TRUE when in a sub header or sub-sub header my @rry = <DATA>; for my $rry(@rry) { if ( $rry =~ /\A={3,4}sub.*?={3,4}$/ ) { $flag = 1; print $rry . "\n"; next; } if ( $rry =~ /\A[^=].*/ && $flag == 1 ) { print $rry . "\n"; } else { $flag=0; next; } }

    ... and it seems to work. Output with your sample as __DATA__

    ===sub header=== [[testing]] ====sub-sub header==== bla bla ===sub header=== [[testing]] sub-sub header bla bla

    For your purposes, this (slightly modified) might well be in a sub, triggered by a match on more than two "=" signs.

    Update:   Attributed quote in the first graf and added last para.