Regex question - either starts, or has newline prefixing?

ultranerds has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regex question - either starts, or has newline prefixing? by BioLion (Curate) on Jul 13, 2009 at 12:36 UTC
I am not really sure what the output you want is? Can you give an example of what you want the output to be? do you want to capture the newline? skip it? if you want to match whether it is there or not you could try : `while ($section =~ m\|(?:\n)?\=\=\=([a-z0-9_ &\[\]ÀÂÄàâäÇçÉÊÈËéêèëÏÌÎ +ïìîÖÔÒöôòÜÛÙüûùA-Z?!;«»()"\s\-]+)\=\=\=\|g) { push @subheaders, $1; }` [download] But this is makes no sense. I think your regex may no longer be working because your $section (the example data given?) is one big string (with \n's) in it. So doing a global pattern match (m//g) and including ^ to specify only a match at the start breaks your successful pattern because you are not matching against your input line by line (where ^ would be applicable) but as one big string, so it will only match the very start of the string, i.e. the "==Introduction==" bit, regardless of the /g modifier. Wow, went on a bit there. In short, your pattern is failing because you are matching against a slurp, not line by line, so the ^ breaks your pattern. This perlfaq might help. Alternately i have totally missed the point and should pipe down. Just a something something...	[reply] [d/l]
Re^2: Regex question - either starts, or has newline prefixing? by ultranerds (Hermit) on Jul 13, 2009 at 12:45 UTC
Hi, Thanks for the reply =) You got it exactly right. This seems to have done the trick: `my @headers; while ($post->{post_message} =~ m\|(?:\n)?\=\=([a-z0-9_ &\[\]ÀÂÄàâä +ÇçÉÊÈËéêèëÏÌÎïìîÖÔÒöôòÜÛÙüûùA-Z?!;«»()"\s]+)\=\=\|g) { my $section = get_section_code($1,$post->{post_message}); my @subheaders; while ($section =~ m\|\n\=\=\=([a-z0-9_ &\[\]ÀÂÄàâäÇçÉÊÈËéêèëÏÌ +ÎïìîÖÔÒöôòÜÛÙüûùA-Z?!;«»()"\s\-]+)\=\=\=\|g) { push @subheaders, $1; } push @headers, { Section => $1, SubLoop => \@subheaders }, }` [download] Thanks again for your time :) Cheers Andy	[reply] [d/l]
Re^3: Regex question - either starts, or has newline prefixing? by BioLion (Curate) on Jul 13, 2009 at 12:53 UTC
No worries! Thanks for reading my impressively long winded explanation... I tend to ramble... like i am now... Just a something something...	[reply]
Re: Regex question - either starts, or has newline prefixing? by johngg (Canon) on Jul 13, 2009 at 12:46 UTC
If I've understood correctly, you might want to try a look-behind assertion. Also, you can use a quantifier, capture and back-reference for your equals signs. It looks like you need to preserve your first capture as it will be obliterated by subsequent matches. Something like (not tested) `... my $rxHeader = qr {(?x) (?: (?<=\A) \| (?<=\n) ) ([=]{2,}) ([ your big character class ]+) \1 }; ... while ( $post->{ post_message } =~ m{$rxHeader}g ) { my $section = get_section_code( $2 ,$post->{ post_message } ); my @subheaders; while ( $section =~ m{$rxHeader}g ) { push @subheaders, $2; ... } push @headers, { Section => $section, SubLoop => \@subheaders } }` [download] I hope I've understood correctly and this is useful. Cheers, JohnGG Update: Fixed typo, `s/behaind/behind/` Update 2: Fixed typo in code, `s/{(?x}/{(?x)/`	[reply] [d/l] [select]
Re^2: Regex question - either starts, or has newline prefixing? by ultranerds (Hermit) on Jul 13, 2009 at 13:41 UTC
Thanks - that gave me a fatal error though: Sequence (?x...) not recognized in regex; marked by <-- HERE in m/(?x <-- HERE k+/ Cheers Andy	[reply]
Re^3: Regex question - either starts, or has newline prefixing? by ultranerds (Hermit) on Jul 13, 2009 at 13:56 UTC
Never mind - I decided to take the slightly more long-winded route, and use: my @headers; my @tmp_split = split /\n/, $post->{post_message}; foreach (@tmp_split) { chomp; if ($_ =~ /^\=\=([a-z0-9_ &\[\]ÀÂÄàâäÇçÉÊÈËéêèëÏÌÎïìîÖÔÒöôòÜÛÙ +üûùA-Z?!;«»()"\s]+)\=\=$/) { my $header = $1; # print qq\|ok, got header of $1<br />\n\|; my $section = get_section_code($1,$post->{post_message}); my @tmp_2 = split /\n/, $section; my @subheaders; foreach my $tmp (@tmp_2) { # print qq\|checking sub-section for $header, line is: " +"\| chomp; if ($tmp =~ /^\=\=\=([a-z0-9_ &\[\]ÀÂÄàâäÇçÉÊÈËéêèëÏÌÎ +ïìîÖÔÒöôòÜÛÙüûùA-Z?!;«»()"\s]+)\=\=\=$/) { push @subheaders, $1; } } push @headers, { Section => $header, SubLoop => \@subheade +rs }; @subheaders = undef; @tmp_2 = undef; } } [download] Seems to work fine :) Cheers Andy	[reply] [d/l]
Re^3: Regex question - either starts, or has newline prefixing? by johngg (Canon) on Jul 13, 2009 at 19:28 UTC
Sorry, that's because there's a typo in my code `:-(` This `{(?x}` [download] Should have been this `{(?x)` [download] Cheers, JohnGG	[reply] [d/l] [select]
Re: Regex question - either starts, or has newline prefixing? by ikegami (Patriarch) on Jul 13, 2009 at 18:54 UTC
I believe you want `^` with the `m` modifier. `while ("abc\ndef\nghi" =~ /^(.)/mg) { print("$1\n"); # a,d,g }` [download]	[reply] [d/l] [select]
Re: Regex question - either starts, or has newline prefixing? by ww (Archbishop) on Jul 13, 2009 at 23:49 UTC
You said: or the line starts with \n. This makes me wonder if part of the problem is conceptual: Generally, better, IMO, to think of `\n` as the end of a line (and, in the case of the data you present, the end of the previous line). Combining that notion with the comment above re slurping, would something like this demo be simpler? `my $flag = 0; # TRUE when in a sub header or sub-sub header my @rry = <DATA>; for my $rry(@rry) { if ( $rry =~ /\A={3,4}sub.?={3,4}$/ ) { $flag = 1; print $rry . "\n"; next; } if ( $rry =~ /\A[^=]./ && $flag == 1 ) { print $rry . "\n"; } else { $flag=0; next; } }` [download] ... and it seems to work. Output with your sample as `__DATA__` `===sub header=== [[testing]] ====sub-sub header==== bla bla ===sub header=== [[testing]] sub-sub header bla bla` [download] For your purposes, this (slightly modified) might well be in a sub, triggered by a match on more than two "=" signs. Update: Attributed quote in the first graf and added last para.	[reply] [d/l] [select]