Matching n characters with m//g

medium.dave has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Matching n characters with m//g
by Athanasius (Archbishop) on Feb 22, 2016 at 03:44 UTC

Hello medium.dave, and welcome to the Monastery!

Within a regex, the special character ^ matches the beginning of the string,¹ but you want to match the position where the previous match left off. For that, you need \G, on which see “Assertions” in perlre#Regular-Expressions.

Update: ¹or at the beginning of the line if the regex has an /m modifier.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Matching n characters with m//g

by medium.dave (Novice) on Feb 22, 2016 at 05:06 UTC

It still doesn't match. I'm at a loss here. Any ideas how to advance the pos() X characters? then i could just use substr.

my new regex:

$m =~ /\G(.{$sz})/g;
[download]

[reply]
[d/l]

Re^3: Matching n characters with m//g

by medium.dave (Novice) on Feb 22, 2016 at 05:08 UTC

OK, I've managed to solve this, what I now do is use substr() and advance the position of pos(), like so:

my $cur_pos = pos($m);
print $FH substr($m, $cur_pos, $sz);
pos($m) = $cur_pos + $sz;
[download]

works great!

[reply]
[d/l]

Re: Matching n characters with m//g
by kcott (Archbishop) on Feb 22, 2016 at 03:55 UTC

G'day medium.dave,

Welcome to the Monastery.

The braces are special in regexes. .{32760} matches 32,760 characters; it performs no match on the string containing 32760. See perlre: Quantifiers for more complete details.

What you need to do is escape the braces. Something like this:

$ perl -wE 'my $x = q[X{123}X]; say $x; $x =~ /([{]123[}])/; say $1'
X{123}X
{123}
[download]

Although you haven't shown the complete context of your code, the caret (^) in your regex looks dubious. Perhaps take a step back and read "perlretut - Perl regular expressions tutorial".

— Ken

[reply]
[d/l]
[select]

Re^2: Matching n characters with m//g

by medium.dave (Novice) on Feb 22, 2016 at 04:25 UTC

Thanks Ken, I am indeed attempting to extract than many characters, so the utilisation of the curly brackets in this instance is correct. Thanks for following up though, appreciated!

[reply]

Re^3: Matching n characters with m//g

by kcott (Archbishop) on Feb 22, 2016 at 06:05 UTC

OK, it looks like I confused "{32760}" and "{234565}"; however, I don't see "{234565}" anywhere in your code, although you state:

"... when I come to the part "RFC822.TEXT {234565}" I now need to read exactly 234565 from the string."

The string data you posted is multi-line (i.e. it contains newlines) so you'll need the 's' modifier if you want '.' to match newlines as well as other characters (see perlre: Modifiers). Here's a rough example of what I think you're looking for:

$ perl -wE 'my $x = qq{abcd\nefgh\nijkl}; say ">$x<"; say ">$1<" while
+ $x =~ /(.{3})/gs'
>abcd
efgh
ijkl<
>abc<
>d
e<
>fgh<
>
ij<
[download]

[Note how I've wrapped each string in angle brackets, so that you can see not just the start and end of each string, but also the placement of newlines within them.]

The caret (in your original code) still looks dubious. If it isn't, and you want to match the start of lines (in a multi-line string), you'll need the 'm' modifier (see perlre: Modifiers). Also, if you want to match the start of the entire string, use '\A' (see perlre: Assertions).

— Ken

[reply]
[d/l]
[select]

Re: Matching n characters with m//g
by AnomalousMonk (Archbishop) on Feb 22, 2016 at 08:13 UTC

I'm not sure just where you're going with this, but it's possible to bite off more than 32,766 repetitions in one chunk, you just have to grind on it a bit:

c:\@Work\Perl\monks>perl -wMstrict -le
"use constant K => 234_565;
 ;;
 my $s = 'xxx' . 'a' x K . 'yyy';
 print substr $s, 0, 10;
 print substr $s, -10;
 print length $s;
 ;;
 use constant MAX => 32_760;
 ;;
 use integer;
 my $n = K / MAX;  print $n;
 my $m = K % MAX;  print $m;
 ;;
 my ($t) = $s =~ m{ xxx ((?: .{${ \MAX }}){$n} .{$m}) yyy }xms;
 print substr $t, 0, 10;
 print substr $t, -10;
 print length $t;
"
xxxaaaaaaa
aaaaaaayyy
234571
7
5245
aaaaaaaaaa
aaaaaaaaaa
234565
[download]

$s

my $s = 'xxx' . 'a' x K . 'a' . 'yyy';

You can use the m//gc modifier pairing (the /c modifier is the new one here) to parse through your text to locate and extract the large number that is the count of the block of characters following. (Update: Please see Using regular expressions in Perl in perlretut for further explanation of the /g and /c modifiers.) If you can get to the point in your string at which you can extract the large number of characters you need next, you can figure the quantifier counts necessary to do a "compound counted quantifier".

That said, it might be better to go the route of an honest-to-goodness parser instead.

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]
[select]

Re: Matching n characters with m//g
by AnomalousMonk (Archbishop) on Feb 22, 2016 at 08:52 UTC

Here's something to illustrate perhaps a little better what I had in mind:

c:\@Work\Perl\monks>perl -wMstrict -le
"use constant MAX => 32_760;
 ;;
 my $s = 'FLAGS xyzzy UID 42 RFC {234565} ' . 'a' x 234565 . 'Y yyyy';
 print substr $s, 0, 50;
 print substr $s, -50;
 print length $s;
 ;;
 my %piece;
 ;;
 PARSE: {
   if ($s =~ m{ \G RFC \s* \{ (\d+) \} \s* }xmsgc) {
     my $total = $1;
     use integer;
     my $n = $total / MAX;
     my $m = $total % MAX;
     $s =~ m{ \G ((?: .{${ \MAX }}){$n} .{$m}) \s* }xmsg or die 'no co
+unt';
     $piece{rfc} = $1;
     redo PARSE;
     }
   elsif ($s =~ m{ \G \s* Y \s* (\w+) \s* }xmsgc) {
     $piece{y} = $1;
     redo PARSE;
     }
   elsif ($s =~ m{ \G UID \s* (\d+) \s* }xmsgc) {
     $piece{uid} = $1;
     redo PARSE;
     }
   elsif ($s =~ m{ \G \s* FLAGS \s* (\w+) \s* }xmsgc) {
     $piece{flags} = $1;
     redo PARSE;
     }
   else {
     last PARSE;
     }
   }
 ;;
 print qq{flags '$piece{flags}'  uid '$piece{uid}'  y '$piece{y}'};
 printf qq{start rfc: '%s' \n}, substr $piece{rfc}, 0, 30;
 printf qq{  end rfc: '%s' \n}, substr $piece{rfc}, -30;
 print 'length rfc: ', length $piece{rfc};
"
FLAGS xyzzy UID 42 RFC {234565} aaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaY yyyy
234603
flags 'xyzzy'  uid '42'  y 'yyyy'
start rfc: 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
  end rfc: 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
length rfc: 234565
[download]

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]
[select]

Re: Matching n characters with m//g
by ikegami (Patriarch) on Feb 24, 2016 at 12:11 UTC

substr is so much simpler here.

while( $sz > 32760 ) {
   print $FH substr($m, pos($m), 32760);
   pos($m) += 32760;
   $sz -= 32760;
}

print $FH substr($m, pos($m), $sz);
pos($m) += $sz;
[download]

Or if you don't mind destroying $m,

while( $sz > 32760 ) {
   print $FH substr($m, 0, 32760, '');
   $sz -= 32760;
}

print $FH substr($m, 0, $sz, '');
[download]

But I fail to see how those are any different than just

print $FH substr($m, pos($m), $sz);
pos($m) += $sz;
[download]

and

print $FH substr($m, 0, $sz, '');
[download]

[reply]
[d/l]
[select]

Re: Matching n characters with m//g
by Anonymous Monk on Feb 22, 2016 at 14:58 UTC

One minor point that I have not seen mentioned (or maybe I was just inattentive) is the fact that if you want '.' to truly match any character you need to specify the /s modifier on the regular expression. Without this, it will not match a newline.

[reply]