medium.dave has asked for the wisdom of the Perl Monks concerning the following question:

Hi all. I have a "really long string" with many parts to it (something like an IMAP response that needs to be manually processed). So it has sections like:

FLAGS (\Seen) UID 42 RFC822.TEXT {234565} this is a bunch of string data, but it also contains some binary data...it's not actually IMAP code, it's an internal app that works similarly, but a lot more people are more familiar with IMAP... ..snip.. a lot more string RFC822.SIZE 234565 INTERNALDATE " 21 FEB 2016 12:23:23 +1000")
So, I need to work through each part of the string, which I am doing just fine. However, when I come to the part "RFC822.TEXT {234565}" I now need to read exactly 234565 from the string. I am processing this with the m//g operator, so each time I call m//g it matches the string from where I left off the last match. So the code I have, to work through the string, because there is a limit to the n-times match of 32766, is the following:
while( $sz > 32760 ) { $m =~ /(^.{32760})/g; print $FH $1; $sz -= 32760; } $m =~ /(^.{$sz})/g; print $FH $1;

...except the m/(^.{32760})/g isn't matching, and the string pos() is being reset.

Any ideas why? Your input would be gratefully be accepted, I've been banging away at this with various incarnations for more hours than I care to admit :-(

Replies are listed 'Best First'.
Re: Matching n characters with m//g
by Athanasius (Archbishop) on Feb 22, 2016 at 03:44 UTC

    Hello medium.dave, and welcome to the Monastery!

    Within a regex, the special character ^ matches the beginning of the string,1 but you want to match the position where the previous match left off. For that, you need \G, on which see “Assertions” in perlre#Regular-Expressions.

    Update: 1or at the beginning of the line if the regex has an /m modifier.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      It still doesn't match. I'm at a loss here. Any ideas how to advance the pos() X characters? then i could just use substr.

      my new regex:

      $m =~ /\G(.{$sz})/g;

        OK, I've managed to solve this, what I now do is use substr() and advance the position of pos(), like so:

        my $cur_pos = pos($m); print $FH substr($m, $cur_pos, $sz); pos($m) = $cur_pos + $sz;

        works great!

Re: Matching n characters with m//g
by kcott (Archbishop) on Feb 22, 2016 at 03:55 UTC

    G'day medium.dave,

    Welcome to the Monastery.

    The braces are special in regexes. .{32760} matches 32,760 characters; it performs no match on the string containing 32760. See perlre: Quantifiers for more complete details.

    What you need to do is escape the braces. Something like this:

    $ perl -wE 'my $x = q[X{123}X]; say $x; $x =~ /([{]123[}])/; say $1' X{123}X {123}

    Although you haven't shown the complete context of your code, the caret (^) in your regex looks dubious. Perhaps take a step back and read "perlretut - Perl regular expressions tutorial".

    — Ken

      Thanks Ken, I am indeed attempting to extract than many characters, so the utilisation of the curly brackets in this instance is correct. Thanks for following up though, appreciated!

        OK, it looks like I confused "{32760}" and "{234565}"; however, I don't see "{234565}" anywhere in your code, although you state:

        "... when I come to the part "RFC822.TEXT {234565}" I now need to read exactly 234565 from the string."

        The string data you posted is multi-line (i.e. it contains newlines) so you'll need the 's' modifier if you want '.' to match newlines as well as other characters (see perlre: Modifiers). Here's a rough example of what I think you're looking for:

        $ perl -wE 'my $x = qq{abcd\nefgh\nijkl}; say ">$x<"; say ">$1<" while + $x =~ /(.{3})/gs' >abcd efgh ijkl< >abc< >d e< >fgh< > ij<

        [Note how I've wrapped each string in angle brackets, so that you can see not just the start and end of each string, but also the placement of newlines within them.]

        The caret (in your original code) still looks dubious. If it isn't, and you want to match the start of lines (in a multi-line string), you'll need the 'm' modifier (see perlre: Modifiers). Also, if you want to match the start of the entire string, use '\A' (see perlre: Assertions).

        — Ken

Re: Matching n characters with m//g
by AnomalousMonk (Archbishop) on Feb 22, 2016 at 08:13 UTC

    I'm not sure just where you're going with this, but it's possible to bite off more than 32,766 repetitions in one chunk, you just have to grind on it a bit:

    c:\@Work\Perl\monks>perl -wMstrict -le "use constant K => 234_565; ;; my $s = 'xxx' . 'a' x K . 'yyy'; print substr $s, 0, 10; print substr $s, -10; print length $s; ;; use constant MAX => 32_760; ;; use integer; my $n = K / MAX; print $n; my $m = K % MAX; print $m; ;; my ($t) = $s =~ m{ xxx ((?: .{${ \MAX }}){$n} .{$m}) yyy }xms; print substr $t, 0, 10; print substr $t, -10; print length $t; " xxxaaaaaaa aaaaaaayyy 234571 7 5245 aaaaaaaaaa aaaaaaaaaa 234565
    You can test this match by making the  $s string
        my $s = 'xxx' . 'a' x K . 'a' . 'yyy';
    instead; the match will fail.

    You can use the  m//gc modifier pairing (the  /c modifier is the new one here) to parse through your text to locate and extract the large number that is the count of the block of characters following. (Update: Please see Using regular expressions in Perl in perlretut for further explanation of the  /g and  /c modifiers.) If you can get to the point in your string at which you can extract the large number of characters you need next, you can figure the quantifier counts necessary to do a "compound counted quantifier".

    That said, it might be better to go the route of an honest-to-goodness parser instead.


    Give a man a fish:  <%-{-{-{-<

Re: Matching n characters with m//g
by AnomalousMonk (Archbishop) on Feb 22, 2016 at 08:52 UTC

    Here's something to illustrate perhaps a little better what I had in mind:

    c:\@Work\Perl\monks>perl -wMstrict -le "use constant MAX => 32_760; ;; my $s = 'FLAGS xyzzy UID 42 RFC {234565} ' . 'a' x 234565 . 'Y yyyy'; print substr $s, 0, 50; print substr $s, -50; print length $s; ;; my %piece; ;; PARSE: { if ($s =~ m{ \G RFC \s* \{ (\d+) \} \s* }xmsgc) { my $total = $1; use integer; my $n = $total / MAX; my $m = $total % MAX; $s =~ m{ \G ((?: .{${ \MAX }}){$n} .{$m}) \s* }xmsg or die 'no co +unt'; $piece{rfc} = $1; redo PARSE; } elsif ($s =~ m{ \G \s* Y \s* (\w+) \s* }xmsgc) { $piece{y} = $1; redo PARSE; } elsif ($s =~ m{ \G UID \s* (\d+) \s* }xmsgc) { $piece{uid} = $1; redo PARSE; } elsif ($s =~ m{ \G \s* FLAGS \s* (\w+) \s* }xmsgc) { $piece{flags} = $1; redo PARSE; } else { last PARSE; } } ;; print qq{flags '$piece{flags}' uid '$piece{uid}' y '$piece{y}'}; printf qq{start rfc: '%s' \n}, substr $piece{rfc}, 0, 30; printf qq{ end rfc: '%s' \n}, substr $piece{rfc}, -30; print 'length rfc: ', length $piece{rfc}; " FLAGS xyzzy UID 42 RFC {234565} aaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaY yyyy 234603 flags 'xyzzy' uid '42' y 'yyyy' start rfc: 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' end rfc: 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' length rfc: 234565


    Give a man a fish:  <%-{-{-{-<

Re: Matching n characters with m//g
by ikegami (Patriarch) on Feb 24, 2016 at 12:11 UTC

    substr is so much simpler here.

    while( $sz > 32760 ) { print $FH substr($m, pos($m), 32760); pos($m) += 32760; $sz -= 32760; } print $FH substr($m, pos($m), $sz); pos($m) += $sz;

    Or if you don't mind destroying $m,

    while( $sz > 32760 ) { print $FH substr($m, 0, 32760, ''); $sz -= 32760; } print $FH substr($m, 0, $sz, '');

    But I fail to see how those are any different than just

    print $FH substr($m, pos($m), $sz); pos($m) += $sz;

    and

    print $FH substr($m, 0, $sz, '');
Re: Matching n characters with m//g
by Anonymous Monk on Feb 22, 2016 at 14:58 UTC

    One minor point that I have not seen mentioned (or maybe I was just inattentive) is the fact that if you want '.' to truly match any character you need to specify the /s modifier on the regular expression. Without this, it will not match a newline.