Regex to find the last matching set in a long scalar

superwombat has asked for the wisdom of the Perl Monks concerning the following question:

So... I've been working on this for a couple days now and I'm getting nowhere. I'm a bit of a perl novice, but here's the situation. I'm trying to parse a massive scalar that contains a log file. I need to build an expression to extract the last set of data, which will be marked with a header and footer. I've figured out how to get there by converting the scalar to an array, going through it line by line until I find my matches and verify that they are the last ones etc... but it winds up taking a lot of processing time to do it that way. I'm hoping that the proper regular expression will be able to find my result without having to munch through the entire file so many times.

Here's the example code I've been working with to test my regular expression.

my $string = "
start:
end:

test
code
start: real
1
end: real
with
start: real
repeating newlines
and more than
start: real
one

instance
end: real
of the 
start: real
desired
string


end: real


start:
end:";

if ($string =~/(start: real)((.|\n)*(?!start: )(.|\n)*)*(end: real)/){
    print "$&";
    }
[download]

The goal is it should match

start: real
desired
string


end: real
[download]

and nothing else, but right now it matches from the first "start: real" to the "last end: real".

I can see how to make the expression non-greedy, which would make it possible to capture only the first set, but I don't know how to capture only the last one.

Comment on Regex to find the last matching set in a long scalar Select or Download Code

Replies are listed 'Best First'.

Re: Regex to find the last matching set in a long scalar
by choroba (Cardinal) on Nov 15, 2014 at 08:37 UTC

I can see how to make the expression non-greedy, which would make it possible to capture only the first set

reverse

لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

[reply]

Re^2: Regex to find the last matching set in a long scalar

by superwombat (Acolyte) on Nov 15, 2014 at 08:44 UTC

Maybe. I don't know how to reverse a scalar. I could convert it to an array and reverse that, which is my current (slow) method. My end goal is to avoid having to manipulate this giant scalar as much as possible, and I was hoping with with the proper regex I could extract the data I want with as little handling of the large file as possible.

Sadly, because of the way I'm retrieving this file I can't just read it from the filehandle, it has to be a single scalar, so elegant solutions like File::ReadBackwards don't work in my case either.

[reply]

Re^3: Regex to find the last matching set in a long scalar

by Corion (Patriarch) on Nov 15, 2014 at 09:27 UTC

choroba linked to the reverse function. Maybe you want to read the link?

[reply]

Re^4: Regex to find the last matching set in a long scalar

by superwombat (Acolyte) on Nov 15, 2014 at 09:43 UTC

Re: Regex to find the last matching set in a long scalar
by Athanasius (Archbishop) on Nov 15, 2014 at 08:47 UTC

Hello superwombat,

If you precede the expression to match with the greedy match-all .*, the match you get will be the last one:

#! perl
use strict;
use warnings;

my $string = '
start:
end:

test
code
start: real
1
end: real
with
start: real
repeating newlines
and more than
start: real
one

instance
end: real
of the 
start: real
desired
string


end: real


start:
end:';

my $header = 'start: real';
my $footer = 'end: real';

$string =~ /.*(^$header$(?:.*)^$footer$)/ms;

print "$1\n" if $1;
[download]

Output:

18:45 >perl 1078_SoPW.pl
start: real
desired
string


end: real

18:45 >
[download]

Note the /ms modifiers in the regex. As perlre explains:

Used together, as /ms, they let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Regex to find the last matching set in a long scalar

by superwombat (Acolyte) on Nov 15, 2014 at 09:45 UTC

Not sure why my first reply didn't show up. Thanks for your help, that's just what I needed,

[reply]

Re: Regex to find the last matching set in a long scalar
by Laurent_R (Canon) on Nov 15, 2014 at 18:59 UTC

.*

Complex regular subexpression recursion limit (32766) exceeded at ...
[download]

no warnings qw/regexp/;

[reply]
[d/l]
[select]

Re^2: Regex to find the last matching set in a long scalar

by Athanasius (Archbishop) on Nov 16, 2014 at 07:34 UTC

If the log file is huge, it will be better to avoid the scalar variable altogether and to instead read the file in backwards, line by line. The File::ReadBackwards module on is designed for this purpose:

#! perl
use strict;
use warnings;
use File::ReadBackwards;

my $logfile = 'log.txt';
my $header  = 'start: real';
my $footer  = 'end: real';
my $in_data =  0;
my @lines;
my $bw      =  File::ReadBackwards->new($logfile)
    or die "Cannot open file '$logfile' for reading backwards: $!";

while (my $line = $bw->readline)
{
    chomp $line;

    if ($in_data)
    {
        unshift @lines, $line;
        last if $line eq $header;
    }
    elsif ($line eq $footer)
    {
        unshift @lines, $line;
        $in_data = 1;
    }
}

print join("\n", @lines), "\n";
[download]

No regex needed. :-)

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]

Re^3: Regex to find the last matching set in a long scalar

by Laurent_R (Canon) on Nov 16, 2014 at 14:17 UTC

Yes, I definitely agree with you that it is usually not the best idea to slurp a huge file into a scalar (or an array), and I usually avoid that even for smaller files, at least when possible. And also that the iterating backward with the module you are mentioning is probably a better solution.

[reply]