Regexp matching on a multiline file: dealing with line breaks

BlueStarry has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Regexp matching on a multiline file: dealing with line breaks
by Athanasius (Archbishop) on Dec 06, 2015 at 04:48 UTC

Hello BlueStarry, and welcome to the Monastery!

If the entire file will fit in memory, a variation on kennethk’s solution is to simply delete the newlines before searching:

#! perl
use strict;
use warnings;

my $target = 'kitten';
my $string =  do { local $/; <DATA>; };
   $string =~ s/\n//g;
my $count  = () = $string =~ /\Q$target/g;

print "The target string '$target' occurs $count times in the file\n";

__DATA__
sushikitten
ilovethekit
tensushithe
kittenisthe
[download]

Output:

14:28 >perl 1474_SoPW.pl
The target string 'kitten' occurs 3 times in the file

14:28 >
[download]

However, as your input file is 5 GB, this approach is probably impractical. In which case you’re going to have to bite the bullet and implement a solution with “strange buffers” — such as a sliding window technique. Maybe have a look at Data::Iterator::SlidingWindow.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Regexp matching on a multiline file: dealing with line breaks

by Anonymous Monk on Dec 06, 2015 at 13:17 UTC

my $string = do { local $/; <DATA>; };

[reply]
[d/l]

Re^3: Regexp matching on a multiline file: dealing with line breaks

by Athanasius (Archbishop) on Dec 06, 2015 at 13:43 UTC

We want to read the whole file (in this case, the contents of the __DATA__ section at the end of the script) into the scalar variable $string. Using the diamond operator, a call to <DATA> reads the next line from the filehandle.

So to read the whole file at once, we need to tell Perl that a “line” is the whole file. In Perl, the special variable $/ (also called $INPUT_RECORD_SEPARATOR and $RS) specifies what terminates a “line,” and undef is a special value which means “read the whole file at once.” See perlvar#Variables-related-to-filehandles.

Since $/ is a global variable, changing its value can have far-reaching consequences across a large program. It’s therefore good practice to localize any changes to just that part of the code where they’re required. Hence the idiom of declaring the variable with the local declaration and limiting the scope of that declaration by enclosing it in a block. We could say:

my $string;

{
    local $/;
    $string = <DATA>;
}
[download]

but wrapping it up in a do block is neater and more concise.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: Regexp matching on a multiline file: dealing with line breaks
by kennethk (Abbot) on Dec 05, 2015 at 22:29 UTC

In this case, I will assume you are loading the entire file into memory; if you are not, you will need to work with "strange buffers and things like that". It looks like your format has no whitespace dependence, so one easy way to do it is to allow arbitrary whitespace in your regex:

/k\s*i\s*t\s*t\s*e\s*n/;
[download]

my $string = 'kitten';
my $regex = join '\s*', split //, $string;
/$regex/;
[download]

#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

[reply]
[d/l]
[select]

Re: Regexp matching on a multiline file: dealing with line breaks
by Laurent_R (Canon) on Dec 06, 2015 at 09:45 UTC

g

If it does not fit in memory, then you probably have to bite the bullet and use strange buffers and things like that, but don't be too afraid of that, these "strange things", such as a sliding window, do not need to be very complicated and can be implemented in just 3 or 4 lines of code.

[reply]
[d/l]

Re^2: Regexp matching on a multiline file: dealing with line breaks

by BlueStarry (Novice) on Dec 06, 2015 at 09:55 UTC

Many thanks to everyone.

I'll go with sliding windows but first probably i've got an idea myself, but i don't know if it's correct. My original file is divided in many "paragraphs" every one of them starting with a special line like this

>Header
[download]

[reply]
[d/l]

Re^3: Regexp matching on a multiline file: dealing with line breaks

by Laurent_R (Canon) on Dec 06, 2015 at 10:01 UTC

Yes, by all means, if you can identify sections or chunks where you can be sure that there cannot be an overlapping match on the chunk boundary, then you don't even need a sliding window: just load and process one chunk after another just the same way you've been told before for the whole file, it is even simpler than a sliding window.

[reply]

Re^3: Regexp matching on a multiline file: dealing with line breaks

by Athanasius (Archbishop) on Dec 06, 2015 at 13:12 UTC

As Laurent_R says, this is an excellent strategy. Have a look at the entry for $INPUT_RECORD_SEPARATOR (usually spelled just $/) in perlvar. For example:

#! perl
use strict;
use warnings;

my $target = 'kitten';
my $count  =  0;

$/ = ">Header\n";

{
    local $/ = ">Header\n";

    while (my $string = <DATA>)
    {
        $string =~ s/\n//g;
        print "string is '$string'\n";
        $count += () = $string =~ /\Q$target/g;
    }
}

print "The target string '$target' occurs $count times in the file\n";

__DATA__
>Header
sushikitten
ilovethekit
tensushithe
kittenisthe
>Header
sushikittAn
ilovethekit
tensushithe
kittBnisthe
[download]

Output:

23:11 >perl 1474_SoPW.pl
string is '>Header'
string is 'sushikittenilovethekittensushithekittenisthe>Header'
string is 'sushikittAnilovethekittensushithekittBnisthe'
The target string 'kitten' occurs 4 times in the file

23:11 >
[download]

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^4: Regexp matching on a multiline file: dealing with line breaks

by BlueStarry (Novice) on Dec 06, 2015 at 14:02 UTC

Re^4: Regexp matching on a multiline file: dealing with line breaks

by BlueStarry (Novice) on Dec 10, 2015 at 17:19 UTC

Re^5: Regexp matching on a multiline file: dealing with line breaks

by choroba (Cardinal) on Dec 10, 2015 at 17:26 UTC

Some notes below your chosen depth have not been shown here